Tuesday, May 2, 2023

So I asked ChatGPT and Bard to analyze the data for me...

In case you've lived under a rock for the last 5+ months, you probably heard about OpenAI's chatGPT and Google's Bard Large Language Model (LLM) generative AI models.  ChatGPT's data is limited up to September of 2021, and as of this writing it is not connected to the Internet to pull information.  Google's Bard was recently (April 21, 2023) updated to have the ability to code (generation / debug / explanation) (article).

I played with chatGPT when my wife asked me about it when it first came out.  I have read about Natural Language Programming (NLP) a few years back but I haven't learned about the recent advancement.  I didn't think it would be that unique considering what I knew about NLP and our ability to google for an answer, and the limited data up to September of 2021.  The surprise came only after I used it for coding help - Excel, Python, PowerBI, etc.  It was able to code, or generate lines of code for specific request!  It saved me time from reading all the google search results / stack overfow codes to determine if a search result is the correct solution to my issue.  The key to get what you are looking for by return, you may need to provide certain structure, or information, or example, or prompt engineering, to get chatGPT to generate the intended reply.  

So, what is prompt to use?

please conduct a data analysis of the data from the attached link and provide a 4 paragraph summary - https://grouplens.org/datasets/movielens/1m/

From ChatGPT -

As an AI language model, I do not have access to the attached link. However, I can provide a general summary of the MovieLens 1M dataset provided by GroupLens and discuss potential insights that can be gained from conducting data analysis on the dataset.

The MovieLens 1M dataset contains movie ratings and demographic information for approximately 1 million users, along with information on 3,900 movies. The dataset includes information on user demographics such as age, gender, and occupation, as well as movie information such as genre, release year, and title. Ratings are provided on a scale of 1 to 5, with higher ratings indicating greater enjoyment of the movie.

By analyzing the MovieLens 1M dataset, one can gain valuable insights into user preferences for movies and identify trends and patterns in movie ratings. For example, one could examine the distribution of ratings across different genres of movies to identify which genres are more popular among users. Additionally, one could use clustering techniques to group users with similar tastes and identify common characteristics of those groups. These insights can be valuable for movie producers and distributors in understanding which types of movies are likely to be successful among different segments of the population.

Moreover, the MovieLens 1M dataset can be used to develop recommendation systems for movies. By analyzing the ratings of individual users and identifying similar users with similar ratings, one can develop personalized recommendations for users based on their individual tastes. This can improve the user experience on movie streaming platforms and increase engagement with the platform.

In conclusion, the MovieLens 1M dataset is a valuable resource for conducting data analysis on user preferences for movies. The dataset provides rich information on movie ratings and user demographics, and can be used to develop insights into user preferences, identify trends and patterns in movie ratings, and develop recommendation systems for movies. These insights can be valuable for movie producers and distributors in developing successful movies and improving the user experience on movie streaming platforms.

To say I'm disappointed is an understatement.  I guess ChatGPT still does not connect to the internet, and did not have history information from the internet of this old dataset.

Since Bard's recent update showed it is now capable of connecting to the internet, I figured why not give it a try as well.

Using the same prompt, this is from Bard -

Sure, I can do that. Here is a 4 paragraph summary of the data from the attached link:

The MovieLens 1M dataset is a collection of 1 million movie ratings from 6,000 users. The dataset includes information about the users, the movies, and the ratings. The users are identified by a unique ID number, and the movies are identified by a unique ID number and a title. The ratings are on a scale of 1 to 5 stars.

The data analysis of the MovieLens 1M dataset revealed some interesting insights about movie ratings. For example, the average rating for all movies in the dataset is 3.5 stars. The most popular movies in the dataset are those that have been rated highly by many users. These movies include The Shawshank Redemption, The Godfather, and The Dark Knight. The movies that are most highly rated in the dataset are those that have been rated highly by a small number of users. These movies include The Silence of the Lambs, Schindler's List, and Pulp Fiction.

The data analysis also revealed some trends in movie ratings over time. For example, the average rating for movies has increased over time. This is likely due to the fact that movies are becoming more and more popular. The popularity of different genres of movies has also changed over time. For example, comedies were more popular in the 1980s, while dramas were more popular in the 1990s.

The MovieLens 1M dataset is a valuable resource for anyone who wants to learn more about movie ratings. It is a large and diverse dataset that can be used to answer a wide variety of questions about movies.

Here are some additional insights that can be gained from the MovieLens 1M dataset:

The most popular movies are typically action, comedy, or drama films.

The most highly rated movies are typically drama films.

The average rating for movies has increased over time.

The popularity of different genres of movies has changed over time.

The most popular movies among men are typically action, comedy, or thriller films.

The most popular movies among women are typically drama, comedy, or romance films.

The most highly rated movies among men are typically drama films.

The most highly rated movies among women are typically drama films.

The MovieLens 1M dataset is a valuable resource for anyone who wants to learn more about movie ratings. It is a large and diverse dataset that can be used to answer a wide variety of questions about movies.

Wow.  That's good for basic data analysis.  I need to step up my game otherwise companies can use AI to conduct analysis with obvious feedback.  With prompt engineering, you can ask ChatGPT / Bard to provide explicit assumptions used, or include your specific assumptions in your prompt so AI takes that into account when preparing your specific answer.

I feel that Bard has a slight advantage over ChatGPT when you want to feed external data to AI to conduct the data analysis, assuming you are OK that OpenAI or Google get to review your data.  If you are not OK with someone else seeing your data - then you're stuck with using AI for coding support for the time being.  😓

You can sign up ChatGPT here, and Bard here.

No comments:

Post a Comment

Google Data Analytics capstone project

Google Data Analytics capstone project

I decided to use blog as the venue to publish my Google Data Analytics learning because I wasn't sure how I can upload the findings, the...