Showing posts with label analysis. Show all posts
Showing posts with label analysis. Show all posts

Sunday, April 30, 2023

Movie Rating Data Review

While finishing my reading of Python for Data Analysis(2nd Edition), I came to some data analysis exercise on MovieLens movie review data.  I followed along the author's examples in most cases, but tried to expand on some area as he suggested.

I thought everything would be smooth sailing considering I was "following along" the exercise, except I was using the 2nd Edition dated from 2017.  I figure I might as well follow along and do a data analysis of the dataset, as well as some interesting issues I ran into.

Goal: practice data analysis with a movie review dataset from 2000 and learn from this experience.

First task is to download the data and review the readme file to understand the data structure and their relationship.  👍


Next step let's take a look of the data in notepad to compare it with README.  Look simple enough with two columns (::) as separators.  👍


Let's import and review the data.  Author used read_table function so I'll use the same code and let's see what happens...
  
Warning message


Error message

Interesting.  Why am I getting a warning and an error?
    - Parsewarning - looks like we can add "engine='python" to eliminate the warning.

    - Codec issue - something about the text encoding of movies.dat file not being UTF-8 standard.  Googling around and I found at least couple of solutions:

  1. By using notepad, I can save-as the file into UTF-8 format.
  2. Use chardet, Python character encoding detector package and read in some part of the files to determine the encoding; then use the encoding method to read the file.
After making the changes, the datesets are loaded successfully.  Let's review the data and see if there is anything we should be aware of.

Users dataset - there is no null values within the 6040 rows, and no duplicated UserID.  I'm not concerns with duplicates with other columns because they serve as grouping / categories.  There are more male users than female users, ~2.5 times more. 

Users dataset preview / examination

Movies dataset - Since there is only Movie_ID column that has numerical values, the describe() call does not reveal any useful information.  It looks like the movie release year is included in the Movie_Name column inside the parentheses.  The Genre column can have multiple genres separated by pipe (|) symbol.  There is no duplicate or null value in Movie_ID or Movie_Name field.

Movies dataset preview / examination

Ratings dataset - Looks like there is no null values among the data.  A check of the duplicates of UserID and Movie_ID shows that there is no duplicates.  The average rating is 3.58 but the median rating is 4.  This implies there is a long left tail - likely there are ratings of 1's pulling the average ratings down from the median.
Ratings dataset review


Rating distribution

Before doing any further analysis, it will be easier to combine the datasets together.  Between users and  ratings, they have a common column User_ID to merge on.  Between ratings and movies dataset, they have Movie_ID to get on.  We can merge the datasets together by using nesting two merge function - pd.merge(pd.merge(ratings, users), movies) because pandas can identify the common key to merge on.

 
Dataset quick comparison of keys

Combined dataset

A few observation from the data once they are combined -
  • There are more male ratings than female ratings.
Ratings count by gender
  • There are more ratings from age 25 years old group.
Ratings by age group
  • "Comedy" genre movies are the most popular, with the most number of ratings (333,823 ratings).
  • "Documentary" genre movies are the least popular, with the least number of ratings (5,705 ratings).
Ratings average and count by genre
  • There are many movies with only 1 rating.  There are also many movies with more than 2,000 ratings.  The mean (269) of the rating is greater than the median (123) of the rating, implying that there are many ratings on the higher end, skewing the mean to the right side of the median.
Rating count distribution plot

The author focuses on ratings between male and female users, and suggests readers to look into the movie genre.  Considering that there are many movies with a single rating, author filters the dataset with 250 ratings.  By looking at how many ratings each movie has, I'm using the median / 123 ratings instead of 250 ratings. 

The top 10 rated movies (overall) and the top 10 male rated movies and top 10 female rated moves are all different!  Is that a surprise? 😵

Top 10 rated movies M+F

Top 10 male rated movies

Top 10 female rated movies

On the genres, how many different genres are there?  Looks like there are 18 individual genres.

Action, Animation, Comedy, Children's, Sci-Fi, Documentary, War, Film-Noir, Horror, Western, Fantasy, Mystery, Musical, Crime, Adventure, Thriller, Romance, Drama.


What can we do about the genres?  Personally I would like to see the top 10 rated movies in the certain genre.  We know that comedy has the most ratings, and film-noir has the highest average rating, so let's take a look to see the top 10 movies for those two genre.

Top 10 rated comedy movies

Top 10 rated film-nor movies

Conclusions:
There is a lot more that I can explore with the dataset.  For example, I can explore how ratings evolve over time, how genre preference evolve over time, group users into different categories for movie research, and use the data to train for movie recommender based on a user's rating of a few movies.  By using the newer dataset, you can also review the movie preference trends over time.  I can also use Excel or PowerBI to conduct data analysis to evaluate my own skills across different tools.  This dataset is full of information to practice data analysis skill.  

Lessons Learned:
  • Keep the dataframe as simple as possible.
  • There are different plotting capabilities within Python - calling plot function within dataframe seems to be more efficient than calling matplotlib.  I ran into an issue when it took over 20 minutes to plot the histogram, to find out the issue is python trying to write out the individual x-axis tick on the chart where they over-wrote each other in the limited space.
  • Need to review and practice more multi-index dataframe.  Or maybe just simplify it - do I really need to use multi-index?
  • Don't underestimate the complexity of the datasets even when it looks simple.  This dataset took me two weeks to sort through some of the issues I ran into.  The actual writeup time is much more than the time spent on analysis.
  • Practice, practice, and more practice.  😞

Python jupyter notebook file: here

Citation:

The dataset is downloaded from https://grouplens.org/datasets/movielens/1m/.  This is the old dataset from 2/2003.  

Acknowledge use of the dataset:

F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. https://doi.org/10.1145/2827872

Example code from book author -

https://github.com/wesm/pydata-book/tree/2nd-edition


Wednesday, September 21, 2022

Datacamp Case Study - Databel Churn investigation

While doing my Tableau lessons on Datacamp, one of the courses was a case study to analyze customer churn in Tableau (Case Study: Analyzing Customer Churn in Tableau).  To practice my Python skill, I decided to use the data set from the course and analyze it using Python and show my analysis here.

Databel is a fictitious cell phone company.  The company executes want to understand customer churn situation and recommendations to reduce churn.

Top findings -

  • Current churn rate is 26.86% with 6687 customers.
  • Top churn reasons from the provided dataset - 
    1. Competitor make better offer
    2. Competitor had better devices
    3. Attitude of support person
    4. Don't know
    5. Competitor offered more data
  • Majority of the customers do not belong in a group plan.  Customers without group contract have higher than average churn rate (32%).
  • Customers are price sensitive to the international plan charge and international call usage.  
    • Churn rate is higher for customers making many international calls but not on the international calling plan, as well as customers on the international calling plan but not actively making international calls.
  • More than half of the customers are on month-to-month contract.  
    • Customers on Month-to-month contract type account for majority of the churn(87%).
  • Customers age 65 and above have higher churn rate than other age group.  However, seniors account for less than 20% of the total customer base.  

  • Customers on unlimited data plan have higher churn rate than customers not on data plan.

Recommendation -

  • Review current customer base who's not part of a group or on a contract.  Develop marketing and sales promotion to encourage customers to join a group or contract.
  • Review current customer base who's on the international calling plan and their international call usage.  Develop new pricing plan to turn on / off international calling plan automatically based on monthly usage.
  • Review current customer relation management system to ensure automatic number identification system is implemented, and a call history tracking system is on place.  When a customer calls the support or customer service department, all the previous interaction with the customer is listed for easy access so customers do not need to call and repeat unsolved issues.
  • Conduct full review of customer support and customer service procedures for quality assurance.  Provide additional trainings as needed to ensure staffs are equipped with what they need to support customers.
  • Churn rate for customers on unlimited data plan is higher than customers who are not on the unlimited data plan.  The dataset does not include current device offering information, or peak data download speed.  This will required further research to ensure company offers the latest devices and comparable data download speed.

Assumption made based on the project document, background and provided data

  • The churn rate calculated is a yearly figure.

Lessons learned from doing this exercise -

  • Be mindful of the grouping details - if it's inclusive for the upper and lower limit when doing the calculation.  Test, verify and adjust the group title accordingly.
  • Data is not always complete.  Review the data before conducting any in-depth analysis.
  • Blogger is not a good site to post Jupyter notebook file.  The format of the file will take some time to fix.  I will have to look into posting my files on Github in the future.

Possible additional work in the future

  • Map churn rate by state and display on a map, similar to what can be done with Tableau.
  • Research further on cell phone industry churn rate and compare the statistics. 

Information about the data -

Databel meta data description - Databel metadata description from Datacamp

Databel data from Datacamp - Databel.csv data from Datacamp 

Tableau Public analysis based on Datacamp's course can be seen here - Databel Tableau Analysis

Jupyter notebook for the Python file and analysis - Databel Jupyter notebook

Tools I used -

Anaconda3 and Python 3.7


Tuesday, July 5, 2022

Comparing two home's electricity usage

We are looking into solar panel cost because electricity cost has been increasing, and our wonderful provider SCE has had a few blackouts in the past 12 months more than what I would care to remember.  SCE provides our home's electricity usage data, up to 13 months, for download from their site.  I figured it's a good project to review our electricity usage and see if I can make any sense out of the data.  After all, what's the point of doing data analytics if we can't apply it to our daily life?

Data download - I got ours from SCE site under data sharing & download menu on the left of the screen.  They provided options to download CSV or XML format.  I used the CSV format since I am familiar with it.  SCE website

After reviewing the data, I noticed a few things about the data -

1. The usage data has 13 rows of header information that include address, start date and duration.  Each date worth of data divided into 24 hourly sections.  Between date sections is separated with an empty row for spacing, date information and header column.


2. Any household with Net Energy Metering would have a section with usage and a section with electricity production.  (credit: SCE.com)


3. Each date-hour usage has the date, from / to hour and usage on the left most column.

Steps to clean up the data -

Before working on the file, I made a copy of the worksheet to keep as the backup original data.  The header information, as well as filler information between each date section, are not necessary to the analysis.  I deleted those rows and added a filter to remove unnecessary filler information.  With the filter, I deleted all rows not showing dates.  For customers with NEM, a quick solution is to add a column and remark solar production with "S" for easy filtering in the future.  I will need to figure out how to import the data properly with R script in the future.

I wasn't sure how Power BI or Tableau would treat date / time information for sorting purpose so I created additional columns for date, hour, month, weekday from column A.  I later learned that Power BI can use the month from data / time data to organize / display information.  Tableau Public is able to use the data / time data directly so I didn't need to create any additional columns for date / time / weekday.

Since the time period is the same for both households, I consolidated the usage data into one file and noted accordingly.  This way it is easier to plot the graphs using the same axis for both households.

Power BI Desktop visualization -

This visual compares the total usage between two households.

 

This visual compares the average usage between two households.


Tableau Public visualization -

This visual compares the electricity usage between two households, and the average usages for each household.


This visual shows the total usage hourly comparison between two households.


Conclusion based on available information -

  • Home 1 uses more electricity in total and in the summer (June - September).  Home 2 uses more electricity during other months in the year (October - May).  Home 2 uses more power in the morning (7AM - noon).
  • Home 1 full year usage is 10,547KW.  Home 2 full year usage is 9,014KW.

Assumption made - 

  • Data downloaded from SCE is accurate and we don't have anyone stealing electricity from our homes.

Additional information I would like to have for further analysis - 

  • I wish SCE would provide two separate files for the "delivered" portion of the electricity if the household has NEM.  I think it will be easier to load data for analysis but I will continue looking into other methods of cleaning / transforming the data.

Lessons learned from doing this exercise - 

  • "Average" measurements in Power BI can be deceiving.  The first sets  of average daily usage visuals were wrong because the average function used total number of data as the numerator in the calculation, instead of days in a month or number of "Sundays."  I created a table to do the calculation separately to ensure I understood / found the issue.

Possible additional work in the future - 

  • Create same type of visuals from Tableau Public and Power BI Desktop.
  • Analyze the data and get solar quotes to understand breakeven.
  • Investigate R or Python scripts to import SCE data and process portion automatically instead of manual adjustment.
    • Updated 7/6/22 - I practiced using R to clean the downloaded CSV file.  
    • R script to load SCE's NEM CSV
      • This method will add a column "remark" and denote the usage to be either "DLVR" or "RCV."  "DLVR" is "delivered" or as SCE noted as delivered to the home.  "RCV" is "Received" or as SCE noted as received from the home. 
      • Originally I assumed each day has 24 hours except I forgot about daylight saving, which added/subtracted an hour on those two days in the year.  Therefore I had to add a new column and use a different method to clean the data.

Tools I used

Data wrangling - Microsoft Excel

Visualization - Microsoft Power BI Desktop and Tableau Public

Date range - 2021 June through 2022 May (12 months)


Friday, June 24, 2022

Google Data Analytics capstone project

I decided to use blog as the venue to publish my Google Data Analytics learning because I wasn't sure how I can upload the findings, the code, the visualization and the slides all to the same site / location.

The project I chose was the Chicago bike share data and the question to answer was "How do annual members and casual riders use Cyclistic bikes differently."

My findings from the analysis -

1. Annual members use the bikes more than casual riders in terms of total rides, but casual riders ride longer in terms of total ride time and average ride time.

2. Annual members use the bikes less, in terms of number of rides, than casual riders during weekends and summer season, but more during weekday and other seasons.




3. Annual members average trip time does not vary much from day of the week (2.6 minutes) to month of the year (3.7 minutes).  



4. Both annual members and casual riders use more classic bikes in terms of total rides and total ride time.


Conclusion based on available information -
  • Casual riders average ride time is about twice as long as annual members for the classic bike category (28.9 minutes vs. 13.9 minutes), and about 50% longer on average ride for electric bike (19.2 minutes vs. 12.2 minutes).
  • Need additional analysis on casual riders if we are able to obtain customers' ZIP code to further divide customer base into local versus visitor for better identification.  If casual riders are mostly from outside of Cyclistic bike's service area, it may not be effective to run ad campaign to try to convert them to annual members.
Assumption made based on the project document, background and provided data - 
  • Annual members are more profitable than casual riders.
  • There should not be any negative trip duration.
Additional information I would like to have for further analysis -
  • Request clarification on "docked_bike" category within the rideable_type variable.  My guess is that "docked_bike" is the casual rider's Day Pass purchase.
  • Request enough identifiable information to determine if casual riders live in the service area or if they purchased multiple single passes, without violating personal privacy concern.
  • Request enough annual member's identifiable information to determine personal daily usage pattern without violating personal privacy concern.
Lessons learned from doing this exercise -
  • Document the process.  Comment codes while exploring the data so I remember what I'm trying to accomplish at each step.
  • For someone like me who is used to Excel, it took more steps to manipulate data and visualization using R.  However, R gives user more control to do exactly what one wants to do with the visualization.
  • Be careful on stupid mistakes like typo's and referring to the wrong data frame.  These are silly mistakes.  😞
  • Practice, practice, practice!
Possible additional work in the future - 
  • The summary calculation done through R and Power BI came out different.  This will require further analysis.
  • Create the visualization using Tableau Public and share the visuals.
    • Updated 6/30/22 - Unfortunately the dataset might be to big for Tableau Public to process on a consistent basis.  The program kept giving me error messages but I was able to create one visual - Tableau Public link 
  • Conduct the exercise with Python.
  • Include distance analysis with available start and end station longitude and latitude information and see if there is any difference in terms of learning (vs. average time)
Tools I used -
Data wrangling - RStudio Desktop
Visualization - Microsoft Power BI Desktop

Data from Divvy - https://divvy-tripdata.s3.amazonaws.com/index.html
Data license agreement - https://ride.divvybikes.com/data-license-agreement
Data range - 2021 June through 2022 May (12 months)

Google Data Analytics capstone project

Google Data Analytics capstone project

I decided to use blog as the venue to publish my Google Data Analytics learning because I wasn't sure how I can upload the findings, the...