Friday, June 30, 2023

Movie Data Analysis using Power BI

I wanted to compare Power BI and Python (and Tableau if I ever get to it) on the data analysis tasks so I figured it would be a good comparison using the movie data (see citation / acknowledgement of the dataset credit at the end) while the data is relatively fresh on my mind. 😅

I'm going through the data similar to what I did analyzing it with Python.  A summary of my findings first -

  • Power BI makes it easier to load / convert / review data.
  • Power BI is easier to create interactive graphical representation.
    • Be mindful the interactive nature of Power BI.  It can also lead to different interpretations based on what filters are set.
    • User can select the field and see the top rated movies or movies with most ratings.
Action Movie Rating Trend & Highest Rated Action Movies
    • There maybe interactive graphical representation Python packages that I'm not familiar with.
  • If you want to see the numerical representation of the 1Q, median, average, 3Q, you will need to create some measures using DAX, and drag in the necessary fields and select the appropriate calculation method.
  • You can only merge two tables / queries at a time.  To merge more than two tables / queries, we need to create a interim table / query then merge with the next table / query.  Although it is the same with Python merge function limit to with two tables, we can nest the merge together, creating a multiple table merge in Python.
  • Unfortunately, after loading in the DAT files and creating multiple tables, the Power Bi file is quite large as compared to Python Jupyter Notebook file (93 meg Power BI file versus 440 k Jupyter Notebook file).  For comparison purpose, the merged data file in CSV format is about 74 meg.
Detail findings - 
  • Data import is as easy as clicking a few buttons.  The encoding issue with Python, as well as separator between each data point, can be easily resolved by selecting from a drop-down menu!

  • Power BI has build-in tools to examine data quality and data distribution, making reviewing of individual columns easier.  However, you need to be mindful of the steps to ensure you review each and every columns, as well as combinations of columns if needed.

  • Note there are options under View - column quality / distribution / profile to show these details in Power Query Editor.  The information is usually based on the first 1,000 rows, so it is best to change the setting to profile the entire data set (see below left lower corner).  They do not provide table level summaries but they provide column level details.  This certainly makes the initial data review easier through graphical review, except you do not get the detail numerical value.


  • To verify if a combination of columns / fields have any duplicates, you can combine the columns together in Power Query Editor, and include a separator if the columns are numeric.  It is not as simple as a line of Python code, but it is easy enough to get done.  Review the column profile / distribution to verify any duplicates.  Side note - I noticed my Intel CPU usage went up while the column profiling was going through.  Is it possible that Power BI uses GPU for processing?  I will have to test it on my desktop computer with dedicated GPU to find out...  😁
Steps to Create a New Merge Column

Column Profiling
  • To split the genres into multiple rows, it is a click of a button within Query Editor using "Split Column" button and click the "By Delimiter," and selecting appropriate options (Advance option to split into rows instead of columns).


  • To see the statistical details of the Ratings table, I have to go through a few steps.
1. I need to create a table to group the Ratings by Movie_ID.
2. I need to create a few measures to calculate the different percentiles, IQR, upper and lower whiskers using DAX.
3. I need to create a table and drag in the appropriate measures and fields and use the appropriate calculation (min/max/standard deviation/etc).


4. Alternatively, you can search the Power BI add-in by clicking the 3 dots for more visuals and search for box plot and third party visual add-in.


5. I found a Box and Whisker by MAQ software to provide a good box plot visual with all the details.


  • Part of the initial EDA is to recognize that there are many high ratings (rated 5) movies with single user ratings.  On average a movie would have ~269 ratings with the median ~ 123 ratings.  On my Python analysis I went with the median instead of the average to filter out movies which do not have enough user ratings.  To do the same within Power BI, I add in a slider and use the Rating count within the Rating_group table.
    • Slider makes it easy to change the boundary in case the user prefer to use average instead of median, or if the user wants to see the unfiltered result.


With some drag and drop and copy and paste, it is easy to create a few interactive summaries - Top 10 movies by rating and count by gender, Top genres by rating and count by gender, and # of rating and average rating over time, all with a slider to control the minimum number of ratings to prevent low number of ratings skewing the results.

Top 10 Movies
Top Genres
Genre and Rating over Time

Issues:

Found an interesting issue that when I use Python to plot the box plot, the plot is different when I have one or two variables in the x-axis.  I'm not quite sure what the problem is, but it does not match my calculation (shown in red box below) or Python's estimate.  


File upload:

You can find my Power BI file here.

Citation:

The dataset is downloaded from https://grouplens.org/datasets/movielens/1m/.  This is the old dataset from 2/2003.  

Acknowledge use of the dataset:

F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. https://doi.org/10.1145/2827872

No comments:

Post a Comment

Google Data Analytics capstone project

Google Data Analytics capstone project

I decided to use blog as the venue to publish my Google Data Analytics learning because I wasn't sure how I can upload the findings, the...