My data explorer journey: powerBI

Friday, June 30, 2023

Movie Data Analysis using Power BI

I wanted to compare Power BI and Python (and Tableau if I ever get to it) on the data analysis tasks so I figured it would be a good comparison using the movie data (see citation / acknowledgement of the dataset credit at the end) while the data is relatively fresh on my mind. 😅

I'm going through the data similar to what I did analyzing it with Python. A summary of my findings first -

Power BI makes it easier to load / convert / review data.
Power BI is easier to create interactive graphical representation.

Be mindful the interactive nature of Power BI. It can also lead to different interpretations based on what filters are set.
User can select the field and see the top rated movies or movies with most ratings.

Action Movie Rating Trend & Highest Rated Action Movies

There maybe interactive graphical representation Python packages that I'm not familiar with.

If you want to see the numerical representation of the 1Q, median, average, 3Q, you will need to create some measures using DAX, and drag in the necessary fields and select the appropriate calculation method.

You can only merge two tables / queries at a time. To merge more than two tables / queries, we need to create a interim table / query then merge with the next table / query. Although it is the same with Python merge function limit to with two tables, we can nest the merge together, creating a multiple table merge in Python.
Unfortunately, after loading in the DAT files and creating multiple tables, the Power Bi file is quite large as compared to Python Jupyter Notebook file (93 meg Power BI file versus 440 k Jupyter Notebook file). For comparison purpose, the merged data file in CSV format is about 74 meg.

Detail findings -

Data import is as easy as clicking a few buttons. The encoding issue with Python, as well as separator between each data point, can be easily resolved by selecting from a drop-down menu!

Power BI has build-in tools to examine data quality and data distribution, making reviewing of individual columns easier. However, you need to be mindful of the steps to ensure you review each and every columns, as well as combinations of columns if needed.

Note there are options under View - column quality / distribution / profile to show these details in Power Query Editor. The information is usually based on the first 1,000 rows, so it is best to change the setting to profile the entire data set (see below left lower corner). They do not provide table level summaries but they provide column level details. This certainly makes the initial data review easier through graphical review, except you do not get the detail numerical value.

To verify if a combination of columns / fields have any duplicates, you can combine the columns together in Power Query Editor, and include a separator if the columns are numeric. It is not as simple as a line of Python code, but it is easy enough to get done. Review the column profile / distribution to verify any duplicates. Side note - I noticed my Intel CPU usage went up while the column profiling was going through. Is it possible that Power BI uses GPU for processing? I will have to test it on my desktop computer with dedicated GPU to find out... 😁

Steps to Create a New Merge Column

Column Profiling

To split the genres into multiple rows, it is a click of a button within Query Editor using "Split Column" button and click the "By Delimiter," and selecting appropriate options (Advance option to split into rows instead of columns).

To see the statistical details of the Ratings table, I have to go through a few steps.

1. I need to create a table to group the Ratings by Movie_ID.
2. I need to create a few measures to calculate the different percentiles, IQR, upper and lower whiskers using DAX.

3. I need to create a table and drag in the appropriate measures and fields and use the appropriate calculation (min/max/standard deviation/etc).

4. Alternatively, you can search the Power BI add-in by clicking the 3 dots for more visuals and search for box plot and third party visual add-in.

5. I found a Box and Whisker by MAQ software to provide a good box plot visual with all the details.

Part of the initial EDA is to recognize that there are many high ratings (rated 5) movies with single user ratings. On average a movie would have ~269 ratings with the median ~ 123 ratings. On my Python analysis I went with the median instead of the average to filter out movies which do not have enough user ratings. To do the same within Power BI, I add in a slider and use the Rating count within the Rating_group table.

Slider makes it easy to change the boundary in case the user prefer to use average instead of median, or if the user wants to see the unfiltered result.

With some drag and drop and copy and paste, it is easy to create a few interactive summaries - Top 10 movies by rating and count by gender, Top genres by rating and count by gender, and # of rating and average rating over time, all with a slider to control the minimum number of ratings to prevent low number of ratings skewing the results.

Top 10 Movies

Top Genres

Genre and Rating over Time

Issues:

Found an interesting issue that when I use Python to plot the box plot, the plot is different when I have one or two variables in the x-axis. I'm not quite sure what the problem is, but it does not match my calculation (shown in red box below) or Python's estimate.

File upload:

You can find my Power BI file here.

Citation:

The dataset is downloaded from https://grouplens.org/datasets/movielens/1m/. This is the old dataset from 2/2003.

Acknowledge use of the dataset:

F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. https://doi.org/10.1145/2827872

Tuesday, July 5, 2022

Comparing two home's electricity usage

We are looking into solar panel cost because electricity cost has been increasing, and our wonderful provider SCE has had a few blackouts in the past 12 months more than what I would care to remember. SCE provides our home's electricity usage data, up to 13 months, for download from their site. I figured it's a good project to review our electricity usage and see if I can make any sense out of the data. After all, what's the point of doing data analytics if we can't apply it to our daily life?

Data download - I got ours from SCE site under data sharing & download menu on the left of the screen. They provided options to download CSV or XML format. I used the CSV format since I am familiar with it. SCE website

After reviewing the data, I noticed a few things about the data -

1. The usage data has 13 rows of header information that include address, start date and duration. Each date worth of data divided into 24 hourly sections. Between date sections is separated with an empty row for spacing, date information and header column.

2. Any household with Net Energy Metering would have a section with usage and a section with electricity production. (credit: SCE.com)

3. Each date-hour usage has the date, from / to hour and usage on the left most column.

Steps to clean up the data -

Before working on the file, I made a copy of the worksheet to keep as the backup original data. The header information, as well as filler information between each date section, are not necessary to the analysis. I deleted those rows and added a filter to remove unnecessary filler information. With the filter, I deleted all rows not showing dates. For customers with NEM, a quick solution is to add a column and remark solar production with "S" for easy filtering in the future. I will need to figure out how to import the data properly with R script in the future.

I wasn't sure how Power BI or Tableau would treat date / time information for sorting purpose so I created additional columns for date, hour, month, weekday from column A. I later learned that Power BI can use the month from data / time data to organize / display information. Tableau Public is able to use the data / time data directly so I didn't need to create any additional columns for date / time / weekday.

Since the time period is the same for both households, I consolidated the usage data into one file and noted accordingly. This way it is easier to plot the graphs using the same axis for both households.

Power BI Desktop visualization -

This visual compares the total usage between two households.

This visual compares the average usage between two households.

Tableau Public visualization -

This visual compares the electricity usage between two households, and the average usages for each household.

This visual shows the total usage hourly comparison between two households.

Electricity Usage Comparison between two homes

Conclusion based on available information -

Home 1 uses more electricity in total and in the summer (June - September). Home 2 uses more electricity during other months in the year (October - May). Home 2 uses more power in the morning (7AM - noon).
Home 1 full year usage is 10,547KW. Home 2 full year usage is 9,014KW.

Assumption made -

Data downloaded from SCE is accurate and we don't have anyone stealing electricity from our homes.

Additional information I would like to have for further analysis -

I wish SCE would provide two separate files for the "delivered" portion of the electricity if the household has NEM. I think it will be easier to load data for analysis but I will continue looking into other methods of cleaning / transforming the data.

Lessons learned from doing this exercise -

"Average" measurements in Power BI can be deceiving. The first sets of average daily usage visuals were wrong because the average function used total number of data as the numerator in the calculation, instead of days in a month or number of "Sundays." I created a table to do the calculation separately to ensure I understood / found the issue.

Possible additional work in the future -

Create same type of visuals from Tableau Public and Power BI Desktop.
Analyze the data and get solar quotes to understand breakeven.
Investigate R or Python scripts to import SCE data and process portion automatically instead of manual adjustment.

Updated 7/6/22 - I practiced using R to clean the downloaded CSV file.
R script to load SCE's NEM CSV

This method will add a column "remark" and denote the usage to be either "DLVR" or "RCV." "DLVR" is "delivered" or as SCE noted as delivered to the home. "RCV" is "Received" or as SCE noted as received from the home.
Originally I assumed each day has 24 hours except I forgot about daylight saving, which added/subtracted an hour on those two days in the year. Therefore I had to add a new column and use a different method to clean the data.

Tools I used -

Data wrangling - Microsoft Excel

Visualization - Microsoft Power BI Desktop and Tableau Public

Date range - 2021 June through 2022 May (12 months)

My data explorer journey

Friday, June 30, 2023

Movie Data Analysis using Power BI

Tuesday, July 5, 2022

Comparing two home's electricity usage

Google Data Analytics capstone project

Google Data Analytics capstone project

Report Abuse

Labels