Friday, June 24, 2022

Google Analytics Cyclistic bike-share R script

Here is the R script I used to load, review, merge, add additional variables, filter and export to a new csv file.

#here are the R script I used to import the 12 months worth of CSV files

#review their structures to ensure all files have the same structures to merge together

#note - need to review each file's variables individually, which is not done in this script

install.packages("tidyverse")

install.packages("readr")

install.packages("janitor")

install.packages("lubridate")

library(tidyverse)

library(readr)

library(janitor)

library(lubridate)


df1 <- read_csv("202106-divvy-tripdata.csv")

df2 <- read_csv("202107-divvy-tripdata.csv")

df3 <- read_csv("202108-divvy-tripdata.csv")

df4 <- read_csv("202109-divvy-tripdata.csv")

df5 <- read_csv("202110-divvy-tripdata.csv")

df6 <- read_csv("202111-divvy-tripdata.csv")

df7 <- read_csv("202112-divvy-tripdata.csv")

df8 <- read_csv("202201-divvy-tripdata.csv")

df9 <- read_csv("202202-divvy-tripdata.csv")

df10 <- read_csv("202203-divvy-tripdata.csv")

df11 <- read_csv("202204-divvy-tripdata.csv")

df12 <- read_csv("202205-divvy-tripdata.csv")

df <- rbind(df1, df2, df3, df4, df5, df6, df7, df8, df9, df10, df11, df12)


#I ran below scripts for df1 to df12 to ensure all variables match before combining them. 

#there should be an easier / more automatic way to compare them instead of manual comparison.


str(df)

glimpse(df)

colnames(df)

summary(df)

table(df$rideable_type)


#convert variable started_at and ended_at to POSIXlt type

#calculate weekdays to see daily pattern

#calculate month of year to see monthly pattern

#calculate duration (ended_at - started_at) and convert the result to numeric

df$started_at <- as.POSIXlt(df$started_at)

df$ended_at <- as.POSIXlt(df$ended_at)

df$start_weekday <- weekdays(df$started_at)

df$start_month <- format(as.Date(df$started_at), "%m")

df$duration_min <- as.numeric((df$ended_at - df$started_at)/60)

summary(df)


#filter out any data that has duration less than or equal to 0

df_temp <- filter(df, duration_min > 0)


#select a smaller size of data

#after review, decided to keep only ride_id, rideable_type, started_at, ended_at, member_casual,

# start_weekday, duration_min and start_month

df1 <- select(df_temp, ride_id, rideable_type, started_at, ended_at, member_casual, start_weekday, 

              duration_min, start_month)


#save the file into file name "202106-202205-updatedbikedata.csv"

write.csv(df1,"202106-202205-updatedbikedata.csv",row.names=FALSE)

Google Data Analytics capstone project

I decided to use blog as the venue to publish my Google Data Analytics learning because I wasn't sure how I can upload the findings, the code, the visualization and the slides all to the same site / location.

The project I chose was the Chicago bike share data and the question to answer was "How do annual members and casual riders use Cyclistic bikes differently."

My findings from the analysis -

1. Annual members use the bikes more than casual riders in terms of total rides, but casual riders ride longer in terms of total ride time and average ride time.

2. Annual members use the bikes less, in terms of number of rides, than casual riders during weekends and summer season, but more during weekday and other seasons.




3. Annual members average trip time does not vary much from day of the week (2.6 minutes) to month of the year (3.7 minutes).  



4. Both annual members and casual riders use more classic bikes in terms of total rides and total ride time.


Conclusion based on available information -
  • Casual riders average ride time is about twice as long as annual members for the classic bike category (28.9 minutes vs. 13.9 minutes), and about 50% longer on average ride for electric bike (19.2 minutes vs. 12.2 minutes).
  • Need additional analysis on casual riders if we are able to obtain customers' ZIP code to further divide customer base into local versus visitor for better identification.  If casual riders are mostly from outside of Cyclistic bike's service area, it may not be effective to run ad campaign to try to convert them to annual members.
Assumption made based on the project document, background and provided data - 
  • Annual members are more profitable than casual riders.
  • There should not be any negative trip duration.
Additional information I would like to have for further analysis -
  • Request clarification on "docked_bike" category within the rideable_type variable.  My guess is that "docked_bike" is the casual rider's Day Pass purchase.
  • Request enough identifiable information to determine if casual riders live in the service area or if they purchased multiple single passes, without violating personal privacy concern.
  • Request enough annual member's identifiable information to determine personal daily usage pattern without violating personal privacy concern.
Lessons learned from doing this exercise -
  • Document the process.  Comment codes while exploring the data so I remember what I'm trying to accomplish at each step.
  • For someone like me who is used to Excel, it took more steps to manipulate data and visualization using R.  However, R gives user more control to do exactly what one wants to do with the visualization.
  • Be careful on stupid mistakes like typo's and referring to the wrong data frame.  These are silly mistakes.  😞
  • Practice, practice, practice!
Possible additional work in the future - 
  • The summary calculation done through R and Power BI came out different.  This will require further analysis.
  • Create the visualization using Tableau Public and share the visuals.
    • Updated 6/30/22 - Unfortunately the dataset might be to big for Tableau Public to process on a consistent basis.  The program kept giving me error messages but I was able to create one visual - Tableau Public link 
  • Conduct the exercise with Python.
  • Include distance analysis with available start and end station longitude and latitude information and see if there is any difference in terms of learning (vs. average time)
Tools I used -
Data wrangling - RStudio Desktop
Visualization - Microsoft Power BI Desktop

Data from Divvy - https://divvy-tripdata.s3.amazonaws.com/index.html
Data license agreement - https://ride.divvybikes.com/data-license-agreement
Data range - 2021 June through 2022 May (12 months)

Google Data Analytics capstone project

Google Data Analytics capstone project

I decided to use blog as the venue to publish my Google Data Analytics learning because I wasn't sure how I can upload the findings, the...