New York City Taxi and Limousine Commission (TLC) Trip Record Data Analysis

Motivation

You’ll know which city I am talking about if I start describing it as the most populous city in the States, one of the biggest and the best places to be, the Big Apple, The City that never sleeps (literally). You got it right! Its NYC! I am very sure, that out of the many things that pop in your mind when you think of New York, the yellow taxicab 🚖 definitely has been in the list. New York is the home of this iconic yellow cab and can be commonly identified as a symbol of the city. The history of these cabs licensed under New York City Taxi and Limousine Commission (TLC) goes way back to the 70’s [1]. These cabs can be hailed from the streets anywhere and serve as a means of commute for thousands of New Yorkers on a daily basis. Where technology has advanced to an extent that almost everything is just a click away, this cab system has functioned in a conventional non-tech manner, until the recent launch of an app. This spiked my interest to take a look into the backend of this system. Being a techie, I could not resist the temptation of uncovering what lies beneath the humungous heap of data that must have accumulated over all these years. Like, what modes of payment were the largely preferred by the customers, how generously would they tip their cabbies, what were the vendor preferences, [2] why are people preferring Lyft and Uber when a taxicab system was already in place? I have managed to acquire a small data set of the past few years which will help me gain deep meaningful insights about this popular system. Since, such a large valuable amount of data is offered in a very basic format, its essential that it be processed meticulously and represented in a way which is easiest to understand and pleases the eye.

About the Data

During the search for data set I found that Amazon have open dataset on AWS. When data is shared on AWS, anyone can analyze it and build service on top of it using a broad range of compute and data analytics product. This helps me a lot as it allows me to spend more time on data analysis rather than data acquisition. The Registry of Open Data on AWS makes help me to find the dataset which are made publicly available through AWS services. New York City Taxi and Limousine Commission (TLC) Trip Record Data is one of those open datasets on AWS which I choose for this project. Below are few details about this dataset:

Description

Obtaining the Data & Preprocessing


import requests
from time import time
urls = ["https://nyc-tlc.s3.amazonaws.com/trip+data/fhv_tripdata_2019-01.csv",
        "https://nyc-tlc.s3.amazonaws.com/trip+data/fhv_tripdata_2019-02.csv",
        "https://nyc-tlc.s3.amazonaws.com/trip+data/fhv_tripdata_2019-03.csv",
        "https://nyc-tlc.s3.amazonaws.com/trip+data/fhv_tripdata_2019-04.csv",
        "https://nyc-tlc.s3.amazonaws.com/trip+data/fhv_tripdata_2019-05.csv",
        "https://nyc-tlc.s3.amazonaws.com/trip+data/fhv_tripdata_2019-06.csv",
        "https://nyc-tlc.s3.amazonaws.com/trip+data/fhv_tripdata_2019-07.csv",
        "https://nyc-tlc.s3.amazonaws.com/trip+data/fhv_tripdata_2019-08.csv",
        "https://nyc-tlc.s3.amazonaws.com/trip+data/fhv_tripdata_2019-09.csv",
        "https://nyc-tlc.s3.amazonaws.com/trip+data/fhv_tripdata_2019-10.csv",
        "https://nyc-tlc.s3.amazonaws.com/trip+data/fhv_tripdata_2019-11.csv",
        "https://nyc-tlc.s3.amazonaws.com/trip+data/fhv_tripdata_2019-12.csv", ] 
  
 def download_csv(url):
    path = 'C:/Nilay/a.csv'
    print(url)
    r = requests.get(url, stream=True)
    with open(path, 'wb') as f:
        for ch in r:
            f.write(ch)
start = time()
for x in urls:
    download_csv(x)
    print(f"Time to download: {time() - start}")
code ref: https://likegeeks.com/downloading-files-using-python/

import os
import pandas as pd
col_to_put = ['VendorID', 'tpep_pickup_datetime',
               'tpep_dropoff_datetime', 'passenger_count', 
               'trip_distance', 'RatecodeID', 'PULocationID',
               'DOLocationID', 'payment_type', 'fare_amount',
               'tip_amount', 'tolls_amount', 'total_amount'] 
add_header = True
chunksize = 10 ** 5
my_path = "C:/Nilay/Data Processing in Cloud/Personal Project DataSet/2019"

for r, d, f in os.walk(my_path):
    for curr_file in f:
      print(curr_file)
      fileName = "C:/Nilay/Data Processing in Cloud/Personal Project DataSet/2019/"
      for chunk in pd.read_csv(fileName + curr_file, chunksize=chunksize,usecols=col_to_put):
          chunk.to_csv("C:/Nilay/Data Processing in Cloud/" + curr_file, mode='a', index=False, header=add_header) 
          if add_header:
              # The header should not be printed more than one
              add_header = False

BigQuery

Analytics

Graphical Analysis with the help of Google Data Studio

Analysis of Data

Challenges

  1. Skewed Data: I see that data for a yellow taxis for 2019 was not correct and it was showing data from 2018 and previous years which is incorrect so my analysis got impact due to that. Also, many fields were null which also disturbed the overall analysis.I overcame this by manually creating a schema for tables as well as dropping the rows and columns with the help of python script to reduce the impact of it on data analysis.
  2. In correct data types: In green taxi data, for 2019 vendor ids were of type integer but for 2018 and 2017 they were of type float so it didn’t allow me to merge the data into the same table. I had to implement a solution where I created the 2019 table as a separate table from other tables (2017 and 2018 got merged into one table) and got the aggregated data by joining these 2 tables.
  3. Missing field values: Data for a mode of the payment was inconsistent since for few payment modes I couldn’t find the data from 2017-2019. There might be a case that those modes may never be used but I felt they had little discrepancies. I had to merge such data into the category as “other” to minimize the impact on my analysis.
  4. Data Studio issue with Array Aggregator: Looks like Google data studio is not able to handle data which is aggregated by array aggregators so I had to put screenshots of the outcome into the charts instead of some fancy charts which could have been better in visualization.

Future Work

Query Execution Details

Below is the execution details to retrive Table 1-2 and Table 3-4 data from Big Query.

  1. Table 1 [ Green Taxis] Image
  2. Table 2 [Yellow Taxis] Image
  3. Table 3 [Green Taxis] Image
  4. Table 4 [Yellow Taxis] Image

Why Big Query ?

There are many useful as well as optimization features which is provided by Big Query some those are below:

  1. Complex analysis can be done via SQL and user-defined functions (UDFs)
  2. BigQuery is suitable for calculating metrics/stats/aggregates over a large amount of data
  3. Faster and user-friendly web UI for querying and data loading.
  4. Multiple-way to load the data like Google cloud storage, local storage, other cloud storage.
  5. Easy to integrate with Google Data Studio to visualize the data.
  6. Stream-insert is available for receiving/ingesting live data into tables.