Sunday, December 6, 2015

Prediction model for Taxi and Limousine Commission









































Executive Summary:
In recent years, there have been several critical developments that make analysis of the NYC Taxi and Limousine Commission’s data an opportunity for both incumbent and developing businesses.  Firstly, the proliferation of alternative livery services such as Uber and Lyft have introduced competition to a space that has traditionally been monopolized by state-regulated “yellow-cabs.”  Uber and Lyft cars are “hailed” via smartphone, and provide up-front fare estimation for customers.  Additionally, Uber and Lyft vehicles are not subject to the same strict provisions that state-regulated “yellow-cabs” must endure.  In fact, in the short time since Uber’s inception, it is had the effect of steadily reducing the value of regulated taxi-medallions in many major cities throughout the United States.[1]  This newfound competition has created an opportunity to apply data-mining and statistical analysis techniques to publicly available data in order to create or defend a competitive advantage in a market long devoid of any real competition. 

The analysis of data relating to taxi-fares has implications for the both the TLC and services such as Lyft and Uber.  For example, smartphone penetration has reached an incredible 58% of the entire United States adult population.[2]  Smartphones are generally required to “hail” an alternative livery vehicle (such as Uber and Lyft) and thus present a good avenue for the NYC Taxi and Limousine Commission to mitigate lost customers to alternative services.  By creating “apps” that can mimic some of the convenience associated with the Uber/Lyft “experience” such as providing up-front fare estimation, the TLC may be able to retain some revenue that it would otherwise lose to competing services.  Additionally, Uber and Lyft could use the same data as a way to benchmark their own fares to ensure that they maintain parity with their competition with respect to rates.  Additionally, analysis of the data may provide drivers for both regulated “yellow cabs” insights into how they can maximize their fares, or even minimize wasted time.

We employed a “SEMMA” framework for this study which consisted of sampling the available data, exploring it for patterns, modifying variables where necessary, modelling our hypotheses and then assessing the results.

Description of Topic:
New York City is one of several major cities in the United States where licensed taxi-cabs actively “cruise” the city looking for fares.  This is in contrast to how taxicabs operate in most American cities.  Typically, a cab dispatcher is called by a passenger, the dispatcher records information (pickup location, drop-off location and number of passengers) and a taxi is sent to pick up the passenger.  This allows for the dispatcher / driver to know, with reasonable certainty, what a given fare will be before they actually pick up a given passenger.

In New York, and particularly, in Manhattan, this is not the case.  Drivers do not have a way to predict what their fares will be ahead of time since they are simply picking up passengers off of the street.  Moreover, it is prohibited by law for drivers to refuse to undertake a given trip, provided that it terminates within the city’s five boroughs.  This makes it difficult for a driver to estimate their revenue for a given day or trip.  It also makes it difficult for a driver to know where they will end up after picking up a fare in a given location.  By examining detailed data collected from several hundred-thousand taxi trips in NYC, we seek to apply Business Intelligence techniques in order to aid drivers and passengers alike with respect to both predicting the fare that will result from any given “pickup” in addition to optimizing “cruising” locations in order to maximize the probability of picking up “high-revenue” fares.  Additionally, given the introduction of competition from Uber and Lyft, who do not operate under the same regulations as “yellow-cabs,”  we seek to provide an estimation of taxable revenue lost due to the underreporting of tips received by cab-drivers for fares paid in cash.  Since Uber and Lyft operate according to a “credit-card only” model, their fares are not subject to the same type of “tip deflation.”

Previous Studies of Similar Data:
Similar data has been used in the past to identify events of interest such as shift-changes (which result in a shortage of cabs “cruising” since drivers are not allowed to pick up passengers while returning to the livery.)  In fact, the NYC Taxi and Limousine commission has been using the data gathered from their cabs to generate infographics for release via twitter in what they have dubbed “metric Mondays.”  It was the infographic shown directly below that eventually inspired the FOIA request that has made our dataset available to the public.


 [3]

The above infographic inspired programmer, blogger and self-described “data-junkie,” Chris Whong to request the NYC TLC dataset that we are using for our project via the Freedom of Information Act, since the TLC is a public commission.  The image below was created by Chris Whong and is a screenshot from an app which Chris created that shows exactly how a given driver in the dataset earned each of their fares for a given day via an interactive visualization. 




However, the information depicted in Chris Whong’s app is not very helpful for an individual driver since it simply breaks down a given day in another driver’s life without any real predictive capabilities.  Our goal is to create less interactive but more useful models that would enable a driver or passenger to predict, ahead of time, the fare for a given trip.  Although taxi-fares are deterministically calculated, a major component of that calculation is the actual time that a trip takes, measured in seconds.  Since traffic conditions are highly variable depending on time of day and day of week, our goal is to segment the data in such a way that we can use historical observations as a proxy for traffic-conditions between two points at a given time of day and day of week. 

Additionally, we seek to calculate the effect of underreporting tips for taxi-cab fares paid in cash as an alternative method for mitigating lost-revenue for the TLC as a result of increased competition.  Finally, by profiling NYC’s various zip-codes according to day of week, time of day and the types of trips that originate during those times, we hope to generate models that can aid in maximizing a given driver’s revenue by reducing the number of fares that result in trips that terminate outside of Manhattan.  Because taxis licensed by the TLC are governed by strict regulations, cabs licensed to pick up “hailed fares” in Manhattan must return to Manhattan without a fare if their trip terminates in any of the other five NYC boroughs.  Thus, if we could accurately forecast the probability that a given zip-code at a given time of day and day of week will generate a trip that terminates outside of Manhattan, taxi-drivers could avoid those locations. 

Description of Data:
Our group has chosen to use data obtained from the New York City Taxi and Limousine commission.  The Taxi and Limousine commission requires that all legal drivers report certain information pertaining to both the trip itself, and the fares charged, for all legal taxi-rides.  The data also includes the zip-code from which a passenger was picked up, and the zip-code from where the passenger was dropped off.  GPS coordinates for pickup and drop-off locations are also included, and preferable to zip-codes since more specific information can be inferred using more granular locations-data, given the scope of this project, we will only be focusing on zip-codes and collections of zip-codes that correspond to major neighborhoods with respect to location of pickups and drop-offs.

Because the initial dataset encompassed every single legal taxi-ride tendered for the entire year of 2013 we have chosen to only look at the month of January due to limitations in processing power.  From the month of January, we took a random sample of 1,000,000 rows from the roughly 13,000,000 rows available.   We then further reduced the sample to exclude any observations with missing or incomplete data in the pickup and drop-off zip-code fields since zip-codes will be a major portion of our analysis.  This left us with roughly 865,000 rows of data for the month of January.  Next, in order to further reduce the data, we decided to only look at rides that originate within Manhattan.  This is due to the fact that there are myriad rules for where taxis licensed for specific boroughs can and cannot pick up fares.  Thus, we chose to restrict our study to the busiest borough in NYC for taxi-cab traffic, Manhattan.  This self-imposed constraint leaves us with 759,159 observations with which to construct our training, validation and testing sets.





The following variables are included in our dataset:

·         Medallion: the license for a specific vehicle, allowing for it to be used as a “yellow taxi” in NYC.
·         Hack License: the license given to an individual driver granting them the right to operate a “medallion” taxi in NYC.
·         Pickup Date time: the date and time that a passenger was picked up.
·         Vendor ID: the vendor who supplied the fare-meter to the particular taxi-cab.
·         Payment Type: cash or credit.
·         Fare Amount: the amount charged for the time / distance driven, independent of any taxes or surcharges.
·         Surcharge: there is a mandatory surcharge applied to trips that occur during peak hours (between 4 and 6pm on weekdays and after 8pm on weekends).
·         MTA Tax:  there is a flat $.50 tax on every fare payable to New York State.
·         Tip Amount: the amount that a driver was tipped by the occupants of a ride.
·         Tolls Amount: the amount of tolls paid by the driver during a particular fare.
·         Total Amount: sum of all charges levied to the customer, including tip.
·         Rate Code: depending on how far outside of NYC a trip goes, differing rates apply. 
·         Dropoff Datetime: the date and time that a passenger was dropped off for a given fare.
·         Passenger Count: the number of passengers in the vehicle during the trip.
·         Trip Time in Seconds: the number of seconds that passed between pickup and drop-off.
·         Trip Distance: the distance (in miles) travelled during a trip.
·         Year: 2013 for all observations.
·         Month: January (1) for all observations.
·         Date (Day): the numeric day of the month ranging from 1 to 31.
·         Day of Week: day of week (Monday, Tuesday, Wednesday, Thursday, Friday or Saturday) for a given pickup.
·         Pickup Lat/Long: GPS coordinates for pickup.
·         Pickup Zipcode: zipcode for pickup.
·         Pickup Direction: the cardinal direction that the cab was travelling during pickup.
·         Dropoff Lat/Long: GPS coordinates for drop-off.
·         Dropoff Zipcode: the zipcode in which the trip terminated.
·         Dropoff Direction: the cardinal direction of the cab at drop-off.
·         Pickup Neighborhood: a less granular measure of location that zip-code.  While there are 41 Manhattan zip-codes in our dataset, there are only 10 broader neighborhoods represented in this categorical variable.
·         Dropoff Neighborhood: same as above, but with respect to the neighborhoods where a given ride terminates.  If a ride terminates outside of Manhattan, no specific neighborhood is captured, just that the neighborhood is “not in Manhattan.”


Model #1 (Forecasting Fares):

Before begging the modelling process to forecast fares for both passengers and livery services seeking a benchmark for their own fares, we ran our data through a “DMDB node” to assess the distributions of our interval data for reasonable skewness, the results of which are shown below:



Based on the results we transformed all interval-data their corresponding logged values, resulting in distributions much closer to normal:


Additionally, we created “bins” for “pickup_zipcode” that correspond to broader neighborhoods with Manhattan in order to determine whether pickup locations actually need to be broken down to a level as granular as zip-code in order to predict a fare-amount with a high degree of accuracy.  We also grouped hours of the day into distinct categories that include “morning rush,” “late morning,” “afternoon,” “afternoon rush,” “evening,” and “night/late-night” in order to minimize the number of two-factor interactions between pickup location and time of pickup.

In order to deterministically calculate the fare for any given taxi-trip, one must first know the number of passengers, the distance of the trip and, finally, the time that the trip will actually take from inception to termination.  As shown below, when using this information, we can predict a fare with an R-squared > .999. 




However, since traffic conditions vary throughout the week, and throughout the day, it is not possible to know with absolute certainty how long a given trip will take to complete.  Thus, by using zip-codes and times of day as variables in a model, we can essentially create a proxy for average traffic-patterns and accurately predict the amount of any given fare. 

Once we had reasonably skewed distributions, we approached the task of predicting fare-amount using three models consisting of two linear regressions and a neural-network.



·         Linear by Hood = linear regression specified by the formula given below:
o   log(fare_amount) = log(trip_distance)+log(passenger_count)+day_of_week+trip_time_of_day
+pickup_neighborhood
·         Linear by Zip = linear regression specified by the formula given below:
o   log(fare_amount) = log(trip_distance)+log(passenger_count)+trip_time_of_day+pickup_zipcode
+day_of_week
·         Neural Network = neural network using the same inputs as “linear by zip.”

Each model included two factor-interactions for all class-variables.  We used “Average Square Error” on the validation set as our selection criteria.  As shown below, all models performed roughly equally.  Interestingly, “Reg2,” which corresponds to the linear model using aggregated zip-codes at the neighborhood level actually slightly outperformed the linear model using a more granular approach to measuring location of trip inception. 



Given that the results of each model were incredibly similar, we would recommend the use of “Reg2”, also known as “Linear by Neighborhood” for the final model, despite the extremely modest fit improvement yielded by a neural-network, simply for ease of interpretability for a both driver and passenger.  In any event, something as simple as a table of fares for each neighborhood at each hour of the day for a range of distances would be of great use to a driver to determine optimal areas to “cruise” during specific times of day and days of the week.

Model fit statistics for the suggested regression model are shown below:


With an adjusted R-Squared value of .8808, we feel confident that this level of granularity with respect to pick up-location is sufficient to provide an accurate estimation for any given fare originating in Manhattan. 

Model #2 (Predicting Drop-off Location):
Due to the nature of regulations prohibited cabs licensed to “cruise” for fares in Manhattan face with respect to picking up fares in other boroughs, it is highly desirable for a driver to avoid locations that are likely to result in a trip terminating outside of Manhattan.  We sought to use pickup-locations, time of day and day of week to determine if there were any relevant patterns with respect to trips terminating outside of Manhattan that could be discerned.

Overall, 86% of all trips in our dataset terminated within Manhattan.  That means that in order for a model to outperform the simply assumption that all trips terminate within Manhattan, it needs to achieve a misclassification rate significantly less than 14%.

The following diagram shows the models that we attempted to fit to the data.  The left-hand branches consist of models run on the 60/30/10 data-partition.  The models on the right-hand branch were run on a sample taken of the partitioned data that gave us a 50/50 split between trips that terminated in Manhattan and those that did not, with original level proportions specified at 86% and 14% for trips terminating in Manhattan and elsewhere respectively.



Below are the results for the left-hand branches, which used non-modified data consisting of an 86%-14% split between trips terminating in Manhattan vs. those that terminated elsewhere:


All models performed identically with respect to misclassification rate of the validation set.  Moreover, the models were, at best, only as good as simply predicting that every single trip would terminate within Manhattan.  This means that none of these models are useful with respect to aiding drivers in avoiding specific combinations of pickup locations and times of day that are likely to yield a fare that terminates outside of Manhattan.

Because of the low percentage of trips terminating outside of Manhattan relative to the overall number of trips in the dataset, we figured that perhaps our models simply did not have enough “information” to work with.  In other words, with only 14% of the dataset consisting of “trips of interest,” there simply may not be enough cases with which to differentiate from trips that do terminate within Manhattan.  To combat this problem we created a dataset that had a 50/50 split of trips terminating within Manhattan and trips terminating in other boroughs.  We then applied a decision-node in order to adjust the prior-probabilities to conform to the non-sampled data. 







With an adjusted sample that contains a 50/50 distribution of trips terminating in Manhattan vs. trips that terminate in other Boroughs, the results are the same as before in that each model has a misclassification rate of roughly 14.3%.



The Neural-Network, shown below in green, had the best sensitivity or “true-positive-rate,” however, the rate is still equal to that of the original data. 

While we were excited at the prospect of being able to provide a prediction for drivers that would be very useful with respect to avoiding unprofitable trips that terminate outside of Manhattan, the distribution of trips terminating outside of Manhattan with respect to zip-codes and times of day appears to be consistently distributed at 14% and 86%, respectively.  Unfortunately, this hypothesis did not work.

Model #3 (Assessing lost Taxable-Revenue):
When examining our dataset, one thing was immediately clear.  Almost every single trip in which a customer paid with cash had a tip-amount of $0.00.  However, there were almost no cases in which a fare paid with credit-card did not have a tip associated with it.  Since credit-card tips are automatically reported, our assumption is that drivers are simply pocketing cash-fares to reduce their taxable income.  Given the influx of competition from services like Uber and Lyft, we sought to determine the overall impact of lost taxable revenue that the dishonest reporting of cash-fare tips was driving.

Cash Fares:
Credit Card Fares:
Mean tip = $.0003
Mean tip = $2.18
Mean tip % = 0
Mean tip % = 15.19

In order to model the lost taxable-revenue, we created a dataset consisting of only fares paid via credit-card.  Using various combinations of variables, we modelled whether tips were dependent solely on fare-amount, or whether certain zip-code and time-of-day combinations led to increased or decreased tips. 



The models on the left-hand side of the node array shown above have logged interval-terms, while the models on the right-hand side do not.  The regression nodes labelled with interactions include all two factor interactions between pickup location and time, while the neural networks and non-labelled regression do not. 

Results for logged models:

Results for non-logged models:

As shown above, the neural-network with no logged independent interval terms slightly outperformed the neural network with logged interval terms.  Recall that the mean tip amount across the domain of our dataset is $2.18.  The non-logged neural-network achieved a mean predicted tip-value of $2.17 and a mean residual of $.02. 




That means that, for the month of January 2013 (~14,000,000 trips), the city and state of New York missed out on approximately 14,000,000*2.178751 = ~$30,500,000 in taxable revenue!  Over the course of a year, that is more than $300,000,000 in lost taxable revenue, which would amount to upwards of $45,000,000 in actual taxes not-collected, assuming a generous 15% marginal rate.



Use Cases / Business Implications:
·         Predicting Fare Amount:


o   By creating an app for passengers of “yellow-cab” taxis, the TLC of New York City may be able to mitigate some of the losses in revenue to competitors such as Uber and Lyft by mimicking some of their “convenience factor.”  By using location, time of day and day of week as proxies for traffic information, passengers could quickly and accurately predict a given cab-fare before embarking on their trip.
o   Additionally, services such as Uber and Lyft could actually use the same model as a way to benchmark their own fares.  Since the model predicts any given fare with a high degree of accuracy, alternative Livery services could choose to either match or undercut “yellow-cab” rates according to the model’s output.

·         Predicting lost taxable-revenue:
o   Because fares that are paid in cash are almost always absent a reported tip, and fares paid with credit-cards reliably have a tip associated with them, our assumption is that drivers are simply pocketing cash-tips and consequently reducing the city and state of New York’s taxable revenue.  By modelling tip amount according to spatial and time variables, in addition to fare-amount, we produce a model that can be used to estimate the amount of taxable revenue that is being held by drivers of yellow-cabs.
o   This information can be used by the TLC to determine whether it is worth their time and effort to pursue this lost revenue.  One potential solution is to institute a mechanical device into which drivers would feed cash and then enter the amount which they wanted to leave as a tip.  This would effectively take the cash “out of the drivers hands,” removing their discretion with respect to reporting tips. 


Conclusion:

While working on this project, we found that there are few areas of improvement in the taxi business controlled by TWC commission. As of now these problems are not handled properly by the commission. We found three major areas that needed to be addressed:
·         Huge loss of taxable revenue, which was due to non-reporting of tip fares paid in cash.
·         Inconvenience faced by passengers due to inconsistent taxi fares, given the traffic conditions.
·         Loss of profits for the drivers due to trips ending outside Manhattan.
After using the above described prediction model for each major problem, we can predict the following:
·         Amount of lost revenue.
·         Actual fares and unfavorable pick up location for drivers, regardless of traffic conditions for any destination.
Considering this, the government could start a pilot program to mitigate lost revenue. In addition, the commission could create interactive apps which could address the problems faced by customers and taxi drivers.

Learnings:
·         Preprocessing and Data Filtration.
Given, that we had a huge input dataset, we located many missing values and anomalies with several variables. We were able to replace these with normal values. We learnt to identify important variables that are considered as input and target variables in order to start the modeling process.
·         Training and validation of data.
In order to run a successful model, we learnt how to train our SAS application by using training data so that the application could be able to predict values when we provide validation data.
·         How to use prediction model.
We understood how to predict values by creating linear regression, neural networks and decision tree models.
·         How to interpret results
From Mean square error, misclassification rate, odds ratio output, histograms, fit statistics and cumulative lift, it was easy for us to interpret different models and were able to compare them for selecting the best model.




[1] http://www.businessinsider.com/uber-destroying-value-of-taxi-monopolies-cartels-2014-11
[2] http://www.pewinternet.org/fact-sheets/mobile-technology-fact-sheet/
[3] NYC Taxi and Limousine Commission Official Twitter Account (@NYCtaxi).  March 10, 2014. 
[4] http://chriswhong.com/open-data/taxi-techblog-2-leaflet-d3-and-other-frontend-fun/

No comments:

Post a Comment