Executive Summary:
In recent years, there have been
several critical developments that make analysis of the NYC Taxi and Limousine
Commission’s data an opportunity for both incumbent and developing
businesses. Firstly, the proliferation
of alternative livery services such as Uber and Lyft have introduced
competition to a space that has traditionally been monopolized by
state-regulated “yellow-cabs.” Uber and
Lyft cars are “hailed” via smartphone, and provide up-front fare estimation for
customers. Additionally, Uber and Lyft
vehicles are not subject to the same strict provisions that state-regulated
“yellow-cabs” must endure. In fact, in
the short time since Uber’s inception, it is had the effect of steadily
reducing the value of regulated taxi-medallions in many major cities throughout
the United States.[1] This newfound competition has created an
opportunity to apply data-mining and statistical analysis techniques to
publicly available data in order to create or defend a competitive advantage in
a market long devoid of any real competition.
The analysis of data relating to
taxi-fares has implications for the both the TLC and services such as Lyft and
Uber. For example, smartphone
penetration has reached an incredible 58% of the entire United States adult
population.[2] Smartphones are generally required to “hail”
an alternative livery vehicle (such as Uber and Lyft) and thus present a good
avenue for the NYC Taxi and Limousine Commission to mitigate lost customers to
alternative services. By creating “apps”
that can mimic some of the convenience associated with the Uber/Lyft
“experience” such as providing up-front fare estimation, the TLC may be able to
retain some revenue that it would otherwise lose to competing services. Additionally, Uber and Lyft could use the
same data as a way to benchmark their own fares to ensure that they maintain
parity with their competition with respect to rates. Additionally, analysis of the data may
provide drivers for both regulated “yellow cabs” insights into how they can
maximize their fares, or even minimize wasted time.
We employed a “SEMMA” framework
for this study which consisted of sampling the available data, exploring it for
patterns, modifying variables where necessary, modelling our hypotheses and
then assessing the results.
Description of Topic:
New York City is one of several
major cities in the United States where licensed taxi-cabs actively “cruise”
the city looking for fares. This is in
contrast to how taxicabs operate in most American cities. Typically, a cab dispatcher is called by a
passenger, the dispatcher records information (pickup location, drop-off
location and number of passengers) and a taxi is sent to pick up the
passenger. This allows for the
dispatcher / driver to know, with reasonable certainty, what a given fare will
be before they actually pick up a given passenger.
In New York, and particularly, in
Manhattan, this is not the case. Drivers
do not have a way to predict what their fares will be ahead of time since they
are simply picking up passengers off of the street. Moreover, it is prohibited by law for drivers
to refuse to undertake a given trip, provided that it terminates within the
city’s five boroughs. This makes it
difficult for a driver to estimate their revenue for a given day or trip. It also makes it difficult for a driver to
know where they will end up after picking up a fare in a given location. By examining detailed data collected from
several hundred-thousand taxi trips in NYC, we seek to apply Business
Intelligence techniques in order to aid drivers and passengers alike with
respect to both predicting the fare that will result from any given “pickup” in
addition to optimizing “cruising” locations in order to maximize the
probability of picking up “high-revenue” fares.
Additionally, given the introduction of competition from Uber and Lyft,
who do not operate under the same regulations as “yellow-cabs,” we seek to provide an estimation of taxable
revenue lost due to the underreporting of tips received by cab-drivers for
fares paid in cash. Since Uber and Lyft
operate according to a “credit-card only” model, their fares are not subject to
the same type of “tip deflation.”
Previous Studies of Similar Data:
Similar data has been used in the
past to identify events of interest such as shift-changes (which result in a
shortage of cabs “cruising” since drivers are not allowed to pick up passengers
while returning to the livery.) In fact,
the NYC Taxi and Limousine commission has been using the data gathered from
their cabs to generate infographics for release via twitter in what they have
dubbed “metric Mondays.” It was the
infographic shown directly below that eventually inspired the FOIA request that
has made our dataset available to the public.
The above infographic inspired
programmer, blogger and self-described “data-junkie,” Chris Whong to request
the NYC TLC dataset that we are using for our project via the Freedom of
Information Act, since the TLC is a public commission. The image below was created by Chris Whong
and is a screenshot from an app which Chris created that shows exactly how a given
driver in the dataset earned each of their fares for a given day via an
interactive visualization.
However, the information depicted
in Chris Whong’s app is not very helpful for an individual driver since it
simply breaks down a given day in another driver’s life without any real
predictive capabilities. Our goal is to
create less interactive but more useful models that would enable a driver or
passenger to predict, ahead of time, the fare for a given trip. Although taxi-fares are deterministically
calculated, a major component of that calculation is the actual time that a
trip takes, measured in seconds. Since
traffic conditions are highly variable depending on time of day and day of
week, our goal is to segment the data in such a way that we can use historical
observations as a proxy for traffic-conditions between two points at a given
time of day and day of week.
Additionally, we seek to
calculate the effect of underreporting tips for taxi-cab fares paid in cash as
an alternative method for mitigating lost-revenue for the TLC as a result of
increased competition. Finally, by
profiling NYC’s various zip-codes according to day of week, time of day and the
types of trips that originate during those times, we hope to generate models
that can aid in maximizing a given driver’s revenue by reducing the number of
fares that result in trips that terminate outside of Manhattan. Because taxis licensed by the TLC are
governed by strict regulations, cabs licensed to pick up “hailed fares” in
Manhattan must return to Manhattan without a fare if their trip terminates in
any of the other five NYC boroughs.
Thus, if we could accurately forecast the probability that a given zip-code
at a given time of day and day of week will generate a trip that terminates
outside of Manhattan, taxi-drivers could avoid those locations.
Description of Data:
Our group has chosen to use data
obtained from the New York City Taxi and Limousine commission. The Taxi and Limousine commission requires
that all legal drivers report certain information pertaining to both the trip
itself, and the fares charged, for all legal taxi-rides. The data also includes the zip-code from
which a passenger was picked up, and the zip-code from where the passenger was
dropped off. GPS coordinates for pickup
and drop-off locations are also included, and preferable to zip-codes since
more specific information can be inferred using more granular locations-data, given
the scope of this project, we will only be focusing on zip-codes and
collections of zip-codes that correspond to major neighborhoods with respect to
location of pickups and drop-offs.
Because the initial dataset
encompassed every single legal taxi-ride tendered for the entire year of 2013
we have chosen to only look at the month of January due to limitations in
processing power. From the month of
January, we took a random sample of 1,000,000 rows from the roughly 13,000,000
rows available. We then further reduced
the sample to exclude any observations with missing or incomplete data in the
pickup and drop-off zip-code fields since zip-codes will be a major portion of
our analysis. This left us with roughly
865,000 rows of data for the month of January.
Next, in order to further reduce the data, we decided to only look at
rides that originate within Manhattan.
This is due to the fact that there are myriad rules for where taxis
licensed for specific boroughs can and cannot pick up fares. Thus, we chose to restrict our study to the
busiest borough in NYC for taxi-cab traffic, Manhattan. This self-imposed constraint leaves us with
759,159 observations with which to construct our training, validation and
testing sets.
The following variables are
included in our dataset:
·
Medallion:
the license for a specific vehicle, allowing for it to be used as a “yellow
taxi” in NYC.
·
Hack
License: the license given to an individual driver granting them the right
to operate a “medallion” taxi in NYC.
·
Pickup
Date time: the date and time that a passenger was picked up.
·
Vendor
ID: the vendor who supplied the fare-meter to the particular taxi-cab.
·
Payment
Type: cash or credit.
·
Fare
Amount: the amount charged for the time / distance driven, independent of
any taxes or surcharges.
·
Surcharge:
there is a mandatory surcharge applied to trips that occur during peak
hours (between 4 and 6pm on weekdays and after 8pm on weekends).
·
MTA
Tax: there is a flat $.50 tax on
every fare payable to New York State.
·
Tip
Amount: the amount that a driver was tipped by the occupants of a ride.
·
Tolls
Amount: the amount of tolls paid by the driver during a particular fare.
·
Total
Amount: sum of all charges levied to the customer, including tip.
·
Rate
Code: depending on how far outside of NYC a trip goes, differing rates
apply.
·
Dropoff
Datetime: the date and time that a passenger was dropped off for a given
fare.
·
Passenger
Count: the number of passengers in the vehicle during the trip.
·
Trip Time
in Seconds: the number of seconds that passed between pickup and drop-off.
·
Trip
Distance: the distance (in miles) travelled during a trip.
·
Year: 2013
for all observations.
·
Month: January
(1) for all observations.
·
Date
(Day): the numeric day of the month ranging from 1 to 31.
·
Day of
Week: day of week (Monday, Tuesday, Wednesday, Thursday, Friday or
Saturday) for a given pickup.
·
Pickup
Lat/Long: GPS coordinates for pickup.
·
Pickup
Zipcode: zipcode for pickup.
·
Pickup
Direction: the cardinal direction that the cab was travelling during
pickup.
·
Dropoff
Lat/Long: GPS coordinates for drop-off.
·
Dropoff
Zipcode: the zipcode in which the trip terminated.
·
Dropoff
Direction: the cardinal direction of the cab at drop-off.
·
Pickup
Neighborhood: a less granular measure of location that zip-code. While there are 41 Manhattan zip-codes in our
dataset, there are only 10 broader neighborhoods represented in this
categorical variable.
·
Dropoff
Neighborhood: same as above, but with respect to the neighborhoods where a
given ride terminates. If a ride
terminates outside of Manhattan, no specific neighborhood is captured, just
that the neighborhood is “not in Manhattan.”
Model #1 (Forecasting Fares):
Before begging the modelling
process to forecast fares for both passengers and livery services seeking a
benchmark for their own fares, we ran our data through a “DMDB node” to assess
the distributions of our interval data for reasonable skewness, the results of
which are shown below:
Based on the results we
transformed all interval-data their corresponding logged values, resulting in
distributions much closer to normal:
In order to deterministically
calculate the fare for any given taxi-trip, one must first know the number of
passengers, the distance of the trip and, finally, the time that the trip will
actually take from inception to termination.
As shown below, when using this information, we can predict a fare with
an R-squared > .999.
However, since traffic conditions
vary throughout the week, and throughout the day, it is not possible to know
with absolute certainty how long a given trip will take to complete. Thus, by using zip-codes and times of day as
variables in a model, we can essentially create a proxy for average
traffic-patterns and accurately predict the amount of any given fare.
Once we had reasonably skewed
distributions, we approached the task of predicting fare-amount using three
models consisting of two linear regressions and a neural-network.
·
Linear
by Hood = linear regression specified by the formula given below:
o
log(fare_amount)
= log(trip_distance)+log(passenger_count)+day_of_week+trip_time_of_day
+pickup_neighborhood
·
Linear
by Zip = linear regression specified by the formula given below:
o
log(fare_amount)
= log(trip_distance)+log(passenger_count)+trip_time_of_day+pickup_zipcode
+day_of_week
·
Neural
Network = neural network using the same inputs as “linear by zip.”
Each model included two
factor-interactions for all class-variables.
We used “Average Square Error” on the validation set as our selection
criteria. As shown below, all models
performed roughly equally.
Interestingly, “Reg2,” which corresponds to the linear model using
aggregated zip-codes at the neighborhood level actually slightly outperformed
the linear model using a more granular approach to measuring location of trip
inception.
Given that the results of each
model were incredibly similar, we would recommend the use of “Reg2”, also known
as “Linear by Neighborhood” for the final model, despite the extremely modest
fit improvement yielded by a neural-network, simply for ease of
interpretability for a both driver and passenger. In any event, something as simple as a table
of fares for each neighborhood at each hour of the day for a range of distances
would be of great use to a driver to determine optimal areas to “cruise” during
specific times of day and days of the week.
Model fit statistics for the
suggested regression model are shown below:
Model #2 (Predicting Drop-off Location):
Due to the nature of regulations
prohibited cabs licensed to “cruise” for fares in Manhattan face with respect
to picking up fares in other boroughs, it is highly desirable for a driver to
avoid locations that are likely to result in a trip terminating outside of
Manhattan. We sought to use pickup-locations,
time of day and day of week to determine if there were any relevant patterns
with respect to trips terminating outside of Manhattan that could be discerned.
Overall, 86% of all trips in our
dataset terminated within Manhattan.
That means that in order for a model to outperform the simply assumption
that all trips terminate within Manhattan, it needs to achieve a
misclassification rate significantly less than 14%.
The following diagram shows the
models that we attempted to fit to the data.
The left-hand branches consist of models run on the 60/30/10
data-partition. The models on the
right-hand branch were run on a sample taken of the partitioned data that gave
us a 50/50 split between trips that terminated in Manhattan and those that did
not, with original level proportions specified at 86% and 14% for trips
terminating in Manhattan and elsewhere respectively.
Below are the results for the
left-hand branches, which used non-modified data consisting of an 86%-14% split
between trips terminating in Manhattan vs. those that terminated elsewhere:
All models performed identically
with respect to misclassification rate of the validation set. Moreover, the models were, at best, only as
good as simply predicting that every single trip would terminate within
Manhattan. This means that none of these
models are useful with respect to aiding drivers in avoiding specific
combinations of pickup locations and times of day that are likely to yield a
fare that terminates outside of Manhattan.
Because of the low percentage of
trips terminating outside of Manhattan relative to the overall number of trips
in the dataset, we figured that perhaps our models simply did not have enough
“information” to work with. In other
words, with only 14% of the dataset consisting of “trips of interest,” there
simply may not be enough cases with which to differentiate from trips that do
terminate within Manhattan. To combat
this problem we created a dataset that had a 50/50 split of trips terminating
within Manhattan and trips terminating in other boroughs. We then applied a decision-node in order to
adjust the prior-probabilities to conform to the non-sampled data.
With an adjusted sample that
contains a 50/50 distribution of trips terminating in Manhattan vs. trips that
terminate in other Boroughs, the results are the same as before in that each
model has a misclassification rate of roughly 14.3%.
The Neural-Network, shown below
in green, had the best sensitivity or “true-positive-rate,” however, the rate
is still equal to that of the original data.
Model #3 (Assessing lost Taxable-Revenue):
When examining our dataset, one
thing was immediately clear. Almost
every single trip in which a customer paid with cash had a tip-amount of
$0.00. However, there were almost no
cases in which a fare paid with credit-card did not have a tip associated with
it. Since credit-card tips are
automatically reported, our assumption is that drivers are simply pocketing
cash-fares to reduce their taxable income.
Given the influx of competition from services like Uber and Lyft, we
sought to determine the overall impact of lost taxable revenue that the dishonest
reporting of cash-fare tips was driving.
Cash Fares:
|
Credit Card Fares:
|
Mean tip = $.0003
|
Mean tip = $2.18
|
Mean tip % = 0
|
Mean tip % = 15.19
|
In order to model the lost
taxable-revenue, we created a dataset consisting of only fares paid via
credit-card. Using various combinations
of variables, we modelled whether tips were dependent solely on fare-amount, or
whether certain zip-code and time-of-day combinations led to increased or
decreased tips.
The models on the left-hand side
of the node array shown above have logged interval-terms, while the models on
the right-hand side do not. The
regression nodes labelled with interactions include all two factor interactions
between pickup location and time, while the neural networks and non-labelled
regression do not.
Results for logged models:
Results for non-logged models:
As shown above, the
neural-network with no logged independent interval terms slightly outperformed
the neural network with logged interval terms.
Recall that the mean tip amount across the domain of our dataset is
$2.18. The non-logged neural-network
achieved a mean predicted tip-value of $2.17 and a mean residual of $.02.
That means that, for the month of
January 2013 (~14,000,000 trips), the city and state of New York missed out on
approximately 14,000,000*2.178751 = ~$30,500,000 in taxable revenue! Over the course of a year, that is more than
$300,000,000 in lost taxable revenue, which would amount to upwards of $45,000,000
in actual taxes not-collected, assuming a generous 15% marginal rate.
Use Cases / Business Implications:
·
Predicting Fare Amount:
o
By creating an app for passengers of
“yellow-cab” taxis, the TLC of New York City may be able to mitigate some of
the losses in revenue to competitors such as Uber and Lyft by mimicking some of
their “convenience factor.” By using
location, time of day and day of week as proxies for traffic information,
passengers could quickly and accurately predict a given cab-fare before
embarking on their trip.
o
Additionally, services such as Uber and Lyft
could actually use the same model as a way to benchmark their own fares. Since the model predicts any given fare with
a high degree of accuracy, alternative Livery services could choose to either
match or undercut “yellow-cab” rates according to the model’s output.
·
Predicting lost taxable-revenue:
o
Because fares that are paid in cash are almost
always absent a reported tip, and fares paid with credit-cards reliably have a
tip associated with them, our assumption is that drivers are simply pocketing
cash-tips and consequently reducing the city and state of New York’s taxable
revenue. By modelling tip amount
according to spatial and time variables, in addition to fare-amount, we produce
a model that can be used to estimate the amount of taxable revenue that is
being held by drivers of yellow-cabs.
o
This information can be used by the TLC to
determine whether it is worth their time and effort to pursue this lost
revenue. One potential solution is to
institute a mechanical device into which drivers would feed cash and then enter
the amount which they wanted to leave as a tip.
This would effectively take the cash “out of the drivers hands,”
removing their discretion with respect to reporting tips.
Conclusion:
While working on this project, we
found that there are few areas of improvement in the taxi business controlled
by TWC commission. As of now these problems are not handled properly by the
commission. We found three major areas that needed to be addressed:
·
Huge loss of taxable revenue, which was due to
non-reporting of tip fares paid in cash.
·
Inconvenience faced by passengers due to
inconsistent taxi fares, given the traffic conditions.
·
Loss of profits for the drivers due to trips
ending outside Manhattan.
After using the above described
prediction model for each major problem, we can predict the following:
·
Amount of lost revenue.
·
Actual fares and unfavorable pick up location
for drivers, regardless of traffic conditions for any destination.
Considering this, the government
could start a pilot program to mitigate lost revenue. In addition, the commission
could create interactive apps which could address the problems faced by
customers and taxi drivers.
Learnings:
·
Preprocessing and Data Filtration.
Given, that we
had a huge input dataset, we located many missing values and anomalies with
several variables. We were able to replace these with normal values. We learnt
to identify important variables that are considered as input and target variables
in order to start the modeling process.
·
Training and validation of data.
In order to run
a successful model, we learnt how to train our SAS application by using
training data so that the application could be able to predict values when we
provide validation data.
·
How to use prediction model.
We understood
how to predict values by creating linear regression, neural networks and
decision tree models.
·
How to interpret results
From Mean square
error, misclassification rate, odds ratio output, histograms, fit statistics
and cumulative lift, it was easy for us to interpret different models and were
able to compare them for selecting the best model.
[1] http://www.businessinsider.com/uber-destroying-value-of-taxi-monopolies-cartels-2014-11
[2] http://www.pewinternet.org/fact-sheets/mobile-technology-fact-sheet/
[3]
NYC Taxi and Limousine Commission Official Twitter Account (@NYCtaxi). March 10, 2014.
[4] http://chriswhong.com/open-data/taxi-techblog-2-leaflet-d3-and-other-frontend-fun/
No comments:
Post a Comment