Citibike Classification Challenge

Malte Bonart

October 16, 2019

This work and the underlying source code is available on GitHub.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Creative Commons License

moderately imbalanced data

  Day Pass/ Three Day Pass Annual Membership
Customer (11%) Subscriber (89%)
12.00$ / day 169.00$ / year
max 30 min max 45 min
4.00$ / 15 min 2.50$ / 15 min

customers bike for a longer period

Trips with a duration > 2 hours and < 20 seconds have been removed from the analysis (~0.3%). Customer ride on average 1441, Subscribers 721 seconds.

anomaly in the age distribution for customers

6% of all trips from customers have an age value of 49.

gender is mostly unknown for customers

more customers on the weekend

0:Sunday - 6:Saturday.

more clients during summer months

more subscribers during rush hour

model features

  • tripduration
  • starttime, stoptime
  • hour, weekday, month
  • start (lat,lon), end (lat,lon)
  • age, close_to_fifty
  • gender
  • NYC neighbourhoods (via geocoding API)

the baseline performs well - nearest neighbour classifier is worse

The baseline was constructed by classifying all trips with unknown gender as customers. Nearest neighbour classification is based on time, start location, end location and tripduration.

classification does not outperform baseline

features dimensions logistic regression (f-score)
tripduration 1 0.15
+ gender 3 0.70
+ age 5 0.71
+ time 45 0.71
+ area 173 0.72

The training is based on a random sample of n=5000000 trips, due to resource and time constraints.

random forrest does not improve the classification

comparison of trip duration with biking

biking vs. driving driving (traffic) transit
Average differences -107 -105 209
Median differences 2 -9 263
Biking faster 50% 48% 83%

Based on a n=2000 random sample of trips, collected with the GoogleMaps Directions API. Wilcoxon signed-rank test and t-test for pairs are both significant. biking faster | biking slower

4726 injured bikers, 10 deaths in NYC car accidents 2018

driver inattenion
failure to yield right of way
confusion of bicyclist
traffic control disregarded
passing or lane usage improber


Top reasons for NYC motor vehicle collisions where at least one biker was injured.