diff --git a/notebooks/cogo_trip_analysis.ipynb b/notebooks/cogo_trip_analysis.ipynb
index 110ab4a..be411fe 100644
--- a/notebooks/cogo_trip_analysis.ipynb
+++ b/notebooks/cogo_trip_analysis.ipynb
@@ -1,306 +1,314 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"import statistics\n",
"import folium\n",
"import numpy as np\n",
"import pandas as pd\n",
"import os\n",
"import sys\n",
"import logging\n",
"import warnings\n",
"from pathlib import Path\n",
"from IPython.display import Image, display\n",
"\n",
"from cogo import plotting, data_prep, simulation\n",
"\n",
"\n",
"logger = logging.getLogger(__name__)\n",
"logging.basicConfig(stream=sys.stdout, level=logging.INFO)\n",
"warnings.filterwarnings('ignore')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Analysis overview\n",
"\n",
"At a high level, we're looking at rider volume of COGO bikeshare bikes. The plan of analysis I'm taking is to view this almost as a queueing problem, and to simulate inbound and outbound volume at each station over some arbitrary time period, and with respect to the simultaneous activity at every other station. Ultimately, I'm interested in identifying stations that can become bottlenecks, either by virtue of having no remaining bikes to lend, or by having no available docks to which to return a bike. I chose this approach for several reasons:\n",
" - Predicting overall ridership volume, and forecasting out to some future point, in the context of, e.g., an ARIMA, is a relatively straightforward problem. I assume it has already been solved.\n",
" - Predicting ridership volume on a per-station basis is only marginally useful: this type of analysis is likely to ignore the fact that we are dealing with a closed system in which there are a limited and pre-determined number of bikes available, of which only a subset can be rented out from a given station at any given time. Any model that is agnostic to the capacity of a given station, or to the volume of inbound traffic it receives, is not liable to provide useful or actionable predictions. (E.g., if we predict that a 15-dock station will have 30 departures over the course of an hour, how can we respond? Do we double the capacity of the station? Do we send a van to fill any empty docks? Will the inbound volume over that time period be enough to meet the demand without any intervention?)\n",
" - Treating this as a queueing problem allows us to integrate both seasonal changes in departure volume (hourly, daily, monthly — although this proof-of-concept implementation only looks at ridership on an hourly basis per station) and a transition matrix of probabilities that a prototypical rider will transit from any station `N` to any other station `M`, or will return the bike back to station `N`. This specifically solves for the fact that we are operating in a closed system with a limited number of bikes, each station having access only to an even more limited subset thereof at any given point in time. It also gives us the ability to easily take action based on simulations we run: we can quickly flag cases where a station is about to run out of either available bikes or free docks, allowing us to proactively move bikes from overfull stations to underfull ones.\n",
" \n",
"All that said though, the underlying problem that motivated me to take this track in my analysis is that I, as a rider, am always going to want a bike to be available when I go to rent one, and for a dock to be available when I return one, regardless of where I want to return the bike to. Solving for overall trends in volume is nice, and can help inform expansion and purchasing decisions, but it really doesn't do anything to benefit the customer experience.\n",
"\n",
"\n",
"# Exploring the data\n",
"\n",
"Initially I'm just going to overlay the stations with their cumulative lifetime arrival and departure counts onto a map of downtown Columbus. If we wanted to tweak this to help intuit what our later simulations might tell us, we could look at these data on the basis of average daily (hourly, weekly, whatever) arrivals, departures, and net change: this might help visually identify which stations were net consumers or producers of bikes, and would need to be manually re-balanced more frequently. But for my purposes here, just getting a sense of where everything is located geographically is enough for right now."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"APP_ROOT = Path(os.path.realpath(os.path.expanduser(os.getcwd()))).parents[0]\n",
"\n",
"cogo_data, cogo_stations = data_prep.load_datasets(APP_ROOT)\n",
"hourly_trips = data_prep.prepare_hourly_trips(cogo_data)\n",
"\n",
"# \"Station cross/interlinks\" is basically just a transition\n",
"# matrix representing the probability of a rider transitioning\n",
"# between any two arbitrary bike stations\n",
"station_crosslinks = data_prep.build_station_interlinks(cogo_data)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Really, this should be normalized to control for\n",
"# number of days a station was in service, but for my\n",
"# purposes, I'm just interested in the geographic\n",
"# distribution of stations right now\n",
"df_agg = plotting.counts_by_hexagon(df=cogo_data, resolution=9)\n",
"\n",
"m_hex = plotting.choropleth_map(\n",
" df_agg=df_agg,\n",
" name='Departure Count',\n",
" value_col='departure_count',\n",
" with_legend=True)\n",
"m_hex = plotting.choropleth_map(\n",
" df_agg=df_agg,\n",
" name='Arrival Count',\n",
" value_col='arrival_count',\n",
" initial_map=m_hex,\n",
" with_legend=True,\n",
" kind='outlier'\n",
")\n",
"folium.map.LayerControl('bottomright', collapsed=False).add_to(m_hex)\n",
"m_hex.save(str(APP_ROOT / 'output' / 'choropleth_counts.html'))\n",
"m_hex"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Simulate ridership volumes\n",
"\n",
"For the actual analysis, I'm building out a simulation that will, starting at any arbitrary hour of the day and running for any arbitrary number of minutes, use historic ridership data to simulate arrivals to and departures from each station. To achieve this, I need to first identify, for each station:\n",
" 1. How many docks are available?\n",
" 1. How many departures are there, on average, within a given hour?\n",
" 1. What is the average time between departures within a given hour?\n",
" 1. What is the probability that a rider will take a rented bike to any other individual station `M`?\n",
"\n",
"Given this information, we can then\n",
" - For each tick:\n",
" - For each station:\n",
" - determine if a departure should occur (sampling from geometric distribution given that station's inter-departure interval for that time of day)\n",
" - If a departure should occur:\n",
" - If `n > 0` bikes are available:\n",
" - determine the destination station, based on the probabilities taken from our Station `N`:Station `M` transition matrix\n",
" - undock a bike and assign it to a to global 'in transit' list w/estimated transit time\n",
" - reset station's time since last depart counter (This is just used for internal tracking to verify that the geometric distribution we're sampling from is behaving reasonably)\n",
" - If no bike is available\n",
" - The customer is filled with a deep and lingering sadness\n",
" - Otherwise:\n",
" - increment station's time since last departure counter (again, for internal tracking purposes)\n",
" - For each undocked bike:\n",
" - Decrement remaining travel time by 1 tick (1 minute)\n",
" - If remaining travel time <= 0:\n",
" - If `n > 0` available docks at destination station:\n",
" - dock the bike\n",
" - Otherwise:\n",
" - The customer is filled with a deep and lingering sadness\n",
- " - The bike is thrown into the Olentangy, never to be seen again"
+ " - The bike is thrown into the Olentangy, never to be seen again (although actually I'm assuming the customer will wait until a dock becomes available to avoid being charged the cost of the bike)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"orchestrator = simulation.Orchestrator(\n",
" cogo_stations,\n",
" station_crosslinks,\n",
" hourly_trips,\n",
" bike_count=408 # max bike id in COGO data\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"orchestrator.run_simulation(\n",
" start_hour=12,\n",
" num_ticks=120)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can then inspect all of the delays caused by over- or underfull bike stations. Note that will currently overstate problems when they crop up since any customer who's trying to return a bike is treated as an exceptionally upstanding citizen who will patiently wait at the station with her bike until a dock becomes available. This results in the same customer trying to return the same bike being logged once for every minute she is waiting."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"orchestrator.delays[:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Model evaluation\n",
"\n",
"To evaluate this model's performance, we can just do a straight comparison of the actual arrival and departure counts\\[1\\] predicted by our simulation on an hour-by-station basis against the average volumes of the same from our historical data.\\[2\\] I'll primarily be judging model accuracy on the basis of Median Absolute Difference (as small of a scale as we're operating on, looking at percentage differences is not super meaningful: if I simulate 5 departures for a station in an hour, but the historic average is 2, that's an 80-ish percent difference for a small delta in absolute terms).\n",
"\n",
"\n",
"\\[1\\] --- You'll see below that I am restricting the dataset of historical observations to hours 12 and 13. This is purely a function of the constraints I specified above in terms of the starting hour of the simulation and the number of ticks to advance. You are welcome to adjust as you see fit to model a wider range of time.\n",
"\n",
"\\[2\\] --- Note that we could also hold an out-of-band sample to generate actual predictions against; however, given that this model is just intended to prove the concept and does not have important seasonalities integrated (day of week, month of year), that would only be of limited use and so I will decline to do so here."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sim_res = pd.DataFrame()\n",
"for _, group in orchestrator.simulation_statistics.groupby(['station_id']):\n",
" group['delta'] = group['available_docks'] - group['available_docks'].shift(-1)\n",
" sim_res = pd.concat([sim_res, group])\n",
"sim_res['departures'] = np.abs(sim_res['delta'].apply(lambda x: min(0, x)))\n",
"sim_res['arrivals'] = sim_res['delta'].apply(lambda x: max(0, x))\n",
"\n",
"sim_res['hour'] = sim_res['timestamp'].apply(lambda x: x[:2])\n",
"sim_res = sim_res.groupby(['hour', 'station_id']).agg('sum').reset_index()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cogo_data['arrival_hour'] = cogo_data['stop_timestamp'].dt.strftime('%H')\n",
"cogo_data['arrival_date'] = cogo_data['stop_timestamp'].dt.strftime('%Y-%m-%d')\n",
"\n",
"cogo_data['departure_hour'] = cogo_data['start_timestamp'].dt.strftime('%H')\n",
"cogo_data['departure_date'] = cogo_data['start_timestamp'].dt.strftime('%%Y-%m-%d')\n",
"\n",
"out = (\n",
" cogo_data\n",
" .groupby(['arrival_hour', 'stop_station_id'])\n",
" .agg({\n",
" 'bike_id': 'count',\n",
" 'arrival_date': pd.Series.nunique\n",
" })\n",
" .rename_axis(['hour', 'station_id'])\n",
" .rename({\n",
" 'bike_id': 'arrival_count',\n",
" 'arrival_date': 'n_arrival_days'\n",
" }, axis=1)\n",
")\n",
"out = out.join(\n",
" cogo_data\n",
" .groupby(['departure_hour', 'start_station_id'])\n",
" .agg({\n",
" 'bike_id': 'count',\n",
" 'departure_date': pd.Series.nunique\n",
" })\n",
" .rename_axis(['hour', 'station_id'])\n",
" .rename({\n",
" 'bike_id': 'departure_count',\n",
" 'departure_date': 'n_departure_days'\n",
" }, axis=1)\n",
")\n",
"out['average_arrivals'] = round(\n",
" out['arrival_count'] / out['n_arrival_days'], 0)\n",
"out['average_departures'] = round(\n",
" out['departure_count'] / out['n_departure_days'], 0)\n",
"out = out.reset_index()\n",
"out = out.loc[out['hour'].isin(['12', '13'])]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"final = sim_res.merge(out, on=['hour', 'station_id'], how='outer')\n",
"# final['arrival_deviance'] = (\n",
"# np.abs(final['arrivals'] - final['average_arrivals']) / (\n",
"# (final['arrivals'] + final['average_arrivals']) / 2\n",
"# ) * 100\n",
"# )\n",
"# final['departure_deviance'] = (\n",
"# np.abs(final['departures'] - final['average_departures']) / (\n",
"# (final['departures'] + final['average_departures']) / 2\n",
"# ) * 100\n",
"# )\n",
"\n",
"final['arrival_delta'] = np.abs(final['arrivals'] - final['average_arrivals'])\n",
"final['departure_delta'] = np.abs(final['departures'] - final['average_departures'])\n",
"\n",
"final.agg({\n",
" 'departure_delta': ['mean', 'median'],\n",
" 'arrival_delta': ['mean', 'median']\n",
"})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Conclusions\n",
"\n",
- "For both arrivals and departures, by both mean and median absolute difference, my simulation is on average within about a single departure or arrival of actuals on a per-station by-hour basis. For having ignored substantial seasonal components in my current implementation, this actually kind of impressed me: it's definitely an encouraging result if we were to extend things to incorporate some of those additional variables. Looking at some of these differences on a row-by-row basis, there are a few instances where we see things diverge a little bit more (on the order of +- 3 or so arrivals or departures), especially as we get into the winter months and ridership volumes plummet. But, again, that's expected given the limited scope of the model in its current implementation."
+ "For both arrivals and departures, by both mean and median absolute difference, my simulation is on average within about a single departure or arrival of actuals on a per-station by-hour basis. For having ignored substantial seasonal components in my current implementation, this actually kind of impressed me: it's definitely an encouraging result if we were to extend things to incorporate some of those additional variables. Looking at some of these differences on a row-by-row basis, there are a few instances where we see things diverge a little bit more (on the order of +- 3 or so arrivals or departures), especially as we get into the winter months and ridership volumes plummet. But, again, that's expected given the limited scope of the model in its current implementation.\n",
+ "\n",
+ "The important takeaway from this, I think, is that without intervention the system is not able to effectively self-regulate. Manual re-balancing of bikes among over- and underfull stations is necessary throughout the day for continued smooth operation. Although this analysis gets at the question of volume a little obliquely, it does so in a way that (by looking at the effects of supply and demand at each individual station on the overall network health) demonstrates obvious interventions needed to better support the riders COGO serves in a way that other analyses focused purely on gross ridership volume would not.\n",
+ "\n",
+ "\n",
+ "Some obvious areas for improvement beyond those already mentioned include:\n",
+ " - Transit times between stations are currently deterministic. For a better view of actual ridership behaviors, I should probably consider sampling them from a normal distribution.\n",
+ " - It would be nice to build out a slightly better monitor to proactively alert when docks hit certain minimum/maximum thresholds of available bikes or docks. It's easy enough to get this information retrospectively, but it might be a nice enhancement to surface that detail in real-time as the simulation is running.\n",
+ " - I'd love to have visualized the simulation geotemporally, showing the simulated rides taking place in realtime overlayed on a map of downtown. The tabular results are fine and dandy, but that sort of visualization would make it much easier to grok where the bottlenecks are occurring and when."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}