UN Global Pulse Take Home Assignment

I'll keep this notebook as a running diary of my exploration of the data. Hopefully it will give you some insight in to my thought process and work habits

Project 2: Graph Visualization

Typically, I'll divide projects in to 3 phases: Exploratory Data analysis, early draft and then client feedback / iteration

Step 1: Explore the Data

In [2]:
%matplotlib notebook
In [3]:
import pandas as pd
import requests
import time
In [12]:
dataurl = 'http://139.59.230.55/frontend/api/odpair'
travel = pd.DataFrame(requests.get(dataurl).json())
travel.head()
Out[12]:
count data from to
0 69 [{'count': 7, 'to': 'Kalideres', 'from': 'Pulo... Line 2 Line 3
1 102 [{'count': 4, 'to': 'Cempaka Timur', 'from': '... Line 2 Line 2
2 176 [{'count': 2, 'to': 'Monas', 'from': 'Pulo Gad... Line 2 Line 1
3 5 [{'count': 5, 'to': 'PGC 2', 'from': 'Cempaka ... Line 2 Line 7
4 61 [{'count': 5, 'to': 'Senen Sentral', 'from': '... Line 2 Line 5

OK, so this appears to be a graph of people travelling between a set of destinations. I'm not sure what 'Lines' is referring to, possibly rail or bus lines? I'll ask.

First thoughts are:

  • Assuming this is a representation of rail or bus lines, it may be most helpful to do it as a geo-visualisation - Planners are likely to be familiar with their own city, and seeing fat / thin lines on a map is probably the most intuitive way to understand this data.
  • A particle simulation animation ala SimCity might be cool too, but maybe out of scope for this depending on time constraints (http://bigbytes.mobyus.com/commute.aspx)

some caption

Let's start by disaggregating the line data. It will give us more points to work with for a visualisation

In [9]:
raw = requests.get(dataurl).json()
trip_list = list()
for line in raw:
    trip_list.extend(line['data'])

alltrips = pd.DataFrame(trip_list)

OK, so that's the disaggregated trip data. Now we'll need to geocode those stations.

In [10]:
from geopy import geocoders
g = geocoders.GoogleV3(api_key='AIzaSyAq8t-hz_DRUcQaM5an1FQCDoMvUtvKOO0',timeout=2)
from tqdm import tqdm_notebook, tqdm

tqdm_notebook().pandas(desc="progress")
backoff = 2 # Set a 2 second delay for Geocoding (global so we can parallelize this)

In [11]:
def getgeo(location):
    global backoff #If we decide to parallelize this
    locationstring = str(location) +" busway"
    try:
        loclist = g.geocode(locationstring, exactly_one=False,region='ID',bounds=[106.3903, -6.3725, 106.9743, -5.2017])
        for loc in loclist:
            if ('transit_station' in loc.raw['types']) or ('bus_station' in loc.raw['types']):
                #Only return the co-ordinates if Google thinks it's a bus stop
                return (loc.latitude,loc.longitude)
    except Exception as e:
        print(e) # TODO - Better exception handling.
        backoff = backoff * 2
        time.sleep(backoff)
        return
In [94]:
station_list = list()
station_list.extend(alltrips['from'].values)
station_list.extend(alltrips['to'].values)
station_list = list(set(station_list))

geolocs = pd.DataFrame()
geolocs['station'] = station_list
geolocs['latlon'] = geolocs['station'].progress_apply(getgeo)
In [98]:
geolocs[geolocs['latlon'].isnull()]
Out[98]:
station latlon
12 Monas None
15 Simpang Blv klp gading None
34 RS. Puri Medika Plumpang None
69 Latumenten St. K.A arah Pluit None
76 RS.Harapan kita arah P.Ranti None
106 Karet Kuningan None
116 Makro None

There are 7 stations we can't find a lat-lon for.

If we weren't time constrained, we'd hand-code these or write a better geocoder. For the purposes of this exercise, we're just going to throw away trips to and from those stations and visualise the rest

In [63]:
def geolookup(station):
    try:
        return geolocs[geolocs['station'] == station]['latlon'].values[0]
    except Exception as e:
        print(e)
        return

alltrips['from_latlon'] = alltrips['from'].apply(geolookup)
alltrips['to_latlon'] = alltrips['to'].apply(geolookup)

Step 2: Preliminary Visualisation

In [5]:
alltrips = pd.read_pickle('alltrips.pickle')
In [6]:
alltrips = alltrips.dropna()
In [29]:
from bokeh.charts import output_file, Chord
from bokeh.io import show, output_notebook
output_notebook()
Loading BokehJS ...
In [38]:
chartframe = alltrips[['from','to','count']][alltrips['count'] > 6]
chartframe = chartframe[chartframe['from'] != chartframe['to']]
In [39]:
all_trips_chart = Chord(chartframe, source="from", target="to", value="count")
In [ ]:
show(all_trips_chart)
In [42]:
output_file('chord.html')

some caption

The disaggregated data is rich, but this chord diagram is a bit overwhelming. It also doesn't take in to account spatial relationships between stops.

Just for reference, here it is aggregated by line, as it came out of the api

In [43]:
lineframe = travel[['from','to','count']][travel['count'] > 0]
shortframe = lineframe[lineframe['from'] != lineframe['to']]
linechart = Chord(shortframe, source="from", target="to", value="count")
show(linechart)