Archive

Practical Pandas Part 3 - Exploratory Data Analysis

Welcome back. As a reminder:

  • In part 1 we got dataset with my cycling data from last year merged and stored in an HDF5 store
  • In part 2 we did some cleaning and augmented the cycling data with data from http://forecast.io.

You can find the full source code and data at this project's GitHub repo.

Today we'll use pandas, seaborn, and matplotlib to do some exploratory data analysis. For fun, we'll make some maps at the end using folium.

%matplotlib inline

import os
import datetime

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_hdf(os.path.join('data', 'cycle_store.h5'), key='with_weather')
df.head()
time ride_time_secs stopped_time_secs latitude longitude elevation_feet distance_miles speed_mph pace_secs average_speed_mph average_pace_secs ascent_feet descent_feet calories ride_id time_adj apparentTemperature cloudCover dewPoint humidity icon precipIntensity precipProbability precipType pressure summary temperature visibility windBearing windSpeed
0 2013-08-01 12:07:10 1.1 0 41.703753 -91.609892 963 0.00 2.88 1251 0.00 0 0 0 0 0 2013-08-01 07:07:10 61.62 0 58.66 0.9 clear-day 0 0 NaN 1017.62 Clear 61.62 8.89 282 2.77
1 2013-08-01 12:07:17 8.2 0 41.703825 -91.609835 852 0.01 2.88 1251 2.56 1407 0 129 0 0 2013-08-01 07:07:17 61.62 0 58.66 0.9 clear-day 0 0 NaN 1017.62 Clear 61.62 8.89 282 2.77
2 2013-08-01 12:07:22 13.2 0 41.703858 -91.609814 789 0.01 2.88 1251 2.27 1587 0 173 0 0 2013-08-01 07:07:22 61.62 0 58.66 0.9 clear-day 0 0 NaN 1017.62 Clear 61.62 8.89 282 2.77
3 2013-08-01 12:07:27 18.2 0 41.703943 -91.610090 787 0.02 6.60 546 4.70 767 0 173 1 0 2013-08-01 07:07:27 61.62 0 58.66 0.9 clear-day 0 0 NaN 1017.62 Clear 61.62 8.89 282 2.77
4 2013-08-01 12:07:40 31.2 0 41.704381 -91.610258 788 0.06 9.50 379 6.37 566 0 173 2 0 2013-08-01 07:07:40 61.62 0 58.66 0.9 clear-day 0 0 NaN 1017.62 Clear 61.62 8.89 282 2.77

Upon further inspection, it looks like some of our rows are duplicated.

df.duplicated().sum()
2

The problem is actually a bit more severe than that. The app I used to collect the data sometimes records multiple observations per second, but only reports the results at the second frequency.

df.time.duplicated().sum()
114

What to do here? We could drop change the frequency to micro- or nano-second resolution, and add, say, a half-second onto the duplicated observations.

Since this is just for fun though, I'm going to do the easy thing and throw out the duplicates (in real life you'll want to make sure this doesn't affect your analysis). Then we can set the time column to be our index, which will make our later analysis a bit simpler.

df = df.drop_duplicates(subset=['time']).set_index('time')
df.index.is_unique
True

Because of a bug in pandas, we lost our timzone information when we filled in our missing values. Until that's fixed we'll have to manually add back the timezone info and convert. The actual values stored were UTC (which is good practice whenever you have timezone-aware timestamps), it just doesn't know that it's UTC.

df = df.tz_localize('UTC').tz_convert('US/Central')
df.head()
ride_time_secs stopped_time_secs latitude longitude elevation_feet distance_miles speed_mph pace_secs average_speed_mph average_pace_secs ascent_feet descent_feet calories ride_id time_adj apparentTemperature cloudCover dewPoint humidity icon precipIntensity precipProbability precipType pressure summary temperature visibility windBearing windSpeed
time
2013-08-01 07:07:10-05:00 1.1 0 41.703753 -91.609892 963 0.00 2.88 1251 0.00 0 0 0 0 0 2013-08-01 07:07:10 61.62 0 58.66 0.9 clear-day 0 0 NaN 1017.62 Clear 61.62 8.89 282 2.77
2013-08-01 07:07:17-05:00 8.2 0 41.703825 -91.609835 852 0.01 2.88 1251 2.56 1407 0 129 0 0 2013-08-01 07:07:17 61.62 0 58.66 0.9 clear-day 0 0 NaN 1017.62 Clear 61.62 8.89 282 2.77
2013-08-01 07:07:22-05:00 13.2 0 41.703858 -91.609814 789 0.01 2.88 1251 2.27 1587 0 173 0 0 2013-08-01 07:07:22 61.62 0 58.66 0.9 clear-day 0 0 NaN 1017.62 Clear 61.62 8.89 282 2.77
2013-08-01 07:07:27-05:00 18.2 0 41.703943 -91.610090 787 0.02 6.60 546 4.70 767 0 173 1 0 2013-08-01 07:07:27 61.62 0 58.66 0.9 clear-day 0 0 NaN 1017.62 Clear 61.62 8.89 282 2.77
2013-08-01 07:07:40-05:00 31.2 0 41.704381 -91.610258 788 0.06 9.50 379 6.37 566 0 173 2 0 2013-08-01 07:07:40 61.62 0 58.66 0.9 clear-day 0 0 NaN 1017.62 Clear 61.62 8.89 282 2.77

Timelines

We'll store the time part of the DatetimeIndex in a column called time.

df['time'] = df.index.time

With these, let's plot how far along I was in my ride (distance_miles) at the time of day.

ax = df.plot(x='time', y='distance_miles')

png

There's a couple problems. First of all, the data are split into morning and afternoon. Let's create a new, boolean, column indicating whether the ride took place in the morning or afternoon.

df['is_morning'] = df.time < datetime.time(12)
ax = df[df.is_morning].plot(x='time', y='distance_miles')

png

Better, but this still isn't quite what we want. When we call .plot(x=..., y=...) the data are sorted before being plotted. This means that an observation from one ride gets mixed up with another. So we'll need to group by ride, and then plot.

axes = df[df.is_morning].groupby(df.ride_id).plot(x='time',
                                                  y='distance_miles',
                                                  color='k',
                                                  figsize=(12, 5))

png

Much better. Groupby is one of the most powerful operations in pandas, and it pays to understand it well. Here's the same thing for the evening.

axes = df[~df.is_morning].groupby(df.ride_id).plot(x='time',
                                                   y='distance_miles',
                                                   color='k',
                                                   figsize=(12, 5))

png

Fun. The horizontal distance is the length of time it took me to make the ride. The starting point on the horizontal axis conveys the time that I set out. I like this chart because it also conveys the start time of each ride. The plot shows that the morning ride typically took longer, but we can verify that.

ride_time = df.groupby(['ride_id', 'is_morning'])['ride_time_secs'].agg('max')
mean_time = ride_time.groupby(level=1).mean().rename(
    index={True: 'morning', False: 'evening'})
mean_time / 60
is_morning
evening       30.761667
morning       29.362716
Name: ride_time_secs, dtype: float64

So the morning ride is typically shorter! But I think I know what's going on. We were misleading with our plot earlier since the range of the horizontal axis weren't identical. Always check the axis!

At risk of raising the ire of Hadley Whickham, we'll plot these on the same plot, with a secondary x-axis. (I think its OK in this case since the second is just a transformation – a 10 hour or so shift – of the first).

We'll plot evening first, use matplotlib's twinx method, and plot the morning on the second axes.

fig, ax = plt.subplots()

morning_color = sns.xkcd_rgb['amber']
evening_color = sns.xkcd_rgb['dusty purple']

_ = df[~df.is_morning].groupby(df.ride_id).plot(x='time', y='distance_miles',
    color=evening_color, figsize=(12, 5),
    ax=ax, alpha=.9, grid=False)
ax2 = ax.twiny()
_ = df[df.is_morning].groupby(df.ride_id).plot(x='time', y='distance_miles',
    color=morning_color, figsize=(12, 5),
    ax=ax2, alpha=.9, grid=False)

# Create fake lines for our custom legend.
morning_legend = plt.Line2D([0], [0], color=morning_color)
evening_legend = plt.Line2D([0], [0], color=evening_color)

ax.legend([morning_legend, evening_legend], ['Morning', 'Evening'])
<matplotlib.legend.Legend at 0x115640198>

png

There's a bit of boilerplate at the end. pandas tries to add a legend element for each ride ID. It doesn't know that we only care whether it's morning or evening. So instead we just fake it, creating two lines thate are invisible, and labeling them appropriately.

Anyway, we've accomplished our original goal. The steeper slope on the evening rides show that they typically took me less time. I guess I wasn't too excited to get to school in the morning. The joys of being a grad student.

I'm sure I'm not the only one noticing that long evening ride sticking out from the rest. Let's note it's ride ID and follow up. We need the ride_id so groupby that. It's the longest ride so take the max of the distance. And we want the ride_id of the maximum distance, so take the argmax of that. These last three sentances can be beautifully chained together into a single line that reads like poetry.

long_ride_id = df.groupby('ride_id')['distance_miles'].max().argmax()
long_ride_id
22

Cartography

We'll use Folium to do a bit of map plotting. If you're using python3 (like I am) you'll need to use this pull request from tbicr, or just clone the master of my fork, where I've merged the changes.

Since this is a practical pandas post, and not an intro to folium, I won't delve into the details here. The basics are that we initialize a Map with some coordinates and tiles, and then add lines to that map. The lines will come from the latitude and longitude columns of our DataFrame.

Here's a small helper function from birdage to inline the map in the notebook. This allows it to be viewable (and interactive) on nbviewer. For the blog post I'm linking them to `

def inline_map(map):
    """
    Embeds the HTML source of the map directly into the IPython notebook.

    This method will not work if the map depends on any files (json data). Also this uses
    the HTML5 srcdoc attribute, which may not be supported in all browsers.
    """
    from IPython.display import HTML
    map._build_map()
    return HTML('<iframe srcdoc="{srcdoc}" style="width: 100%; height: 510px; border: none"></iframe>'.format(srcdoc=map.HTML.replace('"', '&quot;')))

I've plotted two rides, a hopefully representative ride (#42) and the long ride from above.

import folium
folium.initialize_notebook()

lat, lon = df[['latitude', 'longitude']].mean()
mp = folium.Map(location=(lat, lon), tiles='OpenStreetMap', zoom_start=13)
mp.line(locations=df.loc[df.ride_id == 42, ['latitude', 'longitude']].values)
mp.line(locations=df.loc[df.ride_id == long_ride_id, ['latitude', 'longitude']].values,
        line_color='#800026')
inline_map(mp)

So you pan around a bit, it looks like the GPS receiver on my phone was just going crazy. But without visualizing the data (as a map), there'd be no way to know that.

For fun, we can plot all the rides.

mp_all = folium.Map(location=(lat, lon), tiles='OpenStreetMap', zoom_start=13)

for ride_id in df.ride_id.unique():
    mp_all.line(locations=df.loc[df.ride_id == ride_id, ['latitude', 'longitude']].values,
            line_weight=1, line_color='#111', line_opacity=.3)

inline_map(mp_all)

You can barely make out that I changed my path partway through the year to take Old Hospital Road instead of the North Ridge Trail (North boundry of my path).

Folium is cool; you should check it out (really, just use anything made by Rob).