Practical Pandas--Part 1

This is the first post in a series where I'll show how I use pandas on real-world datasets.

For this post, we'll look at data I collected with Cyclemeter on my daily bike ride to and from school last year. I had to manually start and stop the tracking at the beginning and end of each ride. There may have been times where I forgot to do that, so we'll see if we can find those.

Let's begin in the usual fashion, a bunch of imports and loading our data.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from IPython import display

Each day has data recorded in two formats, CSVs and KMLs. For now I've just uploaded the CSVs to the data/ directory. We'll start with the those, and come back to the KMLs later.

!ls data | head -n 5
Cyclemeter-Cycle-20130801-0707.csv
Cyclemeter-Cycle-20130801-1720.csv
Cyclemeter-Cycle-20130805-0819.csv
Cyclemeter-Cycle-20130806-0750.csv
Cyclemeter-Cycle-20130806-1724.csv

Take a look at the first one to see how the file's laid out.

df = pd.read_csv('data/Cyclemeter-Cycle-20130801-0707.csv')
df.head()
Time Ride Time Ride Time (secs) Stopped Time Stopped Time (secs) Latitude Longitude Elevation (feet) Distance (miles) Speed (mph) Pace Pace (secs) Average Speed (mph) Average Pace Average Pace (secs) Ascent (feet) Descent (feet) Calories
0 2013-08-01 07:07:10 0:00:01 1.1 0:00:00 0.0 41.703753 -91.609892 963 0.00 2.88 0:20:51 1251 0.00 0:00:00 0 0 0 0
1 2013-08-01 07:07:17 0:00:08 8.2 0:00:00 0.0 41.703825 -91.609835 852 0.01 2.88 0:20:51 1251 2.56 0:23:27 1407 0 129 0
2 2013-08-01 07:07:22 0:00:13 13.2 0:00:00 0.0 41.703858 -91.609814 789 0.01 2.88 0:20:51 1251 2.27 0:26:27 1587 0 173 0
3 2013-08-01 07:07:27 0:00:18 18.2 0:00:00 0.0 41.703943 -91.610090 787 0.02 6.60 0:09:06 546 4.70 0:12:47 767 0 173 1
4 2013-08-01 07:07:40 0:00:31 31.2 0:00:00 0.0 41.704381 -91.610258 788 0.06 9.50 0:06:19 379 6.37 0:09:26 566 0 173 2
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252 entries, 0 to 251
Data columns (total 18 columns):
Time                   252 non-null object
Ride Time              252 non-null object
Ride Time (secs)       252 non-null float64
Stopped Time           252 non-null object
Stopped Time (secs)    252 non-null float64
Latitude               252 non-null float64
Longitude              252 non-null float64
Elevation (feet)       252 non-null int64
Distance (miles)       252 non-null float64
Speed (mph)            252 non-null float64
Pace                   252 non-null object
Pace (secs)            252 non-null int64
Average Speed (mph)    252 non-null float64
Average Pace           252 non-null object
Average Pace (secs)    252 non-null int64
Ascent (feet)          252 non-null int64
Descent (feet)         252 non-null int64
Calories               252 non-null int64
dtypes: float64(7), int64(6), object(5)
memory usage: 35.5+ KB

Pandas has automatically parsed the headers, but it could use a bit of help on some dtypes. We can see that the Time column is a datetime but it's been parsed as an object dtype. This is pandas' fallback dtype that can store anything, but its operations won't be optimized like they would on an float or bool or datetime[64]. read_csv takes a parse_dates parameter, which we'll give a list of column names.

date_cols = ["Time", "Ride Time", "Stopped Time", "Pace", "Average Pace"]

df = pd.read_csv("data/Cyclemeter-Cycle-20130801-0707.csv",
                 parse_dates=date_cols)
display.display_html(df.head())
df.info()
Time Ride Time Ride Time (secs) Stopped Time Stopped Time (secs) Latitude Longitude Elevation (feet) Distance (miles) Speed (mph) Pace Pace (secs) Average Speed (mph) Average Pace Average Pace (secs) Ascent (feet) Descent (feet) Calories
0 2013-08-01 07:07:10 2016-03-12 00:00:01 1.1 2016-03-12 0.0 41.703753 -91.609892 963 0.00 2.88 2016-03-12 00:20:51 1251 0.00 2016-03-12 00:00:00 0 0 0 0
1 2013-08-01 07:07:17 2016-03-12 00:00:08 8.2 2016-03-12 0.0 41.703825 -91.609835 852 0.01 2.88 2016-03-12 00:20:51 1251 2.56 2016-03-12 00:23:27 1407 0 129 0
2 2013-08-01 07:07:22 2016-03-12 00:00:13 13.2 2016-03-12 0.0 41.703858 -91.609814 789 0.01 2.88 2016-03-12 00:20:51 1251 2.27 2016-03-12 00:26:27 1587 0 173 0
3 2013-08-01 07:07:27 2016-03-12 00:00:18 18.2 2016-03-12 0.0 41.703943 -91.610090 787 0.02 6.60 2016-03-12 00:09:06 546 4.70 2016-03-12 00:12:47 767 0 173 1
4 2013-08-01 07:07:40 2016-03-12 00:00:31 31.2 2016-03-12 0.0 41.704381 -91.610258 788 0.06 9.50 2016-03-12 00:06:19 379 6.37 2016-03-12 00:09:26 566 0 173 2
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252 entries, 0 to 251
Data columns (total 18 columns):
Time                   252 non-null datetime64[ns]
Ride Time              252 non-null datetime64[ns]
Ride Time (secs)       252 non-null float64
Stopped Time           252 non-null datetime64[ns]
Stopped Time (secs)    252 non-null float64
Latitude               252 non-null float64
Longitude              252 non-null float64
Elevation (feet)       252 non-null int64
Distance (miles)       252 non-null float64
Speed (mph)            252 non-null float64
Pace                   252 non-null datetime64[ns]
Pace (secs)            252 non-null int64
Average Speed (mph)    252 non-null float64
Average Pace           252 non-null datetime64[ns]
Average Pace (secs)    252 non-null int64
Ascent (feet)          252 non-null int64
Descent (feet)         252 non-null int64
Calories               252 non-null int64
dtypes: datetime64[ns](5), float64(7), int64(6)
memory usage: 35.5 KB

One minor issue is that some of the dates are parsed as datetimes when they're really just times. We'll take care of that later. Pandas store everything as datetime64. For now we'll keep them as datetimes, and remember that they're really just times.

Now let's do the same thing, but for all the files.

Use a generator expression to filter down to just csv's that match the simple condition of having the correct naming style. I try to use lazy generators instead of lists wherever possible. In this case the list is so small that it really doesn't matter, but it's a good habit.

import os
csvs = (f for f in os.listdir('data') if f.startswith('Cyclemeter')
        and f.endswith('.csv'))

I see a potential problem: We'll potentailly want to concatenate each csv together into a single DataFrame. However we'll want to retain some idea of which specific ride an observation came from. So let's create a ride_id variable, which will just be an integar ranging from \(0 \ldots N\), where \(N\) is the number of rides.

Make a simple helper function to do this, and apply it to each csv.

def read_ride(path_, i):
    """
    read in csv at path, and assign the `ride_id` variable to i.
    """
    date_cols = ["Time", "Ride Time", "Stopped Time", "Pace", "Average Pace"]

    df = pd.read_csv(path_, parse_dates=date_cols)
    df['ride_id'] = i
    return df

dfs = (read_ride(os.path.join('data', csv), i)
       for (i, csv) in enumerate(csvs))

Now concatenate together. The original indicies are meaningless, so we'll ignore them in the concat.

df = pd.concat(dfs, ignore_index=True)
df.head()
Time Ride Time Ride Time (secs) Stopped Time Stopped Time (secs) Latitude Longitude Elevation (feet) Distance (miles) Speed (mph) Pace Pace (secs) Average Speed (mph) Average Pace Average Pace (secs) Ascent (feet) Descent (feet) Calories ride_id
0 2013-08-01 07:07:10 2016-03-12 00:00:01 1.1 2016-03-12 0.0 41.703753 -91.609892 963 0.00 2.88 2016-03-12 00:20:51 1251.0 0.00 2016-03-12 00:00:00 0 0 0 0 0
1 2013-08-01 07:07:17 2016-03-12 00:00:08 8.2 2016-03-12 0.0 41.703825 -91.609835 852 0.01 2.88 2016-03-12 00:20:51 1251.0 2.56 2016-03-12 00:23:27 1407 0 129 0 0
2 2013-08-01 07:07:22 2016-03-12 00:00:13 13.2 2016-03-12 0.0 41.703858 -91.609814 789 0.01 2.88 2016-03-12 00:20:51 1251.0 2.27 2016-03-12 00:26:27 1587 0 173 0 0
3 2013-08-01 07:07:27 2016-03-12 00:00:18 18.2 2016-03-12 0.0 41.703943 -91.610090 787 0.02 6.60 2016-03-12 00:09:06 546.0 4.70 2016-03-12 00:12:47 767 0 173 1 0
4 2013-08-01 07:07:40 2016-03-12 00:00:31 31.2 2016-03-12 0.0 41.704381 -91.610258 788 0.06 9.50 2016-03-12 00:06:19 379.0 6.37 2016-03-12 00:09:26 566 0 173 2 0

Great! The data itself is clean enough that we didn't have to do too much munging.

Let's persist the merged DataFrame. Writing it out to a csv would be fine, but I like to use pandas' HDF5 integration (via pytables) for personal projects.

df.to_hdf('data/cycle_store.h5', key='merged',
          format='table')

I used the table format in case we want to do some querying on the HDFStore itself, but we'll save that for next time.

That's it for this post. Next time, we'll do some exploratry data analysis on the data.