Tidy Data in Action

Hadley Whickham wrote a famous paper (for a certain definition of famous) about the importance of tidy data when doing data analysis. I want to talk a bit about that, using an example from a StackOverflow post, with a solution using pandas. The principles of tidy data aren't language specific.

A tidy dataset must satisfy three criteria1:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

In this StackOverflow post, the asker had some data NBA games, and wanted to know the number of days since a team last played. Here's the example data:

import datetime

import pandas as pd

df = pd.DataFrame({'HomeTeam': ['HOU', 'CHI', 'DAL', 'HOU'],
                   'AwayTeam' : ['CHI', 'DAL', 'CHI', 'DAL'],
                   'HomeGameNum': [1, 2, 2, 2],
                   'AwayGameNum' : [1, 1, 3, 3],
                   'Date' : [datetime.date(2014,3,11), datetime.date(2014,3,12),
                             datetime.date(2014,3,14), datetime.date(2014,3,15)]})
AwayGameNum AwayTeam Date HomeGameNum HomeTeam
0 1 CHI 2014-03-11 1 HOU
1 1 DAL 2014-03-12 2 CHI
2 3 CHI 2014-03-14 2 DAL
3 3 DAL 2014-03-15 2 HOU

I want to focus on the second of the three criteria: Each observation forms a row. Realize that the structure your dataset should take reflects the question you're trying to answer. For the SO question, we want to answer "How many days has it been since this team's last game?" Given this context what is an observation?

We'll define an observation as a team playing on a day. Does the original dataset in df satisfy the criteria for tidy data? No, it doesn't since each row contains 2 observations, one for the home team and one for the away team.

Let's tidy up the dataset, using the melt function taken from R's reshape package (also via Hadley Wickham).

df['Date'] = pd.to_datetime(df.Date)
tidy = (df.reset_index()
          .pipe(pd.melt, id_vars=['game_number', 'Date'],
                value_vars=['AwayTeam', 'HomeTeam'])
game_number Date variable value
0 0 2014-03-11 AwayTeam CHI
4 0 2014-03-11 HomeTeam HOU
1 1 2014-03-12 AwayTeam DAL
5 1 2014-03-12 HomeTeam CHI
2 2 2014-03-14 AwayTeam CHI
6 2 2014-03-14 HomeTeam DAL
7 3 2014-03-15 HomeTeam HOU

Now we group, and apply a lambda function that calculates the days of rest between games.

tidy.groupby('value').Date.apply(lambda x: x.diff() - pd.Timedelta(days=1)).to_frame(name='rest')
0 NaT
4 NaT
1 NaT
5 0 days
2 1 days
6 1 days
3 0 days
7 3 days

I planned on comparing that one line solution to the code needed with the messy data. But honestly, I'm having trouble writing the messy data version. You don't really have anything to group on, so you'd need to keep track of the row where you last saw this team (either in AwayTeam or HomeTeam). And then each row will have two answers, one for each team. It's certainly possible to write the necessary code, but the fact that I'm struggling so much to write the messy version is pretty good evidence for the importance of tidy data.

  1. page 4 in Whickham's paper