Archive

Tidy Data in Action

Hadley Whickham wrote a famous paper (for a certain definition of famous) about the importance of tidy data when doing data analysis. I want to talk a bit about that, using an example from a StackOverflow post, with a solution using pandas. The principles of tidy data aren't language specific.

A tidy dataset must satisfy three criteria (page 4 in Whickham's paper):

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

In this StackOverflow post, the asker had some data NBA games, and wanted to know the number of days since a team last played. Here's the example data:

import datetime

import pandas as pd

df = pd.DataFrame({'HomeTeam': ['HOU', 'CHI', 'DAL', 'HOU'],
                   'AwayTeam' : ['CHI', 'DAL', 'CHI', 'DAL'],
                   'HomeGameNum': [1, 2, 2, 2],
                   'AwayGameNum' : [1, 1, 3, 3],
                   'Date' : [datetime.date(2014,3,11), datetime.date(2014,3,12),
                             datetime.date(2014,3,14), datetime.date(2014,3,15)]})
df
AwayGameNum AwayTeam Date HomeGameNum HomeTeam
0 1 CHI 2014-03-11 1 HOU
1 1 DAL 2014-03-12 2 CHI
2 3 CHI 2014-03-14 2 DAL
3 3 DAL 2014-03-15 2 HOU

4 rows × 5 columns

I want to focus on the second of the three criteria: Each observation forms a row. Realize that the structure your dataset should take reflects the question you're trying to answer. For the SO question, we want to answer "How many days has it been since this team's last game?" Given this context what is an observation?


We'll define an observation as a team playing on a day. Does the original dataset in df satisfy the criteria for tidy data? No, it doesn't since each row contains 2 observations, one for the home team and one for the away team.

Let's tidy up the dataset.

  • I repeat each row (once for each team) and drop the game numbers (I don't need them for this example)
  • Select just the new rows (the one with odd indicies, % is the modulo operator in python)
  • Overwrite the value of Team for the new rows, keeping the existing value for the old rows
  • rename the HomeTeam column to is_home and make it a boolen column (True when the team is home)
s = df[['Date', 'HomeTeam', 'AwayTeam']].reindex_axis(df.index.repeat(2)).reset_index(drop=True)
s = s.rename(columns={'AwayTeam': 'Team'})

new = s[(s.index % 2).astype(bool)]

s.loc[new.index, 'Team'] = new.loc[:, 'HomeTeam']

s = s.rename(columns={'HomeTeam': 'is_home'})
s['is_home'] = s['Team'] == s['is_home']
s
Date is_home Team
0 2014-03-11 False CHI
1 2014-03-11 True HOU
2 2014-03-12 False DAL
3 2014-03-12 True CHI
4 2014-03-14 False CHI
5 2014-03-14 True DAL
6 2014-03-15 False DAL
7 2014-03-15 True HOU

8 rows × 3 columns

Now that we have a 1:1 correspondance between rows and observations, answering the question is simple.

We'll just group by each team and find the difference between each consecutive Date for that team. Then subtract one day so that back to back games reflect 0 days of rest.

s['rest'] = s.groupby('Team')['Date'].diff() - datetime.timedelta(1)
s
Date is_home Team rest
0 2014-03-11 False CHI NaT
1 2014-03-11 True HOU NaT
2 2014-03-12 False DAL NaT
3 2014-03-12 True CHI 0 days
4 2014-03-14 False CHI 1 days
5 2014-03-14 True DAL 1 days
6 2014-03-15 False DAL 0 days
7 2014-03-15 True HOU 3 days

8 rows × 4 columns

I planned on comparing that one line solution to the code needed with the messy data. But honestly, I'm having trouble writing the messy data version. You don't really have anything to group on, so you'd need to keep track of the row where you last saw this team (either in AwayTeam or HomeTeam). And then each row will have two answers, one for each team. It's certainly possible to write the necessary code, but the fact that I'm struggling so much to write the messy version is pretty good evidence for the importance of tidy data.