## Modern Pandas (Part 7): Time Series

This is part 7 in my series on writing modern idiomatic pandas.

This post is available as a Jupyter notebook

Pandas started out in the financial world, so naturally it has strong time series support. The first half of this post will look at pandas' capabilities for manipulating time series data. The second half will discuss modeling time series data with statsmodels.

%matplotlib inline

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style='ticks', context='talk')


Let's grab some stock data for Goldman Sachs using the pandas-datareader package, which spun off of pandas:

gs = web.DataReader("GS", data_source='yahoo', start='2006-01-01',
end='2010-01-01')

Open High Low Close Volume Adj Close
Date
2006-01-03 126.699997 129.440002 124.230003 128.869995 6188700 114.660688
2006-01-04 127.349998 128.910004 126.379997 127.089996 4861600 113.076954
2006-01-05 126.000000 127.320000 125.610001 127.040001 3717400 113.032471
2006-01-06 127.290001 129.250000 127.290001 128.839996 4319600 114.633997
2006-01-09 128.500000 130.619995 128.000000 130.389999 4723500 116.013096

There isn't a special data-container just for time series in pandas, they're just Series or DataFrames with a DatetimeIndex. That said, DataFrames and Series with a DatetiemIndex do gain some special behaviors and additional methods.

## Special Slicing

Looking at the elements of gs.index, we see that DatetimeIndexes are made up of pandas.Timestamps:

gs.index[0]

Timestamp('2006-01-03 00:00:00')


A Timestamp is mostly compatible with the builtin datetime.datetime class, but much amenable to storage in arrays.

Working with Timestamps can be awkward, so Series and DataFrames with DatetimeIndexes have some special slicing rules. The first special case is partial-string indexing. Say we wanted to select all the days in 2006. Even with Timestamp's convenient constructors, it's a pain.

gs.loc[pd.Timestamp('2006-01-01'):pd.Timestamp('2006-12-31')].head()

Open High Low Close Volume Adj Close
Date
2006-01-03 126.699997 129.440002 124.230003 128.869995 6188700 114.660688
2006-01-04 127.349998 128.910004 126.379997 127.089996 4861600 113.076954
2006-01-05 126.000000 127.320000 125.610001 127.040001 3717400 113.032471
2006-01-06 127.290001 129.250000 127.290001 128.839996 4319600 114.633997
2006-01-09 128.500000 130.619995 128.000000 130.389999 4723500 116.013096

Thanks to partial-string indexing, it's as simple as

gs.loc['2006'].head()

Open High Low Close Volume Adj Close
Date
2006-01-03 126.699997 129.440002 124.230003 128.869995 6188700 114.660688
2006-01-04 127.349998 128.910004 126.379997 127.089996 4861600 113.076954
2006-01-05 126.000000 127.320000 125.610001 127.040001 3717400 113.032471
2006-01-06 127.290001 129.250000 127.290001 128.839996 4319600 114.633997
2006-01-09 128.500000 130.619995 128.000000 130.389999 4723500 116.013096

Since label slicing is inclusive, this slice selects any observation where the year is 2006 (and partial-string indexing isn't limited to just years).

The second "convenience" is __getitem__ (square-bracket) fall-back indexing. I'm only going to mention it here, with the caveat that you should never use it. DataFrame __getitem__ typically looks in the column: gs['2006'] would search gs.columns for '2006', not find it, and raise a KeyError. But DataFrames with a DatetimeIndex catch that KeyError and try to slice the index. If it succeeds in slicing the index, the result like gs.loc['2006'] is returned. If it fails, the KeyError is re-raised. This is confusing because in pretty much every other case1 DataFrame.__getitem__ works on columns, and it's fragile because if you happened to have a column '2006' you would get just that column, and no fall-back indexing would occur. Just use gs.loc['2006'] when slicing DataFrame indexes.

## Special Methods

### Resampling

Resampling is similar to a groupby: you split the time series into groups (5-day buckets below), apply a function to each group (mean), and combine the result (one row per group).

gs.resample("5d").mean().head()

Open High Low Close Volume Adj Close
Date
2006-01-03 126.834999 128.730002 125.877501 127.959997 4771825 113.851027
2006-01-08 130.349998 132.645000 130.205002 131.660000 4664300 117.143065
2006-01-13 131.510002 133.395005 131.244995 132.924995 3258250 118.268581
2006-01-18 132.210002 133.853333 131.656667 132.543335 4997766 118.001965
2006-01-23 133.771997 136.083997 133.310001 135.153998 3968500 120.476883
gs.resample("W").agg(['mean', 'sum']).head()

Open High Low Close Volume Adj Close
mean sum mean sum mean sum mean sum mean sum mean sum
Date
2006-01-08 126.834999 507.339996 128.730002 514.920006 125.877501 503.510002 127.959997 511.839988 4771825 19087300 113.851027 455.404110
2006-01-15 130.684000 653.419998 132.848001 664.240006 130.544000 652.720001 131.979999 659.899994 4310420 21552100 117.427781 587.138903
2006-01-22 131.907501 527.630005 133.672501 534.690003 131.389999 525.559998 132.555000 530.220000 4653725 18614900 117.994103 471.976414
2006-01-29 133.771997 668.859986 136.083997 680.419983 133.310001 666.550003 135.153998 675.769989 3968500 19842500 120.476883 602.384416
2006-02-05 140.900000 704.500000 142.467999 712.339996 139.937998 699.689988 141.618002 708.090011 3920120 19600600 126.238926 631.194630

You can up-sample to convert to a higher frequency. The new points are filled with NaNs.

gs.resample("6H").mean().head()

Open High Low Close Volume Adj Close
Date
2006-01-03 00:00:00 126.699997 129.440002 124.230003 128.869995 6188700.0 114.660688
2006-01-03 06:00:00 NaN NaN NaN NaN NaN NaN
2006-01-03 12:00:00 NaN NaN NaN NaN NaN NaN
2006-01-03 18:00:00 NaN NaN NaN NaN NaN NaN
2006-01-04 00:00:00 127.349998 128.910004 126.379997 127.089996 4861600.0 113.076954

### Rolling / Expanding / EW

These methods aren't unique to DatetimeIndexes, but they often make sense with time series, so I'll show them here.

gs.Close.plot(label='Raw')
gs.Close.rolling(28).mean().plot(label='28D MA')
gs.Close.expanding().mean().plot(label='Expanding')
gs.Close.ewm(alpha=0.03).mean().plot(label='EWMA($\\alpha=.03$)')

plt.legend(bbox_to_anchor=(1.25, .5))
plt.tight_layout()
sns.despine()


Each of .rolling, .expanding, and .ewm return a deferred object, similar to a GroupBy.

roll = gs.Close.rolling(30, center=True)
roll

Rolling [window=30,center=True,axis=0]

m = roll.agg(['mean', 'std'])
ax = m['mean'].plot()
ax.fill_between(m.index, m['mean'] - m['std'], m['mean'] + m['std'],
alpha=.25)
plt.tight_layout()
sns.despine()


## Grab Bag

### Offsets

These are similar to dateutil.relativedelta, but they work with arrays.

gs.index + pd.DateOffset(months=3, days=-2)

DatetimeIndex(['2006-04-01', '2006-04-02', '2006-04-03', '2006-04-04',
'2006-04-07', '2006-04-08', '2006-04-09', '2006-04-10',
'2006-04-11', '2006-04-15',
...
'2010-03-15', '2010-03-16', '2010-03-19', '2010-03-20',
'2010-03-21', '2010-03-22', '2010-03-26', '2010-03-27',
'2010-03-28', '2010-03-29'],
dtype='datetime64[ns]', name='Date', length=1007, freq=None)


### Holiday Calendars

There are a whole bunch of special calendars, useful for traders probably.

from pandas.tseries.holiday import USColumbusDay

USColumbusDay.dates('2015-01-01', '2020-01-01')

DatetimeIndex(['2015-10-12', '2016-10-10', '2017-10-09', '2018-10-08',
'2019-10-14'],
dtype='datetime64[ns]', freq='WOM-2MON')


### Timezones

Pandas works with pytz for nice timezone-aware datetimes. The typical workflow is

1. localize timezone-naive timestamps to some timezone
2. convert to desired timezone

If you already have timezone-aware Timestamps, there's no need for step one.

# tz naiive -> tz aware..... to desired UTC

Open High Low Close Volume Adj Close
Date
2006-01-03 05:00:00+00:00 126.699997 129.440002 124.230003 128.869995 6188700 114.660688
2006-01-04 05:00:00+00:00 127.349998 128.910004 126.379997 127.089996 4861600 113.076954
2006-01-05 05:00:00+00:00 126.000000 127.320000 125.610001 127.040001 3717400 113.032471
2006-01-06 05:00:00+00:00 127.290001 129.250000 127.290001 128.839996 4319600 114.633997
2006-01-09 05:00:00+00:00 128.500000 130.619995 128.000000 130.389999 4723500 116.013096

# Modeling Time Series

The rest of this post will focus on time series in the econometric sense. My indented reader for this section isn't all that clear, so I apologize upfront for any sudden shifts in complexity. I'm roughly targeting material that could be presented in a first or second semester applied statistics course, but with more hand-waving and less formality. I've put a whole bunch of resources at the end for people eager to learn more.

We'll focus on modelling Average Monthly Flights. If you've been following along in the series, you've seen most of this code for downloading the data before, so feel free to skip this next block.

import os
import io
import glob
import zipfile

import requests
import statsmodels.api as sm

'''
'''
month = date.month
year = date.year
month_name = date.strftime('%B')
'Pragma': 'no-cache',
'Origin': 'http://www.transtats.bts.gov',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.8',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36',
'Content-Type': 'application/x-www-form-urlencoded',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Cache-Control': 'no-cache',
'Referer': 'http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time',
'Connection': 'keep-alive',
'DNT': '1',
}
os.makedirs('timeseries', exist_ok=True)
# long URL truncated, check the notebook
data = 'UserTableName=On_Time_Performance&DBShortName=On_Time&RawDataTable=T_ONTIME&sqlstr=+SELECT+'

stream=True)
fp = os.path.join('timeseries', '{}-{}.zip'.format(year, month))

with open(fp, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk:
f.write(chunk)
return fp

months = pd.date_range(start, end=end, freq='M')
# We could easily parallelize this loop.
for i, month in enumerate(months):

def unzip_one(fp):
zf = zipfile.ZipFile(fp)
csv = zf.extract(zf.filelist[0])
return csv

def time_to_datetime(df, columns):
'''
Combine all time items into datetimes.

2014-01-01,1149.0 -> 2014-01-01T11:49:00
'''
def converter(col):
timepart = (col.astype(str)

plt.legend()
sns.despine()


## Resources

This is a collection of links for those interested.

# Conclusion

Congratulations if you made it this far, this piece just kept growing (and I still had to cut stuff). The main thing cut was talking about how SARIMAX is implemented on top of using statsmodels' statespace framework. The statespace framework, developed mostly by Chad Fulton over the past couple years, is really nice. You can pretty easily extend it with custom models, but still get all the benefits of the framework's estimation and results facilities. I'd recommend reading the notebooks. We also didn't get to talk at all about Skipper Seabold's work on VARs, but maybe some other time.

As always, feedback is welcome.

1. The only other one is boolean indexing. I think that one is fine since you're passing in an array of booleans, not a label.