This is part 7 in my series on writing modern idiomatic pandas.
This post is available as a Jupyter notebook
Pandas started out in the financial world, so naturally it has strong time series support. The first half of this post will look at pandas' capabilities for manipulating time series data. The second half will discuss modeling time series data with statsmodels.
%matplotlib inline import numpy as np import pandas as pd import pandas_datareader.data as web import seaborn as sns import matplotlib.pyplot as plt sns.set(style='ticks', context='talk')
Let's grab some stock data for Goldman Sachs using the pandasdatareader
package, which spun off of pandas:
gs = web.DataReader("GS", data_source='yahoo', start='20060101', end='20100101') gs.head()
Open  High  Low  Close  Volume  Adj Close  

Date  
20060103  126.699997  129.440002  124.230003  128.869995  6188700  114.660688 
20060104  127.349998  128.910004  126.379997  127.089996  4861600  113.076954 
20060105  126.000000  127.320000  125.610001  127.040001  3717400  113.032471 
20060106  127.290001  129.250000  127.290001  128.839996  4319600  114.633997 
20060109  128.500000  130.619995  128.000000  130.389999  4723500  116.013096 
There isn't a special datacontainer just for time series in pandas, they're just Series
or DataFrame
s with a DatetimeIndex
.
That said, DataFrames
and Series
with a DatetiemIndex
do gain some special behaviors and additional methods.
Special Slicing
Looking at the elements of gs.index
, we see that DatetimeIndex
es are made up of pandas.Timestamp
s:
gs.index[0]
Timestamp('20060103 00:00:00')
A Timestamp
is mostly compatible with the builtin datetime.datetime
class, but much amenable to storage in arrays.
Working with Timestamp
s can be awkward, so Series and DataFrames with DatetimeIndexes
have some special slicing rules.
The first special case is partialstring indexing. Say we wanted to select all the days in 2006. Even with Timestamp
's convenient constructors, it's a pain.
gs.loc[pd.Timestamp('20060101'):pd.Timestamp('20061231')].head()
Open  High  Low  Close  Volume  Adj Close  

Date  
20060103  126.699997  129.440002  124.230003  128.869995  6188700  114.660688 
20060104  127.349998  128.910004  126.379997  127.089996  4861600  113.076954 
20060105  126.000000  127.320000  125.610001  127.040001  3717400  113.032471 
20060106  127.290001  129.250000  127.290001  128.839996  4319600  114.633997 
20060109  128.500000  130.619995  128.000000  130.389999  4723500  116.013096 
Thanks to partialstring indexing, it's as simple as
gs.loc['2006'].head()
Open  High  Low  Close  Volume  Adj Close  

Date  
20060103  126.699997  129.440002  124.230003  128.869995  6188700  114.660688 
20060104  127.349998  128.910004  126.379997  127.089996  4861600  113.076954 
20060105  126.000000  127.320000  125.610001  127.040001  3717400  113.032471 
20060106  127.290001  129.250000  127.290001  128.839996  4319600  114.633997 
20060109  128.500000  130.619995  128.000000  130.389999  4723500  116.013096 
Since label slicing is inclusive, this slice selects any observation where the year is 2006 (and partialstring indexing isn't limited to just years).
The second "convenience" is __getitem__
(squarebracket) fallback indexing. I'm only going to mention it here, with the caveat that you should never use it.
DataFrame __getitem__
typically looks in the column: gs['2006']
would search gs.columns
for '2006'
, not find it, and raise a KeyError
. But DataFrames with a DatetimeIndex
catch that KeyError
and try to slice the index.
If it succeeds in slicing the index, the result like gs.loc['2006']
is returned.
If it fails, the KeyError
is reraised.
This is confusing because in pretty much every other case^{1} DataFrame.__getitem__
works on columns, and it's fragile because if you happened to have a column '2006'
you would get just that column, and no fallback indexing would occur. Just use gs.loc['2006']
when slicing DataFrame indexes.
Special Methods
Resampling
Resampling is similar to a groupby
: you split the time series into groups (5day buckets below), apply a function to each group (mean
), and combine the result (one row per group).
gs.resample("5d").mean().head()
Open  High  Low  Close  Volume  Adj Close  

Date  
20060103  126.834999  128.730002  125.877501  127.959997  4771825  113.851027 
20060108  130.349998  132.645000  130.205002  131.660000  4664300  117.143065 
20060113  131.510002  133.395005  131.244995  132.924995  3258250  118.268581 
20060118  132.210002  133.853333  131.656667  132.543335  4997766  118.001965 
20060123  133.771997  136.083997  133.310001  135.153998  3968500  120.476883 
gs.resample("W").agg(['mean', 'sum']).head()
Open  High  Low  Close  Volume  Adj Close  

mean  sum  mean  sum  mean  sum  mean  sum  mean  sum  mean  sum  
Date  
20060108  126.834999  507.339996  128.730002  514.920006  125.877501  503.510002  127.959997  511.839988  4771825  19087300  113.851027  455.404110 
20060115  130.684000  653.419998  132.848001  664.240006  130.544000  652.720001  131.979999  659.899994  4310420  21552100  117.427781  587.138903 
20060122  131.907501  527.630005  133.672501  534.690003  131.389999  525.559998  132.555000  530.220000  4653725  18614900  117.994103  471.976414 
20060129  133.771997  668.859986  136.083997  680.419983  133.310001  666.550003  135.153998  675.769989  3968500  19842500  120.476883  602.384416 
20060205  140.900000  704.500000  142.467999  712.339996  139.937998  699.689988  141.618002  708.090011  3920120  19600600  126.238926  631.194630 
You can upsample to convert to a higher frequency. The new points are filled with NaNs.
gs.resample("6H").mean().head()
Open  High  Low  Close  Volume  Adj Close  

Date  
20060103 00:00:00  126.699997  129.440002  124.230003  128.869995  6188700.0  114.660688 
20060103 06:00:00  NaN  NaN  NaN  NaN  NaN  NaN 
20060103 12:00:00  NaN  NaN  NaN  NaN  NaN  NaN 
20060103 18:00:00  NaN  NaN  NaN  NaN  NaN  NaN 
20060104 00:00:00  127.349998  128.910004  126.379997  127.089996  4861600.0  113.076954 
Rolling / Expanding / EW
These methods aren't unique to DatetimeIndex
es, but they often make sense with time series, so I'll show them here.
gs.Close.plot(label='Raw') gs.Close.rolling(28).mean().plot(label='28D MA') gs.Close.expanding().mean().plot(label='Expanding') gs.Close.ewm(alpha=0.03).mean().plot(label='EWMA($\\alpha=.03$)') plt.legend(bbox_to_anchor=(1.25, .5)) plt.tight_layout() sns.despine()
Each of .rolling
, .expanding
, and .ewm
return a deferred object, similar to a GroupBy
.
roll = gs.Close.rolling(30, center=True) roll
Rolling [window=30,center=True,axis=0]
m = roll.agg(['mean', 'std']) ax = m['mean'].plot() ax.fill_between(m.index, m['mean']  m['std'], m['mean'] + m['std'], alpha=.25) plt.tight_layout() sns.despine()
Grab Bag
Offsets
These are similar to dateutil.relativedelta
, but they work with arrays.
gs.index + pd.DateOffset(months=3, days=2)
DatetimeIndex(['20060401', '20060402', '20060403', '20060404', '20060407', '20060408', '20060409', '20060410', '20060411', '20060415', ... '20100315', '20100316', '20100319', '20100320', '20100321', '20100322', '20100326', '20100327', '20100328', '20100329'], dtype='datetime64[ns]', name='Date', length=1007, freq=None)
Holiday Calendars
There are a whole bunch of special calendars, useful for traders probably.
from pandas.tseries.holiday import USColumbusDay USColumbusDay.dates('20150101', '20200101')
DatetimeIndex(['20151012', '20161010', '20171009', '20181008', '20191014'], dtype='datetime64[ns]', freq='WOM2MON')
Timezones
Pandas works with pytz
for nice timezoneaware datetimes.
The typical workflow is
 localize timezonenaive timestamps to some timezone
 convert to desired timezone
If you already have timezoneaware Timestamps, there's no need for step one.
# tz naiive > tz aware..... to desired UTC gs.tz_localize('US/Eastern').tz_convert('UTC').head()
Open  High  Low  Close  Volume  Adj Close  

Date  
20060103 05:00:00+00:00  126.699997  129.440002  124.230003  128.869995  6188700  114.660688 
20060104 05:00:00+00:00  127.349998  128.910004  126.379997  127.089996  4861600  113.076954 
20060105 05:00:00+00:00  126.000000  127.320000  125.610001  127.040001  3717400  113.032471 
20060106 05:00:00+00:00  127.290001  129.250000  127.290001  128.839996  4319600  114.633997 
20060109 05:00:00+00:00  128.500000  130.619995  128.000000  130.389999  4723500  116.013096 
Modeling Time Series
The rest of this post will focus on time series in the econometric sense. My indented reader for this section isn't all that clear, so I apologize upfront for any sudden shifts in complexity. I'm roughly targeting material that could be presented in a first or second semester applied statistics course, but with more handwaving and less formality. I've put a whole bunch of resources at the end for people eager to learn more.
We'll focus on modelling Average Monthly Flights. If you've been following along in the series, you've seen most of this code for downloading the data before, so feel free to skip this next block.
import os import io import glob import zipfile import requests import statsmodels.api as sm def download_one(date): ''' Download a single month's flights ''' month = date.month year = date.year month_name = date.strftime('%B') headers = { 'Pragma': 'nocache', 'Origin': 'http://www.transtats.bts.gov', 'AcceptEncoding': 'gzip, deflate', 'AcceptLanguage': 'enUS,en;q=0.8', 'UpgradeInsecureRequests': '1', 'UserAgent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36', 'ContentType': 'application/xwwwformurlencoded', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'CacheControl': 'nocache', 'Referer': 'http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=OnTime', 'Connection': 'keepalive', 'DNT': '1', } os.makedirs('timeseries', exist_ok=True) # long URL truncated, check the notebook data = 'UserTableName=On_Time_Performance&DBShortName=On_Time&RawDataTable=T_ONTIME&sqlstr=+SELECT+' r = requests.post('http://www.transtats.bts.gov/DownLoad_Table.asp?Table_ID=236&Has_Group=3&Is_Zipped=0', headers=headers, data=data.format(year=year, month=month, month_name=month_name), stream=True) fp = os.path.join('timeseries', '{}{}.zip'.format(year, month)) with open(fp, 'wb') as f: for chunk in r.iter_content(chunk_size=1024): if chunk: f.write(chunk) return fp def download_many(start, end): months = pd.date_range(start, end=end, freq='M') # We could easily parallelize this loop. for i, month in enumerate(months): download_one(month) def unzip_one(fp): zf = zipfile.ZipFile(fp) csv = zf.extract(zf.filelist[0]) return csv def time_to_datetime(df, columns): ''' Combine all time items into datetimes. 20140101,1149.0 > 20140101T11:49:00 ''' def converter(col): timepart = (col.astype(str) .str.replace('\.0$', '') # NaNs force float dtype .str.pad(4, fillchar='0')) return pd.to_datetime(df['fl_date'] + ' ' + timepart.str.slice(0, 2) + ':' + timepart.str.slice(2, 4), errors='coerce') return datetime_part df[columns] = df[columns].apply(converter) return df def read_one(fp): df = (pd.read_csv(fp, encoding='latin1') .rename(columns=str.lower) .drop('unnamed: 21', axis=1) .pipe(time_to_datetime, ['dep_time', 'arr_time', 'crs_arr_time', 'crs_dep_time']) .assign(fl_date=lambda x: pd.to_datetime(x['fl_date']))) return df
download_many('20000101', '20160101') zips = glob.glob(os.path.join('timeseries', '*.zip')) csvs = [unzip_one(fp) for fp in zips] dfs = [read_one(fp) for fp in csvs] df = pd.concat(dfs, ignore_index=True) cat_cols = ['unique_carrier', 'carrier', 'tail_num', 'origin', 'dest'] df[cat_cols] = df[cat_cols].apply(pd.Categorical) df.to_hdf('ts.hdf5', 'ts', format='table')
df = pd.read_hdf('ts.hdf5', 'ts')
We can calculate the historical values with a resample.
daily = df.fl_date.value_counts().sort_index() y = daily.resample('MS').mean() y.head()
20000101 1882.387097 20000201 1926.896552 20000301 1951.000000 20000401 1944.400000 20000501 1957.967742 Freq: MS, Name: fl_date, dtype: float64
Note that I use the "MS"
frequency code there.
Pandas defaults to end of month (or end of year).
Append an 'S'
to get the start.
ax = y.plot() ax.set(ylabel='Average Monthly Flights') sns.despine()
import statsmodels.formula.api as smf import statsmodels.tsa.api as smt import statsmodels.api as sm
One note of warning: I'm using the development version of statsmodels (commit de15ec8
to be precise).
Not all of the items I've shown here are available in the currentlyreleased version.
Think back to a typical regression problem, ignoring anything to do with time series for now. The usual task is to predict some value \(y\) using some a linear combination of features in \(X\).
When working with time series, some of the most important (and sometimes only) features are the previous, or lagged, values of \(y\).
Let's start by trying just that "manually": running a regression of y
on lagged values of itself.
We'll see that this regression suffers from a few problems: multicollinearity, autocorrelation, nonstationarity, and seasonality.
I'll explain what each of those are in turn and why they're problems.
Afterwards, we'll use a second model, seasonal ARIMA, which handles those problems for us.
First, let's create a dataframe with our lagged values of y
using the .shift
method, which shifts the index i
periods, so it lines up with that observation.
X = (pd.concat([y.shift(i) for i in range(6)], axis=1, keys=['y'] + ['L%s' % i for i in range(1, 6)]) .dropna()) X.head()
y  L1  L2  L3  L4  L5  

20000601  1976.133333  1957.967742  1944.400000  1951.000000  1926.896552  1882.387097 
20000701  1937.032258  1976.133333  1957.967742  1944.400000  1951.000000  1926.896552 
20000801  1960.354839  1937.032258  1976.133333  1957.967742  1944.400000  1951.000000 
20000901  1900.533333  1960.354839  1937.032258  1976.133333  1957.967742  1944.400000 
20001001  1931.677419  1900.533333  1960.354839  1937.032258  1976.133333  1957.967742 
We can fit the lagged model using statsmodels (which uses patsy to translate the formula string to a design matrix).
mod_lagged = smf.ols('y ~ trend + L1 + L2 + L3 + L4 + L5', data=X.assign(trend=np.arange(len(X)))) res_lagged = mod_lagged.fit() res_lagged.summary()
Dep. Variable:  y  Rsquared:  0.881 

Model:  OLS  Adj. Rsquared:  0.877 
Method:  Least Squares  Fstatistic:  221.7 
Date:  Fri, 13 May 2016  Prob (Fstatistic):  2.40e80 
Time:  16:14:16  LogLikelihood:  1076.6 
No. Observations:  187  AIC:  2167. 
Df Residuals:  180  BIC:  2190. 
Df Model:  6  
Covariance Type:  nonrobust 
coef  std err  t  P>t  [0.025  0.975]  

Intercept  208.2440  65.495  3.180  0.002  79.008  337.480 
trend  0.1123  0.106  1.055  0.293  0.322  0.098 
L1  1.0489  0.075  14.052  0.000  0.902  1.196 
L2  0.0001  0.108  0.001  0.999  0.213  0.213 
L3  0.1450  0.108  1.346  0.180  0.358  0.068 
L4  0.0393  0.109  0.361  0.719  0.254  0.175 
L5  0.0506  0.074  0.682  0.496  0.096  0.197 
Omnibus:  55.872  DurbinWatson:  2.009 

Prob(Omnibus):  0.000  JarqueBera (JB):  322.488 
Skew:  0.956  Prob(JB):  9.39e71 
Kurtosis:  9.142  Cond. No.  5.97e+04 
There are a few problems with this approach though. Since our lagged values are highly correlated with each other, our regression suffers from multicollinearity. That ruins our estimates of the slopes.
sns.heatmap(X.corr())
Second, we'd intuitively expect the \(\beta_i\)s to gradually decline to zero. The immediately preceding period should be most important (\(\beta_1\) is the largest coefficient in absolute value), followed by \(\beta_2\), and \(\beta_3\)... Looking at the regression summary and the bar graph below, this isn't the case (the cause is related to multicollinearity).
res_lagged.params.drop(['Intercept', 'trend']).plot.bar(rot=0) plt.ylabel('Coefficeint') sns.despine()
Finally, our degrees of freedom drop since we lose two for each variable (one for estimating the coefficient, one for the lost observation as a result of the shift
).
At least in (macro)econometrics, each observation is precious and we're loath to throw them away, though sometimes that's unavoidable.
Autocorrelation
Another problem our lagged model suffered from is autocorrelation (also know as serial correlation).
Roughly speaking, autocorrelation is when there's a clear pattern in the residuals of your regression (the observed minus the predicted).
Let's fit a simple model of \(y = \beta_0 + \beta_1 T + \epsilon\), where T
is the time trend (np.arange(len(y))
).
# `Results.resid` is a Series of residuals: y  ŷ mod_trend = sm.OLS.from_formula( 'y ~ trend', data=y.to_frame(name='y') .assign(trend=np.arange(len(y)))) res_trend = mod_trend.fit()
Residuals (the observed minus the expected, or \(\hat{e_t} = y_t  \hat{y_t}\)) are supposed to be white noise. That's one of the assumptions many of the properties of linear regression are founded upon. In this case there's a correlation between one residual and the next: if the residual at time \(t\) was above expectation, then the residual at time \(t + 1\) is much more likely to be above average as well (\(e_t > 0 \implies E_t[e_{t+1}] > 0\)).
We'll define a helper function to plot the residuals time series, and some diagnostics about them.
def tsplot(y, lags=None, figsize=(10, 8)): fig = plt.figure(figsize=figsize) layout = (2, 2) ts_ax = plt.subplot2grid(layout, (0, 0), colspan=2) acf_ax = plt.subplot2grid(layout, (1, 0)) pacf_ax = plt.subplot2grid(layout, (1, 1)) y.plot(ax=ts_ax) smt.graphics.plot_acf(y, lags=lags, ax=acf_ax) smt.graphics.plot_pacf(y, lags=lags, ax=pacf_ax) [ax.set_xlim(1.5) for ax in [acf_ax, pacf_ax]] sns.despine() plt.tight_layout() return ts_ax, acf_ax, pacf_ax
Calling it on the residuals from the linear trend:
tsplot(res_trend.resid, lags=36)
The top subplot shows the time series of our residuals \(e_t\), which should be white noise (but it isn't). The bottom shows the autocorrelation of the residuals as a correlogram. It measures the correlation between a value and it's lagged self, e.g. \(corr(e_t, e_{t1}), corr(e_t, e_{t2}), \ldots\). The partial autocorrelation plot in the bottomright shows a similar concept. It's partial in the sense that the value for \(corr(e_t, e_{tk})\) is the correlation between those two periods, after controlling for the values at all shorter lags.
Autocorrelation is a problem in regular regressions like above, but we'll use it to our advantage when we setup an ARIMA model below. The basic idea is pretty sensible: if your regression residuals have a clear pattern, then there's clearly some structure in the data that you aren't taking advantage of. If a positive residual today means you'll likely have a positive residual tomorrow, why not incorporate that information into your forecast, and lower your forecasted value for tomorrow? That's pretty much what ARIMA does.
Stationarity
It's important that your dataset be stationary, otherwise you run the risk of finding spurious correlations. A common example is the relationship between number of TVs per person and life expectancy. It's not likely that there's an actual causal relationship there. Rather, there could be a third variable that's driving both (wealth, say). Granger and Newbold (1974) had some stern words for the econometrics literature on this.
We find it very curious that whereas virtually every textbook on econometric methodology contains explicit warnings of the dangers of autocorrelated errors, this phenomenon crops up so frequently in wellrespected applied work.
(:fire:), but in that academic passiveaggressive way.
The typical way to handle nonstationarity is to difference the nonstationary variable until is is stationary.
y.to_frame(name='y').assign(Δy=lambda x: x.y.diff()).plot(subplots=True) sns.despine()
Our original series actually doesn't look that bad. It doesn't look like nominal GDP say, where there's a clearly rising trend. But we have more rigorous methods for detecting whether a series is nonstationary than simply plotting and squinting at it. One popular method is the Augmented DickeyFuller test. It's a statistical hypothesis test that roughly says:
\(H_0\) (null hypothesis): \(y\) is nonstationary, needs to be differenced
\(H_A\) (alternative hypothesis): \(y\) is stationary, doesn't need to be differenced
I don't want to get into the weeds on exactly what the test statistic is, and what the distribution looks like.
This is implemented in statsmodels as smt.adfuller
.
The return type is a bit busy for me, so we'll wrap it in a namedtuple
.
from collections import namedtuple ADF = namedtuple("ADF", "adf pvalue usedlag nobs critical icbest")
ADF(*smt.adfuller(y))._asdict() # for prettyprinting
OrderedDict([('adf', 1.9904608794641487), ('pvalue', 0.29077127047555601), ('usedlag', 15), ('nobs', 176), ('critical', {'1%': 3.4680615871598537, '10%': 2.5756015922004134, '5%': 2.8781061899535128}), ('icbest', 1987.6605732826176)])
So we failed to reject the null hypothesis that the original series was nonstationary. Let's difference it.
ADF(*smt.adfuller(y.diff().dropna()))._asdict()
OrderedDict([('adf', 3.5862361055645211), ('pvalue', 0.0060296818910968268), ('usedlag', 14), ('nobs', 176), ('critical', {'1%': 3.4680615871598537, '10%': 2.5756015922004134, '5%': 2.8781061899535128}), ('icbest', 1979.6445486427308)])
This looks better. We'll fit another OLS model of \(\Delta y = \beta_0 + \beta_1 L \Delta y_{t1} + e_t\)
data = (y.to_frame(name='y') .assign(Δy=lambda df: df.y.diff()) .assign(LΔy=lambda df: df.Δy.shift())) mod_stationary = smf.ols('Δy ~ LΔy', data=data.dropna()) res_stationary = mod_stationary.fit()
tsplot(res_stationary.resid, lags=24)
This is better, but we still see a regular cycle in the residuals. The cause is seasonality.
Seasonality
We have strong monthly seasonality:
smt.seasonal_decompose(y).plot()
There are a few ways to handle seasonality.
We'll just rely on the SARIMAX
method to do it for us.
For now, recognize that it's a problem to be solved.
ARIMA
So, we've sketched the problems with regular old regression: multicollinearity, autocorrelation, nonstationarity, and seasonality.
Our tool of choice, smt.SARIMAX
, which stands for Seasonal ARIMA with eXogenous regressors, can handle all these.
We'll walk through the components in pieces.
ARIMA stands for AutoRegressive Integrated Moving Average. It's a relatively simple yet flexible way of modeling univariate time series. It's made up of three components, and is typically written as \(\mathrm{ARIMA}(p, d, q)\).
AutoRegressive
The idea is to predict a variable by a linear combination of its lagged values (autoregressive as in regressing a value on its past self). An AR(p), where \(p\) represents the number of lagged values used, is written as
\(c\) is a constant and \(e_t\) is white noise. This looks a lot like a linear regression model with multiple predictors, but the predictors happen to be lagged values of \(y\) (though they are estimated differently).
Integrated
Integrated is like the opposite of differencing, and is the part that deals with stationarity. If you have to difference your dataset 1 time to get it stationary, then \(d=1\), and your series is integrated of order 1. We'll introduce one bit of notation for differencing: \(\Delta y_t = y_t  y_{t1}\) for \(d=1\).
Moving Average
MA models look somewhat similar to the AR component, but it's dealing with different values.
\(c\) again is a constant and \(e_t\) again is white noise. But now the coefficients are multiplying the residuals from previous predictions, not the actual (or differenced) values.
Combining
Putting that together, an ARIMA(1, 1, 1) process is written as
Using lag notation, where \(L y_t = y_{t1}\), i.e. y.shift()
in pandas, we can rewrite that as
That was for our specific \(\mathrm{ARIMA}(1, 1, 1)\) model. For the general \(\mathrm{ARIMA}(p, d, q)\), that becomes
We went through that extremely quickly, so don't feel bad if things aren't clear. Fortunately, the model is pretty easy to use with statsmodels.
mod = smt.SARIMAX(y, trend='c', order=(1, 1, 1)) res = mod.fit() tsplot(res.resid[2:], lags=24)
res.summary()
Dep. Variable:  fl_date  No. Observations:  192 

Model:  SARIMAX(1, 1, 1)  Log Likelihood  1104.663 
Date:  Fri, 13 May 2016  AIC  2217.326 
Time:  16:16:27  BIC  2230.356 
Sample:  01012000  HQIC  2222.603 
 12012015  
Covariance Type:  opg 
coef  std err  z  P>z  [0.025  0.975]  

intercept  0.7993  4.959  0.161  0.872  8.921  10.519 
ar.L1  0.3515  0.564  0.623  0.533  0.754  1.457 
ma.L1  0.2310  0.577  0.400  0.689  1.361  0.899 
sigma2  6181.2832  350.439  17.639  0.000  5494.435  6868.131 
LjungBox (Q):  209.30  JarqueBera (JB):  424.36 

Prob(Q):  0.00  Prob(JB):  0.00 
Heteroskedasticity (H):  0.86  Skew:  1.15 
Prob(H) (twosided):  0.54  Kurtosis:  9.93 
There's a bunch of output there with various tests, estimated parameters, and information criteria. Let's just say that things are looking better, but we still haven't accounted for seasonality.
A seasonal ARIMA model is written as \(\mathrm{ARIMA}(p,d,q)×(P,D,Q)_s\). Lowercase letters are for the nonseasonal component, just like before. Uppercase letters are a similar specification for the seasonal component, where \(s\) is the periodicity (4 for quarterly, 12 for monthly).
It's like we have two processes, one for nonseasonal component and one for seasonal components, and we multiply them together with regular algebra rules.
The general form of that looks like (quoting the statsmodels docs here)
where
 \(\phi_p(L)\) is the nonseasonal autoregressive lag polynomial
 \(\tilde{\phi}_P(L^S)\) is the seasonal autoregressive lag polynomial
 \(\Delta^d\Delta_s^D\) is the time series, differenced \(d\) times, and seasonally differenced \(D\) times.
 \(A(t)\) is the trend polynomial (including the intercept)
 \(\theta_q(L)\) is the nonseasonal moving average lag polynomial
 \(\tilde{\theta}_Q(L^s)\) is the seasonal moving average lag polynomial
I don't find that to be very clear, but maybe an example will help. We'll fit a seasonal ARIMA\((2,0,1)×(0, 1, 2)_{12}\).
So the nonseasonal component is
 \(p=2\): period autoregressive: use \(y_{t1}\) and \(y_{t2}\)
 \(d=0\): no firstdifferencing of the data
 \(q=1\): use the previous nonseasonal residual, \(e_{t1}\), to forecast
And the seasonal component is
 \(P=0\): Don't use any previous seasonal values
 \(D=1\): Difference the series 12 periods back:
y.diff(12)
 \(Q=2\): Use the two previous seasonal residuals
mod_seasonal = smt.SARIMAX(y, trend='c', order=(1, 1, 2), seasonal_order=(0, 1, 2, 12)) res_seasonal = mod_seasonal.fit()
res_seasonal.summary()
Dep. Variable:  fl_date  No. Observations:  192 

Model:  SARIMAX(1, 1, 2)x(0, 1, 2, 12)  Log Likelihood  992.148 
Date:  Fri, 13 May 2016  AIC  1998.297 
Time:  16:16:31  BIC  2021.099 
Sample:  01012000  HQIC  2007.532 
 12012015  
Covariance Type:  opg 
coef  std err  z  P>z  [0.025  0.975]  

intercept  0.7824  5.279  0.148  0.882  9.564  11.129 
ar.L1  0.9880  0.374  2.639  0.008  1.722  0.254 
ma.L1  0.9905  0.437  2.265  0.024  0.133  1.847 
ma.L2  0.0041  0.091  0.045  0.964  0.174  0.182 
ma.S.L12  0.7869  0.066  11.972  0.000  0.916  0.658 
ma.S.L24  0.2121  0.063  3.366  0.001  0.089  0.336 
sigma2  3645.3299  219.296  16.623  0.000  3215.518  4075.142 
LjungBox (Q):  47.28  JarqueBera (JB):  464.42 

Prob(Q):  0.20  Prob(JB):  0.00 
Heteroskedasticity (H):  0.29  Skew:  1.30 
Prob(H) (twosided):  0.00  Kurtosis:  10.45 
tsplot(res_seasonal.resid[12:], lags=24)
Things look much better now.
One thing I didn't really talk about is order selection. How to choose \(p, d, q, P, D\) and \(Q\).
R's forecast package does have a handy auto.arima
function that does this for you.
Python / statsmodels don't have that at the minute.
The alternative seems to be experience (boo), intuition (boo), and goodold gridsearch.
You can fit a bunch of models for a bunch of combinations of the parameters and use the AIC or BIC to choose the best.
Here is a useful reference, and this StackOverflow answer recommends a few options.
Forecasting
Now that we fit that model, let's put it to use. First, we'll make a bunch of onestep ahead forecasts. At each point (month), we take the history up to that point and make a forecast for the next month. So the forecast for January 2014 has available all the data up through December 2013.
pred = res_seasonal.get_prediction(start='20010301') pred_ci = pred.conf_int()
ax = y.plot(label='observed') pred.predicted_mean.plot(ax=ax, label='Forecast', alpha=.7) ax.fill_between(pred_ci.index, pred_ci.iloc[:, 0], pred_ci.iloc[:, 1], color='k', alpha=.2) plt.legend() sns.despine()
There are a few places where the observed series slips outside the 95% confidence interval. The series seems especially unstable before 2005.
Alternatively, we can make dynamic forecasts as of some month (January 2013 in the example below). That means the forecast from that point forward only use information available as of January 2013. The predictions are generated in a similar way: a bunch of onestep forecasts. Only instead of plugging in the actual values beyond January 2013, we plug in the forecast values.
pred_dy = res_seasonal.get_prediction(start='20020301', dynamic='20130101') pred_dy_ci = pred_dy.conf_int()
ax = y.plot(label='observed') pred_dy.predicted_mean.plot(ax=ax, label='Forecast') ax.fill_between(pred_dy_ci.index, pred_dy_ci.iloc[:, 0], pred_dy_ci.iloc[:, 1], color='k', alpha=.25) ax.fill_betweenx(ax.get_ylim(), pd.Timestamp('20130101'), y.index[1], alpha=.1, zorder=1) ax.annotate('Dynamic $\\longrightarrow$', (pd.Timestamp('20130201'), 550)) plt.legend() sns.despine()
Resources
This is a collection of links for those interested.
Time series modeling in Python
General Textbooks
 Forecasting: Principles and Practice: A great introduction
 Stock and Watson: Readable undergraduate resource, has a few chapters on time series
 Greene's Econometric Analysis: My favorite PhD level textbook
 Hamilton's Time Series Analysis: A classic
 Lutkehpohl's New Introduction to Multiple Time Series Analysis: Extremely dry, but useful if you're implementing this stuff
Conclusion
Congratulations if you made it this far, this piece just kept growing (and I still had to cut stuff).
The main thing cut was talking about how SARIMAX
is implemented on top of using statsmodels' statespace framework.
The statespace framework, developed mostly by Chad Fulton over the past couple years, is really nice.
You can pretty easily extend it with custom models, but still get all the benefits of the framework's estimation and results facilities.
I'd recommend reading the notebooks.
We also didn't get to talk at all about Skipper Seabold's work on VARs, but maybe some other time.
As always, feedback is welcome.

The only other one is boolean indexing. I think that one is fine since you're passing in an array of booleans, not a label. ↩