My favorite feature of scikit-learn is its pipelines. These are a nice convenience for putting together a chain of operations from raw data to classifier. More importantly, they help prevent training data from leaking into your validation, so I use them whenever possible.

Pandas somewhat recently added a `Categorical`

dtype (*factors* if you're
coming from R-land).
These are great for representing data (often non-numeric) that comes from
a *fixed, discrete* set of possibly ordered categories.

I frequently see questions on how to use `Categoricals`

with
scikit-learn. I sat down today to write a quick post on *a* way to
accomplish this.

This post grew to be a bit longer than I intended so I'm assuming some familiarity with

- dummy variables
- Pandas Categoricals (
`codes`

vs.`categories`

) - Scikit-learn Estimator API
- Scikit-learn Pipelines

## The Data

We'll use the ggplot2 builtin `diamonds`

dataset (via
Vincent Arel-Bundock's great
`Rdatasets`

repository)

import pandas as pd url = ('https://github.com/vincentarelbundock/Rdatasets/raw/master/' 'csv/ggplot2/diamonds.csv') df = pd.read_csv(url, index_col=0) df.head()

carat | cut | color | clarity | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|---|

1 | 0.23 | Ideal | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3.98 | 2.43 |

2 | 0.21 | Premium | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 |

3 | 0.23 | Good | E | VS1 | 56.9 | 65 | 327 | 4.05 | 4.07 | 2.31 |

4 | 0.29 | Premium | I | VS2 | 62.4 | 58 | 334 | 4.20 | 4.23 | 2.63 |

5 | 0.31 | Good | J | SI2 | 63.3 | 58 | 335 | 4.34 | 4.35 | 2.75 |

## Preprocessing

I'm going to assume here that you know your categories ahead of time.

cat_columns = ['cut', 'color', 'clarity'] cuts = ['Fair', 'Good', 'Very Good', 'Premium', 'Ideal'] color = list("DEFGHIJ") clarity = ["I1", "SI2", "SI1", "VS2", "VS1", "VVS2", "VVS1", "IF"]

And we'll use the `Categorical`

constructor to convert the strings
to categoricals.
I'm not a gemologist, but I didn't make the `color`

field ordered.
It doesn't matter for our classifier either way, and we get to see a
bit of variety in the code.

df['cut'] = pd.Categorical(df.cut, categories=cuts, ordered=True) df['color'] = pd.Categorical(df.color, categories=color) df['clarity'] = pd.Categorical(df.clarity, categories=clarity, ordered=True) df.info()

<class 'pandas.core.frame.DataFrame'> Int64Index: 53940 entries, 1 to 53940 Data columns (total 10 columns): carat 53940 non-null float64 cut 53940 non-null object color 53940 non-null object clarity 53940 non-null object depth 53940 non-null float64 table 53940 non-null float64 price 53940 non-null int64 x 53940 non-null float64 y 53940 non-null float64 z 53940 non-null float64 dtypes: float64(6), int64(1), object(3) memory usage: 4.5+ MB

## The Transformer

Our pipeline will be pretty basic, just enough to motivate writing a proper transformer. We'll fit a Lasso, which means we should probably scale our features before fitting. So our pipeline will be

- raw data
- dummy variables / OneHotEncoded
- Scaled
- Lasso

Our imports

from itertools import chain from sklearn.linear_model import Lasso import sklearn.cross_validation as cv from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline, TransformerMixin

We're predicting price, so let's drop that from the exogenous variables. And we'll throw in a train-test split for fun.

X = df.drop('price', axis=1) y = df['price'] X_train, X_test, y_train, y_test = cv.train_test_split( X, y )

And now for the transformer itself. I encourage you to skim the implementation; I highlight the interesting sections later on.

class CategoricalTransformer(TransformerMixin): def fit(self, X, y=None, *args, **kwargs): self.columns_ = X.columns self.cat_columns_ = X.select_dtypes(include=['category']).columns self.non_cat_columns_ = X.columns.drop(self.cat_columns_) self.cat_map_ = {col: X[col].cat.categories for col in self.cat_columns_} self.ordered_ = {col: X[col].cat.ordered for col in self.cat_columns_} self.dummy_columns_ = {col: ["_".join([col, v]) for v in self.cat_map_[col]] for col in self.cat_columns_} self.transformed_columns_ = pd.Index( self.non_cat_columns_.tolist() + list(chain.from_iterable(self.dummy_columns_[k] for k in self.cat_columns_)) ) def transform(self, X, y=None, *args, **kwargs): return (pd.get_dummies(X) .reindex(columns=self.transformed_columns_) .fillna(0)) def inverse_transform(self, X): X = np.asarray(X) series = [] non_cat_cols = (self.transformed_columns_ .get_indexer(self.non_cat_columns_)) non_cat = pd.DataFrame(X[:, non_cat_cols], columns=self.non_cat_columns_) for col, cat_cols in self.dummy_columns_.items(): locs = self.transformed_columns_.get_indexer(cat_cols) codes = X[:, locs].argmax(1) cats = pd.Categorical.from_codes(codes, self.cat_map_[col], ordered=self.ordered_[col]) series.append(pd.Series(cats, name=col)) # concats sorts, we want the original order df = (pd.concat([non_cat] + series, axis=1) .reindex(columns=self.columns_)) return df self = CategoricalTransformer() # for testing later

Well, that's quite a bit of code. First, some higher-level thoughts.

This is *much* more code than necessary for just exploring things.
You could get away with something like

X_train_ = StandardScaler().fit_transform(pd.get_dummies(X_train)) lm = Lasso() lm.fit(X_train_, y_train)

Most of the extra code up there is dealing with things that will come
up when you go to actually use the estimator in production.
Things like rows being in a different order or training samples not seeing
every code.
Between pandas Categoricals and all of the bookkeeping in the `.fit`

method,
we can handle these cases.

## In Detail

First, notice that we don't have an `__init__`

method.
You could easily include one, and explicitly assign your categories here.
I think that's how scikit-learn's `OneHotEncoder`

does it.
We'll rely on that information being stored in the DataFrame.

## In Detail: `fit`

The `fit`

step really is just bookkeeping about how to go from labels
(in pandas-land) to positions (in NumPy land).

The most interesting thing is the underused `select_dtypes`

method,
but that's doesn't have much to do with our task at hand.
The main takeaway from `.fit`

is that we can reliably

- transform a DataFrame to a dummy encoded feature Matrix
- invert that transformation back to a DataFrame (with our categories intact)

And to repeat, those transformations will work regardless of the ordering of the features in our training dataset. We don't even need all of the categories to actually be realized.

## In Detail: `transform`

def transform(self, X, y=None, *args, **kwargs): return (pd.get_dummies(X) .reindex(columns=self.transformed_columns_) .fillna(0))

The `transform`

method uses the `get_dummies`

function from pandas to go from
our categories to a dummy-variable representation. By default
it only operates on `categorical`

or `object`

dtypes, appending the dummy
variables as new columns on the end of the DataFrame, while dropping the
categorical column.
The column names are the original column name joined with the actual category
by an underscore.
We `.reindex`

by `self.transformed_columns_`

in case a specific category
was not realized in the training dataset we fit on (say `color_I`

just didn't
have any observations)

>>>self.transformed_columns_ Index(['carat', 'depth', 'table', 'x', 'y', 'z', 'cut_Fair', 'cut_Good', 'cut_Very Good', 'cut_Premium', 'cut_Ideal', 'color_D', 'color_E', 'color_F', 'color_G', 'color_H', 'color_I', 'color_J', 'clarity_I1', 'clarity_SI2', 'clarity_SI1', 'clarity_VS2', 'clarity_VS1', 'clarity_VVS2', 'clarity_VVS1', 'clarity_IF'], dtype='object')

This is our feature matrix, that will be fed to the next stage of the pipeline.

## In Detail: `inverse_transform`

The `inverse_transform`

method isn't used as often, but is nice to have around.
We use our mappings from NumPy positions back to pandas labels to reconstruct
our arrays.

There are two methods worth pointing out

This is a lower level method for getting the index *positions* when you have
an array of *labels*. We use this since we need to know where the label for,
say, `caret`

ended up in the transformed NumPy array.

This is an alternative constructor to the usual.
It's the method you want when you've got a set of `codes`

(integers)
and want to construct a `Categorical`

.

Pandas will someday
add a `from_dummies`

or `invert_dummies`

(and we'll do it quicker if you ask
for it, and even more quickly if you implement it), but for now we write our own.

The trick is to recognize that once we've dummy encoded the values, a row's
`argmax`

represents the position of the category. For example, if we look at `cut`

self.transform(X)[self.dummy_columns_['cut']].head()

cut_Fair | cut_Good | cut_Very Good | cut_Premium | cut_Ideal | |
---|---|---|---|---|---|

1 | 0 | 0 | 0 | 0 | 1 |

2 | 0 | 0 | 0 | 1 | 0 |

3 | 0 | 1 | 0 | 0 | 0 |

4 | 0 | 0 | 0 | 1 | 0 |

5 | 0 | 1 | 0 | 0 | 0 |

We see that the `argmax`

for the first row is `4`

.

In this section:

locs = self.transformed_columns_.get_indexer(cat_cols) codes = X[:, locs].argmax(1) cats = pd.Categorical.from_codes(codes, self.cat_map_[col], ordered=self.ordered_[col])

`codes`

, the `argmax`

for each row, is an array of integers.
We use those codes along with our original categories and ordering
to reconstruct the original Categorical.

The rest is some more handling of non-categorical columns, concatination, and column sorting.

## Check your work

Pandas has a bunch of useful methods for testing.

import pandas.util.testing as tm result = self.inverse_transform(self.transform(X)) result.index = X.index tm.assert_frame_equal(result, X)

No `AssertionError`

is good news :)

## Use it

And lets "deploy" it to "production"

pipe = make_pipeline(CategoricalTransformer(), StandardScaler(), Lasso()) pipe.fit(X_train, y_train) print("train", pipe.score(X_train, y_train)) print("test", pipe.score(X_test, y_test))

outputs

train 0.92170189249 test 0.914089736447

If you care about things like multicolinearity you'd want to either drop
the intercept or drop one of the categories per `Categorical`

.

## Soap Box

In an ideal world, writing your own transformer here wouldn't be necessary (technically you don't have to write it yourself, just take my code). But we don't live in an ideal world. Scikit-learn uses the NumPy array as its fundamental data type. Pandas has recently introduced several dtypes on our own, instead of at the NumPy level. There are some very interesting conversations on why pandas went this route. Whether or not that's the right path (I think it was, but hey) is mostly academic. The real-world consequence is that pandas has developed dialects that many other PyData-ecosystem projects don't understand.

Andreas Mueller, a core developer for scikit-learn, has said he wants to explore this space a bit more this year, so hopefully things will get better. I'm looking forward to seeing where it goes, and how we can help.