Pipelines and Categoricals

My favorite feature of scikit-learn is its pipelines. These are a nice convenience for putting together a chain of operations from raw data to classifier. More importantly, they help prevent training data from leaking into your validation, so I use them whenever possible.

Pandas somewhat recently added a Categorical dtype (factors if you're coming from R-land). These are great for representing data (often non-numeric) that comes from a fixed, discrete set of possibly ordered categories.

I frequently see questions on how to use Categoricals with scikit-learn. I sat down today to write a quick post on a way to accomplish this.

This post grew to be a bit longer than I intended so I'm assuming some familiarity with

  1. dummy variables
  2. Pandas Categoricals (codes vs. categories)
  3. Scikit-learn Estimator API
  4. Scikit-learn Pipelines

The Data

We'll use the ggplot2 builtin diamonds dataset (via Vincent Arel-Bundock's great Rdatasets repository)

import pandas as pd
url = ('https://github.com/vincentarelbundock/Rdatasets/raw/master/'
df = pd.read_csv(url, index_col=0)
carat cut color clarity depth table price x y z
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75


I'm going to assume here that you know your categories ahead of time.

cat_columns = ['cut', 'color', 'clarity']
cuts = ['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']
color = list("DEFGHIJ")
clarity = ["I1", "SI2", "SI1", "VS2", "VS1", "VVS2", "VVS1", "IF"]

And we'll use the Categorical constructor to convert the strings to categoricals. I'm not a gemologist, but I didn't make the color field ordered. It doesn't matter for our classifier either way, and we get to see a bit of variety in the code.

df['cut'] = pd.Categorical(df.cut, categories=cuts, ordered=True)
df['color'] = pd.Categorical(df.color, categories=color)
df['clarity'] = pd.Categorical(df.clarity, categories=clarity,
<class 'pandas.core.frame.DataFrame'>
Int64Index: 53940 entries, 1 to 53940
Data columns (total 10 columns):
carat      53940 non-null float64
cut        53940 non-null object
color      53940 non-null object
clarity    53940 non-null object
depth      53940 non-null float64
table      53940 non-null float64
price      53940 non-null int64
x          53940 non-null float64
y          53940 non-null float64
z          53940 non-null float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.5+ MB

The Transformer

Our pipeline will be pretty basic, just enough to motivate writing a proper transformer. We'll fit a Lasso, which means we should probably scale our features before fitting. So our pipeline will be

  1. raw data
  2. dummy variables / OneHotEncoded
  3. Scaled
  4. Lasso

Our imports

from itertools import chain

from sklearn.linear_model import Lasso
import sklearn.cross_validation as cv
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline, TransformerMixin

We're predicting price, so let's drop that from the exogenous variables. And we'll throw in a train-test split for fun.

X = df.drop('price', axis=1)
y = df['price']

X_train, X_test, y_train, y_test = cv.train_test_split(
    X, y

And now for the transformer itself. I encourage you to skim the implementation; I highlight the interesting sections later on.

class CategoricalTransformer(TransformerMixin):

    def fit(self, X, y=None, *args, **kwargs):
        self.columns_ = X.columns
        self.cat_columns_ = X.select_dtypes(include=['category']).columns
        self.non_cat_columns_ = X.columns.drop(self.cat_columns_)

        self.cat_map_ = {col: X[col].cat.categories
                         for col in self.cat_columns_}
        self.ordered_ = {col: X[col].cat.ordered
                         for col in self.cat_columns_}

        self.dummy_columns_ = {col: ["_".join([col, v])
                                     for v in self.cat_map_[col]]
                               for col in self.cat_columns_}
        self.transformed_columns_ = pd.Index(
            self.non_cat_columns_.tolist() +
                                     for k in self.cat_columns_))

    def transform(self, X, y=None, *args, **kwargs):
        return (pd.get_dummies(X)

    def inverse_transform(self, X):
        X = np.asarray(X)
        series = []
        non_cat_cols = (self.transformed_columns_
        non_cat = pd.DataFrame(X[:, non_cat_cols],
        for col, cat_cols in self.dummy_columns_.items():
            locs = self.transformed_columns_.get_indexer(cat_cols)
            codes = X[:, locs].argmax(1)
            cats = pd.Categorical.from_codes(codes, self.cat_map_[col],
            series.append(pd.Series(cats, name=col))
        # concats sorts, we want the original order
        df = (pd.concat([non_cat] + series, axis=1)
        return df

self = CategoricalTransformer()  # for testing later

Well, that's quite a bit of code. First, some higher-level thoughts.

This is much more code than necessary for just exploring things. You could get away with something like

X_train_ = StandardScaler().fit_transform(pd.get_dummies(X_train))
lm = Lasso()
lm.fit(X_train_, y_train)

Most of the extra code up there is dealing with things that will come up when you go to actually use the estimator in production. Things like rows being in a different order or training samples not seeing every code. Between pandas Categoricals and all of the bookkeeping in the .fit method, we can handle these cases.

In Detail

First, notice that we don't have an __init__ method. You could easily include one, and explicitly assign your categories here. I think that's how scikit-learn's OneHotEncoder does it. We'll rely on that information being stored in the DataFrame.

In Detail: fit

The fit step really is just bookkeeping about how to go from labels (in pandas-land) to positions (in NumPy land).

The most interesting thing is the underused select_dtypes method, but that's doesn't have much to do with our task at hand. The main takeaway from .fit is that we can reliably

  1. transform a DataFrame to a dummy encoded feature Matrix
  2. invert that transformation back to a DataFrame (with our categories intact)

And to repeat, those transformations will work regardless of the ordering of the features in our training dataset. We don't even need all of the categories to actually be realized.

In Detail: transform

def transform(self, X, y=None, *args, **kwargs):
    return (pd.get_dummies(X)

The transform method uses the get_dummies function from pandas to go from our categories to a dummy-variable representation. By default it only operates on categorical or object dtypes, appending the dummy variables as new columns on the end of the DataFrame, while dropping the categorical column. The column names are the original column name joined with the actual category by an underscore. We .reindex by self.transformed_columns_ in case a specific category was not realized in the training dataset we fit on (say color_I just didn't have any observations)

Index(['carat', 'depth', 'table', 'x', 'y', 'z', 'cut_Fair', 'cut_Good',
       'cut_Very Good', 'cut_Premium', 'cut_Ideal', 'color_D', 'color_E',
       'color_F', 'color_G', 'color_H', 'color_I', 'color_J', 'clarity_I1',
       'clarity_SI2', 'clarity_SI1', 'clarity_VS2', 'clarity_VS1',
       'clarity_VVS2', 'clarity_VVS1', 'clarity_IF'],

This is our feature matrix, that will be fed to the next stage of the pipeline.

In Detail: inverse_transform

The inverse_transform method isn't used as often, but is nice to have around. We use our mappings from NumPy positions back to pandas labels to reconstruct our arrays.

There are two methods worth pointing out

This is a lower level method for getting the index positions when you have an array of labels. We use this since we need to know where the label for, say, caret ended up in the transformed NumPy array.

This is an alternative constructor to the usual. It's the method you want when you've got a set of codes (integers) and want to construct a Categorical.

Pandas will someday add a from_dummies or invert_dummies (and we'll do it quicker if you ask for it, and even more quickly if you implement it), but for now we write our own.

The trick is to recognize that once we've dummy encoded the values, a row's argmax represents the position of the category. For example, if we look at cut

cut_Fair cut_Good cut_Very Good cut_Premium cut_Ideal
1 0 0 0 0 1
2 0 0 0 1 0
3 0 1 0 0 0
4 0 0 0 1 0
5 0 1 0 0 0

We see that the argmax for the first row is 4.

In this section:

locs = self.transformed_columns_.get_indexer(cat_cols)
codes = X[:, locs].argmax(1)
cats = pd.Categorical.from_codes(codes, self.cat_map_[col],

codes, the argmax for each row, is an array of integers. We use those codes along with our original categories and ordering to reconstruct the original Categorical.

The rest is some more handling of non-categorical columns, concatination, and column sorting.

Check your work

Pandas has a bunch of useful methods for testing.

import pandas.util.testing as tm

result = self.inverse_transform(self.transform(X))
result.index = X.index

tm.assert_frame_equal(result, X)

No AssertionError is good news :)

Use it

And lets "deploy" it to "production"

pipe = make_pipeline(CategoricalTransformer(), StandardScaler(), Lasso())

pipe.fit(X_train, y_train)

print("train", pipe.score(X_train, y_train))

print("test", pipe.score(X_test, y_test))


train 0.92170189249
test 0.914089736447

If you care about things like multicolinearity you'd want to either drop the intercept or drop one of the categories per Categorical.

Soap Box

In an ideal world, writing your own transformer here wouldn't be necessary (technically you don't have to write it yourself, just take my code). But we don't live in an ideal world. Scikit-learn uses the NumPy array as its fundamental data type. Pandas has recently introduced several dtypes on our own, instead of at the NumPy level. There are some very interesting conversations on why pandas went this route. Whether or not that's the right path (I think it was, but hey) is mostly academic. The real-world consequence is that pandas has developed dialects that many other PyData-ecosystem projects don't understand.

Andreas Mueller, a core developer for scikit-learn, has said he wants to explore this space a bit more this year, so hopefully things will get better. I'm looking forward to seeing where it goes, and how we can help.