1. Pipelines and Categoricals

    My favorite feature of scikit-learn is its pipelines. These are a nice convenience for putting together a chain of operations from raw data to classifier. More importantly, they help prevent training data from leaking into your validation, so I use them whenever possible.

    Pandas somewhat recently added a Categorical dtype ...

  2. Tidy Data in Action

    Hadley Whickham wrote a famous paper (for a certain definition of famous) about the importance of tidy data when doing data analysis. I want to talk a bit about that, using an example from a StackOverflow post, with a solution using pandas. The principles of tidy data aren't language ...