Maintaing Performance

As pandas' documentation claims: pandas provides high-performance data structures. But how do we verify that the claim is correct? And how do we ensure that it stays correct over many releases. This post describes

  1. pandas' current setup for monitoring performance
  2. My personal debugging strategy for understanding and fixing performance regressions …

pandas + binder

This post describes the start of a journey to get pandas' documentation running on Binder. The end result is this nice button:


For a while now I've been jealous of Dask's examples repository. That's a repository containing a collection of Jupyter notebooks demonstrating Dask in action. It stitches together some …

Tabular Data in Scikit-Learn and Dask-ML

Scikit-Learn 0.20.0 will contain some nice new features for working with tabular data. This blogpost will introduce those improvements with a small demo. We'll then see how Dask-ML was able to piggyback on the work done by scikit-learn to offer a version that works well with Dask Arrays …

Distributed Auto-ML with TPOT with Dask

This work is supported by Anaconda Inc.

This post describes a recent improvement made to TPOT. TPOT is an automated machine learning library for Python. It does some feature engineering and hyper-parameter optimization for you. TPOT uses genetic algorithms to evaluate which models are performing well and how to choose …

Moral Philosophy for pandas or: What is .values?

The other day, I put up a Twitter poll asking a simple question: What's the type of series.values?

I was a bit limited for space, so I'll expand on …

Modern Pandas (Part 8): Scaling

This is part 1 in my series on writing modern idiomatic pandas.

As I sit down to write this, the third-most popular pandas question on StackOverflow covers how to use pandas for large datasets. This is in …

dask-ml 0.4.1 Released

This work is supported by Anaconda Inc and the Data Driven Discovery Initiative from the Moore Foundation.

dask-ml 0.4.1 was released today with a few enhancements. See the changelog for all the changes from 0.4.0.

Conda packages are available on conda-forge

$ conda install -c conda-forge dask-ml …

Extension Arrays for Pandas

This is a status update on some enhancements for pandas. The goal of the work is to store things that are sufficiently array-like in a pandas DataFrame, even if they aren't a regular NumPy array. Pandas already does this in a few places for some blessed types (like Categorical); we'd …

Easy distributed training with Joblib and dask

This work is supported by Anaconda Inc and the Data Driven Discovery Initiative from the Moore Foundation.

This past week, I had a chance to visit some of the scikit-learn developers at Inria in Paris. It was a fun and productive week, and I'm thankful to them for hosting me …


Today we released the first version of dask-ml, a library for parallel and distributed machine learning. Read the documentation or install it with

pip install dask-ml

Packages are currently building for conda-forge, and will be up later today.

conda install -c conda-forge dask-ml

The Goals

dask is, to quote the …