Archive

What's Next?

Some personal news: Last Friday was my last day at Anaconda. Next week, I'm joining Microsoft's AI for Earth team. This is a very bittersweet transition. While I loved working at Anaconda and all the great people there, I'm extremely excited about what I'll be working on at Microsoft.

Reflections …


Maintaing Performance

As pandas' documentation claims: pandas provides high-performance data structures. But how do we verify that the claim is correct? And how do we ensure that it stays correct over many releases. This post describes

  1. pandas' current setup for monitoring performance
  2. My personal debugging strategy for understanding and fixing performance regressions …

pandas + binder

This post describes the start of a journey to get pandas' documentation running on Binder. The end result is this nice button:

Binder


For a while now I've been jealous of Dask's examples repository. That's a repository containing a collection of Jupyter notebooks demonstrating Dask in action. It stitches together some …


Tabular Data in Scikit-Learn and Dask-ML

Scikit-Learn 0.20.0 will contain some nice new features for working with tabular data. This blogpost will introduce those improvements with a small demo. We'll then see how Dask-ML was able to piggyback on the work done by scikit-learn to offer a version that works well with Dask Arrays …


Distributed Auto-ML with TPOT with Dask

This work is supported by Anaconda Inc.

This post describes a recent improvement made to TPOT. TPOT is an automated machine learning library for Python. It does some feature engineering and hyper-parameter optimization for you. TPOT uses genetic algorithms to evaluate which models are performing well and how to choose …


Moral Philosophy for pandas or: What is .values?

The other day, I put up a Twitter poll asking a simple question: What's the type of series.values?

I was a bit limited for space, so I'll expand on …


Modern Pandas (Part 8): Scaling


This is part 1 in my series on writing modern idiomatic pandas.


As I sit down to write this, the third-most popular pandas question on StackOverflow covers how to use pandas for large datasets. This is in …


dask-ml 0.4.1 Released

This work is supported by Anaconda Inc and the Data Driven Discovery Initiative from the Moore Foundation.

dask-ml 0.4.1 was released today with a few enhancements. See the changelog for all the changes from 0.4.0.

Conda packages are available on conda-forge

$ conda install -c conda-forge dask-ml …

Extension Arrays for Pandas

This is a status update on some enhancements for pandas. The goal of the work is to store things that are sufficiently array-like in a pandas DataFrame, even if they aren't a regular NumPy array. Pandas already does this in a few places for some blessed types (like Categorical); we'd …


Easy distributed training with Joblib and dask

This work is supported by Anaconda Inc and the Data Driven Discovery Initiative from the Moore Foundation.

This past week, I had a chance to visit some of the scikit-learn developers at Inria in Paris. It was a fun and productive week, and I'm thankful to them for hosting me …