Today we released the first version of
dask-ml, a library for parallel and
distributed machine learning. Read the documentation or install it with
pip install dask-ml
Packages are currently building for conda-forge, and will be up later today.
conda install -c conda-forge dask-ml
dask is, to quote the docs, "a flexible parallel computing library for
dask.dataframe have done a great job
scaling NumPy arrays and pandas dataframes;
dask-ml hopes to do the same in
the machine learning domain.
Put simply, we want
est = MyEstimator() est.fit(X, y)
to work well in parallel and potentially distributed across a cluster.
provides us with the building blocks to do that.
What's Been Done
dask-ml collects some efforts that others already built:
- distributed joblib:
scaling out some scikit-learn operations to clusters (from
Some drop in replacements for scikit-learn's
- distributed GLMs: Fit
large Generalized Linear Models on your cluster (from
- dask + xgboost: Peer a
dask.distributedcluster with XGBoost running in distributed mode (from
- dask + tensorflow:
dask.distributedcluster with TensorFlow running in distributed mode (from
- Out-of-core learning in
.partial_fitAPI in pipelines (from
In addition to providing a single home for these existing efforts, we've implemented some algorithms that are designed to run in parallel and distributed across a cluster.
KMeans: Uses the
k-means||algorithm for initialization, and a parallelized Lloyd's algorithm for the EM step.
- Preprocessing: These are estimators that can be dropped into scikit-learn Pipelines, but they operate in parallel on dask collections. They'll work well on datasets in distributed memory, and may be faster for NumPy arrays (depending on the overhead from parallelizing, and whether or not the scikit-learn implementation is already parallel).
Scikit-learn is a robust, mature, stable library.
dask-ml is... not. Which
means there are plenty of places to contribute! Dask makes writing parallel and
distributed implementations of algorithms fun. For the most part, you don't even
have to think about "where's my data? How do I parallelize this?" Dask does that
Have a look at the issues or propose a new one. I'd love to hear issues that you've run into when scaling the "traditional" scientific python stack out to larger problems.