Conda packages are available on conda-forge
$ conda install -c conda-forge dask-ml
and wheels and the source are available on PyPI
$ pip install dask-ml
I wanted to highlight one change, that touches on a topic I mentioned in my first post on scalable Machine Learning. I discussed how, in my limited experience, a common workflow was to train on a small batch of data and predict for a much larger set of data. The training data easily fits in memory on a single machine, but the full dataset does not.
A new meta-estimator,
ParallelPostFit helps with this
common case. It's a meta-estimator that wraps a regular scikit-learn estimator,
similar to how
GridSearchCV wraps an estimator. The
.fit method is very
simple; it just calls the underlying estimator's
.fit method and copies over
the learned attributes. This means
ParalellPostFit is not suitable for
training on large datasets. It is, however, perfect for post-fit tasks like
As an example, we'll fit a scikit-learn
GradientBoostingClassifier on a small
>>> from sklearn.ensemble import GradientBoostingClassifier >>> import sklearn.datasets >>> import dask_ml.datasets >>> X, y = sklearn.datasets.make_classification(n_samples=1000, ... random_state=0) >>> clf = ParallelPostFit(estimator=GradientBoostingClassifier()) >>> clf.fit(X, y) ParallelPostFit(estimator=GradientBoostingClassifier(...))
Nothing special so far. But now, let's suppose we had a "large" dataset for
prediction. We'll use
dask_ml.datasets.make_classification, but in practice
you would read this from a file system or database.
>>> X_big, y_big = dask_ml.datasets.make_classification(n_samples=100000, chunks=1000, random_state=0)
In this case we have a dataset with 100,000 samples split into blocks of 1,000. We can now predict for this large dataset.
>>> clf.predict(X) dask.array<predict, shape=(10000,), dtype=int64, chunksize=(1000,)>
Now things are different.
.transform, all return dask arrays instead of immediately computing the
result. We've built up task graph of computations to be performed, which allows
dask to step in and compute things in parallel. When you're ready for the
>>> clf.predict_proba(X).compute() array([[0.99141094, 0.00858906], [0.93178389, 0.06821611], [0.99129105, 0.00870895], ..., [0.97996652, 0.02003348], [0.98087444, 0.01912556], [0.99407016, 0.00592984]])
At that point the dask scheduler comes in and executes your compute in parallel, using all the cores of your laptop or workstation, or all the machines on your cluster.
ParallelPostFit "fixes" a couple of issues in scikit-learn outside of
If you're able to depend on dask and dask-ml, consider giving
a shot and let me know how it turns out. For estimators whose predict is
relatively expensive and not already parallelized,
ParallelPostFit can give
a nice performance boost.
Even if the underlying estimator's
tranform method is cheap or
ParallelPostFit does still help with distributed the work on all
the machines in your cluster, or doing the operation out-of-core.
Thanks to all the contributors who worked on this release.