Working with Collections
Working with Collections¶
Often we want to do a bit of custom work with dask.delayed
(for example,
for complex data ingest), then leverage the algorithms in dask.array
or
dask.dataframe
, and then switch back to custom work. To this end, all
collections support from_delayed
functions and to_delayed
methods.
As an example, consider the case where we store tabular data in a custom format
not known by Dask DataFrame. This format is naturally broken apart into
pieces and we have a function that reads one piece into a Pandas DataFrame.
We use dask.delayed
to lazily read these files into Pandas DataFrames,
use dd.from_delayed
to wrap these pieces up into a single
Dask DataFrame, use the complex algorithms within the DataFrame
(groupby, join, etc.), and then switch back to dask.delayed
to save our results
back to the custom format:
import dask.dataframe as dd
from dask.delayed import delayed
from my_custom_library import load, save
filenames = ...
dfs = [delayed(load)(fn) for fn in filenames]
df = dd.from_delayed(dfs)
df = ... # do work with dask.dataframe
dfs = df.to_delayed()
writes = [delayed(save)(df, fn) for df, fn in zip(dfs, filenames)]
dd.compute(*writes)
Data science is often complex, and dask.delayed
provides a release valve for
users to manage this complexity on their own, and solve the last mile problem
for custom formats and complex situations.