dask.bag.Bag.groupby

dask.bag.Bag.groupby¶

Bag.groupby(grouper, method=None, npartitions=None, blocksize=1048576, max_branch=None, shuffle=None)[source]¶

Group collection by key function

This requires a full dataset read, serialization and shuffle. This is expensive. If possible you should use foldby.

Parameters

grouper: function: Function on which to group elements
shuffle: str: Either ‘disk’ for an on-disk shuffle or ‘tasks’ to use the task scheduling framework. Use ‘disk’ if you are on a single machine and ‘tasks’ if you are on a distributed cluster.
npartitions: int: If using the disk-based shuffle, the number of output partitions
blocksize: int: If using the disk-based shuffle, the size of shuffle blocks (bytes)
max_branch: int: If using the task-based shuffle, the amount of splitting each partition undergoes. Increase this for fewer copies but more scheduler overhead.

See also

Bag.foldby

Examples

>>> import dask.bag as db
>>> b = db.from_sequence(range(10))
>>> iseven = lambda x: x % 2 == 0
>>> dict(b.groupby(iseven))             
{True: [0, 2, 4, 6, 8], False: [1, 3, 5, 7, 9]}

dask.bag.Bag.frequencies

dask.bag.Bag.join