dask.bag.Bag.groupby
dask.bag.Bag.groupby¶
- Bag.groupby(grouper, method=None, npartitions=None, blocksize=1048576, max_branch=None, shuffle=None)[source]¶
Group collection by key function
This requires a full dataset read, serialization and shuffle. This is expensive. If possible you should use
foldby
.- Parameters
- grouper: function
Function on which to group elements
- shuffle: str
Either ‘disk’ for an on-disk shuffle or ‘tasks’ to use the task scheduling framework. Use ‘disk’ if you are on a single machine and ‘tasks’ if you are on a distributed cluster.
- npartitions: int
If using the disk-based shuffle, the number of output partitions
- blocksize: int
If using the disk-based shuffle, the size of shuffle blocks (bytes)
- max_branch: int
If using the task-based shuffle, the amount of splitting each partition undergoes. Increase this for fewer copies but more scheduler overhead.
See also
Examples
>>> import dask.bag as db >>> b = db.from_sequence(range(10)) >>> iseven = lambda x: x % 2 == 0 >>> dict(b.groupby(iseven)) {True: [0, 2, 4, 6, 8], False: [1, 3, 5, 7, 9]}