dask.dataframe.DataFrame.shuffle
dask.dataframe.DataFrame.shuffle¶
- DataFrame.shuffle(on, npartitions=None, max_branch=None, shuffle=None, ignore_index=False, compute=None)¶
Rearrange DataFrame into new partitions
Uses hashing of on to map rows to output partitions. After this operation, rows with the same value of on will be in the same partition.
- Parameters
- onstr, list of str, or Series, Index, or DataFrame
Column(s) or index to be used to map rows to output partitions
- npartitionsint, optional
Number of partitions of output. Partition count will not be changed by default.
- max_branch: int, optional
The maximum number of splits per input partition. Used within the staged shuffling algorithm.
- shuffle: {‘disk’, ‘tasks’, ‘p2p’}, optional
Either
'disk'
for single-node operation or'tasks'
and'p2p'
for distributed operation. Will be inferred by your current scheduler.- ignore_index: bool, default False
Ignore index during shuffle. If
True
, performance may improve, but index values will not be preserved.- compute: bool
Whether or not to trigger an immediate computation. Defaults to False.
Notes
This does not preserve a meaningful index/partitioning scheme. This is not deterministic if done in parallel.
Examples
>>> df = df.shuffle(df.columns[0])