dask.dataframe.DataFrame.shuffle

dask.dataframe.DataFrame.shuffle¶

DataFrame.shuffle(on, npartitions=None, max_branch=None, shuffle=None, ignore_index=False, compute=None)¶

Rearrange DataFrame into new partitions

Uses hashing of on to map rows to output partitions. After this operation, rows with the same value of on will be in the same partition.

Parameters

onstr, list of str, or Series, Index, or DataFrame: Column(s) or index to be used to map rows to output partitions
npartitionsint, optional: Number of partitions of output. Partition count will not be changed by default.
max_branch: int, optional: The maximum number of splits per input partition. Used within the staged shuffling algorithm.
shuffle: {‘disk’, ‘tasks’, ‘p2p’}, optional: Either 'disk' for single-node operation or 'tasks' and 'p2p' for distributed operation. Will be inferred by your current scheduler.
ignore_index: bool, default False: Ignore index during shuffle. If True, performance may improve, but index values will not be preserved.
compute: bool: Whether or not to trigger an immediate computation. Defaults to False.

Notes

This does not preserve a meaningful index/partitioning scheme. This is not deterministic if done in parallel.

Examples

>>> df = df.shuffle(df.columns[0])  

dask.dataframe.DataFrame.shape

dask.dataframe.DataFrame.size