dask.bag.Bag
dask.bag.Bag¶
- class dask.bag.Bag(dsk: Graph, name: str, npartitions: int)[source]¶
Parallel collection of Python objects
Examples
Create Bag from sequence
>>> import dask.bag as db >>> b = db.from_sequence(range(5)) >>> list(b.filter(lambda x: x % 2 == 0).map(lambda x: x * 10)) [0, 20, 40]
Create Bag from filename or globstring of filenames
>>> b = db.read_text('/path/to/mydata.*.json.gz').map(json.loads)
Create manually (expert use)
>>> dsk = {('x', 0): (range, 5), ... ('x', 1): (range, 5), ... ('x', 2): (range, 5)} >>> b = db.Bag(dsk, 'x', npartitions=3)
>>> sorted(b.map(lambda x: x * 10)) [0, 0, 0, 10, 10, 10, 20, 20, 20, 30, 30, 30, 40, 40, 40]
>>> int(b.fold(lambda x, y: x + y)) 30
Methods
__init__(dsk, name, npartitions)accumulate(binop[, initial])Repeatedly apply binary function to a sequence, accumulating results.
all([split_every])Are all elements truthy?
any([split_every])Are any of the elements truthy?
compute(**kwargs)Compute this dask collection
count([split_every])Count the number of elements.
distinct([key])Distinct elements of collection
filter(predicate)Filter elements in collection by a predicate function.
flatten()Concatenate nested lists into one long list.
fold(binop[, combine, initial, split_every, ...])Parallelizable reduction
foldby(key, binop[, initial, combine, ...])Combined reduction and groupby.
frequencies([split_every, sort])Count number of occurrences of each distinct element.
groupby(grouper[, method, npartitions, ...])Group collection by key function
join(other, on_self[, on_other])Joins collection with another collection.
map(func, *args, **kwargs)Apply a function elementwise across one or more bags.
map_partitions(func, *args, **kwargs)Apply a function to every partition across one or more bags.
max([split_every])Maximum element
mean()Arithmetic mean
min([split_every])Minimum element
persist(**kwargs)Persist this dask collection into memory
pluck(key[, default])Select item from all tuples/dicts in collection.
product(other)Cartesian product between two bags.
random_sample(prob[, random_state])Return elements from bag with probability of
prob.reduction(perpartition, aggregate[, ...])Reduce collection with reduction operators.
remove(predicate)Remove elements in collection that match predicate.
repartition([npartitions, partition_size])Repartition Bag across new divisions.
starmap(func, **kwargs)Apply a function using argument tuples from the given bag.
std([ddof])Standard deviation
sum([split_every])Sum all elements
take(k[, npartitions, compute, warn])Take the first k elements.
to_avro(filename, schema[, name_function, ...])Write bag to set of avro files
to_dataframe([meta, columns, optimize_graph])Create Dask Dataframe from a Dask Bag.
to_delayed([optimize_graph])Convert into a list of
dask.delayedobjects, one per partition.to_textfiles(path[, name_function, ...])Write dask Bag to disk, one filename per partition, one line per element.
topk(k[, key, split_every])K largest elements in collection
unzip(n)Transform a bag of tuples to
nbags of their elements.var([ddof])Variance
visualize([filename, format, optimize_graph])Render the computation of this object's task graph using graphviz.
Attributes
strString processing functions