dask.bag.Bag

dask.bag.Bag¶

class dask.bag.Bag(dsk: Graph, name: str, npartitions: int)[source]¶

Parallel collection of Python objects

Examples

Create Bag from sequence

>>> import dask.bag as db
>>> b = db.from_sequence(range(5))
>>> list(b.filter(lambda x: x % 2 == 0).map(lambda x: x * 10))
[0, 20, 40]

Create Bag from filename or globstring of filenames

>>> b = db.read_text('/path/to/mydata.*.json.gz').map(json.loads)  

Create manually (expert use)

>>> dsk = {('x', 0): (range, 5),
...        ('x', 1): (range, 5),
...        ('x', 2): (range, 5)}
>>> b = db.Bag(dsk, 'x', npartitions=3)

>>> sorted(b.map(lambda x: x * 10))
[0, 0, 0, 10, 10, 10, 20, 20, 20, 30, 30, 30, 40, 40, 40]

>>> int(b.fold(lambda x, y: x + y))
30

__init__(dsk: Graph, name: str, npartitions: int)[source]¶

Methods

`__init__`(dsk, name, npartitions)
`accumulate`(binop[, initial])	Repeatedly apply binary function to a sequence, accumulating results.
`all`([split_every])	Are all elements truthy?
`any`([split_every])	Are any of the elements truthy?
`compute`(**kwargs)	Compute this dask collection
`count`([split_every])	Count the number of elements.
`distinct`([key])	Distinct elements of collection
`filter`(predicate)	Filter elements in collection by a predicate function.
`flatten`()	Concatenate nested lists into one long list.
`fold`(binop[, combine, initial, split_every, ...])	Parallelizable reduction
`foldby`(key, binop[, initial, combine, ...])	Combined reduction and groupby.
`frequencies`([split_every, sort])	Count number of occurrences of each distinct element.
`groupby`(grouper[, method, npartitions, ...])	Group collection by key function
`join`(other, on_self[, on_other])	Joins collection with another collection.
`map`(func, args, *kwargs)	Apply a function elementwise across one or more bags.
`map_partitions`(func, args, *kwargs)	Apply a function to every partition across one or more bags.
`max`([split_every])	Maximum element
`mean`()	Arithmetic mean
`min`([split_every])	Minimum element
`persist`(**kwargs)	Persist this dask collection into memory
`pluck`(key[, default])	Select item from all tuples/dicts in collection.
`product`(other)	Cartesian product between two bags.
`random_sample`(prob[, random_state])	Return elements from bag with probability of `prob`.
`reduction`(perpartition, aggregate[, ...])	Reduce collection with reduction operators.
`remove`(predicate)	Remove elements in collection that match predicate.
`repartition`([npartitions, partition_size])	Repartition Bag across new divisions.
`starmap`(func, **kwargs)	Apply a function using argument tuples from the given bag.
`std`([ddof])	Standard deviation
`sum`([split_every])	Sum all elements
`take`(k[, npartitions, compute, warn])	Take the first k elements.
`to_avro`(filename, schema[, name_function, ...])	Write bag to set of avro files
`to_dataframe`([meta, columns, optimize_graph])	Create Dask Dataframe from a Dask Bag.
`to_delayed`([optimize_graph])	Convert into a list of `dask.delayed` objects, one per partition.
`to_textfiles`(path[, name_function, ...])	Write dask Bag to disk, one filename per partition, one line per element.
`topk`(k[, key, split_every])	K largest elements in collection
`unzip`(n)	Transform a bag of tuples to `n` bags of their elements.
`var`([ddof])	Variance
`visualize`([filename, format, optimize_graph])	Render the computation of this object's task graph using graphviz.

Attributes

str

String processing functions

dask.bag.Bag.to_avro

dask.bag.Bag.accumulate