dask.bag.Bag
dask.bag.Bag¶
- class dask.bag.Bag(dsk: Graph, name: str, npartitions: int)[source]¶
Parallel collection of Python objects
Examples
Create Bag from sequence
>>> import dask.bag as db >>> b = db.from_sequence(range(5)) >>> list(b.filter(lambda x: x % 2 == 0).map(lambda x: x * 10)) [0, 20, 40]
Create Bag from filename or globstring of filenames
>>> b = db.read_text('/path/to/mydata.*.json.gz').map(json.loads)
Create manually (expert use)
>>> dsk = {('x', 0): (range, 5), ... ('x', 1): (range, 5), ... ('x', 2): (range, 5)} >>> b = db.Bag(dsk, 'x', npartitions=3)
>>> sorted(b.map(lambda x: x * 10)) [0, 0, 0, 10, 10, 10, 20, 20, 20, 30, 30, 30, 40, 40, 40]
>>> int(b.fold(lambda x, y: x + y)) 30
Methods
__init__
(dsk, name, npartitions)accumulate
(binop[, initial])Repeatedly apply binary function to a sequence, accumulating results.
all
([split_every])Are all elements truthy?
any
([split_every])Are any of the elements truthy?
compute
(**kwargs)Compute this dask collection
count
([split_every])Count the number of elements.
distinct
([key])Distinct elements of collection
filter
(predicate)Filter elements in collection by a predicate function.
flatten
()Concatenate nested lists into one long list.
fold
(binop[, combine, initial, split_every, ...])Parallelizable reduction
foldby
(key, binop[, initial, combine, ...])Combined reduction and groupby.
frequencies
([split_every, sort])Count number of occurrences of each distinct element.
groupby
(grouper[, method, npartitions, ...])Group collection by key function
join
(other, on_self[, on_other])Joins collection with another collection.
map
(func, *args, **kwargs)Apply a function elementwise across one or more bags.
map_partitions
(func, *args, **kwargs)Apply a function to every partition across one or more bags.
max
([split_every])Maximum element
mean
()Arithmetic mean
min
([split_every])Minimum element
persist
(**kwargs)Persist this dask collection into memory
pluck
(key[, default])Select item from all tuples/dicts in collection.
product
(other)Cartesian product between two bags.
random_sample
(prob[, random_state])Return elements from bag with probability of
prob
.reduction
(perpartition, aggregate[, ...])Reduce collection with reduction operators.
remove
(predicate)Remove elements in collection that match predicate.
repartition
([npartitions, partition_size])Repartition Bag across new divisions.
starmap
(func, **kwargs)Apply a function using argument tuples from the given bag.
std
([ddof])Standard deviation
sum
([split_every])Sum all elements
take
(k[, npartitions, compute, warn])Take the first k elements.
to_avro
(filename, schema[, name_function, ...])Write bag to set of avro files
to_dataframe
([meta, columns, optimize_graph])Create Dask Dataframe from a Dask Bag.
to_delayed
([optimize_graph])Convert into a list of
dask.delayed
objects, one per partition.to_textfiles
(path[, name_function, ...])Write dask Bag to disk, one filename per partition, one line per element.
topk
(k[, key, split_every])K largest elements in collection
unzip
(n)Transform a bag of tuples to
n
bags of their elements.var
([ddof])Variance
visualize
([filename, format, optimize_graph])Render the computation of this object's task graph using graphviz.
Attributes
str
String processing functions