dask.bag.Bag.to_avro

dask.bag.Bag.to_avro¶

Bag.to_avro(filename, schema, name_function=None, storage_options=None, codec='null', sync_interval=16000, metadata=None, compute=True, **kwargs)¶

Write bag to set of avro files

The schema is a complex dictionary describing the data, see https://avro.apache.org/docs/1.8.2/gettingstartedpython.html#Defining+a+schema and https://fastavro.readthedocs.io/en/latest/writer.html . It’s structure is as follows:

{'name': 'Test',
 'namespace': 'Test',
 'doc': 'Descriptive text',
 'type': 'record',
 'fields': [
    {'name': 'a', 'type': 'int'},
 ]}

where the “name” field is required, but “namespace” and “doc” are optional descriptors; “type” must always be “record”. The list of fields should have an entry for every key of the input records, and the types are like the primitive, complex or logical types of the Avro spec ( https://avro.apache.org/docs/1.8.2/spec.html ).

Results in one avro file per input partition.

Parameters

b: dask.bag.Bag
filename: list of str or str: Filenames to write to. If a list, number must match the number of partitions. If a string, must include a glob character “*”, which will be expanded using name_function
schema: dict: Avro schema dictionary, see above
name_function: None or callable: Expands integers into strings, see dask.bytes.utils.build_name_function
storage_options: None or dict: Extra key/value options to pass to the backend file-system
codec: ‘null’, ‘deflate’, or ‘snappy’: Compression algorithm
sync_interval: int: Number of records to include in each block within a file
metadata: None or dict: Included in the file header
compute: bool: If True, files are written immediately, and function blocks. If False, returns delayed objects, which can be computed by the user where convenient.
kwargs: passed to compute(), if compute=True

Examples

>>> import dask.bag as db
>>> b = db.from_sequence([{'name': 'Alice', 'value': 100},
...                       {'name': 'Bob', 'value': 200}])
>>> schema = {'name': 'People', 'doc': "Set of people's scores",
...           'type': 'record',
...           'fields': [
...               {'name': 'name', 'type': 'string'},
...               {'name': 'value', 'type': 'int'}]}
>>> b.to_avro('my-data.*.avro', schema)  
['my-data.0.avro', 'my-data.1.avro']

dask.bag.Bag.to_delayed

dask.bag.Bag