dask.bag.to_textfiles
dask.bag.to_textfiles¶
- dask.bag.to_textfiles(b, path, name_function=None, compression='infer', encoding='utf-8', compute=True, storage_options=None, last_endline=False, **kwargs)[source]¶
Write dask Bag to disk, one filename per partition, one line per element.
Paths: This will create one file for each partition in your bag. You can specify the filenames in a variety of ways.
Use a globstring
>>> b.to_textfiles('/path/to/data/*.json.gz')
The * will be replaced by the increasing sequence 1, 2, …
/path/to/data/0.json.gz /path/to/data/1.json.gz
Use a globstring and a
name_function=
keyword argument. The name_function function should expect an integer and produce a string. Strings produced by name_function must preserve the order of their respective partition indices.>>> from datetime import date, timedelta >>> def name(i): ... return str(date(2015, 1, 1) + i * timedelta(days=1))
>>> name(0) '2015-01-01' >>> name(15) '2015-01-16'
>>> b.to_textfiles('/path/to/data/*.json.gz', name_function=name)
/path/to/data/2015-01-01.json.gz /path/to/data/2015-01-02.json.gz ...
You can also provide an explicit list of paths.
>>> paths = ['/path/to/data/alice.json.gz', '/path/to/data/bob.json.gz', ...] >>> b.to_textfiles(paths)
Compression: Filenames with extensions corresponding to known compression algorithms (gz, bz2) will be compressed accordingly.
Bag Contents: The bag calling
to_textfiles
must be a bag of text strings. For example, a bag of dictionaries could be written to JSON text files by mappingjson.dumps
on to the bag first, and then callingto_textfiles
:>>> b_dict.map(json.dumps).to_textfiles("/path/to/data/*.json")
Last endline: By default the last line does not end with a newline character. Pass
last_endline=True
to invert the default.