dask.dataframe.to_hdf

dask.dataframe.to_hdf¶

dask.dataframe.to_hdf(df, path, key, mode='a', append=False, scheduler=None, name_function=None, compute=True, lock=None, dask_kwargs=None, **kwargs)[source]¶

Store Dask Dataframe to Hierarchical Data Format (HDF) files

This is a parallel version of the Pandas function of the same name. Please see the Pandas docstring for more detailed information about shared keyword arguments.

This function differs from the Pandas version by saving the many partitions of a Dask DataFrame in parallel, either to many files, or to many datasets within the same file. You may specify this parallelism with an asterix * within the filename or datapath, and an optional name_function. The asterix will be replaced with an increasing sequence of integers starting from 0 or with the result of calling name_function on each of those integers.

This function only supports the Pandas 'table' format, not the more specialized 'fixed' format.

Parameters

pathstring, pathlib.Path: Path to a target filename. Supports strings, pathlib.Path, or any object implementing the __fspath__ protocol. May contain a * to denote many filenames.
keystring: Datapath within the files. May contain a * to denote many locations
name_functionfunction: A function to convert the * in the above options to a string. Should take in a number from 0 to the number of partitions and return a string. (see examples below)
computebool: Whether or not to execute immediately. If False then this returns a dask.Delayed value.
lockbool, Lock, optional: Lock to use to prevent concurrency issues. By default a threading.Lock, multiprocessing.Lock or SerializableLock will be used depending on your scheduler if a lock is required. See dask.utils.get_scheduler_lock for more information about lock selection.
schedulerstring: The scheduler to use, like “threads” or “processes”
**other:: See pandas.to_hdf for more information

Returns

filenameslist: Returned if compute is True. List of file names that each partition is saved to.
delayeddask.Delayed: Returned if compute is False. Delayed object to execute to_hdf when computed.

See also

read_hdf
to_parquet

Examples

Save Data to a single file

>>> df.to_hdf('output.hdf', '/data')            

Save data to multiple datapaths within the same file:

>>> df.to_hdf('output.hdf', '/data-*')          

Save data to multiple files:

>>> df.to_hdf('output-*.hdf', '/data')          

Save data to multiple files, using the multiprocessing scheduler:

>>> df.to_hdf('output-*.hdf', '/data', scheduler='processes') 

Specify custom naming scheme. This writes files as ‘2000-01-01.hdf’, ‘2000-01-02.hdf’, ‘2000-01-03.hdf’, etc..

>>> from datetime import date, timedelta
>>> base = date(year=2000, month=1, day=1)
>>> def name_function(i):
...     ''' Convert integer 0 to n to a string '''
...     return base + timedelta(days=i)

>>> df.to_hdf('*.hdf', '/data', name_function=name_function) 

dask.dataframe.to_parquet

dask.dataframe.to_records