dask.dataframe.to_hdf
dask.dataframe.to_hdf¶
- dask.dataframe.to_hdf(df, path, key, mode='a', append=False, scheduler=None, name_function=None, compute=True, lock=None, dask_kwargs=None, **kwargs)[source]¶
Store Dask Dataframe to Hierarchical Data Format (HDF) files
This is a parallel version of the Pandas function of the same name. Please see the Pandas docstring for more detailed information about shared keyword arguments.
This function differs from the Pandas version by saving the many partitions of a Dask DataFrame in parallel, either to many files, or to many datasets within the same file. You may specify this parallelism with an asterix
*
within the filename or datapath, and an optionalname_function
. The asterix will be replaced with an increasing sequence of integers starting from0
or with the result of callingname_function
on each of those integers.This function only supports the Pandas
'table'
format, not the more specialized'fixed'
format.- Parameters
- pathstring, pathlib.Path
Path to a target filename. Supports strings,
pathlib.Path
, or any object implementing the__fspath__
protocol. May contain a*
to denote many filenames.- keystring
Datapath within the files. May contain a
*
to denote many locations- name_functionfunction
A function to convert the
*
in the above options to a string. Should take in a number from 0 to the number of partitions and return a string. (see examples below)- computebool
Whether or not to execute immediately. If False then this returns a
dask.Delayed
value.- lockbool, Lock, optional
Lock to use to prevent concurrency issues. By default a
threading.Lock
,multiprocessing.Lock
orSerializableLock
will be used depending on your scheduler if a lock is required. See dask.utils.get_scheduler_lock for more information about lock selection.- schedulerstring
The scheduler to use, like “threads” or “processes”
- **other:
See pandas.to_hdf for more information
- Returns
- filenameslist
Returned if
compute
is True. List of file names that each partition is saved to.- delayeddask.Delayed
Returned if
compute
is False. Delayed object to executeto_hdf
when computed.
See also
Examples
Save Data to a single file
>>> df.to_hdf('output.hdf', '/data')
Save data to multiple datapaths within the same file:
>>> df.to_hdf('output.hdf', '/data-*')
Save data to multiple files:
>>> df.to_hdf('output-*.hdf', '/data')
Save data to multiple files, using the multiprocessing scheduler:
>>> df.to_hdf('output-*.hdf', '/data', scheduler='processes')
Specify custom naming scheme. This writes files as ‘2000-01-01.hdf’, ‘2000-01-02.hdf’, ‘2000-01-03.hdf’, etc..
>>> from datetime import date, timedelta >>> base = date(year=2000, month=1, day=1) >>> def name_function(i): ... ''' Convert integer 0 to n to a string ''' ... return base + timedelta(days=i)
>>> df.to_hdf('*.hdf', '/data', name_function=name_function)