dask.dataframe.to_hdf
dask.dataframe.to_hdf¶
- dask.dataframe.to_hdf(df, path, key, mode='a', append=False, scheduler=None, name_function=None, compute=True, lock=None, dask_kwargs=None, **kwargs)[source]¶
Store Dask Dataframe to Hierarchical Data Format (HDF) files
This is a parallel version of the Pandas function of the same name. Please see the Pandas docstring for more detailed information about shared keyword arguments.
This function differs from the Pandas version by saving the many partitions of a Dask DataFrame in parallel, either to many files, or to many datasets within the same file. You may specify this parallelism with an asterix
*within the filename or datapath, and an optionalname_function. The asterix will be replaced with an increasing sequence of integers starting from0or with the result of callingname_functionon each of those integers.This function only supports the Pandas
'table'format, not the more specialized'fixed'format.- Parameters
- pathstring, pathlib.Path
Path to a target filename. Supports strings,
pathlib.Path, or any object implementing the__fspath__protocol. May contain a*to denote many filenames.- keystring
Datapath within the files. May contain a
*to denote many locations- name_functionfunction
A function to convert the
*in the above options to a string. Should take in a number from 0 to the number of partitions and return a string. (see examples below)- computebool
Whether or not to execute immediately. If False then this returns a
dask.Delayedvalue.- lockbool, Lock, optional
Lock to use to prevent concurrency issues. By default a
threading.Lock,multiprocessing.LockorSerializableLockwill be used depending on your scheduler if a lock is required. See dask.utils.get_scheduler_lock for more information about lock selection.- schedulerstring
The scheduler to use, like “threads” or “processes”
- **other:
See pandas.to_hdf for more information
- Returns
- filenameslist
Returned if
computeis True. List of file names that each partition is saved to.- delayeddask.Delayed
Returned if
computeis False. Delayed object to executeto_hdfwhen computed.
See also
Examples
Save Data to a single file
>>> df.to_hdf('output.hdf', '/data')
Save data to multiple datapaths within the same file:
>>> df.to_hdf('output.hdf', '/data-*')
Save data to multiple files:
>>> df.to_hdf('output-*.hdf', '/data')
Save data to multiple files, using the multiprocessing scheduler:
>>> df.to_hdf('output-*.hdf', '/data', scheduler='processes')
Specify custom naming scheme. This writes files as ‘2000-01-01.hdf’, ‘2000-01-02.hdf’, ‘2000-01-03.hdf’, etc..
>>> from datetime import date, timedelta >>> base = date(year=2000, month=1, day=1) >>> def name_function(i): ... ''' Convert integer 0 to n to a string ''' ... return base + timedelta(days=i)
>>> df.to_hdf('*.hdf', '/data', name_function=name_function)