dask.dataframe.read_json

dask.dataframe.read_json¶

dask.dataframe.read_json(url_path, orient='records', lines=None, storage_options=None, blocksize=None, sample=1048576, encoding='utf-8', errors='strict', compression='infer', meta=None, engine=<function read_json>, include_path_column=False, path_converter=None, **kwargs)[source]¶

Create a dataframe from a set of JSON files

This utilises pandas.read_json(), and most parameters are passed through - see its docstring.

Differences: orient is ‘records’ by default, with lines=True; this is appropriate for line-delimited “JSON-lines” data, the kind of JSON output that is most common in big-data scenarios, and which can be chunked when reading (see read_json()). All other options require blocksize=None, i.e., one partition per input file.

Parameters

url_path: str, list of str: Location to read from. If a string, can include a glob character to find a set of file names. Supports protocol specifications such as "s3://".
encoding, errors:: The text encoding to implement, e.g., “utf-8” and how to respond to errors in the conversion (see str.encode()).
orient, lines, kwargs: passed to pandas; if not specified, lines=True when orient=’records’, False otherwise.
storage_options: dict: Passed to backend file-system implementation
blocksize: None or int: If None, files are not blocked, and you get one partition per input file. If int, which can only be used for line-delimited JSON files, each partition will be approximately this size in bytes, to the nearest newline character.
sample: int: Number of bytes to pre-load, to provide an empty dataframe structure to any blocks without data. Only relevant when using blocksize.
encoding, errors:: Text conversion, see bytes.decode()
compressionstring or None: String like ‘gzip’ or ‘xz’.
enginecallable or str, default pd.read_json: The underlying function that dask will use to read JSON files. By default, this will be the pandas JSON reader (pd.read_json). If a string is specified, this value will be passed under the engine key-word argument to pd.read_json (only supported for pandas>=2.0).
include_path_columnbool or str, optional: Include a column with the file path where each row in the dataframe originated. If True, a new column is added to the dataframe called path. If str, sets new column name. Default is False.
path_converterfunction or None, optional: A function that takes one argument and returns a string. Used to convert paths in the path column, for instance, to strip a common prefix from all the paths.
metapd.DataFrame, pd.Series, dict, iterable, tuple, optional: An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Returns

dask.DataFrame

Examples

Load single file

>>> dd.read_json('myfile.1.json')  

Load multiple files

>>> dd.read_json('myfile.*.json')  

>>> dd.read_json(['myfile.1.json', 'myfile.2.json'])  

Load large line-delimited JSON files using partitions of approx 256MB size

>> dd.read_json(‘data/file*.csv’, blocksize=2**28)

dask.dataframe.read_hdf

dask.dataframe.read_orc