dask.dataframe.read_json
dask.dataframe.read_json¶
- dask.dataframe.read_json(url_path, orient='records', lines=None, storage_options=None, blocksize=None, sample=1048576, encoding='utf-8', errors='strict', compression='infer', meta=None, engine=<function read_json>, include_path_column=False, path_converter=None, **kwargs)[source]¶
Create a dataframe from a set of JSON files
This utilises
pandas.read_json()
, and most parameters are passed through - see its docstring.Differences: orient is ‘records’ by default, with lines=True; this is appropriate for line-delimited “JSON-lines” data, the kind of JSON output that is most common in big-data scenarios, and which can be chunked when reading (see
read_json()
). All other options require blocksize=None, i.e., one partition per input file.- Parameters
- url_path: str, list of str
Location to read from. If a string, can include a glob character to find a set of file names. Supports protocol specifications such as
"s3://"
.- encoding, errors:
The text encoding to implement, e.g., “utf-8” and how to respond to errors in the conversion (see
str.encode()
).- orient, lines, kwargs
passed to pandas; if not specified, lines=True when orient=’records’, False otherwise.
- storage_options: dict
Passed to backend file-system implementation
- blocksize: None or int
If None, files are not blocked, and you get one partition per input file. If int, which can only be used for line-delimited JSON files, each partition will be approximately this size in bytes, to the nearest newline character.
- sample: int
Number of bytes to pre-load, to provide an empty dataframe structure to any blocks without data. Only relevant when using blocksize.
- encoding, errors:
Text conversion,
see bytes.decode()
- compressionstring or None
String like ‘gzip’ or ‘xz’.
- enginecallable or str, default
pd.read_json
The underlying function that dask will use to read JSON files. By default, this will be the pandas JSON reader (
pd.read_json
). If a string is specified, this value will be passed under theengine
key-word argument topd.read_json
(only supported for pandas>=2.0).- include_path_columnbool or str, optional
Include a column with the file path where each row in the dataframe originated. If
True
, a new column is added to the dataframe calledpath
. Ifstr
, sets new column name. Default isFalse
.- path_converterfunction or None, optional
A function that takes one argument and returns a string. Used to convert paths in the
path
column, for instance, to strip a common prefix from all the paths.- metapd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.
- Returns
- dask.DataFrame
Examples
Load single file
>>> dd.read_json('myfile.1.json')
Load multiple files
>>> dd.read_json('myfile.*.json')
>>> dd.read_json(['myfile.1.json', 'myfile.2.json'])
Load large line-delimited JSON files using partitions of approx 256MB size
>> dd.read_json(‘data/file*.csv’, blocksize=2**28)