dask.dataframe.read_parquet
dask.dataframe.read_parquet¶
- dask.dataframe.read_parquet(path, columns=None, filters=None, categories=None, index=None, storage_options=None, engine='auto', use_nullable_dtypes: bool | None = None, dtype_backend=None, calculate_divisions=None, ignore_metadata_file=False, metadata_task_size=None, split_row_groups='infer', blocksize='default', aggregate_files=None, parquet_file_extension=('.parq', '.parquet', '.pq'), filesystem=None, **kwargs)[source]¶
Read a Parquet file into a Dask DataFrame
This reads a directory of Parquet data into a Dask.dataframe, one file per partition. It selects the index among the sorted columns if any exist.
- Parameters
- pathstr or list
Source directory for data, or path(s) to individual parquet files. Prefix with a protocol like
s3://to read from alternative filesystems. To read from multiple files you can pass a globstring or a list of paths, with the caveat that they must all have the same protocol.- columnsstr or list, default None
Field name(s) to read in as columns in the output. By default all non-index fields will be read (as determined by the pandas parquet metadata, if present). Provide a single field name instead of a list to read in the data as a Series.
- filtersUnion[List[Tuple[str, str, Any]], List[List[Tuple[str, str, Any]]]], default None
List of filters to apply, like
[[('col1', '==', 0), ...], ...]. Using this argument will NOT result in row-wise filtering of the final partitions unlessengine="pyarrow"is also specified. For other engines, filtering is only performed at the partition level, that is, to prevent the loading of some row-groups and/or files.For the “pyarrow” engine, predicates can be expressed in disjunctive normal form (DNF). This means that the inner-most tuple describes a single column predicate. These inner predicates are combined with an AND conjunction into a larger predicate. The outer-most list then combines all of the combined filters with an OR disjunction.
Predicates can also be expressed as a
List[Tuple]. These are evaluated as an AND conjunction. To express OR in predicates, one must use the (preferred for “pyarrow”)List[List[Tuple]]notation.Note that the “fastparquet” engine does not currently support DNF for the filtering of partitioned columns (
List[Tuple]is required).- indexstr, list or False, default None
Field name(s) to use as the output frame index. By default will be inferred from the pandas parquet file metadata, if present. Use
Falseto read all fields as columns.- categorieslist or dict, default None
For any fields listed here, if the parquet encoding is Dictionary, the column will be created with dtype category. Use only if it is guaranteed that the column is encoded as dictionary in all row-groups. If a list, assumes up to 2**16-1 labels; if a dict, specify the number of labels expected; if None, will load categories automatically for data written by dask/fastparquet, not otherwise.
- storage_optionsdict, default None
Key/value pairs to be passed on to the file-system backend, if any. Note that the default file-system backend can be configured with the
filesystemargument, described below.- open_file_optionsdict, default None
Key/value arguments to be passed along to
AbstractFileSystem.openwhen each parquet data file is open for reading. Experimental (optimized) “precaching” for remote file systems (e.g. S3, GCS) can be enabled by adding{"method": "parquet"}under the"precache_options"key. Also, a custom file-open function can be used (instead ofAbstractFileSystem.open), by specifying the desired function under the"open_file_func"key.- engine{‘auto’, ‘pyarrow’, ‘fastparquet’}, default ‘auto’
Parquet library to use. Defaults to ‘auto’, which uses
pyarrowif it is installed, and falls back tofastparquetotherwise.- use_nullable_dtypes{False, True}
Whether to use extension dtypes for the resulting
DataFrame.use_nullable_dtypes=Trueis only supported whenengine="pyarrow".Note
This option is deprecated. Use “dtype_backend” instead.
- dtype_backend{‘numpy_nullable’, ‘pyarrow’}, defaults to NumPy backed DataFrames
Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when ‘numpy_nullable’ is set, pyarrow is used for all dtypes if ‘pyarrow’ is set.
dtype_backend="pyarrow"requirespandas1.5+.- calculate_divisionsbool, default False
Whether to use min/max statistics from the footer metadata (or global
_metadatafile) to calculate divisions for the output DataFrame collection. Divisions will not be calculated if statistics are missing. This option will be ignored ifindexis not specified and there is no physical index column specified in the custom “pandas” Parquet metadata. Note thatcalculate_divisions=Truemay be extremely slow when no global_metadatafile is present, especially when reading from remote storage. Set this toTrueonly when known divisions are needed for your workload (see Partitions).- ignore_metadata_filebool, default False
Whether to ignore the global
_metadatafile (when one is present). IfTrue, or if the global_metadatafile is missing, the parquet metadata may be gathered and processed in parallel. Parallel metadata processing is currently supported forArrowDatasetEngineonly.- metadata_task_sizeint, default configurable
If parquet metadata is processed in parallel (see
ignore_metadata_filedescription above), this argument can be used to specify the number of dataset files to be processed by each task in the Dask graph. If this argument is set to0, parallel metadata processing will be disabled. The default values for local and remote filesystems can be specified with the “metadata-task-size-local” and “metadata-task-size-remote” config fields, respectively (see “dataframe.parquet”).- split_row_groups‘infer’, ‘adaptive’, bool, or int, default ‘infer’
If True, then each output dataframe partition will correspond to a single parquet-file row-group. If False, each partition will correspond to a complete file. If a positive integer value is given, each dataframe partition will correspond to that number of parquet row-groups (or fewer). If ‘adaptive’, the metadata of each file will be used to ensure that every partition satisfies
blocksize. If ‘infer’ (the default), the uncompressed storage-size metadata in the first file will be used to automatically setsplit_row_groupsto either ‘adaptive’ orFalse.- blocksizeint or str, default ‘default’
The desired size of each output
DataFramepartition in terms of total (uncompressed) parquet storage space. This argument is currenlty used to set the default value ofsplit_row_groups(using row-group metadata from a single file), and will be ignored ifsplit_row_groupsis not set to ‘infer’ or ‘adaptive’. Default may be engine-dependant, but is 256 MiB for the ‘pyarrow’ and ‘fastparquet’ engines.- aggregate_filesbool or str, default None
WARNING: Passing a string argument to
aggregate_fileswill result in experimental behavior. This behavior may change in the future.Whether distinct file paths may be aggregated into the same output partition. This parameter is only used when split_row_groups is set to ‘infer’, ‘adaptive’ or to an integer >1. A setting of True means that any two file paths may be aggregated into the same output partition, while False means that inter-file aggregation is prohibited.
For “hive-partitioned” datasets, a “partition”-column name can also be specified. In this case, we allow the aggregation of any two files sharing a file path up to, and including, the corresponding directory name. For example, if
aggregate_filesis set to"section"for the directory structure below,03.parquetand04.parquetmay be aggregated together, but01.parquetand02.parquetcannot be. If, however,aggregate_filesis set to"region",01.parquetmay be aggregated with02.parquet, and03.parquetmay be aggregated with04.parquet:dataset-path/ ├── region=1/ │ ├── section=a/ │ │ └── 01.parquet │ ├── section=b/ │ └── └── 02.parquet └── region=2/ ├── section=a/ │ ├── 03.parquet └── └── 04.parquetNote that the default behavior of
aggregate_filesisFalse.- parquet_file_extension: str, tuple[str], or None, default (“.parq”, “.parquet”, “.pq”)
A file extension or an iterable of extensions to use when discovering parquet files in a directory. Files that don’t match these extensions will be ignored. This argument only applies when
pathscorresponds to a directory and no_metadatafile is present (orignore_metadata_file=True). Passing inparquet_file_extension=Nonewill treat all files in the directory as parquet files.The purpose of this argument is to ensure that the engine will ignore unsupported metadata files (like Spark’s ‘_SUCCESS’ and ‘crc’ files). It may be necessary to change this argument if the data files in your parquet dataset do not end in “.parq”, “.parquet”, or “.pq”.
- filesystem: “fsspec”, “arrow”, or fsspec.AbstractFileSystem backend to use.
Note that the “fastparquet” engine only supports “fsspec” or an explicit
pyarrow.fs.AbstractFileSystemobject. Default is “fsspec”.- dataset: dict, default None
Dictionary of options to use when creating a
pyarrow.dataset.Datasetorfastparquet.ParquetFileobject. These options may include a “filesystem” key (or “fs” for the “fastparquet” engine) to configure the desired file-system backend. However, the top-levelfilesystemargument will always take precedence.NOTE: For the “pyarrow” engine, the
datasetoptions may include a “partitioning” key. However, sincepyarrow.dataset.Partitioningobjects cannot be serialized, the value can be a dict of key-word arguments for thepyarrow.dataset.partitioningAPI (e.g.dataset={"partitioning": {"flavor": "hive", "schema": ...}}). Note that partitioned columns will not be converted to categorical dtypes when a custom partitioning schema is specified in this way.- read: dict, default None
Dictionary of options to pass through to
engine.read_partitionsusing thereadkey-word argument.- arrow_to_pandas: dict, default None
Dictionary of options to use when converting from
pyarrow.Tableto a pandasDataFrameobject. Only used by the “arrow” engine.- **kwargs: dict (of dicts)
Options to pass through to
engine.read_partitionsas stand-alone key-word arguments. Note that these options will be ignored by the engines defined indask.dataframe, but may be used by other custom implementations.
See also
Examples
>>> df = dd.read_parquet('s3://bucket/my-parquet-data')