dask.dataframe.read_csv
dask.dataframe.read_csv¶
- dask.dataframe.read_csv(urlpath, blocksize='default', lineterminator=None, compression='infer', sample=256000, sample_rows=10, enforce=False, assume_missing=False, storage_options=None, include_path_column=False, **kwargs)¶
Read CSV files into a Dask.DataFrame
This parallelizes the
pandas.read_csv()
function in the following ways:It supports loading many files at once using globstrings:
>>> df = dd.read_csv('myfiles.*.csv')
In some cases it can break up large files:
>>> df = dd.read_csv('largefile.csv', blocksize=25e6) # 25MB chunks
It can read CSV files from external resources (e.g. S3, HDFS) by providing a URL:
>>> df = dd.read_csv('s3://bucket/myfiles.*.csv') >>> df = dd.read_csv('hdfs:///myfiles.*.csv') >>> df = dd.read_csv('hdfs://namenode.example.com/myfiles.*.csv')
Internally
dd.read_csv
usespandas.read_csv()
and supports many of the same keyword arguments with the same performance guarantees. See the docstring forpandas.read_csv()
for more information on available keyword arguments.- Parameters
- urlpathstring or list
Absolute or relative filepath(s). Prefix with a protocol like
s3://
to read from alternative filesystems. To read from multiple files you can pass a globstring or a list of paths, with the caveat that they must all have the same protocol.- blocksizestr, int or None, optional
Number of bytes by which to cut up larger files. Default value is computed based on available physical memory and the number of cores, up to a maximum of 64MB. Can be a number like
64000000
or a string like"64MB"
. IfNone
, a single block is used for each file.- sampleint, optional
Number of bytes to use when determining dtypes
- assume_missingbool, optional
If True, all integer columns that aren’t specified in
dtype
are assumed to contain missing values, and are converted to floats. Default is False.- storage_optionsdict, optional
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc.
- include_path_columnbool or str, optional
Whether or not to include the path to each particular file. If True a new column is added to the dataframe called
path
. If str, sets new column name. Default is False.- **kwargs
Extra keyword arguments to forward to
pandas.read_csv()
.
Notes
Dask dataframe tries to infer the
dtype
of each column by reading a sample from the start of the file (or of the first file if it’s a glob). Usually this works fine, but if thedtype
is different later in the file (or in other files) this can cause issues. For example, if all the rows in the sample had integer dtypes, but later on there was aNaN
, then this would error at compute time. To fix this, you have a few options:Provide explicit dtypes for the offending columns using the
dtype
keyword. This is the recommended solution.Use the
assume_missing
keyword to assume that all columns inferred as integers contain missing values, and convert them to floats.Increase the size of the sample using the
sample
keyword.
It should also be noted that this function may fail if a CSV file includes quoted strings that contain the line terminator. To get around this you can specify
blocksize=None
to not split files into multiple partitions, at the cost of reduced parallelism.