Categoricals
Categoricals¶
Dask DataFrame divides categorical data into two types:
Known categoricals have the
categories
known statically (on the_meta
attribute). Each partition must have the same categories as found on the_meta
attributeUnknown categoricals don’t know the categories statically, and may have different categories in each partition. Internally, unknown categoricals are indicated by the presence of
dd.utils.UNKNOWN_CATEGORIES
in the categories on the_meta
attribute. Since most DataFrame operations propagate the categories, the known/unknown status should propagate through operations (similar to howNaN
propagates)
For metadata specified as a description (option 2 above), unknown categoricals are created.
Certain operations are only available for known categoricals. For example,
df.col.cat.categories
would only work if df.col
has known categories,
since the categorical mapping is only known statically on the metadata of known
categoricals.
The known/unknown status for a categorical column can be found using the
known
property on the categorical accessor:
>>> ddf.col.cat.known
False
Additionally, an unknown categorical can be converted to known using
.cat.as_known()
. If you have multiple categorical columns in a DataFrame,
you may instead want to use df.categorize(columns=...)
, which will convert
all specified columns to known categoricals. Since getting the categories
requires a full scan of the data, using df.categorize()
is more efficient
than calling .cat.as_known()
for each column (which would result in
multiple scans):
>>> col_known = ddf.col.cat.as_known() # use for single column
>>> col_known.cat.known
True
>>> ddf_known = ddf.categorize() # use for multiple columns
>>> ddf_known.col.cat.known
True
To convert a known categorical to an unknown categorical, there is also the
.cat.as_unknown()
method. This requires no computation as it’s just a
change in the metadata.
Non-categorical columns can be converted to categoricals in a few different ways:
# astype operates lazily, and results in unknown categoricals
ddf = ddf.astype({'mycol': 'category', ...})
# or
ddf['mycol'] = ddf.mycol.astype('category')
# categorize requires computation, and results in known categoricals
ddf = ddf.categorize(columns=['mycol', ...])
Additionally, with Pandas 0.19.2 and up, dd.read_csv
and dd.read_table
can read data directly into unknown categorical columns by specifying a column
dtype as 'category'
:
>>> ddf = dd.read_csv(..., dtype={col_name: 'category'})
Moreover, with Pandas 0.21.0 and up, dd.read_csv
and dd.read_table
can read
data directly into known categoricals by specifying instances of
pd.api.types.CategoricalDtype
:
>>> dtype = {'col': pd.api.types.CategoricalDtype(['a', 'b', 'c'])}
>>> ddf = dd.read_csv(..., dtype=dtype)
If you write and read to parquet, Dask will forget known categories. This happens because, due to performance concerns, all the categories are saved in every partition rather than in the parquet metadata. It is possible to manually load the categories:
>>> import dask.dataframe as dd
>>> import pandas as pd
>>> df = pd.DataFrame(data=list('abcaabbcc'), columns=['col'])
>>> df.col = df.col.astype('category')
>>> ddf = dd.from_pandas(df, npartitions=1)
>>> ddf.col.cat.known
True
>>> ddf.to_parquet('tmp')
>>> ddf2 = dd.read_parquet('tmp')
>>> ddf2.col.cat.known
False
>>> ddf2.col = ddf2.col.cat.set_categories(ddf2.col.head(1).cat.categories)
>>> ddf2.col.cat.known
True