Categoricals
Categoricals¶
Dask DataFrame divides categorical data into two types:
Known categoricals have the
categoriesknown statically (on the_metaattribute). Each partition must have the same categories as found on the_metaattributeUnknown categoricals don’t know the categories statically, and may have different categories in each partition. Internally, unknown categoricals are indicated by the presence of
dd.utils.UNKNOWN_CATEGORIESin the categories on the_metaattribute. Since most DataFrame operations propagate the categories, the known/unknown status should propagate through operations (similar to howNaNpropagates)
For metadata specified as a description (option 2 above), unknown categoricals are created.
Certain operations are only available for known categoricals. For example,
df.col.cat.categories would only work if df.col has known categories,
since the categorical mapping is only known statically on the metadata of known
categoricals.
The known/unknown status for a categorical column can be found using the
known property on the categorical accessor:
>>> ddf.col.cat.known
False
Additionally, an unknown categorical can be converted to known using
.cat.as_known(). If you have multiple categorical columns in a DataFrame,
you may instead want to use df.categorize(columns=...), which will convert
all specified columns to known categoricals. Since getting the categories
requires a full scan of the data, using df.categorize() is more efficient
than calling .cat.as_known() for each column (which would result in
multiple scans):
>>> col_known = ddf.col.cat.as_known() # use for single column
>>> col_known.cat.known
True
>>> ddf_known = ddf.categorize() # use for multiple columns
>>> ddf_known.col.cat.known
True
To convert a known categorical to an unknown categorical, there is also the
.cat.as_unknown() method. This requires no computation as it’s just a
change in the metadata.
Non-categorical columns can be converted to categoricals in a few different ways:
# astype operates lazily, and results in unknown categoricals
ddf = ddf.astype({'mycol': 'category', ...})
# or
ddf['mycol'] = ddf.mycol.astype('category')
# categorize requires computation, and results in known categoricals
ddf = ddf.categorize(columns=['mycol', ...])
Additionally, with Pandas 0.19.2 and up, dd.read_csv and dd.read_table
can read data directly into unknown categorical columns by specifying a column
dtype as 'category':
>>> ddf = dd.read_csv(..., dtype={col_name: 'category'})
Moreover, with Pandas 0.21.0 and up, dd.read_csv and dd.read_table can read
data directly into known categoricals by specifying instances of
pd.api.types.CategoricalDtype:
>>> dtype = {'col': pd.api.types.CategoricalDtype(['a', 'b', 'c'])}
>>> ddf = dd.read_csv(..., dtype=dtype)
If you write and read to parquet, Dask will forget known categories. This happens because, due to performance concerns, all the categories are saved in every partition rather than in the parquet metadata. It is possible to manually load the categories:
>>> import dask.dataframe as dd
>>> import pandas as pd
>>> df = pd.DataFrame(data=list('abcaabbcc'), columns=['col'])
>>> df.col = df.col.astype('category')
>>> ddf = dd.from_pandas(df, npartitions=1)
>>> ddf.col.cat.known
True
>>> ddf.to_parquet('tmp')
>>> ddf2 = dd.read_parquet('tmp')
>>> ddf2.col.cat.known
False
>>> ddf2.col = ddf2.col.cat.set_categories(ddf2.col.head(1).cat.categories)
>>> ddf2.col.cat.known
True