On-disk format
Note
These docs are written for anndata 0.8. Files written before this version may differ in some conventions, but will still be read by newer versions of the library.
AnnData objects are saved on disk to hierarchichal array stores like HDF5 (via H5py) and Zarr-Python. This allows us to have very similar structures in disk and on memory.
As an example we’ll look into a typical .h5ad
object that’s been through an analysis.
This structure should be largely equivalent to Zarr structure, though there are a few minor differences.
>>> import h5py
>>> f = h5py.File("02_processed.h5ad", "r")
>>> list(f.keys())
['X', 'layers', 'obs', 'obsm', 'uns', 'var', 'varm']
In general, AnnData
objects are comprised of a various types of elements.
Each element is encoded as either an Array (or Dataset in hdf5 terminology) or a collection of elements (e.g. Group) in the store.
We record the type of an element using the encoding-type
and encoding-version
keys in it’s attributes.
For example, we can this file represents an AnnData
object from this metadata:
>>> dict(f.attrs)
{'encoding-type': 'anndata', 'encoding-version': '0.1.0'}
Using this information, we’re able to dispatch onto readers for the different element types that you’d find in an anndata.
Dense arrays
Dense numeric arrays have the most simple representation on disk,
as they have native equivalents in H5py Datasets and Zarr Arrays.
We can see an example of this with dimensionality reductions stored in the obsm
group:
>>> f["obsm"].visititems(print)
X_pca <HDF5 dataset "X_pca": shape (38410, 50), type "<f4">
X_umap <HDF5 dataset "X_umap": shape (38410, 2), type "<f4">
>>> dict(f["obsm"]["X_pca"].attrs)
{'encoding-type': 'array', 'encoding-version': '0.2.0'}
Sparse arrays
Sparse arrays don’t have a native representations in HDF5 or Zarr,
so we’ve defined our own based on their in-memory structure.
Currently two sparse data formats are supported by AnnData
objects, CSC and CSR
(corresponding to scipy.sparse.csc_matrix
and scipy.sparse.csr_matrix
respectivley).
These formats represent a two-dimensional sparse array with
three one-dimensional arrays, indptr
, indices
, and data
.
Note
A full description of these formats is out of scope for this document, but are easy to find.
We represent a sparse array as a Group
on-disk,
where the kind and shape of the sparse array is defined in the Group
’s attributes:
>>> dict(f["X"].attrs)
{'encoding-type': 'csr_matrix',
'encoding-version': '0.1.0',
'shape': array([38410, 27899])}
Inside the group are the three constituent arrays:
>>> f["X"].visititems(print)
data <HDF5 dataset "data": shape (41459314,), type "<f4">
indices <HDF5 dataset "indices": shape (41459314,), type "<i4">
indptr <HDF5 dataset "indptr": shape (38411,), type "<i4">
DataFrames
DataFrames are saved as a columnar format in a group, so each column of a DataFrame is saved as a seperate array. We save a little more information in the attributes here.
>>> dict(f["obs"].attrs)
{'_index': 'Cell',
'column-order': array(['sample', 'cell_type', 'n_genes_by_counts',
'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts',
'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes',
'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes',
'total_counts_mito', 'log1p_total_counts_mito', 'pct_counts_mito',
'label_by_score'], dtype=object),
'encoding-type': 'dataframe',
'encoding-version': '0.2.0'}
These attributes identify the index of the dataframe, as well as the original order of the columns. Each column in this dataframe is encoded as it’s own array.
>>> dict(f["obs"]["total_counts"].attrs)
{'encoding-type': 'array', 'encoding-version': '0.2.0'}
>>> dict(f["obs"]["cell_type"].attrs)
{'encoding-type': 'categorical', 'encoding-version': '0.2.0', 'ordered': False}
Mappings
Mappings are simply stored as Group
s on disk.
These are distinct from DataFrames and sparse arrays since they don’t have any special attributes.
A Group
is created for any Mapping
in the AnnData object,
including the standard obsm
, varm
, layers
, and uns
.
Notably, this definition is used recursively within uns
:
>>> f["uns"].visititems(print)
[...]
pca <HDF5 group "/uns/pca" (2 members)>
pca/variance <HDF5 dataset "variance": shape (50,), type "<f4">
pca/variance_ratio <HDF5 dataset "variance_ratio": shape (50,), type "<f4">
[...]
Scalars
Zero dimensional arrays are used for scalar values (i.e. single values like strings, numbers or booleans).
These should only occur inside of uns
, and are common inside of saved parameters:
>>> f["uns/neighbors/params"].visititems(print)
method <HDF5 dataset "method": shape (), type "|O">
metric <HDF5 dataset "metric": shape (), type "|O">
n_neighbors <HDF5 dataset "n_neighbors": shape (), type "<i8">
>>> f["uns/neighbors/params/metric"][()]
'euclidean'
>>> dict(f["uns/neighbors/params/metric"].attrs)
{'encoding-type': 'string', 'encoding-version': '0.2.0'}
Categorical arrays
>>> categorical = f["obs"]["cell_type"]
>>> dict(categorical.attrs)
{'encoding-type': 'categorical', 'encoding-version': '0.2.0', 'ordered': False}
Discrete labels can be efficiently represented with categorical arrays (similar to factors
in R
).
These arrays encode the labels as small width integers (codes
), which map to the original label set (categories
).
We store these two arrays seperatley
>>> categorical.visititems(print)
categories <HDF5 dataset "categories": shape (22,), type "|O">
codes <HDF5 dataset "codes": shape (38410,), type "|i1">
String arrays
Arrays of strings are handled differently than numeric arrays since numpy doesn’t really have a good way of representing arrays of unicode strings.
anndata
assumes strings are text like data, so are variable length.
>>> dict(categorical["categories"].attrs)
{'encoding-type': 'string-array', 'encoding-version': '0.2.0'}
Nullable integers and booleans
We support IO with Pandas nullable integer and boolean arrays.
We represent these on disk similar to numpy
masked arrays, julia
nullable arrays, or arrow
validity bitmaps (see issue 504 for more discussion).
That is, we store a indicator array (or mask) of null values alongside the array of all values.
>>> h5_file = h5py.File("anndata_format.h5", "a")
>>> int_array = pd.array([1, None, 3, 4])
>>> int_array
<IntegerArray>
[1, <NA>, 3, 4]
Length: 4, dtype: Int64
>>> write_elem(h5_file, "nullable_integer", int_array)
>>> h5_file["nullable_integer"].visititems(print)
mask <HDF5 dataset "mask": shape (4,), type "|b1">
values <HDF5 dataset "values": shape (4,), type "<i8">
>>> dict(h5_file["nullable_integer"].attrs)
{'encoding-type': 'nullable-integer', 'encoding-version': '0.1.0'}