anndata.AnnData.concatenate
- AnnData.concatenate(*adatas, join='inner', batch_key='batch', batch_categories=None, uns_merge=None, index_unique='-', fill_value=None)
Concatenate along the observations axis.
The
uns
,varm
andobsm
attributes are ignored.Currently, this works only in
'memory'
mode.Note
For more flexible and efficient concatenation, see:
concat()
.- Parameters
- adatas :
AnnData
AnnData matrices to concatenate with. Each matrix is referred to as a “batch”.
- join :
str
(default:'inner'
) Use intersection (
'inner'
) or union ('outer'
) of variables.- batch_key :
str
(default:'batch'
) Add the batch annotation to
obs
using this key.- batch_categories :
Sequence
[Any
] |None
Optional
[Sequence
[Any
]] (default:None
) Use these as categories for the batch annotation. By default, use increasing numbers.
- uns_merge :
str
|None
Optional
[str
] (default:None
) Strategy to use for merging entries of uns. These strategies are applied recusivley. Currently implemented strategies include:
None
: The default. The concatenated object will just have an empty dict foruns
."same"
: Only entries which have the same value in all AnnData objects are kept."unique"
: Only entries which have one unique value in all AnnData objects are kept."first"
: The first non-missing value is used."only"
: A value is included if only one of the AnnData objects has a value at this path.
- index_unique :
str
|None
Optional
[str
] (default:'-'
) Make the index unique by joining the existing index names with the batch category, using
index_unique='-'
, for instance. ProvideNone
to keep existing indices.- fill_value
Scalar value to fill newly missing values in arrays with. Note: only applies to arrays and sparse matrices (not dataframes) and will only be used if
join="outer"
.Note
If not provided, the default value is
0
for sparse matrices andnp.nan
for numpy arrays. See the examples below for more information.
- adatas :
- Returns
The concatenated
AnnData
, whereadata.obs[batch_key]
stores a categorical variable labeling the batch.- Return type
Notes
Warning
If you use
join='outer'
this fills 0s for sparse data when variables are absent in a batch. Use this with care. Dense data is filled withNaN
. See the examples.Examples
Joining on intersection of variables.
>>> adata1 = AnnData( ... np.array([[1, 2, 3], [4, 5, 6]]), ... dict(obs_names=['s1', 's2'], anno1=['c1', 'c2']), ... dict(var_names=['a', 'b', 'c'], annoA=[0, 1, 2]), ... ) >>> adata2 = AnnData( ... np.array([[1, 2, 3], [4, 5, 6]]), ... dict(obs_names=['s3', 's4'], anno1=['c3', 'c4']), ... dict(var_names=['d', 'c', 'b'], annoA=[0, 1, 2]), ... ) >>> adata3 = AnnData( ... np.array([[1, 2, 3], [4, 5, 6]]), ... dict(obs_names=['s1', 's2'], anno2=['d3', 'd4']), ... dict(var_names=['d', 'c', 'b'], annoA=[0, 2, 3], annoB=[0, 1, 2]), ... ) >>> adata = adata1.concatenate(adata2, adata3) >>> adata AnnData object with n_obs × n_vars = 6 × 2 obs: 'anno1', 'anno2', 'batch' var: 'annoA-0', 'annoA-1', 'annoA-2', 'annoB-2' >>> adata.X array([[2., 3.], [5., 6.], [3., 2.], [6., 5.], [3., 2.], [6., 5.]], dtype=float32) >>> adata.obs anno1 anno2 batch s1-0 c1 NaN 0 s2-0 c2 NaN 0 s3-1 c3 NaN 1 s4-1 c4 NaN 1 s1-2 NaN d3 2 s2-2 NaN d4 2 >>> adata.var.T b c annoA-0 1 2 annoA-1 2 1 annoA-2 3 2 annoB-2 2 1
Joining on the union of variables.
>>> outer = adata1.concatenate(adata2, adata3, join='outer') >>> outer AnnData object with n_obs × n_vars = 6 × 4 obs: 'anno1', 'anno2', 'batch' var: 'annoA-0', 'annoA-1', 'annoA-2', 'annoB-2' >>> outer.var.T a b c d annoA-0 0.0 1.0 2.0 NaN annoA-1 NaN 2.0 1.0 0.0 annoA-2 NaN 3.0 2.0 0.0 annoB-2 NaN 2.0 1.0 0.0 >>> outer.var_names Index(['a', 'b', 'c', 'd'], dtype='object') >>> outer.X array([[ 1., 2., 3., nan], [ 4., 5., 6., nan], [nan, 3., 2., 1.], [nan, 6., 5., 4.], [nan, 3., 2., 1.], [nan, 6., 5., 4.]], dtype=float32) >>> outer.X.sum(axis=0) array([nan, 25., 23., nan], dtype=float32) >>> import pandas as pd >>> Xdf = pd.DataFrame(outer.X, columns=outer.var_names) >>> Xdf a b c d 0 1.0 2.0 3.0 NaN 1 4.0 5.0 6.0 NaN 2 NaN 3.0 2.0 1.0 3 NaN 6.0 5.0 4.0 4 NaN 3.0 2.0 1.0 5 NaN 6.0 5.0 4.0 >>> Xdf.sum() a 5.0 b 25.0 c 23.0 d 10.0 dtype: float32
One way to deal with missing values is to use masked arrays:
>>> from numpy import ma >>> outer.X = ma.masked_invalid(outer.X) >>> outer.X masked_array( data=[[1.0, 2.0, 3.0, --], [4.0, 5.0, 6.0, --], [--, 3.0, 2.0, 1.0], [--, 6.0, 5.0, 4.0], [--, 3.0, 2.0, 1.0], [--, 6.0, 5.0, 4.0]], mask=[[False, False, False, True], [False, False, False, True], [ True, False, False, False], [ True, False, False, False], [ True, False, False, False], [ True, False, False, False]], fill_value=1e+20, dtype=float32) >>> outer.X.sum(axis=0).data array([ 5., 25., 23., 10.], dtype=float32)
The masked array is not saved but has to be reinstantiated after saving.
>>> outer.write('./test.h5ad') >>> from anndata import read_h5ad >>> outer = read_h5ad('./test.h5ad') >>> outer.X array([[ 1., 2., 3., nan], [ 4., 5., 6., nan], [nan, 3., 2., 1.], [nan, 6., 5., 4.], [nan, 3., 2., 1.], [nan, 6., 5., 4.]], dtype=float32)
For sparse data, everything behaves similarly, except that for
join='outer'
, zeros are added.>>> from scipy.sparse import csr_matrix >>> adata1 = AnnData( ... csr_matrix([[0, 2, 3], [0, 5, 6]], dtype=np.float32), ... dict(obs_names=['s1', 's2'], anno1=['c1', 'c2']), ... dict(var_names=['a', 'b', 'c']), ... ) >>> adata2 = AnnData( ... csr_matrix([[0, 2, 3], [0, 5, 6]], dtype=np.float32), ... dict(obs_names=['s3', 's4'], anno1=['c3', 'c4']), ... dict(var_names=['d', 'c', 'b']), ... ) >>> adata3 = AnnData( ... csr_matrix([[1, 2, 0], [0, 5, 6]], dtype=np.float32), ... dict(obs_names=['s5', 's6'], anno2=['d3', 'd4']), ... dict(var_names=['d', 'c', 'b']), ... ) >>> adata = adata1.concatenate(adata2, adata3, join='outer') >>> adata.var_names Index(['a', 'b', 'c', 'd'], dtype='object') >>> adata.X.toarray() array([[0., 2., 3., 0.], [0., 5., 6., 0.], [0., 3., 2., 0.], [0., 6., 5., 0.], [0., 0., 2., 1.], [0., 6., 5., 0.]], dtype=float32)