Gene annotation data¶

Data sources¶

We currently obtain the gene annotation data from several public data resources and keep them up-to-date, so that you don’t have to do it:

Source	Update frequency	Notes
NCBI Entrez	weekly snapshot
Ensembl	whenever a new release is available	Ensembl Pre! and EnsemblGenomes are not included at the moment
Uniprot	whenever a new release is available
NetAffx	whenever a new release is available	For “reporter” field
PharmGKB	whenever a new release is available
UCSC	whenever a new release is available	For “exons” field
CPDB	whenever a new release is available	For “pathway” field

The most updated data information can be accessed here.

Gene object¶

Gene annotation data are both stored and returned as a gene object, which is essentially a collection of fields (attributes) and their values:

{
    "_id": "1017",
    "_score": 20.4676,
    "taxid": 9606,
    "symbol": "CDK2",
    "entrezgene": 1017,
    "name": "cyclin-dependent kinase 2",
    "genomic_pos": {
        "start": 55966769,
        "chr": "12",
        "end": 55972784,
        "strand": 1
    }
}

The example above omits most of available fields. For a full example, you can just check out a few gene examples: CDK2, ADA. Or, did you try our interactive API page yet?

_id field¶

Each individual gene object contains an “_id” field as the primary key. The value of the “_id” field is the NCBI gene ID (the same as “entrezgene” field, but as a string) if available for a gene object, otherwise, Ensembl gene ID is used (e.g. those Ensembl-only genes). Here is an example. We recommend to use “entrezgene” field for the NCBI gene ID, and “ensembl.gene” field for Ensembl gene ID, instead of using “_id” field.

Note

Regardless how the value of the “_id” field looks like, either NCBI gene ID or Ensembl gene ID always works for our gene annotation service /v3/gene/<geneid>.

_score field¶

You will often see a “_score” field in the returned gene object, which is the internal score representing how well the query matches the returned gene object. It probably does not mean much in gene annotation service when only one gene object is returned. In gene query service, by default, the returned gene hits are sorted by the scores in descending order.

Species¶

We support ALL species annotated by NCBI and Ensembl. All of our services allow you to pass a “species” parameter to limit the query results. “species” parameter accepts taxonomy ids as the input. You can look for the taxomony ids for your favorite species from NCBI Taxonomy.

For convenience, we allow you to pass these common names for commonly used species (e.g. “species=human,mouse,rat”):

Common name	Genus name	Taxonomy id
human	Homo sapiens	9606
mouse	Mus musculus	10090
rat	Rattus norvegicus	10116
fruitfly	Drosophila melanogaster	7227
nematode	Caenorhabditis elegans	6239
zebrafish	Danio rerio	7955
thale-cress	Arabidopsis thaliana	3702
frog	Xenopus tropicalis	8364
pig	Sus scrofa	9823

If needed, you can pass “species=all” to query against all available species, although, we recommend you to pass specific species you need for faster response.

Genome assemblies¶

Our gene query service supports genome interval queries. We import genomic location data from Ensembl, so all species available there are supported. You can find the their reference genome assemblies information here.

This table lists the genome assembies for commonly-used species:

Common name	Genus name	Genome assembly
human	Homo sapiens	GRCh38 (hg38), also support hg19
mouse	Mus musculus	GRCm38 (mm10), also support mm9
rat	Rattus norvegicus	Rnor_6.0 (rn6)
fruitfly	Drosophila melanogaster	BDGP6 (dm6)
nematode	Caenorhabditis elegans	WBcel235 (ce11)
zebrafish	Danio rerio	GRCz10 (danRer10)
frog	Xenopus tropicalis	JGI_7.0 (xenTro7)
pig	Sus scrofa	Sscrofa10.2 (susScr3)

Available fields¶

The table below lists of all of the possible fields that could be in a gene object.

Field	Indexed	Type	Notes