Gene annotation data

Data sources

We currently obtain the gene annotation data from several public data resources and keep them up-to-date, so that you don’t have to do it:

Source

Update frequency

Notes

NCBI Entrez

weekly snapshot

Ensembl

whenever a new release is available

Ensembl Pre! and EnsemblGenomes
are not included at the moment

Uniprot

whenever a new release is available

NetAffx

whenever a new release is available

For “reporter” field

PharmGKB

whenever a new release is available

UCSC

whenever a new release is available

For “exons” field

CPDB

whenever a new release is available

For “pathway” field

The most updated data information can be accessed here.

Gene object

Gene annotation data are both stored and returned as a gene object, which is essentially a collection of fields (attributes) and their values:

{
    "_id": "1017",
    "_score": 20.4676,
    "taxid": 9606,
    "symbol": "CDK2",
    "entrezgene": 1017,
    "name": "cyclin-dependent kinase 2",
    "genomic_pos": {
        "start": 55966769,
        "chr": "12",
        "end": 55972784,
        "strand": 1
    }
}

The example above omits most of available fields. For a full example, you can just check out a few gene examples: CDK2, ADA. Or, did you try our interactive API page yet?

_id field

Each individual gene object contains an “_id” field as the primary key. The value of the “_id” field is the NCBI gene ID (the same as “entrezgene” field, but as a string) if available for a gene object, otherwise, Ensembl gene ID is used (e.g. those Ensembl-only genes). Here is an example. We recommend to use “entrezgene” field for the NCBI gene ID, and “ensembl.gene” field for Ensembl gene ID, instead of using “_id” field.

Note

Regardless how the value of the “_id” field looks like, either NCBI gene ID or Ensembl gene ID always works for our gene annotation service /v3/gene/<geneid>.

_score field

You will often see a “_score” field in the returned gene object, which is the internal score representing how well the query matches the returned gene object. It probably does not mean much in gene annotation service when only one gene object is returned. In gene query service, by default, the returned gene hits are sorted by the scores in descending order.

Species

We support ALL species annotated by NCBI and Ensembl. All of our services allow you to pass a “species” parameter to limit the query results. “species” parameter accepts taxonomy ids as the input. You can look for the taxomony ids for your favorite species from NCBI Taxonomy.

For convenience, we allow you to pass these common names for commonly used species (e.g. “species=human,mouse,rat”):

Common name

Genus name

Taxonomy id

human

Homo sapiens

9606

mouse

Mus musculus

10090

rat

Rattus norvegicus

10116

fruitfly

Drosophila melanogaster

7227

nematode

Caenorhabditis elegans

6239

zebrafish

Danio rerio

7955

thale-cress

Arabidopsis thaliana

3702

frog

Xenopus tropicalis

8364

pig

Sus scrofa

9823

If needed, you can pass “species=all” to query against all available species, although, we recommend you to pass specific species you need for faster response.

Genome assemblies

Our gene query service supports genome interval queries. We import genomic location data from Ensembl, so all species available there are supported. You can find the their reference genome assemblies information here.

This table lists the genome assembies for commonly-used species:

Common name

Genus name

Genome assembly

human

Homo sapiens

GRCh38 (hg38), also support hg19

mouse

Mus musculus

GRCm38 (mm10), also support mm9

rat

Rattus norvegicus

Rnor_6.0 (rn6)

fruitfly

Drosophila melanogaster

BDGP6 (dm6)

nematode

Caenorhabditis elegans

WBcel235 (ce11)

zebrafish

Danio rerio

GRCz10 (danRer10)

frog

Xenopus tropicalis

JGI_7.0 (xenTro7)

pig

Sus scrofa

Sscrofa10.2 (susScr3)

Available fields

The table below lists of all of the possible fields that could be in a gene object.

Field Indexed Type Notes