Gene annotation data¶
Data sources¶
We currently obtain the gene annotation data from several public data resources and keep them up-to-date, so that you don’t have to do it:
Source |
Update frequency |
Notes |
---|---|---|
NCBI Entrez |
weekly snapshot |
|
Ensembl |
whenever a new release is available |
Ensembl Pre! and EnsemblGenomes
are not included at the moment
|
Uniprot |
whenever a new release is available |
|
NetAffx |
whenever a new release is available |
For “reporter” field |
PharmGKB |
whenever a new release is available |
|
UCSC |
whenever a new release is available |
For “exons” field |
CPDB |
whenever a new release is available |
For “pathway” field |
The most updated data information can be accessed here.
Gene object¶
Gene annotation data are both stored and returned as a gene object, which is essentially a collection of fields (attributes) and their values:
{
"_id": "1017",
"_score": 20.4676,
"taxid": 9606,
"symbol": "CDK2",
"entrezgene": 1017,
"name": "cyclin-dependent kinase 2",
"genomic_pos": {
"start": 55966769,
"chr": "12",
"end": 55972784,
"strand": 1
}
}
The example above omits most of available fields. For a full example, you can just check out a few gene examples: CDK2, ADA. Or, did you try our interactive API page yet?
_id field¶
Each individual gene object contains an “_id” field as the primary key. The value of the “_id” field is the NCBI gene ID (the same as “entrezgene” field, but as a string) if available for a gene object, otherwise, Ensembl gene ID is used (e.g. those Ensembl-only genes). Here is an example. We recommend to use “entrezgene” field for the NCBI gene ID, and “ensembl.gene” field for Ensembl gene ID, instead of using “_id” field.
Note
Regardless how the value of the “_id” field looks like, either NCBI gene ID or Ensembl gene ID always works for our gene annotation service /v3/gene/<geneid>.
_score field¶
You will often see a “_score” field in the returned gene object, which is the internal score representing how well the query matches the returned gene object. It probably does not mean much in gene annotation service when only one gene object is returned. In gene query service, by default, the returned gene hits are sorted by the scores in descending order.
Species¶
We support ALL species annotated by NCBI and Ensembl. All of our services allow you to pass a “species” parameter to limit the query results. “species” parameter accepts taxonomy ids as the input. You can look for the taxomony ids for your favorite species from NCBI Taxonomy.
For convenience, we allow you to pass these common names for commonly used species (e.g. “species=human,mouse,rat”):
Common name |
Genus name |
Taxonomy id |
---|---|---|
human |
Homo sapiens |
9606 |
mouse |
Mus musculus |
10090 |
rat |
Rattus norvegicus |
10116 |
fruitfly |
Drosophila melanogaster |
7227 |
nematode |
Caenorhabditis elegans |
6239 |
zebrafish |
Danio rerio |
7955 |
thale-cress |
Arabidopsis thaliana |
3702 |
frog |
Xenopus tropicalis |
8364 |
pig |
Sus scrofa |
9823 |
If needed, you can pass “species=all” to query against all available species, although, we recommend you to pass specific species you need for faster response.
Genome assemblies¶
Our gene query service supports genome interval queries. We import genomic location data from Ensembl, so all species available there are supported. You can find the their reference genome assemblies information here.
This table lists the genome assembies for commonly-used species:
Common name |
Genus name |
Genome assembly |
---|---|---|
human |
Homo sapiens |
GRCh38 (hg38), also support hg19 |
mouse |
Mus musculus |
GRCm38 (mm10), also support mm9 |
rat |
Rattus norvegicus |
Rnor_6.0 (rn6) |
fruitfly |
Drosophila melanogaster |
BDGP6 (dm6) |
nematode |
Caenorhabditis elegans |
WBcel235 (ce11) |
zebrafish |
Danio rerio |
GRCz10 (danRer10) |
frog |
Xenopus tropicalis |
JGI_7.0 (xenTro7) |
pig |
Sus scrofa |
Sscrofa10.2 (susScr3) |
Available fields¶
The table below lists of all of the possible fields that could be in a gene object.
Field | Indexed | Type | Notes |
---|