[ensembl-dev] Many Ensembl-Ids lack annotation?

Amonida Zadissa amonida at sanger.ac.uk
Tue Aug 16 17:33:58 BST 2011


Dear Colin,

ENSG00000257107 (AC104389.2) and ENSG00000255592 (AC104389.1) were
both removed from the gene set in Ensembl release 63, because the
evidence these genes were built from were on the wrong strand.

If you look at the region [1] where these two genes resided in Ensembl
release 62, you can see that the HBG1 and HBG2 genes both are on the
opposite strand to these genes. However, the coordinates of these
genes fully match the coordinates of the removed genes. This is a
strong indication that the underlying evidence for ENSG00000257107 and
ENSG00000255592 were placed on the wrong strand, specially since both
HBG1 and HBG2 are part of the CCDS gene set, hence showing that these
annotations are in fact correct as opposed to the annotations for
AC104389.2 and AC104389.1.

It is worth to mention that during and after each full annotation
of the human genome, the gene set is subjected to full quality
assessment, including removal of genes with poor evidence. The genes
you mention belong to this category. Ensembl release 62 was a full
reannotation of the human genome using GRCh37 assembly.

The other genes that you have listed are all present in the e!63
release, please see bleow.

* ENSG00000196565 HBG2 11 hemoglobin, gamma G [Source:HGNC Symbol;Acc:4832]
http://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000196565;r=11:5274420-5667019

* ENSG00000188536 HBA2 16 hemoglobin, alpha 2 [Source:HGNC Symbol;Acc:4824]
http://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000188536;r=16:222846-223709

* ENSG00000244734 HBB 11 hemoglobin, beta [Source:HGNC Symbol;Acc:4827]
http://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000244734;r=11:5246694-5250625

* ENSG00000206172 HBA1 16 hemoglobin, alpha 1 [Source:HGNC Symbol;Acc:4823]
http://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000206172;r=16:226679-227521

* ENSG00000257107 - Removed
* ENSG00000255592 - Removed

* ENSG00000210082
http://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000210082;r=MT:1671-3229;t=ENST00000387347

* ENSG00000211459
http://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000211459;r=MT:648-1601;t=ENST00000389680

* ENSG00000105372 RPS19 19 ribosomal protein S19 [Source:HGNC Symbol;Acc:10402]
http://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000105372;r=19:42363988-42375482;t=ENST00000221975

* ENSG00000198712 MT-CO2 MT mitochondrially encoded cytochrome c oxidase II [Source:HGNC Symbol;Acc:7421]
http://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000198712;r=MT:7586-8269;t=ENST00000361739

Using Biomart, you can get the full list of genes in Ensembl which
counts to 53893 in total. This set includes all gene types, coding and
non-coding ones. If you are interested to get only the protein coding
genes, then using Biomart you can retrieve 21494 genes. These queries
are based on the latest Ensembl Biomart, that is Ensembl release 63.
Hope this information is useful.

Cheers,
Amonida

--
Amonida Zadissa
Ensembl Genebuild team


[1] http://apr2011.archive.ensembl.org/Homo_sapiens/Location/View?db=core;r=11:5266091-5285463


On Tue, Aug 16, 2011 at 04:55:47PM +0200, Colin Davenport wrote:
> Dear Ensembl users,
> 
> firstly, congratulations. Ensembl is a nice resource which we don't have in
> the bacterial world!
> 
> I have a question about some very highly expressed genes which are lacking
> annotations in the current Ensembl database v62.
> 
> I am using the edgeR bioconductor package to analyse human RNA-seq data.
> Some of the most important genes in the dataset
> have Ensembl IDs, but no annotation attached (see examples below,
> eg. ENSG00000257107).
> 
> Are these old, too new or am I missing something here?
> 
> If I look up the gene on the Ensembl website I get. (IDHistory_gene)
> Ensembl gene ENSG00000257107 is no longer in the database and has not been
> mapped to any newer identifiers
> 
> In fact, the edgeR ensembl database has about 52000 entries, but the bioMart
> export only gives me about 22000 entries with annotation.
> Surely at least the important highly expressed genes must have been mapped
> to other identifiers if they have been removed ?
> 
> 
> 
> 
> Thanks for any help!
> Regards,
> Colin
> 
>   ENSG00000196565 HBG2 11 hemoglobin, gamma G [Source:HGNC Symbol;Acc:4832]
> ENSG00000188536 HBA2 16 hemoglobin, alpha 2 [Source:HGNC Symbol;Acc:4824]
> ENSG00000244734 HBB 11 hemoglobin, beta [Source:HGNC Symbol;Acc:4827]
> ENSG00000206172 HBA1 16 hemoglobin, alpha 1 [Source:HGNC Symbol;Acc:4823]
> ENSG00000257107
> 
>  ENSG00000255592
> 
> 
>  ENSG00000210082
> 
> 
>  ENSG00000211459
> 
> 
>  ENSG00000105372 RPS19 19 ribosomal protein S19 [Source:HGNC
> Symbol;Acc:10402]  ENSG00000198712 MT-CO2 MT mitochondrially encoded
> cytochrome c oxidase II [Source:HGNC Symbol;Acc:7421]

> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/





More information about the Dev mailing list