[ensembl-dev] Why is there a 3-fold size difference between Ensembl and RefSeq human transcriptome?

Amonida Zadissa amonida at sanger.ac.uk
Tue Jul 10 10:56:00 BST 2012


Hi Holger,

The human gene set displayed in Ensembl corresponds to a combination of 
Ensembl and Havana annotations. The merged gene set is the GENCODE gene 
set and the human gene set in Ensembl release 67 represents GENCODE 
release 12.

You're right in that the large number of transcripts in the human set 
corresponds to pseudogenes and non-coding models. In release 67, there 
are 21946 protein coding genes and 88554 protein coding transcripts in 
human. If you're interested in fetching the protein coding genes and/or 
transcripts, using their type in conjunction with the status can be 
useful. You can get the sets either through BioMart or the Ensembl API.

Additionally, you may also be interested in the CCDS set [1]. This is a 
set of coding models that has been agreed upon by Ensembl, Havana, UCSC 
and NCBI and is worked on by the various groups. The aim is to have a 
consensus set of gene annotations that all the groups agree upon. In the 
latest update of the CCDS set there are 18473 multi-transcript genes and 
26473 transcripts. Again, you can access the CCDS set either via BioMart 
or by using our API.

Hope this is helpful, please get in touch if you need more information.

Cheers,
Amonida

[1] http://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi

On 09/07/2012 16:29, Holger Brandl wrote:
> Hello Ensembl,
>
> I've compared the number of human transcripts in Ensembl vs Refseq.
> In Ensembl are ~195k transcripts of which 150k are tagged as KNOWN.
> In comparison the current RefSeq transcriptome for human contains
> just around 43k transcripts. What is causing this huge difference in
> transcriptome size? If it were just about different cutoffs in the
> annotation pipelines I would not expect such a dramatic difference.
>
> When comparing gene numbers there seems to be also a 2-fold
> difference (54k in Ensembl; 23k in RefSeq). To some extent this
> seems to be due to more non-coding and more putative and predicted
> entries in Ensembl, but I'm still surprised about the huge
> difference.
>
> Is there any way to filter (in addition to transcript-status) the
> Ensembl transcripts to get a more conservative set of transcripts
> (similar to the one from NCBI)?
>
> Best regards, Holger Brandl
>
> -- Dr. Holger Brandl Bioinformatics Service Max Planck Institute of
> Molecular Cell Biology and Genetics Pfotenhauerstrasse 108 01307
> Dresden, Germany
>
> Tel.:   +49/351/210-2738 Fax:    +49 351 210 2000 www:
> http://www.mpi-cbg.de
>
>
>
>
>
> _______________________________________________ Dev mailing list
> Dev at ensembl.org List admin (including subscribe/unsubscribe):
> http://lists.ensembl.org/mailman/listinfo/dev Ensembl Blog:
> http://www.ensembl.info/

-- 
Amonida Zadissa Ph.D.
Deputy team leader
EnsEMBL Genebuild team
Wellcome Trust Sanger Institute
England




More information about the Dev mailing list