[ensembl-dev] Why is there a 3-fold size difference between Ensembl and RefSeq human transcriptome?

Mon Jul 9 16:29:21 BST 2012

Hello Ensembl,

I've compared the number of human transcripts in Ensembl vs Refseq. In Ensembl are ~195k transcripts of which 150k are tagged as KNOWN. In comparison the current RefSeq transcriptome for human contains  just around 43k transcripts. What is causing this huge difference in transcriptome size? If it were just about different cutoffs in the annotation pipelines I would not expect such a dramatic difference.

When comparing gene numbers there seems to be also a 2-fold difference (54k in Ensembl; 23k in RefSeq). To some extent this seems to be due to more non-coding and more putative and predicted entries in Ensembl, but I'm still surprised about the huge difference. 

Is there any way to filter (in addition to transcript-status) the Ensembl transcripts to get a more conservative set of transcripts (similar to the one from NCBI)?

Best regards,
Holger Brandl

--
Dr. Holger Brandl
Bioinformatics Service
Max Planck Institute of Molecular Cell Biology and Genetics
Pfotenhauerstrasse 108
01307 Dresden, Germany

Tel.:   +49/351/210-2738
Fax:    +49 351 210 2000
www:  http://www.mpi-cbg.de