[ensembl-dev] Short canonical transcripts
ktaylor at ebi.ac.uk
Thu Nov 29 11:18:12 GMT 2012
Thank you for your report. Unfortunately the canonical transcript assignments for human in Ensembl 68 are not all correct by the rules we normally apply. This is a known bug and we have declared it on our website. The problem was corrected in release 69 and the examples you give are assigned correctly there. You should use the newer release if at all possible, and release 67 should also be correct if you are unable to update.
If you are curious about the current rules for assignment, feel free to have a look at ensembl/Bio/EnsEMBL/Utils/TranscriptSelector.pm
Kieron Taylor PhD.
Ensembl Core software developer
EMBL - European Bioinformatics Institute
On 28 Nov 2012, at 23:08, Aliz Raksi Rao wrote:
> I was using the ENSEMBL API to get protein lengths for all genes, and I have noticed that occasionally, the canonical transcript that is returned using $gene->canonical_transcript() is not the most widely "accepted" transcript from the literature, and neither is it the longest one. For example, DMD is a well-studied gene, and encodes a very large protein, yet the canonical transcript is 238 AAs in length and is obviously too short. The longest transcript is 1115 AAs. This happens in other genes as well, e.g. COL3A1. (Altogether, 200+ genes have canonical transcripts whose length is < 1/4 of the longest transcript.)
> ENSG ENST ENSP HGNC ProteinLength(AA) CANONICAL?
> ENSG00000168542 ENST00000450867 ENSP00000415346 COL3A1 90 1
> ENSG00000168542 ENST00000317840 ENSP00000315243 COL3A1 1163 0
> ENSG00000198947 ENST00000378705 ENSP00000367977 DMD 238 1
> ENSG00000198947 ENST00000541735 ENSP00000444119 DMD 1115 0
> My question is: how is canonical defined? I thought it was either curated information, or if this wasn't available, it's the longest transcript. The API version used is 68. Thanks ahead for your reply.
> Aliz R. Rao
> UCLA Geffen School of Medicine
> Department of Human Genetics, Nelson Lab
> 695 Charles E Young Drive S
> Gonda 5554A
> Los Angeles CA 90095-8348 USA
> alizrrao at gmail.com
> Dev mailing list Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
More information about the Dev