[ensembl-dev] Short canonical transcripts

Kieron Taylor ktaylor at ebi.ac.uk
Thu Nov 29 11:18:12 GMT 2012

Dear Aliz,

Thank you for your report. Unfortunately the canonical transcript assignments for human in Ensembl 68 are not all correct by the rules we normally apply. This is a known bug and we have declared it on our website. The problem was corrected in release 69 and the examples you give are assigned correctly there. You should use the newer release if at all possible, and release 67 should also be correct if you are unable to update.

If you are curious about the current rules for assignment, feel free to have a look at ensembl/Bio/EnsEMBL/Utils/TranscriptSelector.pm


Kieron Taylor PhD.
Ensembl Core software developer

EMBL - European Bioinformatics Institute

On 28 Nov 2012, at 23:08, Aliz Raksi Rao wrote:

> Hello,
> I was using the ENSEMBL API to get protein lengths for all genes, and I have noticed that occasionally, the canonical transcript that is returned using $gene->canonical_transcript() is not the most widely "accepted" transcript from the literature, and neither is it the longest one. For example, DMD is a well-studied gene, and encodes a very large protein, yet the canonical transcript is 238 AAs in length and is obviously too short. The longest transcript is 1115 AAs. This happens in other genes as well, e.g. COL3A1. (Altogether, 200+ genes have canonical transcripts whose length is < 1/4 of the longest transcript.)
> ENSG00000168542	ENST00000450867	ENSP00000415346	COL3A1	90	1
> ENSG00000168542	ENST00000317840	ENSP00000315243	COL3A1	1163	0
> ENSG00000198947	ENST00000378705	ENSP00000367977	DMD	238	1
> ENSG00000198947	ENST00000541735	ENSP00000444119	DMD	1115	0
> My question is: how is canonical defined? I thought it was either curated information, or if this wasn't available, it's the longest transcript. The API version used is 68. Thanks ahead for your reply.
> Best,
> Aliz
> Aliz R. Rao
> UCLA Geffen School of Medicine
> Department of Human Genetics, Nelson Lab
> 695 Charles E Young Drive S
> Gonda 5554A
> Los Angeles CA 90095-8348 USA
> alizrrao at gmail.com
> 714.548.1133
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

More information about the Dev mailing list