[ensembl-dev] Short canonical transcripts

Aliz Raksi Rao alizrrao at gmail.com
Wed Nov 28 23:08:21 GMT 2012


I was using the ENSEMBL API to get protein lengths for all genes, and I
have noticed that occasionally, the canonical transcript that is returned
using $gene->canonical_transcript() is not the most widely "accepted"
transcript from the literature, and neither is it the longest one. For
example, DMD is a well-studied gene, and encodes a very large protein, yet
the canonical transcript is 238 AAs in length and is obviously too short.
The longest transcript is 1115 AAs. This happens in other genes as well,
e.g. COL3A1. (Altogether, 200+ genes have canonical transcripts whose
length is < 1/4 of the longest transcript.)

     *ENSG* *ENST* *ENSP* *HGNC* *ProteinLength(AA)* *CANONICAL?*
ENSG00000168542 ENST00000450867 ENSP00000415346 COL3A1 90 1  ENSG00000168542
ENST00000317840 ENSP00000315243 COL3A1 1163 0  ENSG00000198947
ENST00000378705 ENSP00000367977 DMD 238 1  ENSG00000198947 ENST00000541735
ENSP00000444119 DMD 1115 0
My question is: how is canonical defined? I thought it was either curated
information, or if this wasn't available, it's the longest transcript. The
API version used is 68. Thanks ahead for your reply.


