[ensembl-dev] Question regarding canonical transcripts

Andrew Yates ayates at ebi.ac.uk
Tue Jul 26 17:15:09 BST 2016


Hi Duarte

No we are not saying there are two possible canonical transcripts because of their curated/predicted status.

I did a quick search and found a relevant bit of information from UCSC's genome mailing list. The knownCanonical table is populated by UCSC [1] and not by RefSeq. The rules Ensembl has used to select a canonical transcript from our own gene set [2] and the rules UCSC [3] have used to select from the RefSeq set are not the same. 

Neither Ensembl nor UCSC claim this is a canonical transcript assigned by RefSeq. In both cases it is the application of our rules to an externally imported gene set.

Andy

1 - https://groups.google.com/a/soe.ucsc.edu/d/msg/genome/_6asF5KciPc/ANihqywjAwAJ
2 - https://github.com/Ensembl/ensembl/blob/release/85/modules/Bio/EnsEMBL/Utils/TranscriptSelector.pm#L46 <https://github.com/Ensembl/ensembl/blob/release/85/modules/Bio/EnsEMBL/Utils/TranscriptSelector.pm#L46>
3 - http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=knownGene <http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=knownGene>

------------
Andrew Yates - Genomics Technology Infrastructure Team Leader
The European Bioinformatics Institute (EMBL-EBI)
Wellcome Genome Campus
Hinxton, Cambridge
CB10 1SD, United Kingdom
Tel: +44-(0)1223-492538
Fax: +44-(0)1223-494468
Skype: andy.yates.ebi
http://www.ebi.ac.uk/
http://www.ensembl.org/

> On 26 Jul 2016, at 16:44, Duarte Molha <duartemolha at gmail.com> wrote:
> 
> Now I am really confused !
> 
> Even the UCSC tables link NM_003036.3  as the canonical transcript. Does this mean there can be 2 possible canonical transcripts 
> 
> one for curated annotations and one for predicted?
> 
> 
> Here is the table linkage of refseq transcripts in the knownCanonical table
> 
> #filter: kgXref.geneSymbol = 'SKI'
> #hg19.knownCanonical.chrom	hg19.knownCanonical.chromStart	hg19.knownCanonical.chromEnd	hg19.knownCanonical.clusterId	hg19.knownCanonical.transcript	hg19.knownCanonical.protein	hg19.kgXref.geneSymbol	hg19.kgXref.refseq	hg19.kgXref.protAcc	hg19.kgXref.description
> chr1	2160133	2241652	98	uc001aja.4	uc001aja.4	SKI	NM_003036	NP_003027	Homo sapiens v-ski sarcoma viral oncogene homolog (avian) (SKI), mRNA.
> 
> 
> On 26 July 2016 at 16:06, mag <mr6 at ebi.ac.uk <mailto:mr6 at ebi.ac.uk>> wrote:
> Hi Duarte,
> 
> A canonical transcript is usually the transcript with the longest translation for a given gene
> http://www.ensembl.org/Help/Glossary?id=346 <http://www.ensembl.org/Help/Glossary?id=346>
> 
> In your example, XP_005244832.1 has a translation of 730 aa while NP_003027.1 only has 728.
> Hence, it is chosen as the canonical transcript.
> 
> As Kieron mentioned, if you want specifically curated RefSeq annotation, it might be better to fetch all external annotations then filter out the ones you are interested in.
> 
> 
> Regards,
> Magali
> 
> 
> On 25/07/2016 17:07, Duarte Molha wrote:
>> I will try and produce here the relevant parts of the script.
>> 
>> But I still am at loss why  XP_005244832.1 <http://www.ncbi.nlm.nih.gov/protein/XP_005244832.1> has been tagged as canonical
>> 
>> For what you are saying is that I simply might not have cycled trough all of the refseq transcripts... but is there going to be more than one refseq transcript tagged as canonical for each gene?
>> 
>> Not sure I follow!
>> 
>> Thanks
>> 
>> Duarte
>> 
>> 
>> 
>> 
>>  
>> 
>> Duarte Molha
>> about.me/duarte
>> 
>>  <https://about.me/duarte?promo=email_sig>  
>> 
>> On 25 July 2016 at 11:58, Kieron Taylor <ktaylor at ebi.ac.uk <mailto:ktaylor at ebi.ac.uk>> wrote:
>> Hi Duarte,
>> 
>> Can you send us a snippet of code that accesses the external database adaptor (DBEntryAdaptor?). It sounds like you may not be reading enough of your results to get the RefSeq ID you expect. We have all of the RefSeq IDs you mention associated at some level to the transcript, but some are from "RefSeq peptide predicted" for example.
>> 
>> Kieron
>> 
>> 
>> 
>> Kieron Taylor PhD.
>> Ensembl Developer
>> 
>> EMBL, European Bioinformatics Institute
>> 
>> 
>> 
>> 
>> 
>> 
>> > On 22 Jul 2016, at 10:47, Duarte Molha <duartemolha at gmail.com <mailto:duartemolha at gmail.com>> wrote:
>> >
>> > Hi Guys
>> >
>> > I have a script that based on a gene symbol connects to ensembl and retrieves the canonical transcript and then does the same using the external database adaptor to get the canonical refseq transcript.
>> >
>> > However this does not seem to give me the correct result
>> >
>> > Take for example the gene SKI ( I am using GRCh37 assembly btw)
>> >
>> > If you open this gene on the Ensembl browser:
>> >
>> > http://grch37.ensembl.org/Homo_sapiens/Location/View?db=core;g=ENSG00000157933;r=1:2159997-2161343 <http://grch37.ensembl.org/Homo_sapiens/Location/View?db=core;g=ENSG00000157933;r=1:2159997-2161343>
>> >
>> >
>> > On SKI, Ensembl annotates as the canonical transcript: ENST00000378536
>> >
>> > However, using by script, the external database adaptor returns the refseq XP_005244832.1 as the refseq canonical transcript, even though the correct canonical transcripts is NM_003036.3
>> >
>> > http://www.ncbi.nlm.nih.gov/gene/6497 <http://www.ncbi.nlm.nih.gov/gene/6497>
>> >
>> > Unless I am understanding this incorrectly if the coding regions is the same length in 2 transcripts the longest should be the canonical
>> >
>> > The longer Refseq is NM_003036.3  (has a longer 5prime UTR)
>> >
>> > Can you help me understand this?
>> >
>> > Many thanks
>> >
>> > Duarte
>> > _______________________________________________
>> > Dev mailing list    Dev at ensembl.org <mailto:Dev at ensembl.org>
>> > Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev <http://lists.ensembl.org/mailman/listinfo/dev>
>> > Ensembl Blog: http://www.ensembl.info/ <http://www.ensembl.info/>
>> 
>> 
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org <mailto:Dev at ensembl.org>
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev <http://lists.ensembl.org/mailman/listinfo/dev>
>> Ensembl Blog: http://www.ensembl.info/ <http://www.ensembl.info/>
>> 
>> 
>> 
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org <mailto:Dev at ensembl.org>
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev <http://lists.ensembl.org/mailman/listinfo/dev>
>> Ensembl Blog: http://www.ensembl.info/ <http://www.ensembl.info/>
> 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org <mailto:Dev at ensembl.org>
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev <http://lists.ensembl.org/mailman/listinfo/dev>
> Ensembl Blog: http://www.ensembl.info/ <http://www.ensembl.info/>
> 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20160726/a5a13108/attachment.html>


More information about the Dev mailing list