[ensembl-dev] question regarding refseq exons retreival

Wed Mar 11 16:32:38 GMT 2015

Thanks Magali

Are there any plans to correct the naming structure of your refseq imports?

Take as an example the transcript NM_001164603 associated with gene ASXL1

On your main database you have associated it with transcript ID ENST00000497249
that has a transcript length of 296 bp and 5 exon
ENST00000497249
<http://www.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;g=ENSG00000171456;r=20:32358811-32372242;t=ENST00000497249>

However, the true refseq transcript is stored in your other features as a
novel transcript with the correct identifier
NM_001164603.1
<http://www.ensembl.org/Homo_sapiens/Transcript/Summary?db=otherfeatures;g=171023;r=20:32358344-32372549;t=NM_001164603.1>

It contains 5 exons but a transcript length of 1070bp

however,

In that table you call the transcript with the correct REFSEQ_ID, but then
on the exon layer, you pick different identifiers for exons 2,3 and 4 of
this transcript!

NM_001164603.1.1
<http://www.ensembl.org/Homo_sapiens/Location/View?db=otherfeatures;g=171023;r=20:32358294-32358882;t=NM_001164603.1>
XM_006723729.1.2
<http://www.ensembl.org/Homo_sapiens/Location/View?db=otherfeatures;g=171023;r=20:32366334-32366516;t=NM_001164603.1>
XM_006723729.1.3
<http://www.ensembl.org/Homo_sapiens/Location/View?db=otherfeatures;g=171023;r=20:32367677-32367779;t=NM_001164603.1>
XM_006723729.1.4
<http://www.ensembl.org/Homo_sapiens/Location/View?db=otherfeatures;g=171023;r=20:32368965-32369173;t=NM_001164603.1>
NM_001164603.1.5
<http://www.ensembl.org/Homo_sapiens/Location/View?db=otherfeatures;g=171023;r=20:32372114-32372599;t=NM_001164603.1>

Many thanks

Duarte

=========================
     Duarte Miguel Paulo Molha
         http://about.me/duarte
=========================

On 11 March 2015 at 09:35, mag <mr6 at ebi.ac.uk> wrote:

>  Hi Duarte,
>
> I am not convinced all genes in Ensembl will have at least one mapping to
> RefSeq, but your snippet of code should work regardless.
>
>
> Regards,
> Magali
>
> On 10/03/2015 17:05, Duarte Molha wrote:
>
> Thanks ... I think I have understood
>
>  Just confirm one thing to me ...
>
>  if I get all ensembl transcripts of any given gene at least one of those
> transcripts will have a database mapping to refseq correct?
>
>  for example ... consider the code:
>
>  $transcripts = $gene->get_all_Transcripts(); while ( my $transcript =
> shift @{$transcripts} ) { my %transcripts_refseq_ids = (); foreach my $dbe
> (@{ $transcript->get_all_DBEntries() }) { if($dbe->dbname() eq
> "RefSeq_mRNA") { $transcripts_refseq_ids{ $dbe->display_id() } = 1; } } }
>
>  I should be confident that by cycling through all ensembl transcripts of
> a gene and checking for a mRNA refseq entry I should be able to pull out
> all transcripts that map . Correct?
>
>  Thanks
>
>  Duarte
>
>
>  =========================
>      Duarte Miguel Paulo Molha
>           http://about.me/duarte
> =========================
>
> On 10 March 2015 at 16:20, mag <mr6 at ebi.ac.uk> wrote:
>
>>  Hi Duarte,
>>
>> It is important to bear in mind that Ensembl and RefSeq transcripts are
>> different objects.
>>
>> There is a large overlap between the two resources, but small differences
>> in coding sequence and UTRs mean that there is not always a one-to-one
>> mapping between an Ensembl transcript and a RefSeq transcript.
>> This also means that an Ensembl transcript might overlap some RefSeq
>> exons, but not all.
>>
>> In your use-case however, you should be able to get the information you
>> want by replacing the following call:
>> $gene->get_all_DBLinks( 'RefSeq_mRNA')
>> with $transcript->get_all_DBEntries('RefSeq_mRNA')
>>
>> RefSeq_mRNA corresponds to RefSeq transcripts, which we consequently map
>> to Ensembl transcripts.
>> With your current script, you are fetching all genes where at least one
>> transcript is mapped to a RefSeq transcript.
>> Instead, you can directly fetch only the transcripts which have a mapping
>> to RefSeq.
>>
>>
>> Hope that helps,
>> Magali
>>
>> On 10/03/2015 15:30, Duarte Molha wrote:
>>
>> Thanks Keiron
>>
>>  But this still leaves me with a question.
>>
>>  Say that I have a gene, and I retreive the correct gene object from the
>> ensembl database. How can I output only the transcripts that are referenced
>> in Refseq is not my the way I have done it?
>>
>>  If I go the normal way, the  $gene->get_all_Transcripts(); method will
>> retrieve all ensembl transcripts. How can I limit it to only get
>> transcripts that are refseq?
>>
>>  Thanks
>>
>>  Duarte
>>
>>  =========================
>>      Duarte Miguel Paulo Molha
>>           http://about.me/duarte
>> =========================
>>
>>  On 10 March 2015 at 15:22, Kieron Taylor <ktaylor at ebi.ac.uk> wrote:
>>
>>> Dear Duarte,
>>>
>>> The issue you have exposed is subtle. You seem to be printing “exon
>>> stable IDs” but expecting them to be RefSeq accessions. Our mistake was to
>>> use the RefSeq IDs as arbitrary identifiers for internal use, but I must
>>> stress the what Ensembl calls a Stable ID must never be assumed to have any
>>> meaning outside of an Ensembl database. What you want are display labels.
>>> The exon labels were generated by picking only the first of any possible
>>> RefSeq IDs, hence you cannot get everything you want in this way.
>>>
>>> The correct way to handle this in your code is to fetch the transcript
>>> name and print that in each exon, as RefSeq IDs refer to transcripts and
>>> not exons.
>>>
>>>
>>> Regards,
>>>
>>> Kieron
>>>
>>>
>>> Kieron Taylor PhD.
>>> Ensembl Core senior software developer
>>>
>>> EMBL, European Bioinformatics Institute
>>>
>>>
>>>
>>>
>>>
>>> > On 10 Mar 2015, at 11:57, Duarte Molha <duartemolha at gmail.com> wrote:
>>> >
>>> > Dear developers
>>> >
>>> > I have a script that I wrote (in attachment)  that gets me the refseq
>>> exons for give input gene
>>> >
>>> > However when I use this code using the gene ASXL1 as an example is:
>>> >
>>> > test_query.pl ASXL1
>>> >
>>> > QueryName     feature_type    common_name     Biotype id      chr
>>>  start   end     strand
>>> > ASXL1 Exon    ASXL1   protein_coding  NM_001164603.1.1        chr20
>>>  30946147        30946635        +
>>> > ASXL1 Exon    ASXL1   protein_coding  NM_001164603.1.2        chr20
>>>  30954187        30954269        +
>>> > ASXL1 Exon    ASXL1   protein_coding  NM_001164603.1.3        chr20
>>>  30955530        30955532        +
>>> > ASXL1 Exon    ASXL1   protein_coding  NM_001164603.1.4        chr20
>>>  30956818        30956926        +
>>> > ASXL1 Exon    ASXL1   protein_coding  NM_015338.5.5   chr20
>>>  31015931        31016051        +
>>> > ASXL1 Exon    ASXL1   protein_coding  NM_015338.5.6   chr20
>>>  31016128        31016225        +
>>> > ASXL1 Exon    ASXL1   protein_coding  NM_015338.5.7   chr20
>>>  31017141        31017234        +
>>> > ASXL1 Exon    ASXL1   protein_coding  NM_015338.5.8   chr20
>>>  31017704        31017856        +
>>> > ASXL1 Exon    ASXL1   protein_coding  NM_015338.5.9   chr20
>>>  31019124        31019287        +
>>> > ASXL1 Exon    ASXL1   protein_coding  NM_015338.5.10  chr20
>>>  31019386        31019482        +
>>> > ASXL1 Exon    ASXL1   protein_coding  NM_015338.5.11  chr20
>>>  31020683        31020788        +
>>> > ASXL1 Exon    ASXL1   protein_coding  NM_015338.5.12  chr20
>>>  31021087        31021720        +
>>> > ASXL1 Exon    ASXL1   protein_coding  NM_015338.5.13  chr20
>>>  31022235        31027122        +
>>> >
>>> >
>>> > As you can see, I am missing some of the exons for transcript
>>> NM_015338.5
>>> > In this case, the 1st 3 exons of transcript  NM_015338.5 are identical
>>> to NM_001164603.1, but I would expect to have them listed as :
>>> >
>>> > ASXL1 Exon    ASXL1   protein_coding  NM_015338.5.1   chr20
>>>  30946147        30946635        +
>>> > ASXL1 Exon    ASXL1   protein_coding  NM_015338.5.2   chr20
>>>  30954187        30954269        +
>>> > ASXL1 Exon    ASXL1   protein_coding  NM_015338.5.3   chr20
>>>  30955530        30955532        +
>>> >
>>> > Can you tell me what is wrong with my approach and how I can retrieve
>>> the missing data?
>>> >
>>> > Best regards
>>> >
>>> > Duarte
>>>  > <test_query.pl>_______________________________________________
>>> > Dev mailing list    Dev at ensembl.org
>>> > Posting guidelines and subscribe/unsubscribe info:
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> > Ensembl Blog: http://www.ensembl.info/
>>>
>>>
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info:
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>
>>
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20150311/195bd85b/attachment.html>