[ensembl-dev] question regarding refseq exons retreival

Duarte Molha duartemolha at gmail.com
Fri Mar 13 13:24:35 GMT 2015


Sorry for the crazy formated text... ;)


=========================
     Duarte Miguel Paulo Molha
         http://about.me/duarte
=========================

On 11 March 2015 at 16:32, Duarte Molha <duartemolha at gmail.com> wrote:

> Thanks Magali
>
> Are there any plans to correct the naming structure of your refseq imports?
>
> Take as an example the transcript NM_001164603 associated with gene ASXL1
>
> On your main database you have associated it with transcript ID ENST00000497249
> that has a transcript length of 296 bp and 5 exon
> ENST00000497249
> <http://www.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;g=ENSG00000171456;r=20:32358811-32372242;t=ENST00000497249>
>
> However, the true refseq transcript is stored in your other features as a
> novel transcript with the correct identifier
> NM_001164603.1
> <http://www.ensembl.org/Homo_sapiens/Transcript/Summary?db=otherfeatures;g=171023;r=20:32358344-32372549;t=NM_001164603.1>
>
> It contains 5 exons but a transcript length of 1070bp
>
> however,
>
> In that table you call the transcript with the correct REFSEQ_ID, but then
> on the exon layer, you pick different identifiers for exons 2,3 and 4 of
> this transcript!
>
> NM_001164603.1.1
> <http://www.ensembl.org/Homo_sapiens/Location/View?db=otherfeatures;g=171023;r=20:32358294-32358882;t=NM_001164603.1>
> XM_006723729.1.2
> <http://www.ensembl.org/Homo_sapiens/Location/View?db=otherfeatures;g=171023;r=20:32366334-32366516;t=NM_001164603.1>
> XM_006723729.1.3
> <http://www.ensembl.org/Homo_sapiens/Location/View?db=otherfeatures;g=171023;r=20:32367677-32367779;t=NM_001164603.1>
> XM_006723729.1.4
> <http://www.ensembl.org/Homo_sapiens/Location/View?db=otherfeatures;g=171023;r=20:32368965-32369173;t=NM_001164603.1>
> NM_001164603.1.5
> <http://www.ensembl.org/Homo_sapiens/Location/View?db=otherfeatures;g=171023;r=20:32372114-32372599;t=NM_001164603.1>
>
>
>
> Many thanks
>
> Duarte
>
>
>
> =========================
>      Duarte Miguel Paulo Molha
>          http://about.me/duarte
> =========================
>
> On 11 March 2015 at 09:35, mag <mr6 at ebi.ac.uk> wrote:
>
>>  Hi Duarte,
>>
>> I am not convinced all genes in Ensembl will have at least one mapping to
>> RefSeq, but your snippet of code should work regardless.
>>
>>
>> Regards,
>> Magali
>>
>> On 10/03/2015 17:05, Duarte Molha wrote:
>>
>> Thanks ... I think I have understood
>>
>>  Just confirm one thing to me ...
>>
>>  if I get all ensembl transcripts of any given gene at least one of
>> those transcripts will have a database mapping to refseq correct?
>>
>>  for example ... consider the code:
>>
>>  $transcripts = $gene->get_all_Transcripts(); while ( my $transcript =
>> shift @{$transcripts} ) { my %transcripts_refseq_ids = (); foreach my $dbe
>> (@{ $transcript->get_all_DBEntries() }) { if($dbe->dbname() eq
>> "RefSeq_mRNA") { $transcripts_refseq_ids{ $dbe->display_id() } = 1; } } }
>>
>>  I should be confident that by cycling through all ensembl transcripts
>> of a gene and checking for a mRNA refseq entry I should be able to pull out
>> all transcripts that map . Correct?
>>
>>  Thanks
>>
>>  Duarte
>>
>>
>>  =========================
>>      Duarte Miguel Paulo Molha
>>           http://about.me/duarte
>> =========================
>>
>> On 10 March 2015 at 16:20, mag <mr6 at ebi.ac.uk> wrote:
>>
>>>  Hi Duarte,
>>>
>>> It is important to bear in mind that Ensembl and RefSeq transcripts are
>>> different objects.
>>>
>>> There is a large overlap between the two resources, but small
>>> differences in coding sequence and UTRs mean that there is not always a
>>> one-to-one mapping between an Ensembl transcript and a RefSeq transcript.
>>> This also means that an Ensembl transcript might overlap some RefSeq
>>> exons, but not all.
>>>
>>> In your use-case however, you should be able to get the information you
>>> want by replacing the following call:
>>> $gene->get_all_DBLinks( 'RefSeq_mRNA')
>>> with $transcript->get_all_DBEntries('RefSeq_mRNA')
>>>
>>> RefSeq_mRNA corresponds to RefSeq transcripts, which we consequently map
>>> to Ensembl transcripts.
>>> With your current script, you are fetching all genes where at least one
>>> transcript is mapped to a RefSeq transcript.
>>> Instead, you can directly fetch only the transcripts which have a
>>> mapping to RefSeq.
>>>
>>>
>>> Hope that helps,
>>> Magali
>>>
>>> On 10/03/2015 15:30, Duarte Molha wrote:
>>>
>>> Thanks Keiron
>>>
>>>  But this still leaves me with a question.
>>>
>>>  Say that I have a gene, and I retreive the correct gene object from
>>> the ensembl database. How can I output only the transcripts that are
>>> referenced in Refseq is not my the way I have done it?
>>>
>>>  If I go the normal way, the  $gene->get_all_Transcripts(); method will
>>> retrieve all ensembl transcripts. How can I limit it to only get
>>> transcripts that are refseq?
>>>
>>>  Thanks
>>>
>>>  Duarte
>>>
>>>  =========================
>>>      Duarte Miguel Paulo Molha
>>>           http://about.me/duarte
>>> =========================
>>>
>>>  On 10 March 2015 at 15:22, Kieron Taylor <ktaylor at ebi.ac.uk> wrote:
>>>
>>>> Dear Duarte,
>>>>
>>>> The issue you have exposed is subtle. You seem to be printing “exon
>>>> stable IDs” but expecting them to be RefSeq accessions. Our mistake was to
>>>> use the RefSeq IDs as arbitrary identifiers for internal use, but I must
>>>> stress the what Ensembl calls a Stable ID must never be assumed to have any
>>>> meaning outside of an Ensembl database. What you want are display labels.
>>>> The exon labels were generated by picking only the first of any possible
>>>> RefSeq IDs, hence you cannot get everything you want in this way.
>>>>
>>>> The correct way to handle this in your code is to fetch the transcript
>>>> name and print that in each exon, as RefSeq IDs refer to transcripts and
>>>> not exons.
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Kieron
>>>>
>>>>
>>>> Kieron Taylor PhD.
>>>> Ensembl Core senior software developer
>>>>
>>>> EMBL, European Bioinformatics Institute
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> > On 10 Mar 2015, at 11:57, Duarte Molha <duartemolha at gmail.com> wrote:
>>>> >
>>>> > Dear developers
>>>> >
>>>> > I have a script that I wrote (in attachment)  that gets me the refseq
>>>> exons for give input gene
>>>> >
>>>> > However when I use this code using the gene ASXL1 as an example is:
>>>> >
>>>> > test_query.pl ASXL1
>>>> >
>>>> > QueryName     feature_type    common_name     Biotype id      chr
>>>>  start   end     strand
>>>> > ASXL1 Exon    ASXL1   protein_coding  NM_001164603.1.1        chr20
>>>>  30946147        30946635        +
>>>> > ASXL1 Exon    ASXL1   protein_coding  NM_001164603.1.2        chr20
>>>>  30954187        30954269        +
>>>> > ASXL1 Exon    ASXL1   protein_coding  NM_001164603.1.3        chr20
>>>>  30955530        30955532        +
>>>> > ASXL1 Exon    ASXL1   protein_coding  NM_001164603.1.4        chr20
>>>>  30956818        30956926        +
>>>> > ASXL1 Exon    ASXL1   protein_coding  NM_015338.5.5   chr20
>>>>  31015931        31016051        +
>>>> > ASXL1 Exon    ASXL1   protein_coding  NM_015338.5.6   chr20
>>>>  31016128        31016225        +
>>>> > ASXL1 Exon    ASXL1   protein_coding  NM_015338.5.7   chr20
>>>>  31017141        31017234        +
>>>> > ASXL1 Exon    ASXL1   protein_coding  NM_015338.5.8   chr20
>>>>  31017704        31017856        +
>>>> > ASXL1 Exon    ASXL1   protein_coding  NM_015338.5.9   chr20
>>>>  31019124        31019287        +
>>>> > ASXL1 Exon    ASXL1   protein_coding  NM_015338.5.10  chr20
>>>>  31019386        31019482        +
>>>> > ASXL1 Exon    ASXL1   protein_coding  NM_015338.5.11  chr20
>>>>  31020683        31020788        +
>>>> > ASXL1 Exon    ASXL1   protein_coding  NM_015338.5.12  chr20
>>>>  31021087        31021720        +
>>>> > ASXL1 Exon    ASXL1   protein_coding  NM_015338.5.13  chr20
>>>>  31022235        31027122        +
>>>> >
>>>> >
>>>> > As you can see, I am missing some of the exons for transcript
>>>> NM_015338.5
>>>> > In this case, the 1st 3 exons of transcript  NM_015338.5 are
>>>> identical to NM_001164603.1, but I would expect to have them listed as :
>>>> >
>>>> > ASXL1 Exon    ASXL1   protein_coding  NM_015338.5.1   chr20
>>>>  30946147        30946635        +
>>>> > ASXL1 Exon    ASXL1   protein_coding  NM_015338.5.2   chr20
>>>>  30954187        30954269        +
>>>> > ASXL1 Exon    ASXL1   protein_coding  NM_015338.5.3   chr20
>>>>  30955530        30955532        +
>>>> >
>>>> > Can you tell me what is wrong with my approach and how I can retrieve
>>>> the missing data?
>>>> >
>>>> > Best regards
>>>> >
>>>> > Duarte
>>>>  > <test_query.pl>_______________________________________________
>>>> > Dev mailing list    Dev at ensembl.org
>>>> > Posting guidelines and subscribe/unsubscribe info:
>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>> > Ensembl Blog: http://www.ensembl.info/
>>>>
>>>>
>>>> _______________________________________________
>>>> Dev mailing list    Dev at ensembl.org
>>>> Posting guidelines and subscribe/unsubscribe info:
>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog: http://www.ensembl.info/
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>>
>>>
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info:
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>>
>>
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20150313/622c8aba/attachment.html>


More information about the Dev mailing list