[ensembl-dev] Performance issues getting spliced sequence from Bio::EnsEMBL::Transcript

Matt Wood matt.wood at codifiedgenomics.com
Mon Nov 24 22:34:43 GMT 2014


I did download a FASTA with the installer and am already using the
--offline flag.

I've got a version mostly working now where I do the following:

1. Get start and end positions using
Bio::EnsEMBL::Variation::TranscriptVariation::cdna_start
and Bio::EnsEMBL::Variation::TranscriptVariation::cdna_end
2. Convert those to genomic coordinates
using Bio::EnsEMBL::TranscriptMapper::cdna2genomic
3. Loop through the returned coords and concat the sequence using
Bio::EnsEMBL::Slice::sub_Slice

I'm a little concerned there may be some corner cases I'm missing
that Bio::EnsEMBL::Transcript::spliced_seq was handling, but so far I
haven't run into that. And it's fast enough.

Matt Wood

On Thu, Nov 13, 2014 at 11:07 PM, Will McLaren <wm2 at ebi.ac.uk> wrote:

> Hi Matt,
>
> I assume you're already using a FASTA file (
> http://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html#fasta).
>
> There does seem to be some issue with the sequence fetching code, however.
> If you are using the VEP's --offline flag then the issue doesn't appear (at
> least for me). If you use --cache (which asks the code to prefer use of
> offline resources including the FASTA file but still allows connections to
> the DB) then in this particular case calling that method seems to bypass
> the code that fetches sequence from the FASTA file and instead fetches it
> from the DB.
>
> While we look into this issue, can you possibly try using --offline if you
> aren't already?
>
> Regards
>
> Will McLaren
> Ensembl Variation
>
> On 13 November 2014 07:39, Matt Wood <matt.wood at codifiedgenomics.com>
> wrote:
>
>> I'm working on a VEP plugin where I need to look at a section of cDNA
>> around the variant.
>>
>> In a previous plugin, where I needed to do something similar with genomic
>> DNA, I was able to get a slice from the VariationFeature and subslice it
>> like this:
>>
>> my $subseq = $vf->slice->sub_Slice($start, $end)->seq;
>>
>> That worked really well and performed really well.
>>
>> I can't find anything similar for the cDNA so I'm getting the spliced
>> sequence from the transcript and then using substr() to do what sub_Slice
>> did above.
>>
>> my $cdna_seq = $transcript->spliced_seq;
>> my $subseq = substr($cdna_seq, $start, $end);
>>
>> It works well enough, but performance is too poor to be useful, taking 2
>> or 3 seconds to get $subseq per transcript. I'm wondering if I'm going
>> about things the wrong way and am skipping a cache or something with the
>> methods I'm using.
>>
>> Any ideas for how I can get better performance? Is there a better way to
>> get a chunk of a transcript's spliced sequence?
>>
>> Thanks,
>> Matt Wood
>> Codified Genomics
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20141124/31fa4629/attachment.html>


More information about the Dev mailing list