[ensembl-dev] VEP Cache

Will McLaren wm2 at ebi.ac.uk
Wed Apr 9 09:14:08 BST 2014


Hi Konrad,

This is probably to be expected. There is no caching implemented at all in
the sequence fetching, so each time you call for the sequence, even if the
code has requested the same block before, it will go back to disk to ask
for it again.

It might be worth thinking about implementing some sort of sequence cache
in your plugin; if you add

$self->{has_cache} = 1;

to the new method of your plugin, you can cache stuff on $self in run() and
retrieve it each time you execute the run() method.

The Ensembl core API has similar functionality when fetching sequence from
the database to avoid redundant lookups.

HTH

Will


On 8 April 2014 19:28, Konrad Karczewski <konradk at broadinstitute.org> wrote:

> Hi Will,
>
> Thanks! This appears to work now. However, performance is noticeably
> decreased (one test case with the plugin is ~118 vars/sec, with the plugin:
> 16 vars/sec). I've narrowed down the slowdown to the intron sequence access
> step (takes on the order of a second) - is this expected, or might there be
> something wrong with the indexing?
>
> -Konrad
>
> On Apr 7, 2014, at 5:41 AM, Will McLaren <wm2 at ebi.ac.uk> wrote:
>
> Hi Konrad,
>
> Assuming you have the FASTA file available and functioning, this should
> work OK; you should see a message like this at VEP startup:
>
> 2014-04-07 10:32:11 - Read existing cache info
> 2014-04-07 10:32:12 - Auto-detected FASTA file in cache directory
> 2014-04-07 10:32:13 - Checking/creating FASTA index
>
> or just the final message if you are pointing manually to a FASTA file
> using --fasta.
>
> I just tested this with a rudimental plugin and I can retrieve the intron
> sequence OK, no Ns. Let me know if you still have any problems.
>
> Plugin code:
>
> package IntronSeq;
> use Bio::EnsEMBL::Variation::Utils::BaseVepPlugin;
> use base qw(Bio::EnsEMBL::Variation::Utils::BaseVepPlugin);
>
> sub run {
>   my ($self, $tva) = @_;
>   print STDERR $tva->transcript->get_all_Introns->[0]->seq()."\n";
>   return {};
> }
>
> 1;
>
> Output:
>
> > perl variant_effect_predictor.pl -i example.vcf -force -plugin
> IntronSeq -offline -no_progress
> 2014-04-07 10:39:27 - Read existing cache info
> 2014-04-07 10:39:27 - Auto-detected FASTA file in cache directory
> 2014-04-07 10:39:27 - Checking/creating FASTA index
> 2014-04-07 10:39:27 - Loaded plugin: IntronSeq
> 2014-04-07 10:39:27 - Starting...
> 2014-04-07 10:39:27 - Detected format of input file as vcf
> 2014-04-07 10:39:27 - Read 173 variants into buffer
> 2014-04-07 10:39:27 - Reading transcript data from cache and/or database
> 2014-04-07 10:39:28 - Retrieved 3097 transcripts (0 mem, 3162 cached, 0
> DB, 65 duplicates)
> 2014-04-07 10:39:28 - Analyzing chromosome 21
> 2014-04-07 10:39:28 - Analyzing variants
> 2014-04-07 10:39:28 - Calculating consequences
> Plugin 'IntronSeq' went wrong: Can't call method "seq" on an undefined
> value at /nfs/users/nfs_w/wm2/.vep/Plugins/IntronSeq.pm line 48, <GEN0>
> line 175.
>
> GTGAGTTTCAGAGGCCGTAGGGACAGGGAGCGAGGCCTAGATAGTGGTGTCTGTCTAGATTGGGTCTGAGGCGGGGCCGGGGAGGTCCCGCGGGGCAGAGGAAGGAGGAGGGTTTCTTAGTCCCTCCGCGGCGGTCGCTCTTGCACAGCTTGGGAGGACTAATTTATGGGAACGAGGGTCTGGCGGAGGGCAGGGGCAAGGGCAGGGGTCGGGGCCAGGGGTCGGAGCCAGGCCGCGGGAGGAGCTTGGGCCCGCCTCTGGGAAGCAGCGCACGTTCCGTGCACATCTGTCCATGTCTTCCCAAGGAATACTCGTACTTGCCTTGGCAGGTTCCCTGATTTGGCCTTTGGGATATAAACTCAGCATTTCTCATTCTGGATATTGATAGTTTCGGTGTGGGACCTTTGGTTTCCTGAAATTTTCTTGTTTTTCTTCAGACCCTGTCAAACCGACCACTTTGTTCACCTTCCCAATGACTCTAGTCCAGTTTTGACTCCGTTTCCTGGTTACTTTTTGCCCCTTATTGTAAAGCACTGATTGGAAACACGACACAGGAAATTGGTGGGAAATAGCGATCTGATGTGAAAGAGCCAAATTTAAAAGTAGAGGCACGTATCTGGGCCAGCTCTGTTTCTTCCGCTGGTGTTTGTTAATATTACAAATTGGTTTAATTTTACCTCTGAGCGCACTTTTGGCAGTACGTTAATCATTTTTTCAGTCTTCATATTTATTGTAACTTCTCCACAG
>
> GTGAGTTTCAGAGGCCGTAGGGACAGGGAGCGAGGCCTAGATAGTGGTGTCTGTCTAGATTGGGTCTGAGGCGGGGCCGGGGAGGTCCCGCGGGGCAGAGGAAGGAGGAGGGTTTCTTAGTCCCTCCGCGGCGGTCGCTCTTGCACAGCTTGGGAGGACTAATTTATGGGAACGAGGGTCTGGCGGAGGGCAGGGGCAAGGGCAGGGGTCGGGGCCAGGGGTCGGAGCCAGGCCGCGGGAGGAGCTTGGGCCCGCCTCTGGGAAGCAGCGCACGTTCCGTGCACATCTGTCCATGTCTTCCCAAGGAATACTCGTACTTGCCTTGGCAGGTTCCCTGATTTGGCCTTTGGGATATAAACTCAGCATTTCTCATTCTGGATATTGATAGTTTCGGTGTGGGACCTTTGGTTTCCTGAAATTTTCTTGTTTTTCTTCAGACCCTGTCAAACCGACCACTTTGTTCACCTTCCCAATGACTCTAGTCCAGTTTTGACTCCGTTTCCTGGTTACTTTTTGCCCCTTATTGTAAAGCACTGATTGGAAACACGACACAGGAAATTGGTGGGAAATAGCGATCTGATGTGAAAGAGCCAAATTTAAAAGTAGAGGCACGTATCTGGGCCAGCTCTGTTTCTTCCGCTGGTGTTTGTTAATATTACAAATTGGTTTAATTTTACCTCTGAGCGCACTTTTGGCAGTACGTTAATCATTTTTTCAGTCTTCATATTTATTGTAACTTCTCCACAG
>
> etc etc
>
> Regards
>
> Will McLaren
> Ensembl Variation
>
>
> On 7 April 2014 03:21, Konrad Karczewski <konradk at broadinstitute.org>wrote:
>
>> Hello!
>>
>> I've been developing a loss-of-function plugin for VEP and having some
>> implementation issues relating to the VEP cache. Specifically, when
>> accessing transcripts via the API (with the --offline flag set) it seems
>> the cache does not store intronic sequences. When I run the code below
>> without the --offline flag, it works as expected. With --offline, the
>> lengths prints properly, but the sequence is N repeated length times.
>>
>> # $transcript_variation is provided from VEP plugin "run" subroutine
>> my @gene_introns =
>> @{$transcript_variation->transcript->get_all_Introns()};
>> my $intron_number = 0;
>> print length($gene_introns[$intron_number]->seq()) . "\n"; # Returns
>> correct length for first intron of the transcript
>> print $gene_introns[$intron_number]->seq() . "\n"; # Returns
>> "N"*length(intron)
>>
>> I can rebuild my cache if need be, but I was wondering if there were any
>> plans to integrate intron (and exon) sequence into the cache? (Seems like
>> it should be reasonably straightforward, since VEP requires the genome
>> fasta anyway, but I'm not sure about the details of how this part is
>> implemented). This would be very helpful for a number of reasons, including
>> detecting proper intron sequences (i.e. with a canonical splice motif).
>>
>> (This happens in API versions 74 and 75).
>>
>> Thanks!
>> -Konrad
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20140409/97db5dbb/attachment.html>


More information about the Dev mailing list