[ensembl-dev] VEP Cache

Konrad Karczewski konradk at broadinstitute.org
Tue Apr 8 19:28:18 BST 2014


Hi Will,

Thanks! This appears to work now. However, performance is noticeably decreased (one test case with the plugin is ~118 vars/sec, with the plugin: 16 vars/sec). I've narrowed down the slowdown to the intron sequence access step (takes on the order of a second) - is this expected, or might there be something wrong with the indexing?

-Konrad

On Apr 7, 2014, at 5:41 AM, Will McLaren <wm2 at ebi.ac.uk> wrote:

> Hi Konrad,
> 
> Assuming you have the FASTA file available and functioning, this should work OK; you should see a message like this at VEP startup:
> 
> 2014-04-07 10:32:11 - Read existing cache info
> 2014-04-07 10:32:12 - Auto-detected FASTA file in cache directory
> 2014-04-07 10:32:13 - Checking/creating FASTA index
> 
> or just the final message if you are pointing manually to a FASTA file using --fasta.
> 
> I just tested this with a rudimental plugin and I can retrieve the intron sequence OK, no Ns. Let me know if you still have any problems.
> 
> Plugin code:
> 
> package IntronSeq;
> use Bio::EnsEMBL::Variation::Utils::BaseVepPlugin;
> use base qw(Bio::EnsEMBL::Variation::Utils::BaseVepPlugin);
> 
> sub run {
>   my ($self, $tva) = @_;  
>   print STDERR $tva->transcript->get_all_Introns->[0]->seq()."\n";
>   return {};
> }
> 
> 1;
> 
> Output:
> 
> > perl variant_effect_predictor.pl -i example.vcf -force -plugin IntronSeq -offline -no_progress
> 2014-04-07 10:39:27 - Read existing cache info
> 2014-04-07 10:39:27 - Auto-detected FASTA file in cache directory
> 2014-04-07 10:39:27 - Checking/creating FASTA index
> 2014-04-07 10:39:27 - Loaded plugin: IntronSeq
> 2014-04-07 10:39:27 - Starting...
> 2014-04-07 10:39:27 - Detected format of input file as vcf
> 2014-04-07 10:39:27 - Read 173 variants into buffer
> 2014-04-07 10:39:27 - Reading transcript data from cache and/or database
> 2014-04-07 10:39:28 - Retrieved 3097 transcripts (0 mem, 3162 cached, 0 DB, 65 duplicates)
> 2014-04-07 10:39:28 - Analyzing chromosome 21
> 2014-04-07 10:39:28 - Analyzing variants
> 2014-04-07 10:39:28 - Calculating consequences
> Plugin 'IntronSeq' went wrong: Can't call method "seq" on an undefined value at /nfs/users/nfs_w/wm2/.vep/Plugins/IntronSeq.pm line 48, <GEN0> line 175.
> GTGAGTTTCAGAGGCCGTAGGGACAGGGAGCGAGGCCTAGATAGTGGTGTCTGTCTAGATTGGGTCTGAGGCGGGGCCGGGGAGGTCCCGCGGGGCAGAGGAAGGAGGAGGGTTTCTTAGTCCCTCCGCGGCGGTCGCTCTTGCACAGCTTGGGAGGACTAATTTATGGGAACGAGGGTCTGGCGGAGGGCAGGGGCAAGGGCAGGGGTCGGGGCCAGGGGTCGGAGCCAGGCCGCGGGAGGAGCTTGGGCCCGCCTCTGGGAAGCAGCGCACGTTCCGTGCACATCTGTCCATGTCTTCCCAAGGAATACTCGTACTTGCCTTGGCAGGTTCCCTGATTTGGCCTTTGGGATATAAACTCAGCATTTCTCATTCTGGATATTGATAGTTTCGGTGTGGGACCTTTGGTTTCCTGAAATTTTCTTGTTTTTCTTCAGACCCTGTCAAACCGACCACTTTGTTCACCTTCCCAATGACTCTAGTCCAGTTTTGACTCCGTTTCCTGGTTACTTTTTGCCCCTTATTGTAAAGCACTGATTGGAAACACGACACAGGAAATTGGTGGGAAATAGCGATCTGATGTGAAAGAGCCAAATTTAAAAGTAGAGGCACGTATCTGGGCCAGCTCTGTTTCTTCCGCTGGTGTTTGTTAATATTACAAATTGGTTTAATTTTACCTCTGAGCGCACTTTTGGCAGTACGTTAATCATTTTTTCAGTCTTCATATTTATTGTAACTTCTCCACAG
> GTGAGTTTCAGAGGCCGTAGGGACAGGGAGCGAGGCCTAGATAGTGGTGTCTGTCTAGATTGGGTCTGAGGCGGGGCCGGGGAGGTCCCGCGGGGCAGAGGAAGGAGGAGGGTTTCTTAGTCCCTCCGCGGCGGTCGCTCTTGCACAGCTTGGGAGGACTAATTTATGGGAACGAGGGTCTGGCGGAGGGCAGGGGCAAGGGCAGGGGTCGGGGCCAGGGGTCGGAGCCAGGCCGCGGGAGGAGCTTGGGCCCGCCTCTGGGAAGCAGCGCACGTTCCGTGCACATCTGTCCATGTCTTCCCAAGGAATACTCGTACTTGCCTTGGCAGGTTCCCTGATTTGGCCTTTGGGATATAAACTCAGCATTTCTCATTCTGGATATTGATAGTTTCGGTGTGGGACCTTTGGTTTCCTGAAATTTTCTTGTTTTTCTTCAGACCCTGTCAAACCGACCACTTTGTTCACCTTCCCAATGACTCTAGTCCAGTTTTGACTCCGTTTCCTGGTTACTTTTTGCCCCTTATTGTAAAGCACTGATTGGAAACACGACACAGGAAATTGGTGGGAAATAGCGATCTGATGTGAAAGAGCCAAATTTAAAAGTAGAGGCACGTATCTGGGCCAGCTCTGTTTCTTCCGCTGGTGTTTGTTAATATTACAAATTGGTTTAATTTTACCTCTGAGCGCACTTTTGGCAGTACGTTAATCATTTTTTCAGTCTTCATATTTATTGTAACTTCTCCACAG
> 
> etc etc
> 
> Regards
> 
> Will McLaren
> Ensembl Variation
> 
> 
> On 7 April 2014 03:21, Konrad Karczewski <konradk at broadinstitute.org> wrote:
> Hello!
> 
> I've been developing a loss-of-function plugin for VEP and having some implementation issues relating to the VEP cache. Specifically, when accessing transcripts via the API (with the --offline flag set) it seems the cache does not store intronic sequences. When I run the code below without the --offline flag, it works as expected. With --offline, the lengths prints properly, but the sequence is N repeated length times.
> 
> # $transcript_variation is provided from VEP plugin "run" subroutine
> my @gene_introns = @{$transcript_variation->transcript->get_all_Introns()};
> my $intron_number = 0;
> print length($gene_introns[$intron_number]->seq()) . "\n"; # Returns correct length for first intron of the transcript
> print $gene_introns[$intron_number]->seq() . "\n"; # Returns "N"*length(intron)
> 
> I can rebuild my cache if need be, but I was wondering if there were any plans to integrate intron (and exon) sequence into the cache? (Seems like it should be reasonably straightforward, since VEP requires the genome fasta anyway, but I'm not sure about the details of how this part is implemented). This would be very helpful for a number of reasons, including detecting proper intron sequences (i.e. with a canonical splice motif).
> 
> (This happens in API versions 74 and 75).
> 
> Thanks!
> -Konrad
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20140408/49f38799/attachment.html>


More information about the Dev mailing list