[ensembl-dev] VEP upper limit on number of variations?

Emily Perry emily at ebi.ac.uk
Tue Jun 7 15:19:52 BST 2016


Hello


Please use the link at the bottom of any dev email to unsubscribe.


All the best


Emily


On 07/06/2016 15:09, Stamboulian, Mouses Hrag wrote:
>
> ​unsubscribe
>
> ------------------------------------------------------------------------
> *From:* dev-bounces at ensembl.org <dev-bounces at ensembl.org> on behalf of 
> Taylor, Sean <Sean.Taylor at seattlechildrens.org>
> *Sent:* Tuesday, June 7, 2016 10:00 AM
> *To:* dev at ensembl.org
> *Subject:* [ensembl-dev] VEP upper limit on number of variations?
>
> Hello,
>
> I have a library of about 93 million variants that I want to run VEP 
> on. I have downloaded VEP version 84 onto a local linux machine 
> running CentOS Linux release 7.1.1503. The machine has 125G and 40 
> cores @ 2.3GHz.
>
> My variant library is stored in an impala table in hdfs. To generate 
> my input into VEP, I run a simple impala query as follows:
>
> impala -B -o ~/vep/knownvariants.tsv --output_delimiter="\\t" -q 
> "select contig, pos, pos+length, concat(ref,'\/',alt), '+' from 
> ingest.central_variant_store where variant_type != 'COMPLEX'"
>
> This writes the variants to a tab delimited file in ensembl format:
>
> 10      57676885        57676886 C/G     +
>
> 7       34456697        34456698 A/G     +
>
> 8       62679252        62679253 G/A     +
>
> 7       9184853 9184854 C/A     +
>
> 3       29205854        29205855 C/T     +
>
> 10      42815272        42815273 C/T     +
>
> 8       117963405       117963406 C/T     +
>
> 12      53054550        53054551 C/T     +
>
> 6       105515195       105515196 T/C     +
>
> 20      665650  665651  G/C     +
>
> The entire library is about 93 million. I filtered out the complex 
> variants (i.e. CNVs) as my experience showed that these just threw 
> warnings in VEP and remained unannotated. My idea was to break this 
> output into smaller chunks for processing, ideally around 25M variants 
> each.
>
> I then feed each chunk into VEP using the following arguments:
>
> perlbrew switch 5.16.3
>
> perlbrew exec perl ./variant_effect_predictor.pl --cache --offline 
> --everything --json --buffer_size 25000000 --force_overwrite --verbose 
> -i knownvariants.tsv -o knownvariants.json 2>&1 | tee knownvariants.log
>
> VEP will crunch on this for a while. It will usually clear the 
> following steps:
>
> -load variants into buffer
>
> -check for existing variations
>
> -read transcript data from cache
>
> -read regulatory data from cache
>
> -begin analyzing chromosomes
>
> Somewhere in the last step, VEP will crash with the following error 
> message:
>
> Command [perl ./variant_effect_predictor.pl --cache --offline 
> --everything --json --buffer_size 25000000 --force_overwrite --verbose 
> -i knownvariants.tsv -o knownvariants.json] terminated with exit code 
> 0 ($? = 9) under the following perl environment:
>
> Command terminated with non-zero status.
>
> Current perl:
>
>   Name: perl-5.16.3
>
>  Path: /opt/perl5/perls/perl-5.16.3/bin/perl
>
>   Config: -de -Dprefix=/opt/perl5/perls/perl-5.16.3 
> -Aeval:scriptdir=/opt/perl5/perls/perl-5.16.3/bin
>
>   Compiled at: Apr 15 2016 06:06:22
>
> perlbrew:
>
>   version: 0.75
>
>   ENV:
>
>     PERLBREW_ROOT: /opt/perl5
>
>     PERLBREW_HOME:
>
>     PERLBREW_PATH: /opt/perl5/bin:/opt/perl5/perls/perl-5.16.3/bin
>
>     PERLBREW_MANPATH: /opt/perl5/perls/perl-5.16.3/man
>
> This exit code is not very specific so it is hard to know what is 
> going on. Is it running out of memory? I lean away from that because 
> it seems that I have seen specific error messages related to memory 
> usage when that happens. I have tried inputting smaller numbers of 
> variants such as 12.5M, 10M, 5M, 2.5M, 1M, 100K, and 10K. So far, I 
> can generally execute just fine on up to 2.5M variants. Anything 
> bigger and I get this same error message. I also got this once on a 
> 2.5M run, only this time the program crashed after analysis of all the 
> chromosomes. It was actively writing to the json output when it died.
>
> This leads me to ask if there is a known upper limit to how many 
> variants one can practically push through at a time? Or perhaps a 
> timeout limit? Processing 93M in 2.5M chunks is a bit tedious. Any 
> thoughts on how to improve or optimize this would be appreciated. I 
> have attached the log files from several runs for reference.
>
> Thanks,
>
> Sean Taylor
>
> CONFIDENTIALITY NOTICE: This e-mail message, including any 
> attachments, is for the sole use of the intended recipient(s) and may 
> contain confidential and privileged information protected by law. Any 
> unauthorized review, use, disclosure or distribution is prohibited. If 
> you are not the intended recipient, please contact the sender by reply 
> e-mail and destroy all copies of the original message.
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

-- 
Dr Emily Perry (Pritchard)
Ensembl Outreach Project Leader

European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Wellcome Trust Genome Campus
Hinxton
Cambridge
CB10 1SD
UK

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20160607/c0cc130b/attachment.html>


More information about the Dev mailing list