[ensembl-dev] VEP upper limit on number of variations?
Emily Perry
emily at ebi.ac.uk
Tue Jun 7 15:19:52 BST 2016
Hello
Please use the link at the bottom of any dev email to unsubscribe.
All the best
Emily
On 07/06/2016 15:09, Stamboulian, Mouses Hrag wrote:
>
> unsubscribe
>
> ------------------------------------------------------------------------
> *From:* dev-bounces at ensembl.org <dev-bounces at ensembl.org> on behalf of
> Taylor, Sean <Sean.Taylor at seattlechildrens.org>
> *Sent:* Tuesday, June 7, 2016 10:00 AM
> *To:* dev at ensembl.org
> *Subject:* [ensembl-dev] VEP upper limit on number of variations?
>
> Hello,
>
> I have a library of about 93 million variants that I want to run VEP
> on. I have downloaded VEP version 84 onto a local linux machine
> running CentOS Linux release 7.1.1503. The machine has 125G and 40
> cores @ 2.3GHz.
>
> My variant library is stored in an impala table in hdfs. To generate
> my input into VEP, I run a simple impala query as follows:
>
> impala -B -o ~/vep/knownvariants.tsv --output_delimiter="\\t" -q
> "select contig, pos, pos+length, concat(ref,'\/',alt), '+' from
> ingest.central_variant_store where variant_type != 'COMPLEX'"
>
> This writes the variants to a tab delimited file in ensembl format:
>
> 10 57676885 57676886 C/G +
>
> 7 34456697 34456698 A/G +
>
> 8 62679252 62679253 G/A +
>
> 7 9184853 9184854 C/A +
>
> 3 29205854 29205855 C/T +
>
> 10 42815272 42815273 C/T +
>
> 8 117963405 117963406 C/T +
>
> 12 53054550 53054551 C/T +
>
> 6 105515195 105515196 T/C +
>
> 20 665650 665651 G/C +
>
> The entire library is about 93 million. I filtered out the complex
> variants (i.e. CNVs) as my experience showed that these just threw
> warnings in VEP and remained unannotated. My idea was to break this
> output into smaller chunks for processing, ideally around 25M variants
> each.
>
> I then feed each chunk into VEP using the following arguments:
>
> perlbrew switch 5.16.3
>
> perlbrew exec perl ./variant_effect_predictor.pl --cache --offline
> --everything --json --buffer_size 25000000 --force_overwrite --verbose
> -i knownvariants.tsv -o knownvariants.json 2>&1 | tee knownvariants.log
>
> VEP will crunch on this for a while. It will usually clear the
> following steps:
>
> -load variants into buffer
>
> -check for existing variations
>
> -read transcript data from cache
>
> -read regulatory data from cache
>
> -begin analyzing chromosomes
>
> Somewhere in the last step, VEP will crash with the following error
> message:
>
> Command [perl ./variant_effect_predictor.pl --cache --offline
> --everything --json --buffer_size 25000000 --force_overwrite --verbose
> -i knownvariants.tsv -o knownvariants.json] terminated with exit code
> 0 ($? = 9) under the following perl environment:
>
> Command terminated with non-zero status.
>
> Current perl:
>
> Name: perl-5.16.3
>
> Path: /opt/perl5/perls/perl-5.16.3/bin/perl
>
> Config: -de -Dprefix=/opt/perl5/perls/perl-5.16.3
> -Aeval:scriptdir=/opt/perl5/perls/perl-5.16.3/bin
>
> Compiled at: Apr 15 2016 06:06:22
>
> perlbrew:
>
> version: 0.75
>
> ENV:
>
> PERLBREW_ROOT: /opt/perl5
>
> PERLBREW_HOME:
>
> PERLBREW_PATH: /opt/perl5/bin:/opt/perl5/perls/perl-5.16.3/bin
>
> PERLBREW_MANPATH: /opt/perl5/perls/perl-5.16.3/man
>
> This exit code is not very specific so it is hard to know what is
> going on. Is it running out of memory? I lean away from that because
> it seems that I have seen specific error messages related to memory
> usage when that happens. I have tried inputting smaller numbers of
> variants such as 12.5M, 10M, 5M, 2.5M, 1M, 100K, and 10K. So far, I
> can generally execute just fine on up to 2.5M variants. Anything
> bigger and I get this same error message. I also got this once on a
> 2.5M run, only this time the program crashed after analysis of all the
> chromosomes. It was actively writing to the json output when it died.
>
> This leads me to ask if there is a known upper limit to how many
> variants one can practically push through at a time? Or perhaps a
> timeout limit? Processing 93M in 2.5M chunks is a bit tedious. Any
> thoughts on how to improve or optimize this would be appreciated. I
> have attached the log files from several runs for reference.
>
> Thanks,
>
> Sean Taylor
>
> CONFIDENTIALITY NOTICE: This e-mail message, including any
> attachments, is for the sole use of the intended recipient(s) and may
> contain confidential and privileged information protected by law. Any
> unauthorized review, use, disclosure or distribution is prohibited. If
> you are not the intended recipient, please contact the sender by reply
> e-mail and destroy all copies of the original message.
>
>
> _______________________________________________
> Dev mailing list Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
--
Dr Emily Perry (Pritchard)
Ensembl Outreach Project Leader
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Wellcome Trust Genome Campus
Hinxton
Cambridge
CB10 1SD
UK
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20160607/c0cc130b/attachment.html>
More information about the Dev
mailing list