[ensembl-dev] Bad VCF file crashing VEP

Will McLaren wm2 at ebi.ac.uk
Wed May 25 09:27:21 BST 2016


Hi Stuart,

"-" is not a valid character in the VCF ALT field, see section 1.4.1 in
https://samtools.github.io/hts-specs/VCFv4.2.pdf

I'll add in a check to future versions, but unfortunately there will always
be situations where some weird/wrong input will kill VEP.

You could pass your VCF through a validator before sending it to VEP?

Regards

Will McLaren
Ensembl Variation

On 24 May 2016 at 19:26, Stuart Watt <morungos at gmail.com> wrote:

> Hi all
>
> I’ve hit an issue with some invalid VarScan2 VCF files crashing VEP
> extremely fatally. A VCF that triggers this is:
>
> ##fileformat=VCFv4.1
> ##source=VarScan2
> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total depth of quality
> bases">
> ##INFO=<ID=SOMATIC,Number=0,Type=Flag,Description="Indicates if record is
> a somatic mutation">
> ##INFO=<ID=SS,Number=1,Type=String,Description="Somatic status of variant
> (0=Reference,1=Germline,2=Somatic,3=LOH, or 5=Unknown)">
> ##INFO=<ID=SSC,Number=1,Type=String,Description="Somatic score in Phred
> scale (0-255) derived from somatic p-value">
> ##INFO=<ID=GPV,Number=1,Type=Float,Description="Fisher's Exact Test
> P-value of tumor+normal versus no variant for Germline calls">
> ##INFO=<ID=SPV,Number=1,Type=Float,Description="Fisher's Exact Test
> P-value of tumor versus normal for Somatic/LOH calls">
> ##FILTER=<ID=str10,Description="Less than 10% or more than 90% of variant
> supporting reads on one strand">
> ##FILTER=<ID=indelError,Description="Likely artifact due to indel reads at
> this position">
> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
> ##FORMAT=<ID=RD,Number=1,Type=Integer,Description="Depth of
> reference-supporting bases (reads1)">
> ##FORMAT=<ID=AD,Number=1,Type=Integer,Description="Depth of
> variant-supporting bases (reads2)">
> ##FORMAT=<ID=FREQ,Number=1,Type=String,Description="Variant allele
> frequency">
> ##FORMAT=<ID=DP4,Number=1,Type=String,Description="Strand read counts:
> ref/fwd, ref/rev, var/fwd, var/rev">
> #CHROM  POS     ID REF     ALT     QUAL    FILTER  INFO    FORMAT  NORMAL
>  498_tissue
> chr2    242814072 . TG T . PASS    . GT:GQ:DP:RD:AD:FREQ:DP4
> 0/0:.:34:34:0:0%:18,16,0,0 0/1:.:77:73:2:2.67%:35,38,1,1
> chr3    239555  . C CT/-T   . PASS    . GT:GQ:DP:RD:AD:FREQ:DP4
> 0/1:.:77:29:19:39.58%:10,19,4,15        0/1:.:72:43:15:25.86%:19,24,4,11
>
>
> it’s the last like that does this. If the chr2 entry is missing, the file
> isn’t even detected as a VCF.
>
> The error is:
>
> MSG: start arg must be less than or equal to end arg + 1
> STACK Bio::EnsEMBL::TranscriptMapper::genomic2cds
> /mnt/work1/software/vep/83/Bio/EnsEMBL/TranscriptMapper.pm:397
> STACK Bio::EnsEMBL::Variation::BaseTranscriptVariation::cds_coords
> /mnt/work1/software/vep/83/Bio/EnsEMBL/Variation/BaseTranscriptVariation.pm:325
> STACK
> Bio::EnsEMBL::Variation::BaseVariationFeatureOverlapAllele::_pre_consequence_predicates
> /mnt/work1/software/vep/83/Bio/EnsEMBL/Variation/BaseVariationFeatureOverlapAllele.pm:393
> STACK
> Bio::EnsEMBL::Variation::BaseVariationFeatureOverlapAllele::get_all_OverlapConsequences
> /mnt/work1/software/vep/83/Bio/EnsEMBL/Variation/BaseVariationFeatureOverlapAllele.pm:237
> STACK Bio::EnsEMBL::Variation::Utils::VEP::tva_to_line
> /mnt/work1/software/vep/83/Bio/EnsEMBL/Variation/Utils/VEP.pm:2568
> STACK Bio::EnsEMBL::Variation::Utils::VEP::vfoa_to_line
> /mnt/work1/software/vep/83/Bio/EnsEMBL/Variation/Utils/VEP.pm:2504
> STACK Bio::EnsEMBL::Variation::Utils::VEP::vf_to_consequences
> /mnt/work1/software/vep/83/Bio/EnsEMBL/Variation/Utils/VEP.pm:2191
> STACK Bio::EnsEMBL::Variation::Utils::VEP::rejoin_variants
> /mnt/work1/software/vep/83/Bio/EnsEMBL/Variation/Utils/VEP.pm:1777
> STACK Bio::EnsEMBL::Variation::Utils::VEP::vf_list_to_cons
> /mnt/work1/software/vep/83/Bio/EnsEMBL/Variation/Utils/VEP.pm:1485
> STACK Bio::EnsEMBL::Variation::Utils::VEP::get_all_consequences
> /mnt/work1/software/vep/83/Bio/EnsEMBL/Variation/Utils/VEP.pm:1205
> STACK main::main /mnt/work1/software/vep/83/
> variant_effect_predictor.pl:321
> STACK toplevel /mnt/work1/software/vep/83/variant_effect_predictor.pl:148
> Date (localtime)    = Tue May 24 13:59:39 2016
> Ensembl API version = 83
> ---------------------------------------------------
> ERROR: Forked process(es) died
>
>
> We’re still trying to figure the VarScan issue, but this shouldn’t really
> take out an entire VEP run. Even the issue where this line breaks
> recognition of VEP input is, I’d say, less than ideal, as the file contains
> about 7000 other valid records.
>
> All the best
> Stuart
>
>> *Stuart Watt, PhD*
> Scientific Research Associate, Princess Margaret Cancer Centre
> MaRS Centre, 101 College Street
> Toronto Medical Discovery Tower, Room 9-302
> Toronto, Ontario, Canada M5G 1L7
> stuart.watt at uhnresearch.ca
> 416-634-8816
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20160525/a6b9ec04/attachment.html>


More information about the Dev mailing list