[ensembl-dev] VEP ignoring SNVs when called alongisde an insertion or deletion

Will McLaren wm2 at ebi.ac.uk
Wed Sep 18 09:48:14 BST 2013


Hi Dave,

Thanks for spotting this.

The VEP's VCF parser assumes that if any part of the given variant is
"unbalanced" (i.e. the reference is a different length to the alternate
allele), then the whole should be treated as such. Because Ensembl treats
unbalanced substitutions differently to VCF in terms of the position and
alleles (see http://www.ensembl.org/info/docs/tools/vep/vep_formats.html#vcf),
the first base of your variant is getting trimmed off and the substitution
part is, as you rightly point out, disappearing.

I will try and work on a fix for this - it should be possible to separate
them out - but in the meantime I think the best solution is to separate out
your indels from your substitutions. While I'm not a VCF format expert, I
would hope that such an expert would suggest this is the best way to encode
your variants anyway - both Ensembl and dbSNP, for example, have a policy
of treating SNVs and indels separately if they occur at the same position.

Hope this helps, and thanks for using the VEP!

Will McLaren
Ensembl Variation


On 17 September 2013 10:22, David Parry <D.A.Parry at leeds.ac.uk> wrote:

> Hi,
>
> I apologize if I have misunderstood the caveats given regarding the VCF
> input format for the VEP but I am observing unexpected behavior that I
> don't think is covered by the documentation. If I provide a multiallelic
> variant with both an insertion and a deletion call at the same site the
> VEP correctly outputs both consequences. However, if a variant contains
> either an insertion or deletion alongside a substitution the VEP ignores
> the substitution variant.  For example, while the following variant in a
> VCF:
>
> 6       32634300        .       G       C,CTA
>
> gives the output:
>
> ## ENSEMBL VARIANT EFFECT PREDICTOR v73
> ## Output produced at 2013-09-17 09:57:41
> ## Connected to
> ## Using cache in /home/davidparry/.vep/homo_sapiens/73
> ## Using API version 73, DB version ?
> ## Extra column keys:
> ## DISTANCE : Shortest distance from variant to transcript
> #Uploaded_variation     Location        Allele  Gene    Feature
> Feature_type    Consequence     cDNA_position   CDS_position
> Protein_position        Amino_acids     Codons  Existing_variation
> Extra
> 6_32634301_-/-/TA       6:32634300-32634301     TA      ENSG00000179344
> ENST00000484729 Transcript
> frameshift_variant,NMD_transcript_variant,feature_elongation    115-116
> 84-85   28-29   -       -       -
> 6_32634301_-/-/TA       6:32634300-32634301     TA      ENSG00000179344
> ENST00000399082 Transcript      frameshift_variant,feature_elongation
> 129-130 84-85   28-29   -       -       -
> 6_32634301_-/-/TA       6:32634300-32634301     TA      ENSG00000179344
> ENST00000399084 Transcript      frameshift_variant,feature_elongation
> 263-264 84-85   28-29   -       -       -
> 6_32634301_-/-/TA       6:32634300-32634301     TA      ENSG00000179344
> ENST00000434651 Transcript      frameshift_variant,feature_elongation
> 171-172 84-85   28-29   -       -       -
> 6_32634301_-/-/TA       6:32634300-32634301     TA      ENSG00000179344
> ENST00000399079 Transcript      frameshift_variant,feature_elongation
> 141-142 84-85   28-29   -       -       -
> 6_32634301_-/-/TA       6:32634300-32634301     TA      ENSG00000179344
> ENST00000374943 Transcript      frameshift_variant,feature_elongation
> 161-162 84-85   28-29   -       -       -
> 6_32634301_-/-/TA       6:32634300-32634301     TA      ENSG00000241287
> ENST00000443574 Transcript      upstream_gene_variant   -       -
> -       -       -       -       DISTANCE=4073
> 6_32634301_-/-/TA       6:32634300-32634301     TA      ENSG00000179344
> ENST00000487676 Transcript
> non_coding_exon_variant,nc_transcript_variant,feature_elongation
> 115-116 -       -       -       -  -
>
> In this case the substitution variant is ignored and we only get a
> consequence for the insertion.  Similarly, for a deletion at the same
> site as a substitution:
>
> 6       32634300        .       GTA     G,CTA
>
> gives:
>
> ## ENSEMBL VARIANT EFFECT PREDICTOR v73
> ## Output produced at 2013-09-17 09:51:08
> ## Connected to
> ## Using cache in /home/davidparry/.vep/homo_sapiens/73
> ## Using API version 73, DB version ?
> ## Extra column keys:
> ## DISTANCE : Shortest distance from variant to transcript
> #Uploaded_variation     Location        Allele  Gene    Feature
> Feature_type    Consequence     cDNA_position   CDS_position
> Protein_position        Amino_acids     Codons  Existing_variation
> Extra
> 6_32634301_TA/-/TA      6:32634301-32634302     -       ENSG00000179344
> ENST00000484729 Transcript
> frameshift_variant,NMD_transcript_variant,feature_truncation    114-115
> 83-84   28      -       -       -
> 6_32634301_TA/-/TA      6:32634301-32634302     -       ENSG00000179344
> ENST00000399082 Transcript      frameshift_variant,feature_truncation
> 128-129 83-84   28      -       -       -
> 6_32634301_TA/-/TA      6:32634301-32634302     -       ENSG00000179344
> ENST00000399084 Transcript      frameshift_variant,feature_truncation
> 262-263 83-84   28      -       -       -
> 6_32634301_TA/-/TA      6:32634301-32634302     -       ENSG00000179344
> ENST00000434651 Transcript      frameshift_variant,feature_truncation
> 170-171 83-84   28      -       -       -
> 6_32634301_TA/-/TA      6:32634301-32634302     -       ENSG00000179344
> ENST00000399079 Transcript      frameshift_variant,feature_truncation
> 140-141 83-84   28      -       -       -
> 6_32634301_TA/-/TA      6:32634301-32634302     -       ENSG00000179344
> ENST00000374943 Transcript      frameshift_variant,feature_truncation
> 160-161 83-84   28      -       -       -
> 6_32634301_TA/-/TA      6:32634301-32634302     -       ENSG00000241287
> ENST00000443574 Transcript      upstream_gene_variant   -       -
> -       -       -       -       DISTANCE=4074
> 6_32634301_TA/-/TA      6:32634301-32634302     -       ENSG00000179344
> ENST00000487676 Transcript
> non_coding_exon_variant,nc_transcript_variant,feature_truncation
> 114-115 -       -       -       -  -
>
> ...we only get the consequence for the deletion.
>
> Generally I am processing multisample VCF files with VEP and outputting
> in VCF format.  I want to be able to assess the consequences for a given
> sample's genotype but this sometimes fails at sites like this where my
> script can't find an allele corresponding to the substitution in the VEP
> output.  A workaround would be to separate my indel and my substitution
> calls before running the VEP, but I wondered whether this is
> known/desired behaviour for this tool?
>
> The VEP is a really great tool, so it would be brilliant if there were a
> fix for this.
>
> Cheers,
>
> Dave
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20130918/cc11654f/attachment.html>


More information about the Dev mailing list