[ensembl-dev] How to handle N-content ?

João Eiras joao.eiras at gmail.com
Tue Sep 27 14:24:14 BST 2016


On 27 September 2016 at 11:07, Will McLaren <wm2 at ebi.ac.uk> wrote:
> Hi João,
>
> VEP does not currently support N as a valid REF or ALT allele. As you point
> out, in theory it's possible to make some deductions in some cases, but
> currently our code does not support this.
>
> If you have N as your REF allele, then you should be able to correct this by
> looking up the reference allele at each position. The --check_ref flag in
> VEP will report the correct REF allele for you.
>

Currently, VEP behaves like this

chr1 99772780 . N G 5000 . . .
# Nga/Gga, coding_sequence_variant

chr1 99772780 . A G 5000 . . .
# Aga/Gga, missense_variant, R/G

So, N  is pretty much ignored, since coding_sequence_variant has no
information about how the sequence is changed.

If I use --check_ref, then VEP complains
WARNING: Could not fetch sub-slice from 1:99772780-99772780(1) on line 15
WARNING: Specified reference allele N does not match Ensembl reference
allele on line 15

And the whole variant is dropped from the output.

None of these are the behavior I expected in my first message.

But the question is, would it be interesting to be more gracious
handling N-content ?

> If you have N as your ALT allele, you could spoof the annotation you might
> expect by substituting N for the remaining non-REF alleles e.g. if you have
> REF=A, then you could set ALT=C,G,T.

That obviously does not scale well with variants longer than one
nucleotide., and that would produce several annotations, instead of
only one.




More information about the Dev mailing list