[ensembl-dev] How to handle N-content ?

Will McLaren wm2 at ebi.ac.uk
Tue Sep 27 14:50:51 BST 2016


On 27 September 2016 at 14:24, João Eiras <joao.eiras at gmail.com> wrote:

> On 27 September 2016 at 11:07, Will McLaren <wm2 at ebi.ac.uk> wrote:
> > Hi João,
> >
> > VEP does not currently support N as a valid REF or ALT allele. As you
> point
> > out, in theory it's possible to make some deductions in some cases, but
> > currently our code does not support this.
> >
> > If you have N as your REF allele, then you should be able to correct
> this by
> > looking up the reference allele at each position. The --check_ref flag in
> > VEP will report the correct REF allele for you.
> >
>
> Currently, VEP behaves like this
>
> chr1 99772780 . N G 5000 . . .
> # Nga/Gga, coding_sequence_variant
>
> chr1 99772780 . A G 5000 . . .
> # Aga/Gga, missense_variant, R/G
>
> So, N  is pretty much ignored, since coding_sequence_variant has no
> information about how the sequence is changed.


> If I use --check_ref, then VEP complains
> WARNING: Could not fetch sub-slice from 1:99772780-99772780(1) on line 15
> WARNING: Specified reference allele N does not match Ensembl reference
> allele on line 15
>

You will need to either connect to the database server (use --cache without
--offline) or make a FASTA file available (--fasta) for VEP to read
sequence data from.

http://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html#fasta

The variant will still be skipped, but the correct reference allele will be
reported in the warning message.


>
> And the whole variant is dropped from the output.
>
> None of these are the behavior I expected in my first message.
>
> But the question is, would it be interesting to be more gracious
> handling N-content ?
>
> > If you have N as your ALT allele, you could spoof the annotation you
> might
> > expect by substituting N for the remaining non-REF alleles e.g. if you
> have
> > REF=A, then you could set ALT=C,G,T.
>
> That obviously does not scale well with variants longer than one
> nucleotide., and that would produce several annotations, instead of
> only one.
>

Of course, but this is only marginally worse than all the combinations that
would have to be computed if you did give N as the ALT. You cite the codons
that have N in the third position in the genetic code, but this doesn't
account for variants that fall in any other position in the codon. And nor
does it offer any better solution for variants of longer than 1 nucleotide,
or variants that fall in splicing or other non-coding regions.

If you can describe (or even better write code!) to do as you are
suggesting then feel free to contribute, but as it stands VEP will remain
annotating only input given in definitive form.

Cheers

Will


>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20160927/52fa3ac1/attachment.html>


More information about the Dev mailing list