[ensembl-dev] How to handle N-content ?

Will McLaren wm2 at ebi.ac.uk
Tue Sep 27 10:07:13 BST 2016


Hi João,

VEP does not currently support N as a valid REF or ALT allele. As you point
out, in theory it's possible to make some deductions in some cases, but
currently our code does not support this.

If you have N as your REF allele, then you should be able to correct this
by looking up the reference allele at each position. The --check_ref flag
in VEP will report the correct REF allele for you.

If you have N as your ALT allele, you could spoof the annotation you might
expect by substituting N for the remaining non-REF alleles e.g. if you have
REF=A, then you could set ALT=C,G,T.

The VCF spec does not necessarily exist to support annotation tools, merely
the reporting of variants, so I don't believe it's a valid conclusion to
say this feature should be dropped from the spec.

Regards

Will McLaren
Ensembl Variation


On 27 September 2016 at 02:59, João Eiras <joao.eiras at gmail.com> wrote:

> Hi.
>
> The VCF spec [1] mentions that the REF and ALT fields can contain the
> N nucleotide.
>
> I was checking the COSMIC data, and the VCF files do have a bit of
> N-content in some variants.
>
> I've checked how VEP handles N-content.
>
> If the N-content is in the REF field, VEP will just report
> "coding_sequence_variant" as consequence term and that's it.
>
> Example (not from cosmic) with the GRCm38 genome, on transcript
> ENSMUST00000086738, codon at position 6:
> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
> chr1 99772780 . N G 5000 . . .
> chr1 99772781 . N G 5000 . . .
> chr1 99772782 . N G 5000 . . .
>
> I would expect for N to be handled as a wildcard when comparing REF
> with the sequence in the database, so N would always match, and would
> just tell the annotation tool how long REF is.
>
> Then, if N is in the ALT column, VEP will not produce any annotations
> at all  (transcript_consequences is empty).
>
> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
> chr1 99772780 . A N 5000 . . .
> chr1 99772781 . G N 5000 . . .
> chr1 99772782 . A N 5000 . . .
>
> I think even with N-content most variants can be called (small indels,
> frameshifts, stop codon change or gain), but the main issue is that
> amino-acid changes may not be callable, so VEP should just output X
> when translating the codons affect by N, which VEP already does for
> incomplete codons (tip of transcript or frameshifts). Looking at the
> genetic code table, the amino-acids Alanine (GCN), Arginine (CGN),
> Glycine (GGN), Leucine (CTN), Proline (CCN), Serine (TCN), Threonine
> (ACN) and Valine(GTN) can all be unambiguously called if there is N in
> the 3rd nucleotide of their respective codons.
>
> If annotations tools do not produce usable output with N-content, then
> the spec should be changed.
>
> Thank you.
>
> [1] https://samtools.github.io/hts-specs/VCFv4.2.pdf, page 4
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20160927/2574f1ea/attachment.html>


More information about the Dev mailing list