[ensembl-dev] How to handle N-content ?

João Eiras joao.eiras at gmail.com
Tue Sep 27 02:59:39 BST 2016


Hi.

The VCF spec [1] mentions that the REF and ALT fields can contain the
N nucleotide.

I was checking the COSMIC data, and the VCF files do have a bit of
N-content in some variants.

I've checked how VEP handles N-content.

If the N-content is in the REF field, VEP will just report
"coding_sequence_variant" as consequence term and that's it.

Example (not from cosmic) with the GRCm38 genome, on transcript
ENSMUST00000086738, codon at position 6:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
chr1 99772780 . N G 5000 . . .
chr1 99772781 . N G 5000 . . .
chr1 99772782 . N G 5000 . . .

I would expect for N to be handled as a wildcard when comparing REF
with the sequence in the database, so N would always match, and would
just tell the annotation tool how long REF is.

Then, if N is in the ALT column, VEP will not produce any annotations
at all  (transcript_consequences is empty).

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
chr1 99772780 . A N 5000 . . .
chr1 99772781 . G N 5000 . . .
chr1 99772782 . A N 5000 . . .

I think even with N-content most variants can be called (small indels,
frameshifts, stop codon change or gain), but the main issue is that
amino-acid changes may not be callable, so VEP should just output X
when translating the codons affect by N, which VEP already does for
incomplete codons (tip of transcript or frameshifts). Looking at the
genetic code table, the amino-acids Alanine (GCN), Arginine (CGN),
Glycine (GGN), Leucine (CTN), Proline (CCN), Serine (TCN), Threonine
(ACN) and Valine(GTN) can all be unambiguously called if there is N in
the 3rd nucleotide of their respective codons.

If annotations tools do not produce usable output with N-content, then
the spec should be changed.

Thank you.

[1] https://samtools.github.io/hts-specs/VCFv4.2.pdf, page 4




More information about the Dev mailing list