[ensembl-dev] [VEP] Bogus annotation in variant from Cosmic

João Eiras joao.eiras at gmail.com
Sun Jan 29 17:37:05 GMT 2017


Hi.

I have the following variant from the COSMIC [1] vcf files

chr1    2193996 .       ACCTGT  A

I'm using VEP 87 and both the ensembl plus refseq merged reference.

I get 16 annotations. Two of them caught my eye. But before getting
into them, a quick breakdown of the variant.

The chunk between 2193997 and 2193401 is deleted. This location is
shared between many transcripts for the gene
ENSG00000162585/FAAP20/C1orf86. This is a gene (and its transcripts)
on the reverse strand. The variant cross over from the exon into the
intron. The exon has the range [2193998, 2194133]. So, this variant
deletes 4 bases in the end of the exon, plus one nucleotide of the
splice donor site (note again, reverse strand).

The JSON for the two annotations that seem bogus is:
{"cdna_end": 221,
"cdna_start": 221,
"consequence_terms": ["splice_donor_variant","5_prime_UTR_variant"],
"exons": [
[2194688, 2194775],
[2193998, 2194133], # <- variant starts here
[2193639, 2193910],
[2192845, 2192975],
[2189713, 2189781],
[2186838, 2187206],
[2186004, 2186249],
[2184460, 2185513]
],
"gene_id": 199990,
"gene_name": "'C1orf86",
"impact": "HIGH",
"refseq_match": "rseq_mrna_nonmatch,rseq_5p_mismatch",
"source": "RefSeq",
"strand": -1,
"transcript_biotype": "protein_coding",
"transcript_id": "NM_001282671.1"}

{"cdna_end": 221,
"cdna_start": 221,
"consequence_terms": ["splice_donor_variant","5_prime_UTR_variant"],
"exons": [
[2194688, 2194775],
[2193998, 2194133], # <- variant starts here
[2193639, 2193910],
[2189713, 2189781],
[2186838, 2187206],
[2186004, 2186249],
[2184460, 2185513]
],
"gene_id": 199990,
"gene_name": "'C1orf86",
"impact": "HIGH",
"refseq_match": "rseq_mrna_nonmatch,rseq_5p_mismatch",
"source": "RefSeq",
"strand": -1,
"transcript_biotype": "protein_coding",
"transcript_id": "NM_001282672.1"}

 The transcript_biotype and exons is something I added with a plugin
($tva->transcript->get_all_Exons()). The only different between these
two transcripts is the extra exon (2192845, 2192975).

As you can see, the variant should go from the 2nd exon into the
intron, and is then followed by a 3rd exon and so forth. As such the
consequence terms should be ["splice_donor_variant",
"coding_sequence_variant"] as it is for all the other non-intronic
annotations (in transcripts that preserve the affected exon). For
instance, this is the annotation obtained over transcript
NM_001256946.1.

I see both these annotations have the flag rseq_5p_mismatch, and none
of the other ones.

Why do the consequence terms include "5_prime_UTR_variant" ? Doesn't
seem to make sense given it's not over the 5'UTR and this is a simple
protein_coding transcript.

Thank you for your time.

[1] https://cancer.sanger.ac.uk/cosmic/download




More information about the Dev mailing list