[ensembl-dev] Strange Ensembl/RefSeq VEP annotation for variant

Tue Dec 8 15:29:10 GMT 2015

Hi again,

thanks for your reply! This clears things up and for our use case we 
need to solve this somehow.

It would be nice if we somehow could get indication of this information. 
I've looked into the output from VEP, and if I understand correctly, the 
REFSEQ_MATCH's "rseq_ens_no_match" flag can get us near to flagging 
these kind of transcripts. However, this will include SNV differences 
and not only indel (not to mention UTR differences). In our case we 
don't care about the SNV differences, but the indels, potentally causing 
frameshifts, are important to flag since the cDNA and amino acid codons 
can be wrong.

 From the release notes of Release 79, under "Transcript attributes for 
Refseq-genomic-to-mRNA comparison (Human)" [1], it says the following:

"Transcripts that do not have a perfect match between the mRNA and the 
genomic sequence will get additional attributes to define what regions 
(5' UTR, CDS, 3' UTR, or 'whole transcript' if there is no CDS defined) 
do not align perfectly, along with a summary of the information in the 
alignment (match,mismatch, indel count, total indel length)."

Is it possible to get this summary information from either VEP or 
BioMart somehow? Or if it would be possible to add a new flag indicating 
whether it's indel or snv differences. Either of these sources (ideally 
VEP), would go a long way to solve our problems, as we can then flag 
transcripts with indel differences.

Thanks again,
Svein Tore Koksrud Seljebotn

[1] http://www.ensembl.org/info/website/news.html?id=79#cat-genebuild

 >Hi Svein,
 >
 >In Ensembl, our annotation is based on the reference genome but RefSeq
 >transcripts can differ from the reference which causes problems like
 >this in a occasional cases.
 >In this instance, the reference has a single base deletion with respect
 >to the RefSeq transcript. The absense of the base in the reference has
 >caused the Ensembl transcript to have a 2 base intron within what is a
 >contiguous exon in ReSeq - this accounts for the 3 base difference
 >between the transcripts. The RefSeq/reference missmatch also causes the
 >problems you observe for the RefSeq analysis.
 >
 >If you look at position 3124 it in the alignment between the two
 >transcripts, you can see the Ensembl transcript has a 3 base deletion
 >with respect to the RefSeq transcript:
 >
 >http://grch37.ensembl.org/Homo_sapiens/Transcript/Similarity/Align?db=core;extdb=refseq_mrna;g=ENSG00000090006;r=19:41099072-41135725;sequence=NM_003573.2;t=ENST00000204005
 >
 >If you look at Intron 24-25 on the Exon view you can see the extra
 >intron this introduces:
 >
 >http://grch37.ensembl.org/Homo_sapiens/Transcript/Exons?db=core;extdb=refseq_mrna;g=ENSG00000090006;r=19:41099072-41135725;sequence=NM_003573.2;t=ENST00000204005
 >
 >The GRCh37 site uses an old gene set, but for more recent sets we hold
 >information on when the RefSeq transcript differs from the Ensembl
 >transcript and report such discrepancies with command line VEP. We are
 >seeking to resolve this problem.
 >
 >Best wishes,
 >
 >Sarah
 >
 >
 >On 02/12/2015 11:55, Svein Tore Koksrud Seljebotn wrote:
 >> Hi,
 >>
 >> we encountered one variant that gives a bit confusing annotation
 >> output from VEP (GRCh37, release 82).
 >>
 >> The variant is: 19:41133005 G>A (rs200607327).
 >>
 >> If it's still available, an online VEP annotation can be found here:
 >> 
http://grch37.ensembl.org/Homo_sapiens/Tools/VEP/Results?db=core;tl=1Vp8p7UidQSVCQfB-1297423 

 >> .
 >>
 >>
 >> We use Refseq transcript output for NM_003573.2, and got the following:
 >>
 >> NM_003573.2:c.4200G>A |NP_003564.2:p.Met1400Ile | ATG/ATA
 >>
 >> For the corresponding Ensembl transcript ENST00000204005 [1], we get
 >> the following:
 >>
 >> ENST00000204005.9:c.4198G>A | ENSP00000204005.9:p.Gly1400Arg | GGG/AGG
 >>
 >> In dbSNP and other databases, the correct cDNA position for the RefSeq
 >> transcript for this variant is 4201, not 4200.
 >>
 >> So I have two questions:
 >>
 >> 1. Why is there a three base difference between the two transcripts
 >> (4201 vs 4198)?
 >>
 >> 2. Is there something going wrong in the calculation of the RefSeq
 >> data? Note the frameshift for the codons, resulting in wrong protein
 >> as well.
 >>
 >>
 >> Best regards,
 >> Svein Tore Koksrud Seljebotn
 >>
 >>
 >> [1]
 >> 
http://grch37.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;g=ENSG00000090006;r=19:41099072-41135725;t=ENST00000204005;tl=1Vp8p7UidQSVCQfB-1297423
 >>
 >