[ensembl-dev] Strange Ensembl/RefSeq VEP annotation for variant

Fri Dec 11 10:40:27 GMT 2015

Hi Svein,

One word of caution here is that the when we import RefSeq annotation we use the accessions present in the gff files as stable ids. Unlike Ensembl stable ids, the RefSeq stable ids are not unique (as an mRNA map have been mapped to several places in the genome). Is it’s important to be sure you are matching the alignment information to the transcript you’re looking at (and not another mapping of the mRNA in a different region). The best way is to make sure the coordinates of the start and end of the transcript match the region you expect. From what I’ve seen in the cases where there is more than one mapping for an mRNA, there is often one perfect match and then several imperfect ones.

Fergal.

On 10 Dec 2015, at 16:51, Sarah Hunt <seh at ebi.ac.uk> wrote:

> 
> Hi Svein,
> 
> We started calculating this information in release79, so don't have it for GRCh37 as the gene set was frozen in release75. Many of these difference should have been resolved with the improvements in the GRCh38 assembly.
> 
> The REFSEQ_MATCH 'rseq_cds_mismatch' flag will be most useful to you - this signifies a mismatch in the CDS of the RefSeq model, though it does not specify the type.
> 
> There are 232 RefSeq transcripts in our current GRCh38 release which have indels in the CDS with respect to the ensembl transcript- I have attached an export of the information we hold in case this is useful. I've also attached an example script you could use to extract the full information by transcript. We will look at how best to add this additional detail to VEP .
> 
> All the best,
> 
> Sarah
> 
> On 08/12/2015 15:29, Svein Tore Koksrud Seljebotn wrote:
>> Hi again,
>> 
>> thanks for your reply! This clears things up and for our use case we need to solve this somehow.
>> 
>> It would be nice if we somehow could get indication of this information. I've looked into the output from VEP, and if I understand correctly, the REFSEQ_MATCH's "rseq_ens_no_match" flag can get us near to flagging these kind of transcripts. However, this will include SNV differences and not only indel (not to mention UTR differences). In our case we don't care about the SNV differences, but the indels, potentally causing frameshifts, are important to flag since the cDNA and amino acid codons can be wrong.
>> 
>> From the release notes of Release 79, under "Transcript attributes for Refseq-genomic-to-mRNA comparison (Human)" [1], it says the following:
>> 
>> "Transcripts that do not have a perfect match between the mRNA and the genomic sequence will get additional attributes to define what regions (5' UTR, CDS, 3' UTR, or 'whole transcript' if there is no CDS defined) do not align perfectly, along with a summary of the information in the alignment (match,mismatch, indel count, total indel length)."
>> 
>> Is it possible to get this summary information from either VEP or BioMart somehow? Or if it would be possible to add a new flag indicating whether it's indel or snv differences. Either of these sources (ideally VEP), would go a long way to solve our problems, as we can then flag transcripts with indel differences.
>> 
>> 
>> Thanks again,
>> Svein Tore Koksrud Seljebotn
>> 
>> 
>> [1] http://www.ensembl.org/info/website/news.html?id=79#cat-genebuild
>> 
>> >Hi Svein,
>> >
>> >In Ensembl, our annotation is based on the reference genome but RefSeq
>> >transcripts can differ from the reference which causes problems like
>> >this in a occasional cases.
>> >In this instance, the reference has a single base deletion with respect
>> >to the RefSeq transcript. The absense of the base in the reference has
>> >caused the Ensembl transcript to have a 2 base intron within what is a
>> >contiguous exon in ReSeq - this accounts for the 3 base difference
>> >between the transcripts. The RefSeq/reference missmatch also causes the
>> >problems you observe for the RefSeq analysis.
>> >
>> >If you look at position 3124 it in the alignment between the two
>> >transcripts, you can see the Ensembl transcript has a 3 base deletion
>> >with respect to the RefSeq transcript:
>> >
>> >http://grch37.ensembl.org/Homo_sapiens/Transcript/Similarity/Align?db=core;extdb=refseq_mrna;g=ENSG00000090006;r=19:41099072-41135725;sequence=NM_003573.2;t=ENST00000204005 
>> >
>> >If you look at Intron 24-25 on the Exon view you can see the extra
>> >intron this introduces:
>> >
>> >http://grch37.ensembl.org/Homo_sapiens/Transcript/Exons?db=core;extdb=refseq_mrna;g=ENSG00000090006;r=19:41099072-41135725;sequence=NM_003573.2;t=ENST00000204005 
>> >
>> >The GRCh37 site uses an old gene set, but for more recent sets we hold
>> >information on when the RefSeq transcript differs from the Ensembl
>> >transcript and report such discrepancies with command line VEP. We are
>> >seeking to resolve this problem.
>> >
>> >Best wishes,
>> >
>> >Sarah
>> >
>> >
>> >On 02/12/2015 11:55, Svein Tore Koksrud Seljebotn wrote:
>> >> Hi,
>> >>
>> >> we encountered one variant that gives a bit confusing annotation
>> >> output from VEP (GRCh37, release 82).
>> >>
>> >> The variant is: 19:41133005 G>A (rs200607327).
>> >>
>> >> If it's still available, an online VEP annotation can be found here:
>> >> http://grch37.ensembl.org/Homo_sapiens/Tools/VEP/Results?db=core;tl=1Vp8p7UidQSVCQfB-1297423 
>> >> .
>> >>
>> >>
>> >> We use Refseq transcript output for NM_003573.2, and got the following:
>> >>
>> >> NM_003573.2:c.4200G>A |NP_003564.2:p.Met1400Ile | ATG/ATA
>> >>
>> >> For the corresponding Ensembl transcript ENST00000204005 [1], we get
>> >> the following:
>> >>
>> >> ENST00000204005.9:c.4198G>A | ENSP00000204005.9:p.Gly1400Arg | GGG/AGG
>> >>
>> >> In dbSNP and other databases, the correct cDNA position for the RefSeq
>> >> transcript for this variant is 4201, not 4200.
>> >>
>> >> So I have two questions:
>> >>
>> >> 1. Why is there a three base difference between the two transcripts
>> >> (4201 vs 4198)?
>> >>
>> >> 2. Is there something going wrong in the calculation of the RefSeq
>> >> data? Note the frameshift for the codons, resulting in wrong protein
>> >> as well.
>> >>
>> >>
>> >> Best regards,
>> >> Svein Tore Koksrud Seljebotn
>> >>
>> >>
>> >> [1]
>> >> http://grch37.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;g=ENSG00000090006;r=19:41099072-41135725;t=ENST00000204005;tl=1Vp8p7UidQSVCQfB-1297423
>> >>
>> >
>> 
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
> 
> <RefSeq_indel.txt><refseq_map_info.pl>_______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/