[ensembl-dev] Incorrect HGVS nomenclature

Thu Feb 5 13:59:49 GMT 2015

Hi Vasisht,
                  I’ve attached a couple of files you might find useful. 

e79_mrna_mismatch_GRCh37.txt: This file contains a list of imported RefSeq transcripts where the underlying genomic sequence does not match that of the mRNA sequence it is based upon. This scenario occurs because RefSeq annotation is carried out at the transcript level and then mapped to the genome. Sometimes transcripts may not map perfectly to the genome.

e79_mrna_no_comparison_GRCh37.txt: This contains a list of imported RefSeq transcripts where we were unable to compare the underlying genomic sequence with a corresponding mRNA. This usually happens when the accession changes or has been retired, or if during the import we were not able to parse the mRNA accession from the gff3 file. This is the case for a lot of the imported RefSeq transcripts in GRCh37. It does not mean that there is definitely a difference between the genomic sequence and whatever mRNA the transcript was annotated off, just that we were not able to carry out the comparison.

Some points to note about these files:

1) They are based off our upcoming update to GRCh37, thus were not carried out using the publicly available homo_sapiens_otherfeatures_78_37 db. However the information in the files should allow you to examine transcripts you’re interested by using the genomic coordinates and stable ids.

2) In our upcoming release (e79) we have added an analysis called ‘refseq_import’ into the GRCh37 otherfeatures db (an import of the publicly available RefSeq gff3 file for GRCh37). This is the set of models that the lists were generated off (not the currently available ‘refseq_human_import’).

3) The stable ids for ‘refseq_import' models are not unique. This is because a transcript can sometimes be mapped to several places on the genome. It’s therefore important to use the genomic coordinates to make sure you’re looking at the correct transcript model (these are in the file). As an aside I’ve noticed that in cases where the transcript maps to multiple places, often there is one perfect mapping and the others do not match the genomic sequence.

4) The decision as to whether or not there’s a mismatch between the genomic and mRNA sequence is done through pairwise alignment. The two sequences are aligned and if they’re identical they are considered a perfect match. Failing this the mRNA sequence will undergo polyA clipping and the alignment is carried out again. If the two sequence then align with 100 percent identity and coverage, this is also considered a perfect match. Otherwise they are flagged as being a mismatch. Note that this is done across the entire length of the transcript (so UTR is included if present). I’ve noticed that occasionally the only difference is that the 5’ UTR of the mRNA is longer, in future we might consider trimming these cases to see if it gives a perfect match.

5) This will all be easier to investigate with the update to GRCh37 as you will be able to find all this information as transcript attributes for the ‘refseq_import’ models. The attributes will cover if the match is perfect or imperfect and in the case of imperfect matches what region the mismatch occurred (5’ UTR, CDS, 3’ UTR for coding models or just whole transcript for non-coding models) or if no comparison was possible because of failure to find a matching mRNA accession.

Hope this is of some help,

Fergal.

On 4 Feb 2015, at 14:53, Vasisht Tadigotla <vasisht.tadigotla at courtagen.com> wrote:

> Hi Will,
> 
> Thanks for the clarification. Is there a list of transcripts where the RefSeq sequence doesn’t match the reference?
> 
> Regards,
> Vasisht 
> 
> On February 4, 2015 at 4:38:56 AM, Will McLaren (wm2 at ebi.ac.uk) wrote:
> 
>> Hi Vasisht,
>> 
>> Our RefSeq transcript set is imported from a file of coordinates provided by NCBI. In some cases the sequence of a RefSeq transcript does not match the reference sequence over which it is mapped, which causes a problem since Ensembl does not import the original RefSeq sequence, only the coordinates to which it maps.
>> 
>> This means that retrieving and manipulating the sequence in these transcripts can lead to misinterpretations, as is the case here. According to our analysis there is a 1bp insertion in the RefSeq sequence that does not appear in the reference; this would explain the out by 1 error here.
>> 
>> We hope to be able to provide information about these mismatches in the output from the next version of VEP. In the meantime we would always recommend to use the Ensembl transcript set where possible, as (among other good reasons!) transcript sequences always match the underlying reference.
>> 
>> Regards
>> 
>> Will McLaren
>> Ensembl Variation
>> 
>> On 3 February 2015 at 23:27, Vasisht Tadigotla <vasisht.tadigotla at courtagen.com> wrote:
>> Hi,
>> 
>> I’m annotating a variant using GRCh37 and the VEP in the v78 release and the HGVS annotation of the refseq transcripts doesn’t seem to match up to the sequences for those transcripts.
>> 
>> The variant is in SLC37A4 (chr11:g.118895980CAG>C), the HGVS annotations are NM_001467.5:c.1043_1044delCT and NP_001458.1:p.Pro348ArgfsTer? and the amino acid is being annotated as CCT/C.  The aa change is the same for all refseq transcripts in the annotation. 
>> 
>> The count seems to be off by one - it’s a CTG/G change. The local sequence context is GCC CTG TTT with the TG being deleted. 
>> 
>> The correct HGVS description is  NM_001467.5:c.1042_1043delCT  and the protein is p.Leu348Valfs*53. This is annotated correctly in the Ensembl transcripts - ENST00000545985.1:c.1042_1043delCT, ENSP00000475241.1:p.Leu348ValfsTer53. 
>> 
>> The following options were used for the annotation:
>> 
>> —offline —everything —merged 
>> 
>> The same issue exists with the web version of VEP:
>> 
>> http://grch37.ensembl.org/Homo_sapiens/Tools/VEP/Results?db=core;tl=Ba08mzoDSO2008gG-584627
>> 
>> 
>> Thanks,
>> Vasisht
>> 
>> 
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>> 
>> 
>> _______________________________________________ 
>> Dev mailing list Dev at ensembl.org 
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev 
>> Ensembl Blog: http://www.ensembl.info/ 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20150205/053071c3/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: reseq_comparison.zip
Type: application/zip
Size: 647798 bytes
Desc: not available
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20150205/053071c3/attachment.zip>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20150205/053071c3/attachment-0001.html>