[ensembl-dev] Probably incorrect HGVS on GRCh37 RefSeq

Andrew Parton aparton at ebi.ac.uk
Tue Sep 1 14:07:54 BST 2020


Hi,

From release 100 onwards, VEP should now calculate if there is a misalignment between the RefSeq transcript and the underlying reference genome, and correct accordingly to provide accurate HGVS. For example, the variant you mentioned in January earlier in this thread now correctly reports 516 at the coordinate within the given HGVSc: `perl vep -id '12 103249104 . C A' --database --hgvs --refseq --assembly GRCh37` -> HGVSc=NM_000277.3:c.516G>T

The REFSEQ_MATCH column will report if the RefSeq transcript matched the underlying reference sequence and/or an ensembl transcript. You can read more about the possible statuses returned here: https://www.ensembl.org/info/docs/tools/vep/vep_formats.html#other_fields

Regarding the BAM_EDIT field, your understanding is correct - failed edits are extremely rare (<0.01% of transcripts), however annotations produced on these transcripts should be interpreted with caution.

Kind Regards,
Andrew


> On 31 Aug 2020, at 09:19, Wallace Ko <myko at l3-bioinfo.com> wrote:
> 
> Hi Andrew,
> 
> Since VEP 100 (GRCh37) the REFSEQ_MATCH column is filled with content. Is it reliable to use this column to determine if the HGVS code is probably incorrect because of RefSeq alignment mismatch?
> 
> Or shall I simply use the BAM_EDIT column for the purpose?
> 
> And is my understanding of the BAM_EDIT value below correct (according to this Github issue)?
> 	• -: no mismatch is found. Annotations and HGVS code are both fine.
> 	• OK: mismatch is found and fix is applied. Annotations are fine. HGVS code is fixed too but could still be incorrect in some cases.
> 	• FAILED: mismatch is found and fix could not be applied. Both annotations and HGVS code could be incorrect.
> 
> Regards,
> Wallace Ko
> 
> 
> On Tue, Jan 21, 2020 at 7:50 PM Andrew Parton <aparton at ebi.ac.uk> wrote:
> Hi,
> 
> Yep, that’s correct.
> 
> One thing to be aware of however is that our HGVS code shifts variants reported in repeated regions in the 3’ direction by default, while our CDS position is not shifted in such a way. This is the most common cause of CDS position and HGVSc position mismatch, although it can also be caused by these RefSeq alignment mismatches.
> 
> Kind Regards,
> Andrew
> 
>> On 21 Jan 2020, at 11:08, Wallace Ko <myko at l3-bioinfo.com> wrote:
>> 
>> Hi Andrew,
>> 
>> Thanks for the prompt response.
>> May I assume that this is just the problem of HGVS calculation and CDS position is already corrected by RefSeq alignment in such case?
>> 
>> Regards,
>> Wallace Ko
>> 
>> 
>> On Tue, Jan 21, 2020 at 6:30 PM Andrew Parton <aparton at ebi.ac.uk> wrote:
>> Hi Wallace,
>> 
>> Thanks for this report, it is an issue we are aware of. As you identified, not all RefSeq transcripts completely match the reference genome. In cases where they don't, we are now using alignment files provided by NCBI to create a new reference, matching the transcript, and use this for consequence calling.
>> 
>> Our HGVS calculation does not currently use this reference modification, but it is something we are working on and aim to release later this year. VEP can report reference miss-matches for GRCh38, but these data are not available for GRCh37.
>> 
>> More details on the differences to the reference genome and correcting transcript models using BAM can be found here:  https://www.ensembl.org/info/docs/tools/vep/script/vep_other.html#refseq
>> 
>> Let us know if there’s anything else we can do to help.
>> 
>> Kind Regards,
>> Andrew
>> 
>>> On 21 Jan 2020, at 09:23, Wallace Ko <myko at l3-bioinfo.com> wrote:
>>> 
>>> Hi Ensembl Developers,
>>> 
>>> The variant NC_000012.11:g.103249104C>A is annotated by online VEP and offline cached VEP (99, RefSeq, GRCh37) as:
>>> 	• HGVSc: NM_000277.1:c.517G>T
>>> 	• HGVSp: NP_000268.1:p.Gln172His
>>> 	• CDS Position: 516
>>> On the other hand, ClinVar reports the variant as NM_000277.3:c.516G>T (NP_000268.1:p.Gln172His). Besides, blast result shows that there is a 1-bp gap between c.303 and c.304 when NM_000277.1 is aligned to NC_000012.11. And even VEP itself reports the CDS position as 516.
>>> 
>>> All these make me believe that the HGVSc reported should be at c.516 instead of c.517.
>>> 
>>> Regards,
>>> Wallace Ko
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info: https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
>>> Ensembl Blog: http://www.ensembl.info/
>> 
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
>> Ensembl Blog: http://www.ensembl.info/
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
>> Ensembl Blog: http://www.ensembl.info/
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
> Ensembl Blog: http://www.ensembl.info/
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
> Ensembl Blog: http://www.ensembl.info/





More information about the Dev mailing list