[ensembl-dev] protein position column for indels

Andrew Parton aparton at ebi.ac.uk
Fri Jul 26 11:42:56 BST 2019

Hi David,

Thank you for your query, there are a couple of reasons for these differences. 

1) Insertions/deletions are always described in their most 3’ position in HGVS notation. So if, for example, you insert an A into a repeated region of As, the HGVS output will be reported at the most 3’ region, whereas the protein position column will report the position as it was given to VEP. We are currently looking at shifting all variants 3’ by default, and will include this in a future release.

2) The protein position column will cover all input locations (including the reference), while the HGVS output will use only a minimal allele string. For example, in the sequence


Then input of an insertion of a T in position [3,4] in the standard VCF format of ‘chr 3 varName G GT’ would provide a range of 1-2 for the protein position (as it is also considering the reference G that was given), while the HGVS would recognise the insertion as only being in position 2.

I think these two cases cover all of your examples. If you have any more questions, or any particular examples that you’d like us to take a closer look at, please let us know.

Kind Regards,

> On 25 Jul 2019, at 16:11, David Tamborero <david.tamborero at gmail.com> wrote:
> Hi ensembl devs,
> I m struggling to fully understand how the 'protein position' column is calculated when I check the variant hgvsp representation 
> this happens only for indels; some examples (left=hgvsp entry; right=protein position entry):
> frameshift:
> ENSP00000277541.6:p.Gln2444ThrfsTer34   2444
> ENSP00000256474.2:p.Lys159ArgfsTer14   158-159
> ENSP00000324856.6:p.Tyr253SerfsTer32   252-254
> inframe deletions:
> ENSP00000356379.4:p.Tyr1373del   1373-1374
> ENSP00000361824.3:p.Glu2207del    2207
> ENSP00000339004.3:p.His57del      53-54
> ENSP00000268125.5:p.Phe96_Phe99del    96-99
> ENSP00000413720.3:p.Ala171_Ala174del    171-175
> ENSP00000368332.4:p.Ala114_Ala115del    110-112
> inframe insertions:
> ENSP00000369497.3:p.Glu238_Ser239insArg   239
> ENSP00000339867.2:p.Asp687_Gly688insPhe   687-688
> ENSP00000445920.1:p.Val188_Ala192dup   188-192
> ENSP00000361824.3:p.Arg2308_Met2309dup   2308-2310
> I m guessing that this may be related in part to right/left alignement discrepancies in the reported coordinates between these two columns (e.g. ENSP00000368332.4:p.Ala114_Ala115del --> 110-112 or ENSP00000339004.3:p.His57del  ---> 53-54) ?  
> and that there is certain issue that sometimes makes you report in the protein column 'n' or 'n+1' positions -where n is the number of affected residues according to the HGVSp (e.g.  ENSP00000277541.6:p.Gln2444ThrfsTer34-->2444  or ENSP00000445920.1:p.Val188_Ala192dup  -- > 188-192  report 'n'  whereas ENSP00000413720.3:p.Ala171_Ala174del -->171-175 or ENSP00000368332.4:p.Ala114_Ala115del-->110-112 report 'n+1')?
> apologies if this is documented somewhere, i ve been not able to find the details of that entry
> thanks in advance!
> d
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
> Ensembl Blog: http://www.ensembl.info/

More information about the Dev mailing list