[ensembl-dev] Oddity with ENSP identifiers: same identifier for two unrelated sequences?

Andy Yates ayates at ebi.ac.uk
Tue Jun 12 09:53:33 BST 2012

Hi Matthew,

I can see where your confusion lies here as you have to investigate the archived transcripts of these proteins to discover what has occurred. Please bear in mind that protein stable IDs are derived from the transcript & transcript stable IDs are derived from its exons. The only biological unit which is physically mapped is the exon. 

If we check the archive sites for the transcript identifier ENSP00000400005 we can see a distinct point in time when the protein sequence changed which were releases 64 and 65:

* 64 - http://sep2011.archive.ensembl.org/Homo_sapiens/Transcript/Sequence_Protein?db=core;g=ENSG00000236022;r=17:19109483-19110124;t=ENST00000447506
* 65 - http://dec2011.archive.ensembl.org/Homo_sapiens/Transcript/Sequence_Protein?db=core;g=ENSG00000236022;r=17:19109483-19110124;t=ENST00000447506

A quick check of the exon structure reveals that we still have the same exons in the transcript but that we've had a 44bp contraction of the coding sequence at the 3' end. The actual sequence of the Exon is still identical therefore this is still the same transcript. Also note the change of phase in the 1st Exon as this will be important in a second

* 64 - http://sep2011.archive.ensembl.org/Homo_sapiens/Transcript/Exons?db=core;g=ENSG00000236022;r=17:19109483-19110124;t=ENST00000447506
* 65 - http://dec2011.archive.ensembl.org/Homo_sapiens/Transcript/Exons?db=core;g=ENSG00000236022;r=17:19109483-19110124;t=ENST00000447506

A final check of the cDNA shows that this change of phase has resulted in a coding frameshift which when combined with the truncated cds has resulted in two proteins which seem to be un-related.

* 64 - http://sep2011.archive.ensembl.org/Homo_sapiens/Transcript/Sequence_cDNA?db=core;g=ENSG00000236022;r=17:19109483-19110124;t=ENST00000447506
* 65 - http://dec2011.archive.ensembl.org/Homo_sapiens/Transcript/Sequence_cDNA?db=core;g=ENSG00000236022;r=17:19109483-19110124;t=ENST00000447506

In this situation the mapping pipeline will have incremented the versions on the transcript and protein to flag that there has been a change in the underlying spliced/translated sequence and you should proceed with caution if you were to use this mapping.

A further complication seems to be in release 67 this Havana transcript has been flagged as a processed transcript (non-coding transcript without an ORF) & no longer as a protein coding transcript. The protein should no longer be considered active as indicated by the protein history interface.

Best regards,


Andrew Yates                   Ensembl Core Software Project Leader
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensembl.org/

On 11 Jun 2012, at 20:26, Healy, Matthew wrote:

> The first URL below appears to show two different sequences with the same accession number.
> http://useast.ensembl.org/Homo_sapiens/Transcript/Idhistory/Protein?p=ENSP00000400005;t=ENSP00000400005
> The second URL below shows one possible explanation: perhaps this is the correct accession number for one of the above sequences and there is a bug in the web interface?
> http://useast.ensembl.org/Homo_sapiens/Transcript/Idhistory/Protein?db=core;t=ENSP00000402579
