[ensembl-dev] Oddity with ENSP identifiers: same identifier for two unrelated sequences?

Andy Yates ayates at ebi.ac.uk
Wed Jun 13 16:24:32 BST 2012


Hi there,

We've gone through our code and found a pair of issues which caused the unclear and confusing history of these translations. Firstly it seems that the peptide sequence was not used when deciding to increment the translation stable id; only the transcript spliced sequence was used. The effect of this was to increment the translation stable id version at the same rate as the transcript stable id. We have changed this logic to increment translation stable id if there is a difference in the resulting peptide sequence.

The second issue was related to the splicing of an exon between releases. This resulted in a penalty meaning that even though the exons had a perfect location match we still attempted to use exonerate for the matches. This penalty has been removed.

All the best,

Andy

Andrew Yates                   Ensembl Core Software Project Leader
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensembl.org/

On 12 Jun 2012, at 11:53, Liu, Mingyi wrote:

> Hi, Andy,
> 
> Thanks for explaining the reason for the change, that'll be helpful when we have to explain the results to our internal customers, if needed.
> 
> However, our main confusion was that we assumed that any sequence change of ENSP00000400005 from v64 to v65 should have resulted in either a new stable ID, or a new version of the same stable ID.  But both v64 and v65 marked this protein's latest version # is ENSP00000400005.1, meaning despite the sequence change, the ID/version stayed the same, which causes issues in our internal sequence storage/analysis/ID-based results linking (we did notice that in this particular case, the ID disappeared in v67).  A colleague of ours seemed to believe this issue happened more than this one ID, although we didn't have time to track it down yet.
> 
> Thanks,
> 
> Mingyi
> 
>> -----Original Message-----
>> From: dev-bounces at ensembl.org [mailto:dev-bounces at ensembl.org] On Behalf
>> Of Andy Yates
>> Sent: Tuesday, June 12, 2012 4:54 AM
>> To: Ensembl developers list
>> Subject: Re: [ensembl-dev] Oddity with ENSP identifiers: same identifier
>> for two unrelated sequences?
>> 
>> Hi Matthew,
>> 
>> I can see where your confusion lies here as you have to investigate the
>> archived transcripts of these proteins to discover what has occurred.
>> Please bear in mind that protein stable IDs are derived from the
>> transcript & transcript stable IDs are derived from its exons. The only
>> biological unit which is physically mapped is the exon.
>> 
>> If we check the archive sites for the transcript identifier
>> ENSP00000400005 we can see a distinct point in time when the protein
>> sequence changed which were releases 64 and 65:
>> 
>> * 64 -
>> http://sep2011.archive.ensembl.org/Homo_sapiens/Transcript/Sequence_Prot
>> ein?db=core;g=ENSG00000236022;r=17:19109483-19110124;t=ENST00000447506
>> * 65 -
>> http://dec2011.archive.ensembl.org/Homo_sapiens/Transcript/Sequence_Prot
>> ein?db=core;g=ENSG00000236022;r=17:19109483-19110124;t=ENST00000447506
>> 
>> A quick check of the exon structure reveals that we still have the same
>> exons in the transcript but that we've had a 44bp contraction of the
>> coding sequence at the 3' end. The actual sequence of the Exon is still
>> identical therefore this is still the same transcript. Also note the
>> change of phase in the 1st Exon as this will be important in a second
>> 
>> * 64 -
>> http://sep2011.archive.ensembl.org/Homo_sapiens/Transcript/Exons?db=core
>> ;g=ENSG00000236022;r=17:19109483-19110124;t=ENST00000447506
>> * 65 -
>> http://dec2011.archive.ensembl.org/Homo_sapiens/Transcript/Exons?db=core
>> ;g=ENSG00000236022;r=17:19109483-19110124;t=ENST00000447506
>> 
>> A final check of the cDNA shows that this change of phase has resulted
>> in a coding frameshift which when combined with the truncated cds has
>> resulted in two proteins which seem to be un-related.
>> 
>> * 64 -
>> http://sep2011.archive.ensembl.org/Homo_sapiens/Transcript/Sequence_cDNA
>> ?db=core;g=ENSG00000236022;r=17:19109483-19110124;t=ENST00000447506
>> * 65 -
>> http://dec2011.archive.ensembl.org/Homo_sapiens/Transcript/Sequence_cDNA
>> ?db=core;g=ENSG00000236022;r=17:19109483-19110124;t=ENST00000447506
>> 
>> In this situation the mapping pipeline will have incremented the
>> versions on the transcript and protein to flag that there has been a
>> change in the underlying spliced/translated sequence and you should
>> proceed with caution if you were to use this mapping.
>> 
>> 
>> A further complication seems to be in release 67 this Havana transcript
>> has been flagged as a processed transcript (non-coding transcript
>> without an ORF) & no longer as a protein coding transcript. The protein
>> should no longer be considered active as indicated by the protein
>> history interface.
>> 
>> 
>> Best regards,
>> 
>> Andy
>> 
>> Andrew Yates                   Ensembl Core Software Project Leader
>> EMBL-EBI                       Tel: +44-(0)1223-492538
>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>> Cambridge CB10 1SD, UK         http://www.ensembl.org/
>> 
>> On 11 Jun 2012, at 20:26, Healy, Matthew wrote:
>> 
>>> 
>>> The first URL below appears to show two different sequences with the
>> same accession number.
>>> 
>>> 
>> http://useast.ensembl.org/Homo_sapiens/Transcript/Idhistory/Protein?p=EN
>> SP00000400005;t=ENSP00000400005
>>> 
>>> The second URL below shows one possible explanation: perhaps this is
>> the correct accession number for one of the above sequences and there is
>> a bug in the web interface?
>>> 
>>> 
>> http://useast.ensembl.org/Homo_sapiens/Transcript/Idhistory/Protein?db=c
>> ore;t=ENSP00000402579
>>> 
>>> This message (including any attachments) may contain confidential,
>> proprietary, privileged and/or private information.  The information is
>> intended to be for the use of the individual or entity designated above.
>> If you are not the intended recipient of this message, please notify the
>> sender immediately, and delete the message and any attachments.  Any
>> disclosure, reproduction, distribution or other use of this message or
>> any attachments by an individual or entity other than the intended
>> recipient is prohibited.
>>> 
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> List admin (including subscribe/unsubscribe):
>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>> 
>> 
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> List admin (including subscribe/unsubscribe):
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
> 
> This message (including any attachments) may contain confidential, proprietary, privileged and/or private information.  The information is intended to be for the use of the individual or entity designated above.  If you are not the intended recipient of this message, please notify the sender immediately, and delete the message and any attachments.  Any disclosure, reproduction, distribution or other use of this message or any attachments by an individual or entity other than the intended recipient is prohibited.
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/





More information about the Dev mailing list