[ensembl-dev] translateable_seq returning sequences that don'tappear to be translateable

Tue Jan 25 13:58:14 GMT 2011

Without looking into too much detail here this is conceivably a part of the explanation.

More importantly we should emphasize that the human, mouse and in near future, zebra fish gene sets presented in Ensembl are the product of a merge of the manually curated Havana set (where such annotation is available) and the automatically annotated Ensembl gene set. This is also the basis of the Gencode gene set.

http://www.gencodegenes.org/

As far as I am aware, IPI is going to be phased out at some stage, but UniProtKB has imported Ensembl-Havana translations, which were not reflected in the original UniProtKB set. This procedure has been completed for human, mouse, rat, chicken, cow and dog. It is important to note here that records, which have been imported into UniProtKB from Ensembl are properly labelled and have the Protein Evidence (PE-lines) set appropriately. This avoids circular references of biological support.

You can find records based on Ensembl-Havana in UniProtKB with the following query:

http://www.uniprot.org/uniprot/?query=author%3a%22Ensembl%22&by=taxonomy#9347,314146,39107,314145

Therefore, for the above species UniProtKB should provide you with a good coverage of the proteome, much in the same way IPI aimed for.

Best regards,
Michael

> Are these details related to the observation that the number of IPI
> entries pointing only to Vega is 6871 ?  This suggests Havana is picking
> up or extending ORFs beyond the Ensembl set (or at least that
> cluster-out in IPI).  Is the situation is vice versa in IPI for the 3983
> Ensembl ORFs not picked up by Vega (or any  other pipe for that matter)
> ? 
> 
> Yours,  Chris
> 
> 
> --------------------------------------------------------------------------
> Confidentiality Notice: This message is private and may contain confidential and proprietary information. If you have received this message in error, please notify us and remove it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or disclosure of the contents of this message is not permitted and may be unlawful.
>  
> -----Original Message-----
> From: dev-bounces at ensembl.org [mailto:dev-bounces at ensembl.org] On Behalf
> Of Michael Schuster
> Sent: den 25 januari 2011 12:13
> To: Jeff Hussmann
> Cc: dev at ensembl.org
> Subject: Re: [ensembl-dev] translateable_seq returning sequences that
> don'tappear to be translateable
> 
> Hi Jeff,
> 
> Just to clarify this a bit more. The transcripts you see are mainly
> arising from the manual genome annotation effort by the Havana group at
> the Wellcome Trust Sanger Institute. As these transcripts are manually
> annotated and hand checked, curators extend the transcript as far as
> they find evidence for in cDNAs and ESTs. This may lead to cases where
> the end of the transcript and thus its translation no longer ends at a
> codon boundary. Truncating the translation back to a codon boundary
> would imply a UTR, which is not true. These cases rather indicate that
> the translation must biologically extend, yet there is currently no
> support for its complete annotation.
> 
> The Ensembl API handles these cases perfectly fine and upon getting the
> translated sequence you will find an X either at the beginning or end of
> those transcripts. Looking for an X in the translated sequence will
> therefore tell you that the sequence is not complete and biological
> evidence is missing.
> 
> In the automated Ensembl genome annotation pipeline these cases are
> handled differently so that all transcripts and their translations are
> truncated back to a codon boundary irrespective of a short overhang.
> 
> Sequence edits are completely separate from this issue and the Ensembl
> API supports them on both levels, for Transcripts and for Translations.
> Since Ensembl strictly transcribes transcripts off the genome sequence,
> a transcript is just the concatenation of all exon sequences and a
> translation is the translated part of the coding region, we can use
> these sequence edits to override the genome sequence locally.
> 
> As far as I am aware, we are not using RNA edits at this stage, but we
> could use them to override polymorphic pseudogenes where the reference
> genome has either a missense mutation or a stop codon, while other
> populations clearly have a functional gene. With such a sequence edit we
> could patch the mRNA resulting from the reference genome into a
> functional molecule.
> 
> An example for a sequence edit on the translation level would be
> selenocysteines, where the symbol for stop codons (*) get replaced by
> the symbol for selenocysteins (U).
> 
> I hope this clarifies your observations.
> 
> Best regards,
> Michael
> 
> 
>> Hello. I recently asked a question about the EnsEMBL Perl API on the
> biostar stackexchange site -
> http://biostar.stackexchange.com/questions/5044/ensembl-perl-api-transla
> teable-seq-returns-sequences-that-arent-multiples-of-3-n. I have some
> questions about Giulietta's response to my question, and this list
> seemed a more appropriate place to continue discussion than in comments
> on biostar.
>> 
>> 1) Could anyone elaborate on Giulietta's point involving "all defined
> RNA edits" and selenocysteine? My (very limited) understanding of
> selenocysteine incorporation is that in eukaryotes, nothing in the mRNA
> in the immediate vicinity of a UGA codon is changed by the fact that the
> UGA will eventually be translated into selenocysteine. The database
> would need to know about this in order to return the correct amino acid
> sequence for a transcript, but translateable_seq doesn't return an amino
> acid sequence. It returns a nucleotide sequence.
>> 
>> 2) The focus on ENSMUSG00000064363 in the biostar thread is
> unfortunate. I was pressed for a specific example and chose one
> randomly. I am more concerned with the issue of whether I have realistic
> expectations for the translateable_seq method. A sequence that isn't an
> whole number of codons long or that contains an 'N' character doesn't
> seem translateable in a strict sense of the word. Is it consistent with
> the design intent for the method for these sequences to be returned by
> it?
>> 
>> - Jeff
>> _______________________________________________
>> Dev mailing list
>> Dev at ensembl.org
>> http://lists.ensembl.org/mailman/listinfo/dev
> 
> --
> Michael Schuster
> Ensembl Genome Browser
> EMBL - European Bioinformatics Institute
> Wellcome Trust Genome Campus
> Hinxton, Cambridgeshire CB10 1SD
> United Kingdom
> 
> URL: http://www.ensembl.org/
> 
> 
> 
> _______________________________________________
> Dev mailing list
> Dev at ensembl.org
> http://lists.ensembl.org/mailman/listinfo/dev
> 
> _______________________________________________
> Dev mailing list
> Dev at ensembl.org
> http://lists.ensembl.org/mailman/listinfo/dev

--
Michael Schuster
Ensembl Genome Browser
EMBL - European Bioinformatics Institute
Wellcome Trust Genome Campus
Hinxton, Cambridgeshire CB10 1SD
United Kingdom

URL: http://www.ensembl.org/