[ensembl-dev] translateable_seq returning sequences that don'tappear to be translateable

Tue Jan 25 11:50:13 GMT 2011

Are these details related to the observation that the number of IPI
entries pointing only to Vega is 6871 ?  This suggests Havana is picking
up or extending ORFs beyond the Ensembl set (or at least that
cluster-out in IPI).  Is the situation is vice versa in IPI for the 3983
Ensembl ORFs not picked up by Vega (or any  other pipe for that matter)
? 

Yours,  Chris

--------------------------------------------------------------------------
Confidentiality Notice: This message is private and may contain confidential and proprietary information. If you have received this message in error, please notify us and remove it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or disclosure of the contents of this message is not permitted and may be unlawful.

-----Original Message-----
From: dev-bounces at ensembl.org [mailto:dev-bounces at ensembl.org] On Behalf
Of Michael Schuster
Sent: den 25 januari 2011 12:13
To: Jeff Hussmann
Cc: dev at ensembl.org
Subject: Re: [ensembl-dev] translateable_seq returning sequences that
don'tappear to be translateable

Hi Jeff,

Just to clarify this a bit more. The transcripts you see are mainly
arising from the manual genome annotation effort by the Havana group at
the Wellcome Trust Sanger Institute. As these transcripts are manually
annotated and hand checked, curators extend the transcript as far as
they find evidence for in cDNAs and ESTs. This may lead to cases where
the end of the transcript and thus its translation no longer ends at a
codon boundary. Truncating the translation back to a codon boundary
would imply a UTR, which is not true. These cases rather indicate that
the translation must biologically extend, yet there is currently no
support for its complete annotation.

The Ensembl API handles these cases perfectly fine and upon getting the
translated sequence you will find an X either at the beginning or end of
those transcripts. Looking for an X in the translated sequence will
therefore tell you that the sequence is not complete and biological
evidence is missing.

In the automated Ensembl genome annotation pipeline these cases are
handled differently so that all transcripts and their translations are
truncated back to a codon boundary irrespective of a short overhang.

Sequence edits are completely separate from this issue and the Ensembl
API supports them on both levels, for Transcripts and for Translations.
Since Ensembl strictly transcribes transcripts off the genome sequence,
a transcript is just the concatenation of all exon sequences and a
translation is the translated part of the coding region, we can use
these sequence edits to override the genome sequence locally.

As far as I am aware, we are not using RNA edits at this stage, but we
could use them to override polymorphic pseudogenes where the reference
genome has either a missense mutation or a stop codon, while other
populations clearly have a functional gene. With such a sequence edit we
could patch the mRNA resulting from the reference genome into a
functional molecule.

An example for a sequence edit on the translation level would be
selenocysteines, where the symbol for stop codons (*) get replaced by
the symbol for selenocysteins (U).

I hope this clarifies your observations.

Best regards,
Michael

> Hello. I recently asked a question about the EnsEMBL Perl API on the
biostar stackexchange site -
http://biostar.stackexchange.com/questions/5044/ensembl-perl-api-transla
teable-seq-returns-sequences-that-arent-multiples-of-3-n. I have some
questions about Giulietta's response to my question, and this list
seemed a more appropriate place to continue discussion than in comments
on biostar.
> 
> 1) Could anyone elaborate on Giulietta's point involving "all defined
RNA edits" and selenocysteine? My (very limited) understanding of
selenocysteine incorporation is that in eukaryotes, nothing in the mRNA
in the immediate vicinity of a UGA codon is changed by the fact that the
UGA will eventually be translated into selenocysteine. The database
would need to know about this in order to return the correct amino acid
sequence for a transcript, but translateable_seq doesn't return an amino
acid sequence. It returns a nucleotide sequence.
> 
> 2) The focus on ENSMUSG00000064363 in the biostar thread is
unfortunate. I was pressed for a specific example and chose one
randomly. I am more concerned with the issue of whether I have realistic
expectations for the translateable_seq method. A sequence that isn't an
whole number of codons long or that contains an 'N' character doesn't
seem translateable in a strict sense of the word. Is it consistent with
the design intent for the method for these sequences to be returned by
it?
> 
>  - Jeff
> _______________________________________________
> Dev mailing list
> Dev at ensembl.org
> http://lists.ensembl.org/mailman/listinfo/dev

--
Michael Schuster
Ensembl Genome Browser
EMBL - European Bioinformatics Institute
Wellcome Trust Genome Campus
Hinxton, Cambridgeshire CB10 1SD
United Kingdom

URL: http://www.ensembl.org/

_______________________________________________
Dev mailing list
Dev at ensembl.org
http://lists.ensembl.org/mailman/listinfo/dev