[ensembl-dev] Transcripts with stop codon ?

Thu Feb 2 09:52:55 GMT 2012

We have to deal with predictions, this is part of the game.

Nevertheless, in such cases, you could select the representative 
transcript as the one with the more evidences, not always the longest one.

This is not the case for ENSCJAT00000065209, both transcripts look to be 
predictions.

Sébastien

>> Dear Sébastien
>>
>> It is likely this particular example is a prediction artefact, the stop codon being very close to the splice site in orthologous proteins (http://tinyurl.com/7k2lfsf)
>>
>> As pointed out by Lukasz, not all stop codons are wrong. Selenocysteins are encoded by stop codons. This is why you may find them in our compara alignments.
>
> Our study focused on FANTOM3 mouse cDNAs, while ENSEMBL starts predictions from the genome and mostly relies on protein homology as evidence, so it's difficult to compare directly..
>
> Still, intron retention was a common cause for internal STOP codons in the F3 collection, and the same might happen with gene predictions if substantial weight was attached to full-length cDNAs as evidence..
>
>>
>> Kind regards
>>
>> Javier
>>
>> On 01/02/12 16:25, Lukasz Huminiecki wrote:
>>> Dear ENSEMBL,
>>> While it sounds like the problem discussed here is an experimental or prediction artefact, please also note that there are some genuine (if not necessarily always functional) transcripts with stop codons. For example:
>>>
>>> Pseudo-messenger RNA: phantoms of the transcriptome.
>>> Frith MC, Wilming LG, Forrest A, Kawaji H, Tan SL, Wahlestedt C, Bajic VB, Kai C, Kawai J, Carninci P, Hayashizaki Y, Bailey TL, Huminiecki L.
>>> PLoS Genet. 2006 Apr;2(4):e23. Epub 2006 Apr 28.
>>> PMID: 16683022
>>>
>>> kind regards, Lukasz
>>>
>>> On Feb 1, 2012, at 4:11 PM, Michael Paulini wrote:
>>>
>>>> On 01/02/12 14:54, Moretti Sébastien wrote:
>>>>>>> Hi
>>>>>>>
>>>>>>> I have just noticed that some transcripts have stop codon(s) in their
>>>>>>> sequence. E.g. ENSCJAT00000065209
>>>>>>>
>>>>>>> Is it normal ?
>>>>>>>
>>>>>>>
>>>>>>> These stop codons and, more problematic, the "fake" codons next after
>>>>>>> the stop are included in compara alignments.
>>>>>>>
>>>>>> you mean translations?
>>>>>> Due to the case that the translation also doesn't have a start, I would
>>>>>> put that down as a prediction artefact, similar to what you can see on a
>>>>>> lot of low-coverage gene sets where you get fragments of genes and
>>>>>> in-frame-stops.
>>>>> I fully agree about prediction artefacts.
>>>>> But in the case of ENSCJAT00000065209, there are 2 predicted amino acids after the stop codon. Those ones are included in your alignment and tree building processes.
>>>>> Two aa should not disturb the phylogeny too much but what happens if this is 40 untranslated aa ?
>>>> I think the usual rule applies here: "bad protein predictions make bad phylogenetic trees".
>>>> In the case of our nematode genomes, we fix them when we see them, but in more hands-off operations if it can't be scripted, it will not happen.
>>>> I also had a long discussion about in-frame stops in models on next-gen assemblies, and as most of these genomes will not get fixed in the foreseeable future, there are two options: allow them to cover genomic errors, or count them as real stops and loose potential parts of a gene (which will also mess up phylogenetic trees).
>>>>
>>>> M