[ensembl-dev] how to calculate transcript length ?

Tue Aug 28 11:36:10 BST 2012

Dear Amomida,
I really thank you for your answer.
on the exonic untranslated regions I found an article: 

Statistical features of human exons and their flanking
regions
M. Q. Zhang

Regarding the algorithm employed to estimate UTR in Ensembl should I refer to http://asia.ensembl.org/Homo_sapiens/2011_09_human_genebuild.pdf

and to 
http://vega.sanger.ac.uk/info/index.html

or may you suggest some more specific citation?

Thanks,

Enrico

________________________________
 From: Amonida Zadissa <amonida at sanger.ac.uk>
To: enrico1970 at yahoo.com; Ensembl developers list <dev at ensembl.org> 
Cc: "danielchen06 at gmail.com" <danielchen06 at gmail.com>; amonida at sanger.ac.uk 
Sent: Tuesday, 28 August 2012, 10:53
Subject: Re: [ensembl-dev]  how to calculate transcript length ?

Hi Enrico,

Please note that not all exons may be coding and some exons may be partially coding, like the example you have described.

The transcript ENST00000419234 is indeed 3179 bases long but it contains the untranslated regions (UTRs) at both 5' and 3' ends. The first exon (ENSE00002313575) has 276 bases but only 82 bases are coding. The last exon (ENSE00001355929) is 1570 bases long but only the first 364 bases are coding. Excluding the UTRs (194 at 5' end and 1206 at 3' end), leaves 1779 coding bases, including the final stop codon. This gives a translation of 592 functional amino acids.

Hope this clarifies the scenario.

Best regards,
Amonida

On 27/08/2012 23:56, enrico1970 at yahoo.com wrote:
> Dear Thibaut and Jay,
> I really appreciate your suggestion to the list  but I have a similar query to the one of Gang
> At the page http://asia.ensembl.org/Homo_sapiens/Transcript/Exons?db=core;g=ENSG00000089234;r=12:112080797-112123790;t=ENST00000419234
> 
> The transcript ENST00000419234 has a length of 3179 nucleotides, that corresponds to the sum of its exons, the protein is expected to have 3179/3=1059 amino acids
> 
> but it has only 592 amino acids, the same phenomenon happen for other trancripts.
> 
> What is the data definition of the length of the transcript and of the protein?
> Kind regards,
> 
> Enrico Rubagotti
> 
> 
> 
> 
> 
> Hi,
> I would recommend you to use the Perl API for all the information you
> want to retrieve from Ensembl.
> It's quite easy to use and if there is a schema change, you will not
> have to change all your SQL queries.
> Here is some documentation: http://www.ensembl.org/info/docs/api/core/core_tutorial.html http://www.ensembl.org/info/docs/Doxygen/core-api/index.html http://www.ensembl.org/info/docs/api/index.html Here is some code for your query: use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; $registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', # alternatively 'useastdb.ensembl.org' -user => 'anonymous'
> );
> my $gene_adaptor  = $registry->get_adaptor( 'Human', 'Core', 'Gene' );
> my $gene = $gene_adaptor->fetch_by_stable_id('ENSG00000089234');
> foreach my $transcript (@{$gene->get_all_Transcripts()) { print STDOUT 'Length of ', $transcript->display_id, ': ', $transcript->length, "\n";
> } Regards
> Thibaut On 31/07/12 12:49, Jay Humphrey wrote:
>> Length is end - start + 1. >1 2 3 [4 5 6 7 8] 9 >start = 4, end = 8 >8 - 4 = 4, actually there are 5 residues. >>On 31/07/2012 10:39, ?? wrote: >>Hi All >>I wondering how to calculate transcript length within Ensembl database. >>I try to sum exon's length: >>>>SELECT tp.stable_id, SUM( e.seq_region_end ) - SUM( e.seq_region_start ) >>FROM gene g >>JOIN transcript tp ON ( g.gene_id = tp.gene_id ) >>JOIN exon_transcript et ON ( et.transcript_id = tp.transcript_id ) >>JOIN exon e ON ( e.exon_id = et.exon_id ) >>WHERE g.stable_id = 'ENSG00000089234' >>GROUP BY tp.stable_id >>>>But the result is inconsistent with Ensembl official data: >>http://asia.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000089234;r=12:112080797-112123790 >>>>If you know how to dig out the datas of >>variation,orthologue,paralogue,regulation. please also tell me. >>>>>>Thanks million >>-- >>Gang Chen >>TILSI >>Taicang Institute For Life Science Information >>Address: A2/162, Renmin
>   South Road, Taicang, 215400, Jiangsu >>Province, P.R.China >>Phone: (+86)512-82782588 >>>>>>>>_______________________________________________ >>Dev mailing listDev at ensembl.org >>List admin (including subscribe/unsubscribe):http://lists.ensembl.org/mailman/listinfo/dev >>Ensembl Blog:http://www.ensembl.info/ >>-- >Jay Humphrey                   Ensembl Genomes Web Developer >EMBL-EBI                       Tel: +44-(0)1223-492682 >Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468 >Cambridge CB10 1SD, UKhttp://www.ensemblgenomes.org/ >>>_______________________________________________ >Dev mailing listDev at ensembl.org >List admin (including subscribe/unsubscribe):http://lists.ensembl.org/mailman/listinfo/dev >Ensembl Blog:http://www.ensembl.info/  -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.ensembl.org/pipermail/dev/attachments/20120801/d347b3bb/attachment.htm>
> 
> 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
> 

-- Amonida Zadissa Ph.D.
Deputy team leader
EnsEMBL Genebuild team
Wellcome Trust Sanger Institute
England

-- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20120828/05290993/attachment.html>