[ensembl-dev] 5 prime UTR coordinates

Nick Fankhauser lists at nyk.ch
Tue Mar 13 14:37:16 GMT 2012


Hi Bert

thanks a lot for the information! I assumed it should have, because it
looked like a test case. :)

I was able to finish my script now!

#!/usr/bin/perl
# Retrieves five and three prime UTRs for all transcripts of a list of genes
use lib "/opt/ensembl/modules";
use Bio::EnsEMBL::Registry;
Bio::EnsEMBL::Registry->load_registry_from_db(-HOST =>
'ensembldb.ensembl.org',-USER => 'anonymous');
my $dba = Bio::EnsEMBL::Registry->get_DBAdaptor('human', 'core');
my $ta = $dba->get_TranscriptAdaptor();

sub transcript_UTR {
	my $t = shift;
	my $strand = $t->strand();
	my $seq_start = $t->seq_region_start();
	return(-1,-1,-1,-1) if not defined $t->cdna_coding_start();
	my $seq_end = $t->seq_region_end();
	my ($five_prime_coordinate) = $t->cdna2genomic($t->cdna_coding_start(),
$t->cdna_coding_start());
	my ($three_prime_coordinate) = $t->cdna2genomic($t->cdna_coding_end(),
$t->cdna_coding_end());
	my ($five_start, $five_end, $three_start, $three_end) =( (0)x4 );
	if($strand == 1) {
		if($seq_start != $five_prime_coordinate->start()) {
			$five_start = $seq_start;
			$five_end = $five_prime_coordinate->start()-1;
		}
		if($seq_end != $three_prime_coordinate->start()) {
			$three_start = $three_prime_coordinate->start()+1;
			$three_end = $seq_end;
		}
	}
	else {
		if($seq_start != $five_prime_coordinate->start()) {
			$five_start = $seq_end;
			$five_end = $five_prime_coordinate->start()+1;
		}
		if($seq_end != $three_prime_coordinate->start()) {
			$three_start = $three_prime_coordinate->start()-1;
			$three_end = $seq_start;
		}
	}
	return ($five_start, $five_end, $three_start, $three_end);
}

sub gene_UTRs {
	my $gene_symbol=shift;
	my $gene_adaptor = $dba->get_GeneAdaptor();
	my $gene=$gene_adaptor->fetch_by_display_label($gene_symbol);
	return("") if not defined $gene;
	my $chromosome = $gene->slice->seq_region_name();
	my $text="";
	foreach my $t ( @{ $gene->get_all_Transcripts() } ) {
		my $tid	= $t->stable_id();
		my ($five_start, $five_end, $three_start, $three_end) =
transcript_UTR($t);
		next if $five_start==-1;
		$text.="chr$chromosome\t$five_start\t$five_end\t$gene_symbol\t$tid\t5\n";
	
$text.="chr$chromosome\t$three_start\t$three_end\t$gene_symbol\t$tid\t3\n";
	}
	return($text);
}
		

my $fn=shift;	# get filename from first command line argument
my $ofn=$fn.".utr.bed";
open (FILE, $fn);	# open the file
open (OUT, ">$ofn");	# open output file
close OUT;
while (<FILE>) {	# loop though the file
	chomp();
	next if length($_)<2;
	open (OUT, ">>$ofn");	# open output file
	print OUT gene_UTRs($_);
	close OUT;
}


On 13/03/12 12:20, Bert Overduin wrote:
> Hello Nick,
> 
> That is because the third transcript does not have any UTRs annotated:
> 
> http://www.ensembl.org/Homo_sapiens/Transcript/Summary?g=ENSG00000177275;r=1:248097071-248098057;t=ENST00000318244
> 
> Cheers,
> Bert
> 
> On Tue, Mar 13, 2012 at 11:12 AM, Nick Fankhauser <lists at nyk.ch> wrote:
>> Hi Andy
>>
>> thanks for your helpful reply!
>>
>> But when I try this example script, it only works on the first two
>> transcripts:
>>
>> 5' UTR 248790491-248790430 | 3' UTR 248789478-248789420
>> 5' UTR 249120832-249119135 | 3' UTR 249106098-249104650
>> 5' UTR 0-0 | 3' UTR 0-0
>>
>> For the last one, it doesn't seem to work. Is this method maybe not
>> compatible with the current version of the API?
>>
>> Yours,
>> Nick
>>
>> On 09/03/12 16:55, Andy Yates wrote:
>>> Hi Nick,
>>>
>>> The method five_prime_utr() returns a Bio::Seq object & is meant to be a way of accessing the sequence data not the coordinates. Take a look at this script hosted at GitHub's gist:
>>>
>>> https://gist.github.com/1237394
>>>
>>> It should give you all the information you need to extract the coordinates you need
>>>
>>> All the best,
>>>
>>> Andy
>>>
>>> Andrew Yates                   Ensembl Core Software Project Leader
>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>> Cambridge CB10 1SD, UK         http://www.ensembl.org/
>>>
>>> On 9 Mar 2012, at 10:51, Nick Fankhauser wrote:
>>>
>>>> Hi!
>>>>
>>>> I'm trying to get the coordinates of the 5' UTRs of all transcripts of a
>>>> given gene. But the "start" method can't be located for the UTR object.
>>>> Is there a different approach that makes it possible to retrieve the
>>>> UTR's coordinates?
>>>>
>>>> This is what I tried:
>>>>
>>>> foreach my $transcript ( @{ $gene->get_all_Transcripts() } ) {
>>>>      my $utr5        = $transcript->five_prime_utr();
>>>>      next if not defined $utr5;
>>>>      my $start5      = $utr5->start();
>>>>      my $end5        = $utr5->end();
>>>> }
>>>>
>>>> Thanks!
>>>>
>>>>
>>>> Nick Fankhauser
>>>> ---
>>>> http://www.ccspmd.ethz.ch/people/fniklaus
>>>>
>>>> _______________________________________________
>>>> Dev mailing list    Dev at ensembl.org
>>>> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>>
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
> 
> 
> 




More information about the Dev mailing list