[ensembl-dev] 5 prime UTR coordinates
Nick Fankhauser
lists at nyk.ch
Tue Mar 13 14:37:16 GMT 2012
Hi Bert
thanks a lot for the information! I assumed it should have, because it
looked like a test case. :)
I was able to finish my script now!
#!/usr/bin/perl
# Retrieves five and three prime UTRs for all transcripts of a list of genes
use lib "/opt/ensembl/modules";
use Bio::EnsEMBL::Registry;
Bio::EnsEMBL::Registry->load_registry_from_db(-HOST =>
'ensembldb.ensembl.org',-USER => 'anonymous');
my $dba = Bio::EnsEMBL::Registry->get_DBAdaptor('human', 'core');
my $ta = $dba->get_TranscriptAdaptor();
sub transcript_UTR {
my $t = shift;
my $strand = $t->strand();
my $seq_start = $t->seq_region_start();
return(-1,-1,-1,-1) if not defined $t->cdna_coding_start();
my $seq_end = $t->seq_region_end();
my ($five_prime_coordinate) = $t->cdna2genomic($t->cdna_coding_start(),
$t->cdna_coding_start());
my ($three_prime_coordinate) = $t->cdna2genomic($t->cdna_coding_end(),
$t->cdna_coding_end());
my ($five_start, $five_end, $three_start, $three_end) =( (0)x4 );
if($strand == 1) {
if($seq_start != $five_prime_coordinate->start()) {
$five_start = $seq_start;
$five_end = $five_prime_coordinate->start()-1;
}
if($seq_end != $three_prime_coordinate->start()) {
$three_start = $three_prime_coordinate->start()+1;
$three_end = $seq_end;
}
}
else {
if($seq_start != $five_prime_coordinate->start()) {
$five_start = $seq_end;
$five_end = $five_prime_coordinate->start()+1;
}
if($seq_end != $three_prime_coordinate->start()) {
$three_start = $three_prime_coordinate->start()-1;
$three_end = $seq_start;
}
}
return ($five_start, $five_end, $three_start, $three_end);
}
sub gene_UTRs {
my $gene_symbol=shift;
my $gene_adaptor = $dba->get_GeneAdaptor();
my $gene=$gene_adaptor->fetch_by_display_label($gene_symbol);
return("") if not defined $gene;
my $chromosome = $gene->slice->seq_region_name();
my $text="";
foreach my $t ( @{ $gene->get_all_Transcripts() } ) {
my $tid = $t->stable_id();
my ($five_start, $five_end, $three_start, $three_end) =
transcript_UTR($t);
next if $five_start==-1;
$text.="chr$chromosome\t$five_start\t$five_end\t$gene_symbol\t$tid\t5\n";
$text.="chr$chromosome\t$three_start\t$three_end\t$gene_symbol\t$tid\t3\n";
}
return($text);
}
my $fn=shift; # get filename from first command line argument
my $ofn=$fn.".utr.bed";
open (FILE, $fn); # open the file
open (OUT, ">$ofn"); # open output file
close OUT;
while (<FILE>) { # loop though the file
chomp();
next if length($_)<2;
open (OUT, ">>$ofn"); # open output file
print OUT gene_UTRs($_);
close OUT;
}
On 13/03/12 12:20, Bert Overduin wrote:
> Hello Nick,
>
> That is because the third transcript does not have any UTRs annotated:
>
> http://www.ensembl.org/Homo_sapiens/Transcript/Summary?g=ENSG00000177275;r=1:248097071-248098057;t=ENST00000318244
>
> Cheers,
> Bert
>
> On Tue, Mar 13, 2012 at 11:12 AM, Nick Fankhauser <lists at nyk.ch> wrote:
>> Hi Andy
>>
>> thanks for your helpful reply!
>>
>> But when I try this example script, it only works on the first two
>> transcripts:
>>
>> 5' UTR 248790491-248790430 | 3' UTR 248789478-248789420
>> 5' UTR 249120832-249119135 | 3' UTR 249106098-249104650
>> 5' UTR 0-0 | 3' UTR 0-0
>>
>> For the last one, it doesn't seem to work. Is this method maybe not
>> compatible with the current version of the API?
>>
>> Yours,
>> Nick
>>
>> On 09/03/12 16:55, Andy Yates wrote:
>>> Hi Nick,
>>>
>>> The method five_prime_utr() returns a Bio::Seq object & is meant to be a way of accessing the sequence data not the coordinates. Take a look at this script hosted at GitHub's gist:
>>>
>>> https://gist.github.com/1237394
>>>
>>> It should give you all the information you need to extract the coordinates you need
>>>
>>> All the best,
>>>
>>> Andy
>>>
>>> Andrew Yates Ensembl Core Software Project Leader
>>> EMBL-EBI Tel: +44-(0)1223-492538
>>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468
>>> Cambridge CB10 1SD, UK http://www.ensembl.org/
>>>
>>> On 9 Mar 2012, at 10:51, Nick Fankhauser wrote:
>>>
>>>> Hi!
>>>>
>>>> I'm trying to get the coordinates of the 5' UTRs of all transcripts of a
>>>> given gene. But the "start" method can't be located for the UTR object.
>>>> Is there a different approach that makes it possible to retrieve the
>>>> UTR's coordinates?
>>>>
>>>> This is what I tried:
>>>>
>>>> foreach my $transcript ( @{ $gene->get_all_Transcripts() } ) {
>>>> my $utr5 = $transcript->five_prime_utr();
>>>> next if not defined $utr5;
>>>> my $start5 = $utr5->start();
>>>> my $end5 = $utr5->end();
>>>> }
>>>>
>>>> Thanks!
>>>>
>>>>
>>>> Nick Fankhauser
>>>> ---
>>>> http://www.ccspmd.ethz.ch/people/fniklaus
>>>>
>>>> _______________________________________________
>>>> Dev mailing list Dev at ensembl.org
>>>> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>>
>>> _______________________________________________
>>> Dev mailing list Dev at ensembl.org
>>> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>
>> _______________________________________________
>> Dev mailing list Dev at ensembl.org
>> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>
>
>
More information about the Dev
mailing list