[ensembl-dev] protein coordinates of domains and exons

Tue Jun 2 11:14:54 BST 2015

Dear Leila,

File formatting can be tricky, especially if you’re using UTF-8. The first thing you can try is putting ‘ use utf8;’ in your script, Perl is capable of reading that encoding natively as well as the usual Latin sets that are common defaults.

If that doens’t immediately help, then you should consider examining your file in a different text editor. Some editors hide formatting characters or other details from your sight. Good ones for seeing hidden characters are TextWrangler, Emacs and Vi (in no particular order). There are many more too, but you’ll have to experiment to find what is wrong with your input file.

Regards,

Kieron

Kieron Taylor PhD.
Ensembl Core senior software developer

EMBL, European Bioinformatics Institute

> On 2 Jun 2015, at 11:02, Leila Alieh <alieh.leila at gmail.com> wrote:
> 
> Hi!
> 
> I'm having a stupid problem with my input file. I made a list of transcripts IDs in a csv file, open it, copied the list and pasted in a text file, UTF-8 in plain text, one transcript ID per  line, without any comma or quotes. The code is getting me an error 
> 
> Can't call method "get_all_translateable_Exons" on an undefined value at ./transdom_RR.pl line 53, <$TX> line 1.
> 
> The same code is running on a previous text file with transcript IDs that I used for trial. If I copy and paste one of these transcript from the "old" file to the "new" one the code is running until the first "new" transcript ID. The transcript IDs that I used for trial are present also in my new csv list and if I copy and paste them from the csv to a new text file the code doesn't work. So I think the problem is somehow in the format of the transcript IDs in the excel file, I tried to convert the csv file into xlxs and also to change the format in general and in text, but it didn't work.
> Do you have any suggestion? How should I prepare the txt input file for  the perl code?
> 
> thanks!
> 
> 
> 2015-05-19 16:23 GMT+02:00 Leila Alieh <alieh.leila at gmail.com>:
> Thank you very much!!!
> 
> Magali, your code has been really helpful, i just modified it to read the list of transcript IDs from a text file. Here is the version I'm using in case you want to check (but it seems it's working fine) and someone else will need it
> 
> #!/usr/bin/perl
> 
> use strict;
> use warnings;
> use Bio::EnsEMBL::Registry;
> 
> my $registry = 'Bio::EnsEMBL::Registry' ;
> 
> $registry->load_registry_from_db(
> -host => 'ensembldb.ensembl.org' ,
> -user => 'anonymous' ,
> -port => '3306'
> );
> 
> my $transcript_adaptor = $registry->get_adaptor( 'mouse', 'core', 'Transcript');
> my $txinput= 'tx_test.txt' ;
> open my $TX, $txinput or die $!;
> my @data= <$TX> ;
> foreach my $line(@data) 
> 
> {
> $line=~s/ //g;
> $line=~s/\t//g;
> $   line=~s/\n//g;
> 
> my $transcript = $transcript_adaptor->fetch_by_stable_id($line);
> 
> my $exons = $transcript->get_all_translateable_Exons();
> foreach my $exon (@$exons) {
>   print "Transcript " . $transcript->stable_id . "\t" ."Exon " . $exon->stable_id . ":" . $exon->start . "-" . $exon->end. "\t";
>   my @pep_coords = $transcript->genomic2pep($exon->start, $exon->end, $exon->strand);
>   foreach my $pep (@pep_coords) {
>   
>     print $pep->start() . "-" . $pep->end() . "\n";
>   }
> }
> my $translation = $transcript->translation;
> 
> if ($translation) {
>   my $pfs = $translation->get_all_ProteinFeatures();
>  
>   foreach my $pf (@$pfs) {
>     print "Transcript " . $transcript->stable_id ."\t" . "Domain ". $pf->hseqname . ":" .  $pf->start . "-" . $pf->end . "\n";
>   }
> }
> }
> close $TX;
> 
> Thanks again!
> 
> 2015-05-18 16:01 GMT+02:00 mag <mr6 at ebi.ac.uk>:
> Hi Leila,
> 
> For a given transcript, you can access all its exons and its translation (when available) with related protein features.
> 
> This snippet of code shows how you can display protein coordinates for all exons and protein domains for the related translation, starting from a given transcript:
> 
> my $registry = Bio::EnsEMBL::Registry->load_registry_from_db(
> -host => 'ensembldb.ensembl.org',
> -user => 'anonymous',
> -port => '3306'
> );
> 
> my $transcript_adaptor = $registry->get_adaptor('human', 'core', 'Transcript');
> my $stable_id = 'ENST00000380152';
> my $transcript = $transcript_adaptor->fetch_by_stable_id($stable_id);
> 
> # Only get exons within the coding region
> my $exons = $transcript->get_all_translateable_Exons();
> foreach my $exon (@$exons) {
>   # Print the genomic coordinates for each exon
>   print "Exon " . $exon->stable_id . ":" . $exon->start . "-" . $exon->end. "\t";
>   my @pep_coords = $transcript->genomic2pep($exon->start, $exon->end, $exon->strand);
>   foreach my $pep (@pep_coords) {
>     # Print the protein coordinates for each exon
>     print $pep->start() . "-" . $pep->end() . "\n";
>   }
> }
> 
> my $translation = $transcript->translation;
> # Check if there is a translation
> if ($translation) {
>   my $pfs = $translation->get_all_ProteinFeatures();
>   # Display all protein features
>   foreach my $pf (@$pfs) {
>     print $pf->hseqname . ":" .  $pf->start . "-" . $pf->end . "\n";
>   }
> }
> 
> 
> If you only have exon coordinates to start with, you will need to create a slice for each set of coordinates, then retrieve transcripts overlapping that slice and use the process described above.
> 
> my $slice_adaptor = $registry->get_adaptor('human', 'core', 'Slice');
> my $slice = $slice_adaptor->fetch_by_region('chromosome', $chromosome, $exon_start, $exon_end);
> my $transcripts = $slice->get_all_Transcripts();
> 
> 
> Hope that helps,
> Magali
> 
> 
> On 16/05/2015 00:32, Leila Alieh wrote:
>> Hi all!
>> 
>> I have a list of genomic coordinates of exons and I want to transform them into protein coordinates of the different protein isoforms these exons belong to. Moreover I want to find the protein coordinates of the domains of these proteins, and then overlap the 2 sets of information to find exons which encode for protein domains. For what I read the (only?) way to do so is to use the Perl API of ensembl, and in particular  I should use TranscriptMapper and ProteinFeauture, right? I read the the tutorial and the documentation but I still find it very difficult to understand the API and I don't knowhow to write the code in a way to restrict the query only to my list of exons/proteins. Could you please show me some examples? In particular I'd like to know what Greg did to find the protein coordinates of the protein domains (http://lists.ensembl.org/pipermail/dev/2015-April/011013.html).
>> 
>> Thank you in advance and I apologize if I did some mistake in the thread, it's the first time that I'm using the ensembl mailing list.
>> 
>> P.S. Please, please, please, make the protein coordinates accessible in Ensembl gene mart as soon as possible, it would save a lot of work/time
>> 
>> Thanks again!
>> 
>> 
>> _______________________________________________
>> Dev mailing list    
>> Dev at ensembl.org
>> 
>> Posting guidelines and subscribe/unsubscribe info: 
>> http://lists.ensembl.org/mailman/listinfo/dev
>> 
>> Ensembl Blog: 
>> http://www.ensembl.info/
> 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
> 
> 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/