[ensembl-dev] Strange behavior on querying External transcript Identifiers

Andy Yates ayates at ebi.ac.uk
Thu Jun 7 16:32:31 BST 2012


Hi Duarte,

I'm more surprised at the hanging nature of your script than anything else. We are currently aware of an issue in the API with refseq identifiers where their _ in the 3rd position is seen as a wild card and the API ends up performing some quite lengthy & costly searches. In some cases we've see fetch_all_by_external_name() take 0.5 seconds to complete when the speed of the method should be around 50ms.

You can test this theory out by doing a regex before querying with refseq ids e.g.

my $sub_id = $query_transcript;
$sub_id =~ s/_/\\_/;
$transcript_adaptor->fetch_all_by_external_name($sub_id);

As for the tables involved we consult the following (along with the core tables):

* xref
* object_xref
* external_synonym

Please make sure all are populated

Andy

Andrew Yates                   Ensembl Core Software Project Leader
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensembl.org/

On 6 Jun 2012, at 16:53, Duarte Molha wrote:

> Dear Developers
>  
> I am having an issue running a query on my internal ensembl database.
> I have a script that retrieves all the exons and introns for a given transcript ID.
>  
> When I query the remote ensembl servers I get back the expected data, however if I query my internal database the script just hangs and retrieves nothing.
> The script runs fine for ENSEMBL transcript Ids but fails on NM ids.
> Can you tell me if we need to download any additional tables in order to support external Ids?
>  
> In this case, if I do :
> getIntronsOrExonsByTranscript.pl -i -e --remote NM_000016
>  
> I get back:
> NM_000016       chr1    76190036        76190502        ENST00000370841 ENSE00001844831 0       (+1)
> NM_000016       chr1    76194086        76194173        ENST00000370841 ENSE00000830661 0       (+1)
> NM_000016       chr1    76198329        76198426        ENST00000370841 ENSE00000830662 0       (+1)
> NM_000016       chr1    76198538        76198607        ENST00000370841 ENSE00000830663 0       (+1)
> NM_000016       chr1    76199213        76199313        ENST00000370841 ENSE00000830664 0       (+1)
> NM_000016       chr1    76200476        76200556        ENST00000370841 ENSE00000830665 0       (+1)
> NM_000016       chr1    76205665        76205795        ENST00000370841 ENSE00000830666 0       (+1)
> NM_000016       chr1    76211491        76211599        ENST00000370841 ENSE00000830671 0       (+1)
> NM_000016       chr1    76215104        76215244        ENST00000370841 ENSE00000931972 0       (+1)
> NM_000016       chr1    76216136        76216231        ENST00000370841 ENSE00000931973 0       (+1)
> NM_000016       chr1    76226807        76227055        ENST00000370841 ENSE00000931974 0       (+1)
> NM_000016       chr1    76228377        76229364        ENST00000370841 ENSE00001513957 0       (+1)
> NM_000016       chr1    76190503        76194085        ENST00000370841 ENST00000370841-intron_1        0       (+1)
> NM_000016       chr1    76194174        76198328        ENST00000370841 ENST00000370841-intron_2        0       (+1)
> NM_000016       chr1    76198427        76198537        ENST00000370841 ENST00000370841-intron_3        0       (+1)
> NM_000016       chr1    76198608        76199212        ENST00000370841 ENST00000370841-intron_4        0       (+1)
> NM_000016       chr1    76199314        76200475        ENST00000370841 ENST00000370841-intron_5        0       (+1)
> NM_000016       chr1    76200557        76205664        ENST00000370841 ENST00000370841-intron_6        0       (+1)
> NM_000016       chr1    76205796        76211490        ENST00000370841 ENST00000370841-intron_7        0       (+1)
> NM_000016       chr1    76211600        76215103        ENST00000370841 ENST00000370841-intron_8        0       (+1)
> NM_000016       chr1    76215245        76216135        ENST00000370841 ENST00000370841-intron_9        0       (+1)
> NM_000016       chr1    76216232        76226806        ENST00000370841 ENST00000370841-intron_10       0       (+1)
> NM_000016       chr1    76227056        76228376        ENST00000370841 ENST00000370841-intron_11       0       (+1)
>  
> However when querying our internal database the script hangs in  line 121 (see script bellow)
>  
> Can you tell me what I am doing wrong? I am very confused why this  works remotely but not on my local database.
>  
> Here is the script I’ve been using:
>  
> #!/usr/bin/perl -w
>  
> use lib 'v67_ensembl_api/ensembl/modules/';
>  
> use strict;
> use warnings;
> use Term::ProgressBar;
> use Getopt::Long;
> use Bio::EnsEMBL::Registry;
>  
> my $transcript_list   = "";
> my $help              = "";
> my $remote;
> my $db_version;
> my $get_introns;
> my $get_exons = 1;
>  
> GetOptions(
>     'remote'           => \$remote,
>     'i|introns'        => \$get_introns,
>     'e|exons!'          => \$get_exons,
>     'f|file:s'         => \$transcript_list,
>     'db|db_version:i'   => \$db_version,
>     'h|?|help'         => \$help
> );
>  
> my $transcript_of_interest = shift;
>  
> if ( $help || ( !($transcript_of_interest) && !($transcript_list) ) ) {
>     die "Usage: $0 <Options> [TrancriptName/ID]
>         --remote        Query remote database instead of local
>         --db or --db_version    Use this to specify an alternative database to query defaults to the same database as the api being used
>         -i or --introns        => get all exons for transcript (default on)
>         -e or --exons          => get all introns for transcript (default off)
>         Note: both last 2 options and negatable... no if you want to exclude exons or introns you can specify --noexons or --nointrons;
>         -f or --file        File with a list of transcripts - 1 per line
>         -h or --help or -?  This message\n";
> }
>  
> my $registry = 'Bio::EnsEMBL::Registry';
> unless ($db_version){
>     $db_version = $registry->software_version;
> }
>  
> unless ($remote) {
>     $registry->load_registry_from_db(
>         -db_version => $db_version,
>         -host => 'internal_host',
>         -user => 'ensembl',
>         -pass => 'password',
>         -port => 3306,
>         -verbose=> 1,
>     );
>     my $core_mca      = $registry->get_adaptor("Homo sapiens", 'core'            , 'metacontainer');
>     if ($core_mca) {
>         print STDERR "Connected to core version ", $core_mca->get_schema_version, " database";
>     }else{
>         print STDERR "Failed to connect to local database -- connecting to remote instead\n";
>         $registry->load_registry_from_db(
>             -db_version => $db_version,
>             -host => 'ensembldb.ensembl.org',
>             -user => 'anonymous',
>             #-verbose=> 1,
>         );
>         unless ($core_mca) {
>                 die "Could not connect to core database either locally or remotely";
>                 }
>     }
> }
> else {
>     $registry->load_registry_from_db(
>         -host => 'ensembldb.ensembl.org',
>         -user => 'anonymous',
>         #-verbose=> 1,
>     );
>     my $core_mca      = $registry->get_adaptor("Homo sapiens", 'core' , 'metacontainer');
>     if ($core_mca) {
>         print STDERR "Connected to core version ", $core_mca->get_schema_version, " database";
>     }else{
>         die "Could not connect to core database";
>     }
> }
>  
> my $transcript_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Transcript' );
> my $gene_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Gene' );
>  
> my @transcripts_of_interest = ();
> if ($transcript_list) {
>     open( my $list,"<", $transcript_list ) || die "Could not open $transcript_list file for reading $!\n";
>     @transcripts_of_interest = <$list>;
>     close($list);
> }
> else {
>     push( @transcripts_of_interest, $transcript_of_interest );
> }
>  
> my $analysis_name;
> if ($get_introns && $get_exons){
>     $analysis_name = "Gathering Exons and Introns" ;
> }elsif($get_introns){
>     $analysis_name = "Gathering Introns";
> }elsif($get_exons){
>     $analysis_name = "Gathering Exons";
> }
>  
> my $number_of_entries = scalar @transcripts_of_interest;
> my $progress = Term::ProgressBar->new ({    count => $number_of_entries ,
>                                             name => $analysis_name,
>                                             ETA => 'linear',
>                                     });
> $progress->max_update_rate(1);
> my $next_update = 0;
> my $j = 0;
> foreach my $query_transcript (@transcripts_of_interest) {
>     chomp $query_transcript;
>     my $transcript = "";
>     if ($query_transcript =~ /ENST/i){
>         $transcript =   $transcript_adaptor->fetch_by_stable_id("$query_transcript");
>     }
>     else{
>         ($transcript) = @{ $transcript_adaptor->fetch_all_by_external_name("$query_transcript") };
>     }
>    
>     if ($transcript){
>         fetch_data($get_exons, $get_introns, $query_transcript, $transcript);
>     }
>     else{
>         #$progress->message("Query: $query_transcript failed >> using Fetch all method to retrieve gene associated with this transcript");
>         my ($gene) = @{ $gene_adaptor->fetch_all_by_external_name("$query_transcript") };
>         unless ($gene){
>             $progress->message("Query: $query_transcript failed!");
>             next;
>         }
>         my @transcripts = @{ $gene->get_all_Transcripts() };
>         fetch_data($get_exons, $get_introns, $query_transcript, @transcripts);
>     }
>     $next_update = $progress->update() if (++$j > $next_update);
>    
> }
> $progress->update($number_of_entries) if ($number_of_entries >= $next_update);
>  
>  
> sub feature2string
> {
>     my $feature = shift;
>     my $trans_id = shift;
>     my $i = shift;
>  
>     my $seq_region = $feature->slice->seq_region_name();
>     my $start      = $feature->start();
>     my $end        = $feature->end();
>     my $strand     = $feature->strand();
>     my $stable_id;
>     if ($i){
>         $stable_id = $trans_id."-intron_".$i;
>     }else{
>         $stable_id = $feature->stable_id();
>     }
>  
>     return sprintf( "%s\t%d\t%d\t%s\t%s\t%d\t(%+d)", "chr".$seq_region, $start, $end, $trans_id, $stable_id, "0", $strand );
> }
>  
> sub fetch_data{
>     my $get_exons = shift;
>     my $get_introns = shift;
>     my $query_transcript = shift;
>     my @transcripts = @_;
>     foreach my $transcript (@transcripts){
>         my $i = 0;
>         if ($get_exons){
>             foreach my $exon ( @{ $transcript->get_all_Exons() } ) {
>                 my $estring = feature2string($exon, $transcript->stable_id);
>                 print $query_transcript."\t".$estring."\n";
>             }
>         }
>         if ($get_introns){
>             foreach my $intron ( @{ $transcript->get_all_Introns() } ) {
>                 my $estring = feature2string($intron, $transcript->stable_id, ++$i);
>                 print $query_transcript."\t".$estring."\n";
>             }
>         }
>     }
> }
>  
>  
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/





More information about the Dev mailing list