[ensembl-dev] Strange behavior on querying External transcript Identifiers
Andy Yates
ayates at ebi.ac.uk
Thu Jun 7 16:32:31 BST 2012
Hi Duarte,
I'm more surprised at the hanging nature of your script than anything else. We are currently aware of an issue in the API with refseq identifiers where their _ in the 3rd position is seen as a wild card and the API ends up performing some quite lengthy & costly searches. In some cases we've see fetch_all_by_external_name() take 0.5 seconds to complete when the speed of the method should be around 50ms.
You can test this theory out by doing a regex before querying with refseq ids e.g.
my $sub_id = $query_transcript;
$sub_id =~ s/_/\\_/;
$transcript_adaptor->fetch_all_by_external_name($sub_id);
As for the tables involved we consult the following (along with the core tables):
* xref
* object_xref
* external_synonym
Please make sure all are populated
Andy
Andrew Yates Ensembl Core Software Project Leader
EMBL-EBI Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK http://www.ensembl.org/
On 6 Jun 2012, at 16:53, Duarte Molha wrote:
> Dear Developers
>
> I am having an issue running a query on my internal ensembl database.
> I have a script that retrieves all the exons and introns for a given transcript ID.
>
> When I query the remote ensembl servers I get back the expected data, however if I query my internal database the script just hangs and retrieves nothing.
> The script runs fine for ENSEMBL transcript Ids but fails on NM ids.
> Can you tell me if we need to download any additional tables in order to support external Ids?
>
> In this case, if I do :
> getIntronsOrExonsByTranscript.pl -i -e --remote NM_000016
>
> I get back:
> NM_000016 chr1 76190036 76190502 ENST00000370841 ENSE00001844831 0 (+1)
> NM_000016 chr1 76194086 76194173 ENST00000370841 ENSE00000830661 0 (+1)
> NM_000016 chr1 76198329 76198426 ENST00000370841 ENSE00000830662 0 (+1)
> NM_000016 chr1 76198538 76198607 ENST00000370841 ENSE00000830663 0 (+1)
> NM_000016 chr1 76199213 76199313 ENST00000370841 ENSE00000830664 0 (+1)
> NM_000016 chr1 76200476 76200556 ENST00000370841 ENSE00000830665 0 (+1)
> NM_000016 chr1 76205665 76205795 ENST00000370841 ENSE00000830666 0 (+1)
> NM_000016 chr1 76211491 76211599 ENST00000370841 ENSE00000830671 0 (+1)
> NM_000016 chr1 76215104 76215244 ENST00000370841 ENSE00000931972 0 (+1)
> NM_000016 chr1 76216136 76216231 ENST00000370841 ENSE00000931973 0 (+1)
> NM_000016 chr1 76226807 76227055 ENST00000370841 ENSE00000931974 0 (+1)
> NM_000016 chr1 76228377 76229364 ENST00000370841 ENSE00001513957 0 (+1)
> NM_000016 chr1 76190503 76194085 ENST00000370841 ENST00000370841-intron_1 0 (+1)
> NM_000016 chr1 76194174 76198328 ENST00000370841 ENST00000370841-intron_2 0 (+1)
> NM_000016 chr1 76198427 76198537 ENST00000370841 ENST00000370841-intron_3 0 (+1)
> NM_000016 chr1 76198608 76199212 ENST00000370841 ENST00000370841-intron_4 0 (+1)
> NM_000016 chr1 76199314 76200475 ENST00000370841 ENST00000370841-intron_5 0 (+1)
> NM_000016 chr1 76200557 76205664 ENST00000370841 ENST00000370841-intron_6 0 (+1)
> NM_000016 chr1 76205796 76211490 ENST00000370841 ENST00000370841-intron_7 0 (+1)
> NM_000016 chr1 76211600 76215103 ENST00000370841 ENST00000370841-intron_8 0 (+1)
> NM_000016 chr1 76215245 76216135 ENST00000370841 ENST00000370841-intron_9 0 (+1)
> NM_000016 chr1 76216232 76226806 ENST00000370841 ENST00000370841-intron_10 0 (+1)
> NM_000016 chr1 76227056 76228376 ENST00000370841 ENST00000370841-intron_11 0 (+1)
>
> However when querying our internal database the script hangs in line 121 (see script bellow)
>
> Can you tell me what I am doing wrong? I am very confused why this works remotely but not on my local database.
>
> Here is the script I’ve been using:
>
> #!/usr/bin/perl -w
>
> use lib 'v67_ensembl_api/ensembl/modules/';
>
> use strict;
> use warnings;
> use Term::ProgressBar;
> use Getopt::Long;
> use Bio::EnsEMBL::Registry;
>
> my $transcript_list = "";
> my $help = "";
> my $remote;
> my $db_version;
> my $get_introns;
> my $get_exons = 1;
>
> GetOptions(
> 'remote' => \$remote,
> 'i|introns' => \$get_introns,
> 'e|exons!' => \$get_exons,
> 'f|file:s' => \$transcript_list,
> 'db|db_version:i' => \$db_version,
> 'h|?|help' => \$help
> );
>
> my $transcript_of_interest = shift;
>
> if ( $help || ( !($transcript_of_interest) && !($transcript_list) ) ) {
> die "Usage: $0 <Options> [TrancriptName/ID]
> --remote Query remote database instead of local
> --db or --db_version Use this to specify an alternative database to query defaults to the same database as the api being used
> -i or --introns => get all exons for transcript (default on)
> -e or --exons => get all introns for transcript (default off)
> Note: both last 2 options and negatable... no if you want to exclude exons or introns you can specify --noexons or --nointrons;
> -f or --file File with a list of transcripts - 1 per line
> -h or --help or -? This message\n";
> }
>
> my $registry = 'Bio::EnsEMBL::Registry';
> unless ($db_version){
> $db_version = $registry->software_version;
> }
>
> unless ($remote) {
> $registry->load_registry_from_db(
> -db_version => $db_version,
> -host => 'internal_host',
> -user => 'ensembl',
> -pass => 'password',
> -port => 3306,
> -verbose=> 1,
> );
> my $core_mca = $registry->get_adaptor("Homo sapiens", 'core' , 'metacontainer');
> if ($core_mca) {
> print STDERR "Connected to core version ", $core_mca->get_schema_version, " database";
> }else{
> print STDERR "Failed to connect to local database -- connecting to remote instead\n";
> $registry->load_registry_from_db(
> -db_version => $db_version,
> -host => 'ensembldb.ensembl.org',
> -user => 'anonymous',
> #-verbose=> 1,
> );
> unless ($core_mca) {
> die "Could not connect to core database either locally or remotely";
> }
> }
> }
> else {
> $registry->load_registry_from_db(
> -host => 'ensembldb.ensembl.org',
> -user => 'anonymous',
> #-verbose=> 1,
> );
> my $core_mca = $registry->get_adaptor("Homo sapiens", 'core' , 'metacontainer');
> if ($core_mca) {
> print STDERR "Connected to core version ", $core_mca->get_schema_version, " database";
> }else{
> die "Could not connect to core database";
> }
> }
>
> my $transcript_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Transcript' );
> my $gene_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Gene' );
>
> my @transcripts_of_interest = ();
> if ($transcript_list) {
> open( my $list,"<", $transcript_list ) || die "Could not open $transcript_list file for reading $!\n";
> @transcripts_of_interest = <$list>;
> close($list);
> }
> else {
> push( @transcripts_of_interest, $transcript_of_interest );
> }
>
> my $analysis_name;
> if ($get_introns && $get_exons){
> $analysis_name = "Gathering Exons and Introns" ;
> }elsif($get_introns){
> $analysis_name = "Gathering Introns";
> }elsif($get_exons){
> $analysis_name = "Gathering Exons";
> }
>
> my $number_of_entries = scalar @transcripts_of_interest;
> my $progress = Term::ProgressBar->new ({ count => $number_of_entries ,
> name => $analysis_name,
> ETA => 'linear',
> });
> $progress->max_update_rate(1);
> my $next_update = 0;
> my $j = 0;
> foreach my $query_transcript (@transcripts_of_interest) {
> chomp $query_transcript;
> my $transcript = "";
> if ($query_transcript =~ /ENST/i){
> $transcript = $transcript_adaptor->fetch_by_stable_id("$query_transcript");
> }
> else{
> ($transcript) = @{ $transcript_adaptor->fetch_all_by_external_name("$query_transcript") };
> }
>
> if ($transcript){
> fetch_data($get_exons, $get_introns, $query_transcript, $transcript);
> }
> else{
> #$progress->message("Query: $query_transcript failed >> using Fetch all method to retrieve gene associated with this transcript");
> my ($gene) = @{ $gene_adaptor->fetch_all_by_external_name("$query_transcript") };
> unless ($gene){
> $progress->message("Query: $query_transcript failed!");
> next;
> }
> my @transcripts = @{ $gene->get_all_Transcripts() };
> fetch_data($get_exons, $get_introns, $query_transcript, @transcripts);
> }
> $next_update = $progress->update() if (++$j > $next_update);
>
> }
> $progress->update($number_of_entries) if ($number_of_entries >= $next_update);
>
>
> sub feature2string
> {
> my $feature = shift;
> my $trans_id = shift;
> my $i = shift;
>
> my $seq_region = $feature->slice->seq_region_name();
> my $start = $feature->start();
> my $end = $feature->end();
> my $strand = $feature->strand();
> my $stable_id;
> if ($i){
> $stable_id = $trans_id."-intron_".$i;
> }else{
> $stable_id = $feature->stable_id();
> }
>
> return sprintf( "%s\t%d\t%d\t%s\t%s\t%d\t(%+d)", "chr".$seq_region, $start, $end, $trans_id, $stable_id, "0", $strand );
> }
>
> sub fetch_data{
> my $get_exons = shift;
> my $get_introns = shift;
> my $query_transcript = shift;
> my @transcripts = @_;
> foreach my $transcript (@transcripts){
> my $i = 0;
> if ($get_exons){
> foreach my $exon ( @{ $transcript->get_all_Exons() } ) {
> my $estring = feature2string($exon, $transcript->stable_id);
> print $query_transcript."\t".$estring."\n";
> }
> }
> if ($get_introns){
> foreach my $intron ( @{ $transcript->get_all_Introns() } ) {
> my $estring = feature2string($intron, $transcript->stable_id, ++$i);
> print $query_transcript."\t".$estring."\n";
> }
> }
> }
> }
>
>
> _______________________________________________
> Dev mailing list Dev at ensembl.org
> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
More information about the Dev
mailing list