[ensembl-dev] perl API slow script

Wed Sep 18 11:55:37 BST 2019

Hi Nicolas,
I just want to caution you that talking directly to the database can be risky as there are occasionally (but not recently) changes to the schema. Also, there are sometimes relevant biological phenomena encoded in the database such as post transcriptional modifications that are easy to overlook that the perl API knows how to query and process. Most bulk queries can usually be answered through the biomart interface, but after a quick look around, I couldn’t see a way to get the canonical transcripts from there.
Andrew

On Sep 18, 2019, at 4:12 AM, Nicolas Thierry-Mieg <Nicolas.Thierry-Mieg at univ-grenoble-alpes.fr<mailto:Nicolas.Thierry-Mieg at univ-grenoble-alpes.fr>> wrote:

Hello Thibaut, Andrew, and list members.

thank you both for your guidance!

Thibaut's fixes speed up the code by at least 4x in my hands... Also, thanks Thibaut for explaining the rationale behind your fixes. Conclusion: if you need to use the perl API, dig into the database schema to optimize the queries, and use slices.

The real deal though comes from Andrew's suggestion: by directly connecting to the mysql database I get a 10000x speedup... Yes that's 4 orders of magnitude!
Conclusion: don't use the perl API if you can avoid it. Sad conclusion because I love perl and use it daily, but something seems severely broken in the API.

For posterity here is the code I used based on Andrew's suggestion, it completes in a few seconds in my hands.

#!/bin/sh
DATABASE=homo_sapiens_core_97_38
echo "USE $DATABASE ; select t.stable_id from transcript t, gene g where t.transcript_id = g.canonical_transcript_id ;" | mysql -u anonymous -h ensembldb.ensembl.org<http://ensembldb.ensembl.org/>

Best regards,
Nicolas

On 09/17/2019 06:12 PM, Thibaut Hourlier wrote:
Hi Nicolas,
In the current release there are 248,916 transcript in the human database so the API fetched all of them before processing them. Then the gene knows which transcript is canonical but a transcript doesn’t knows if it’s canonical which means more queries from the API.
Because of the way the API works it is usually faster to use a slice object to get your gene/transcripts or any other object.
Unless you are really restricted by memory, I would use a foreach loop instead of the while loop with shift.
my $slice_adaptor = $reg->get_adaptor(‘human’, ‘core’, ’slice’);
foreach my $slice (@{$slice_adaptor->fetch_all(’toplevel’)}) {
  foreach my $gene (@{$slice->get_all_Genes}) {
    my $transcript = $gene->canonical_transcript;
    print $transcript->stable_id, “\n”;
  }
}
We are close to a new release so the servers can also be a bit overloaded.
Thanks
Thibaut
On 17 Sep 2019, at 16:15, Olson, Andrew <olson at cshl.edu<mailto:olson at cshl.edu> <mailto:olson at cshl.edu>> wrote:

Hi Nicolas,
For bulk operations that are pretty easy, I like to just query the database directly.

echo "select t.* from transcript t, gene g where t.transcript_id = g.canonical_transcript_id and g.is_current = 1” | mysql … > canonicalTranscripts.txt

Andrew

On Sep 17, 2019, at 10:49 AM, Nicolas Thierry-Mieg <Nicolas.Thierry-Mieg at univ-grenoble-alpes.fr<mailto:Nicolas.Thierry-Mieg at univ-grenoble-alpes.fr> <mailto:Nicolas.Thierry-Mieg at univ-grenoble-alpes.fr>> wrote:

Hi list,

I want to obtain the list of Ensembl Human "canonical" transcripts.
As far as I can see this is not available in the GTF or GFF files that can be downloaded from ftp.ensembl.org<http://ftp.ensembl.org/> <https://urldefense.proofpoint.com/v2/url?u=http-3A__ftp.ensembl.org&d=DwIGaQ&c=mkpgQs82XaCKIwNV8b32dmVOmERqJe4bBOtF0CetP9Y&r=ic-pQ08gnhTpvpqfp3_6Uw&m=YzU072eQhrqoU0o6hm_z8tTq9td__gm4LKlP74-gBFQ&s=glthCrkP6MmFR2Y3CJe0B4f7WtTW_CGqbcHag8V6tI0&e= > .

So, I wrote the following small script that uses the perl API to connect to ensembl. My script works, but it's very slow: it took more than 16 hours, just to obtain 66832 ENST identifiers... I'ld expect it to take seconds or minutes, not hours. I must be doing something very wrong but I can't see it.
Please help, what is wrong with the code below?
Or if the issue is permanently saturated ensembl servers, is there some other way I could obtain the ensembl canonical transcripts? I tried using the UCSC Table Browser, but there are discrepancies between their "knownCanonical" table and the ensembl canonical transcripts. I also tried biomart but couldn't find "canonical" anywhere.

use Bio::EnsEMBL::Registry;
my $reg = "Bio::EnsEMBL::Registry";
$reg->load_registry_from_db(
  -host => 'ensembldb.ensembl.org<http://ensembldb.ensembl.org/> <https://urldefense.proofpoint.com/v2/url?u=http-3A__ensembldb.ensembl.org&d=DwIGaQ&c=mkpgQs82XaCKIwNV8b32dmVOmERqJe4bBOtF0CetP9Y&r=ic-pQ08gnhTpvpqfp3_6Uw&m=YzU072eQhrqoU0o6hm_z8tTq9td__gm4LKlP74-gBFQ&s=IIAnyMwBESTsdqmhR88VQrDLNK7NkHX5HYygZeNDcYU&e= >',
  -user => 'anonymous',
  -species => 'homo sapiens'
  );
my $transcripts_adaptor = $reg->get_adaptor('human', 'core', 'transcript');
my $transcripts = $transcripts_adaptor->fetch_all;

while(my $transcript = shift @{$transcripts}) {
  ($transcript->is_canonical) || next;
  print $transcript->stable_id."\n" ;
}

Thanks!
Regards,
Nicolas

_______________________________________________
Dev mailing list    Dev at ensembl.org<mailto:Dev at ensembl.org>
Posting guidelines and subscribe/unsubscribe info: https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.ensembl.org_mailman_listinfo_dev-5Fensembl.org&d=DwIGaQ&c=mkpgQs82XaCKIwNV8b32dmVOmERqJe4bBOtF0CetP9Y&r=ic-pQ08gnhTpvpqfp3_6Uw&m=YzU072eQhrqoU0o6hm_z8tTq9td__gm4LKlP74-gBFQ&s=dEhnoCRjiALyFEzuM8194OCrDLtyEURtGUAvWBljJBw&e= Ensembl Blog: https://urldefense.proofpoint.com/v2/url?u=http-3A__www.ensembl.info_&d=DwIGaQ&c=mkpgQs82XaCKIwNV8b32dmVOmERqJe4bBOtF0CetP9Y&r=ic-pQ08gnhTpvpqfp3_6Uw&m=YzU072eQhrqoU0o6hm_z8tTq9td__gm4LKlP74-gBFQ&s=Ep3XkcBELwpCfB-_pXxXywcP80fZqMEo4vRg-SRuhSY&e=