[ensembl-dev] perl API slow script

Wed Sep 18 14:53:01 BST 2019

Hi Nicolas,

I wanted to add a little more onto the discussion here and the relative merits/demerits of using the API. Andrew is quite right in highlighting a number of risks associated with direct database access. Most of our data can be extracted from the alternative methods you've previously described such as GTFs/GFFs, BioMart or REST. Canonical is an attribute we do not expose through these alternatives.

Our API's speed is partly governed by how it is used, Thibaut's suggestion here is a perfect case in point, but also partly how it is implemented. Many methods/objects are lazy loaded as a balance between keeping memory consumption under control and speed* and assumes you want more than just a single attribute from an object. However, API performance is dependent on factors such as distance from the source database server being used (MySQL's protocol can degrade as distance increases), the load of the database in question and the number of SQL queries being sent (a compound issue based on the previous two points). Switching to a single direct SQL query has helped to negate a number of these issues and allowed you to answer your question far faster than using the API alone.

Using direct SQL will result in a component of your analysis workflow becoming tightly coupled to our backend schema, which we do reserve the right to modify. The API aims to maintain the interface and has a robust deprecation policy behind it. So long as you're happy with the possible risks of direct database access then in this situation it appears to be the best solution available. Though I would caution using direct access instead API access in all situations as eventually you will encounter an edge case the API negates/handles that you would have to reimplement.

Cheers,

Andy

* As aside you could change the call to get all genes to @{$slice->get_all_Genes(undef, undef, 1)}. The last boolean flag tells the API to load transcripts in bulk and not on-demand and should offer a speed increase.

------------
Andrew Yates - Genomics Technology Infrastructure Team Leader
The European Bioinformatics Institute (EMBL-EBI)
Wellcome Genome Campus
Hinxton, Cambridge
CB10 1SD, United Kingdom
Tel: +44-(0)1223-492538
Fax: +44-(0)1223-494468
Skype: andy.yates.ebi
http://www.ebi.ac.uk/
http://www.ensembl.org/

> On 18 Sep 2019, at 12:55, Olson, Andrew <olson at cshl.edu> wrote:
> 
> Hi Nicolas,
> I just want to caution you that talking directly to the database can be risky as there are occasionally (but not recently) changes to the schema. Also, there are sometimes relevant biological phenomena encoded in the database such as post transcriptional modifications that are easy to overlook that the perl API knows how to query and process. Most bulk queries can usually be answered through the biomart interface, but after a quick look around, I couldn’t see a way to get the canonical transcripts from there.
> Andrew
> 
> On Sep 18, 2019, at 4:12 AM, Nicolas Thierry-Mieg <Nicolas.Thierry-Mieg at univ-grenoble-alpes.fr<mailto:Nicolas.Thierry-Mieg at univ-grenoble-alpes.fr>> wrote:
> 
> Hello Thibaut, Andrew, and list members.
> 
> thank you both for your guidance!
> 
> Thibaut's fixes speed up the code by at least 4x in my hands... Also, thanks Thibaut for explaining the rationale behind your fixes. Conclusion: if you need to use the perl API, dig into the database schema to optimize the queries, and use slices.
> 
> The real deal though comes from Andrew's suggestion: by directly connecting to the mysql database I get a 10000x speedup... Yes that's 4 orders of magnitude!
> Conclusion: don't use the perl API if you can avoid it. Sad conclusion because I love perl and use it daily, but something seems severely broken in the API.
> 
> For posterity here is the code I used based on Andrew's suggestion, it completes in a few seconds in my hands.
> 
> #!/bin/sh
> DATABASE=homo_sapiens_core_97_38
> echo "USE $DATABASE ; select t.stable_id from transcript t, gene g where t.transcript_id = g.canonical_transcript_id ;" | mysql -u anonymous -h ensembldb.ensembl.org<http://ensembldb.ensembl.org/>
> 
> 
> Best regards,
> Nicolas
> 
> 
> 
> On 09/17/2019 06:12 PM, Thibaut Hourlier wrote:
> Hi Nicolas,
> In the current release there are 248,916 transcript in the human database so the API fetched all of them before processing them. Then the gene knows which transcript is canonical but a transcript doesn’t knows if it’s canonical which means more queries from the API.
> Because of the way the API works it is usually faster to use a slice object to get your gene/transcripts or any other object.
> Unless you are really restricted by memory, I would use a foreach loop instead of the while loop with shift.
> my $slice_adaptor = $reg->get_adaptor(‘human’, ‘core’, ’slice’);
> foreach my $slice (@{$slice_adaptor->fetch_all(’toplevel’)}) {
>  foreach my $gene (@{$slice->get_all_Genes}) {
>    my $transcript = $gene->canonical_transcript;
>    print $transcript->stable_id, “\n”;
>  }
> }
> We are close to a new release so the servers can also be a bit overloaded.
> Thanks
> Thibaut
> On 17 Sep 2019, at 16:15, Olson, Andrew <olson at cshl.edu<mailto:olson at cshl.edu> <mailto:olson at cshl.edu>> wrote:
> 
> Hi Nicolas,
> For bulk operations that are pretty easy, I like to just query the database directly.
> 
> echo "select t.* from transcript t, gene g where t.transcript_id = g.canonical_transcript_id and g.is_current = 1” | mysql … > canonicalTranscripts.txt
> 
> Andrew
> 
> On Sep 17, 2019, at 10:49 AM, Nicolas Thierry-Mieg <Nicolas.Thierry-Mieg at univ-grenoble-alpes.fr<mailto:Nicolas.Thierry-Mieg at univ-grenoble-alpes.fr> <mailto:Nicolas.Thierry-Mieg at univ-grenoble-alpes.fr>> wrote:
> 
> Hi list,
> 
> I want to obtain the list of Ensembl Human "canonical" transcripts.
> As far as I can see this is not available in the GTF or GFF files that can be downloaded from ftp.ensembl.org<http://ftp.ensembl.org/> <https://urldefense.proofpoint.com/v2/url?u=http-3A__ftp.ensembl.org&d=DwIGaQ&c=mkpgQs82XaCKIwNV8b32dmVOmERqJe4bBOtF0CetP9Y&r=ic-pQ08gnhTpvpqfp3_6Uw&m=YzU072eQhrqoU0o6hm_z8tTq9td__gm4LKlP74-gBFQ&s=glthCrkP6MmFR2Y3CJe0B4f7WtTW_CGqbcHag8V6tI0&e= > .
> 
> So, I wrote the following small script that uses the perl API to connect to ensembl. My script works, but it's very slow: it took more than 16 hours, just to obtain 66832 ENST identifiers... I'ld expect it to take seconds or minutes, not hours. I must be doing something very wrong but I can't see it.
> Please help, what is wrong with the code below?
> Or if the issue is permanently saturated ensembl servers, is there some other way I could obtain the ensembl canonical transcripts? I tried using the UCSC Table Browser, but there are discrepancies between their "knownCanonical" table and the ensembl canonical transcripts. I also tried biomart but couldn't find "canonical" anywhere.
> 
> 
> use Bio::EnsEMBL::Registry;
> my $reg = "Bio::EnsEMBL::Registry";
> $reg->load_registry_from_db(
>  -host => 'ensembldb.ensembl.org<http://ensembldb.ensembl.org/> <https://urldefense.proofpoint.com/v2/url?u=http-3A__ensembldb.ensembl.org&d=DwIGaQ&c=mkpgQs82XaCKIwNV8b32dmVOmERqJe4bBOtF0CetP9Y&r=ic-pQ08gnhTpvpqfp3_6Uw&m=YzU072eQhrqoU0o6hm_z8tTq9td__gm4LKlP74-gBFQ&s=IIAnyMwBESTsdqmhR88VQrDLNK7NkHX5HYygZeNDcYU&e= >',
>  -user => 'anonymous',
>  -species => 'homo sapiens'
>  );
> my $transcripts_adaptor = $reg->get_adaptor('human', 'core', 'transcript');
> my $transcripts = $transcripts_adaptor->fetch_all;
> 
> while(my $transcript = shift @{$transcripts}) {
>  ($transcript->is_canonical) || next;
>  print $transcript->stable_id."\n" ;
> }
> 
> 
> Thanks!
> Regards,
> Nicolas
> 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org<mailto:Dev at ensembl.org>
> Posting guidelines and subscribe/unsubscribe info: https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.ensembl.org_mailman_listinfo_dev-5Fensembl.org&d=DwIGaQ&c=mkpgQs82XaCKIwNV8b32dmVOmERqJe4bBOtF0CetP9Y&r=ic-pQ08gnhTpvpqfp3_6Uw&m=YzU072eQhrqoU0o6hm_z8tTq9td__gm4LKlP74-gBFQ&s=dEhnoCRjiALyFEzuM8194OCrDLtyEURtGUAvWBljJBw&e= Ensembl Blog: https://urldefense.proofpoint.com/v2/url?u=http-3A__www.ensembl.info_&d=DwIGaQ&c=mkpgQs82XaCKIwNV8b32dmVOmERqJe4bBOtF0CetP9Y&r=ic-pQ08gnhTpvpqfp3_6Uw&m=YzU072eQhrqoU0o6hm_z8tTq9td__gm4LKlP74-gBFQ&s=Ep3XkcBELwpCfB-_pXxXywcP80fZqMEo4vRg-SRuhSY&e=
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
> Ensembl Blog: http://www.ensembl.info/