[ensembl-dev] perl API slow script

Wed Sep 18 09:12:31 BST 2019

Hello Thibaut, Andrew, and list members.

thank you both for your guidance!

Thibaut's fixes speed up the code by at least 4x in my hands... Also, 
thanks Thibaut for explaining the rationale behind your fixes. 
Conclusion: if you need to use the perl API, dig into the database 
schema to optimize the queries, and use slices.

The real deal though comes from Andrew's suggestion: by directly 
connecting to the mysql database I get a 10000x speedup... Yes that's 4 
orders of magnitude!
Conclusion: don't use the perl API if you can avoid it. Sad conclusion 
because I love perl and use it daily, but something seems severely 
broken in the API.

For posterity here is the code I used based on Andrew's suggestion, it 
completes in a few seconds in my hands.

#!/bin/sh
DATABASE=homo_sapiens_core_97_38
echo "USE $DATABASE ; select t.stable_id from transcript t, gene g where 
t.transcript_id = g.canonical_transcript_id ;" | mysql -u anonymous -h 
ensembldb.ensembl.org

Best regards,
Nicolas

On 09/17/2019 06:12 PM, Thibaut Hourlier wrote:
> Hi Nicolas,
> In the current release there are 248,916 transcript in the human 
> database so the API fetched all of them before processing them. Then the 
> gene knows which transcript is canonical but a transcript doesn’t knows 
> if it’s canonical which means more queries from the API.
> 
> Because of the way the API works it is usually faster to use a slice 
> object to get your gene/transcripts or any other object.
> Unless you are really restricted by memory, I would use a foreach loop 
> instead of the while loop with shift.
> 
> my $slice_adaptor = $reg->get_adaptor(‘human’, ‘core’, ’slice’);
> foreach my $slice (@{$slice_adaptor->fetch_all(’toplevel’)}) {
>    foreach my $gene (@{$slice->get_all_Genes}) {
>      my $transcript = $gene->canonical_transcript;
>      print $transcript->stable_id, “\n”;
>    }
> }
> 
> We are close to a new release so the servers can also be a bit overloaded.
> 
> Thanks
> Thibaut
> 
>> On 17 Sep 2019, at 16:15, Olson, Andrew <olson at cshl.edu 
>> <mailto:olson at cshl.edu>> wrote:
>>
>> Hi Nicolas,
>> For bulk operations that are pretty easy, I like to just query the 
>> database directly.
>>
>> echo "select t.* from transcript t, gene g where t.transcript_id = 
>> g.canonical_transcript_id and g.is_current = 1” | mysql … > 
>> canonicalTranscripts.txt
>>
>> Andrew
>>
>>> On Sep 17, 2019, at 10:49 AM, Nicolas Thierry-Mieg 
>>> <Nicolas.Thierry-Mieg at univ-grenoble-alpes.fr 
>>> <mailto:Nicolas.Thierry-Mieg at univ-grenoble-alpes.fr>> wrote:
>>>
>>> Hi list,
>>>
>>> I want to obtain the list of Ensembl Human "canonical" transcripts.
>>> As far as I can see this is not available in the GTF or GFF files 
>>> that can be downloaded from ftp.ensembl.org <http://ftp.ensembl.org> .
>>>
>>> So, I wrote the following small script that uses the perl API to 
>>> connect to ensembl. My script works, but it's very slow: it took more 
>>> than 16 hours, just to obtain 66832 ENST identifiers... I'ld expect 
>>> it to take seconds or minutes, not hours. I must be doing something 
>>> very wrong but I can't see it.
>>> Please help, what is wrong with the code below?
>>> Or if the issue is permanently saturated ensembl servers, is there 
>>> some other way I could obtain the ensembl canonical transcripts? I 
>>> tried using the UCSC Table Browser, but there are discrepancies 
>>> between their "knownCanonical" table and the ensembl canonical 
>>> transcripts. I also tried biomart but couldn't find "canonical" anywhere.
>>>
>>>
>>> use Bio::EnsEMBL::Registry;
>>> my $reg = "Bio::EnsEMBL::Registry";
>>> $reg->load_registry_from_db(
>>>   -host => 'ensembldb.ensembl.org <http://ensembldb.ensembl.org>',
>>>   -user => 'anonymous',
>>>   -species => 'homo sapiens'
>>>   );
>>> my $transcripts_adaptor = $reg->get_adaptor('human', 'core', 
>>> 'transcript');
>>> my $transcripts = $transcripts_adaptor->fetch_all;
>>>
>>> while(my $transcript = shift @{$transcripts}) {
>>>   ($transcript->is_canonical) || next;
>>>   print $transcript->stable_id."\n" ;
>>> }
>>>
>>>
>>> Thanks!
>>> Regards,
>>> Nicolas
>>>