[ensembl-dev] Disconnecting from Ensembl at random points and running in parallel

Mon Sep 15 14:43:50 BST 2014

Dear Eli,

On 15/09/2014 14:19, Bio Sequence wrote:
> Hello all,
>
> I have a couple of questions:
>
> 1) My code goes over many genes and extracts data for each of them.
> When running my code, I have noticed that the connection to Ensembl breaks which naturally kills my script.
> This happens at random points and rerunning the code solves the problem (which leads me to believe it is not a specific problem with the script or the genes I am running my script over).
> Is there some kind of known solution for this problem?

We have an option for this called -RECONNECT_WHEN_LOST => 1, which gives 
the script a chance at reconnecting to the server.

You can also call Bio::EnsEMBL::Registry->set_reconnect_when_lost(1) to 
turn this on manually.

>
> The errors printed to screen when the connection breaks vary and include, for example:
> (example a) DBD::mysql::st execute failed: Out of resources when opening file './homo_sapiens_core_76_38/xref.MYD' (Errcode: 24) at ~/src/ensembl/modules/Bio/EnsEMBL/DBSQL/BaseAdaptor.pm line 482, <$source_species_genes> line 80.
>
> (example b) STACK Bio::EnsEMBL::DBSQL::BaseAdaptor::generic_fetch ~/src/ensembl/modules/Bio/EnsEMBL/DBSQL/BaseAdaptor.pm:483
> STACK Bio::EnsEMBL::DBSQL::GeneAdaptor::fetch_by_stable_id ~/src/ensembl/modules/Bio/EnsEMBL/DBSQL/GeneAdaptor.pm:249
> STACK toplevel ~/get_orthologs_for_gene.pl:56
> Date (localtime)    = Mon Sep 15 12:27:12 2014
> Ensembl API version = 76
>
> (example c) DBI connect('host=ensembldb.ensembl.org;port=3306','anonymous',...) failed: Can't connect to MySQL server on 'ensembldb.ensembl.org' (111) at ~/src/ensembl/modules/Bio/EnsEMBL/Registry.pm line 1624.

These errors both speak of some technical issues with our database 
server. It's possible that it was overloaded during your attempts to run 
your script. We can check if anything bad is happening, but sometimes it 
just has too many users, sorry about that.

> 2) Does anyone here have any experience with running several jobs on
> a
queue in parallel to obtain data from Ensembl? Is it possible to have
multiple connections simultaneously? Specifically, I am interested in
running the above mentioned code on different gene lists in parallel.
Would that be possible?

We do exactly this with our own pipelines that generate the many 
downloadable files we provide on our FTP site. It is uncommon to create 
several connections from a single script, but forks or parallel 
processes can be used. Our eHive software is designed for complex 
parallel workflows, but you can do a much more simple parallel job as 
well. Take a look at [1] and [2] to determine whether you want the 
higher engineering cost of using an existing

If you intend to fork within a single Perl instance, be aware that the 
Ensembl Registry is a singleton that shares its connection with various 
adaptors and may get in the way of your efforts unless you localise it 
appropriately within threads.

In addition to this, you can also consider accessing the information via 
our REST API: rest.ensembl.org

We have endpoints for homology, genes, transcripts etc. which would make 
parallel access much easier in terms of coding complexity.

Regards,

Kieron
-- 
Kieron Taylor PhD.
Ensembl Core team
EBI

[1]https://github.com/Ensembl/ensembl-production/tree/release/76/modules/Bio/EnsEMBL/Production/Pipeline/FASTA
[2]http://www.ensembl.org/info/docs/eHive/index.html