[ensembl-dev] Disconnecting from Ensembl at random points and running in parallel

Mon Sep 15 17:58:23 BST 2014

Hi Eli,

What Kieron suggested may work for you, it wasn't the solution for our
problems.  We do very similar stuff as your pseudocode and our problem was
solved by increasing the open-files-limit in mysql.

Regarding your second question, "Does anyone here have any experience with
running several jobs on a queue in parallel to obtain data from Ensembl?"
Yes, this is exactly what we did.  One run was nearly 100,000 jobs over 200+
cores using spot instances on amazon web services.  Our setup was very cost
effective and secure.

We chose not to install the full ensembl database and system on our cluster,
that was too costly and time consuming -- we did not want to be experts in
that.  Like your example, we used the ensembl api in perl scripts that ran
in parallel.  We did try the rest interface early but backed off of that
because it was difficult to parallelize due to gating issues.  Also, we
found differences in the results the rest api gave vs the mature ensemble
perl api.

I'd be happy to discuss this with you off-list so we can keep this list
specific to the needs of the many.

Ed

gray_ed at hotmail.com

-----Original Message-----
From: dev-bounces at ensembl.org [mailto:dev-bounces at ensembl.org] On Behalf Of
Kieron Taylor
Sent: Monday, September 15, 2014 9:44 AM
To: dev at ensembl.org
Subject: Re: [ensembl-dev] Disconnecting from Ensembl at random points and
running in parallel

Dear Eli,

On 15/09/2014 14:19, Bio Sequence wrote:
> Hello all,
>
> I have a couple of questions:
>
> 1) My code goes over many genes and extracts data for each of them.
> When running my code, I have noticed that the connection to Ensembl breaks
which naturally kills my script.
> This happens at random points and rerunning the code solves the problem
(which leads me to believe it is not a specific problem with the script or
the genes I am running my script over).
> Is there some kind of known solution for this problem?

We have an option for this called -RECONNECT_WHEN_LOST => 1, which gives the
script a chance at reconnecting to the server.

You can also call Bio::EnsEMBL::Registry->set_reconnect_when_lost(1) to turn
this on manually.

>
> The errors printed to screen when the connection breaks vary and include,
for example:
> (example a) DBD::mysql::st execute failed: Out of resources when opening
file './homo_sapiens_core_76_38/xref.MYD' (Errcode: 24) at
~/src/ensembl/modules/Bio/EnsEMBL/DBSQL/BaseAdaptor.pm line 482,
<$source_species_genes> line 80.
>
> (example b) STACK Bio::EnsEMBL::DBSQL::BaseAdaptor::generic_fetch 
> ~/src/ensembl/modules/Bio/EnsEMBL/DBSQL/BaseAdaptor.pm:483
> STACK Bio::EnsEMBL::DBSQL::GeneAdaptor::fetch_by_stable_id 
> ~/src/ensembl/modules/Bio/EnsEMBL/DBSQL/GeneAdaptor.pm:249
> STACK toplevel ~/get_orthologs_for_gene.pl:56
> Date (localtime)    = Mon Sep 15 12:27:12 2014
> Ensembl API version = 76
>
> (example c) DBI
connect('host=ensembldb.ensembl.org;port=3306','anonymous',...) failed:
Can't connect to MySQL server on 'ensembldb.ensembl.org' (111) at
~/src/ensembl/modules/Bio/EnsEMBL/Registry.pm line 1624.

These errors both speak of some technical issues with our database server.
It's possible that it was overloaded during your attempts to run your
script. We can check if anything bad is happening, but sometimes it just has
too many users, sorry about that.

> 2) Does anyone here have any experience with running several jobs on a
queue in parallel to obtain data from Ensembl? Is it possible to have
multiple connections simultaneously? Specifically, I am interested in
running the above mentioned code on different gene lists in parallel.
Would that be possible?

We do exactly this with our own pipelines that generate the many
downloadable files we provide on our FTP site. It is uncommon to create
several connections from a single script, but forks or parallel processes
can be used. Our eHive software is designed for complex parallel workflows,
but you can do a much more simple parallel job as well. Take a look at [1]
and [2] to determine whether you want the higher engineering cost of using
an existing

If you intend to fork within a single Perl instance, be aware that the
Ensembl Registry is a singleton that shares its connection with various
adaptors and may get in the way of your efforts unless you localise it
appropriately within threads.

In addition to this, you can also consider accessing the information via our
REST API: rest.ensembl.org

We have endpoints for homology, genes, transcripts etc. which would make
parallel access much easier in terms of coding complexity.

Regards,

Kieron
--
Kieron Taylor PhD.
Ensembl Core team
EBI

[1]https://github.com/Ensembl/ensembl-production/tree/release/76/modules/Bio
/EnsEMBL/Production/Pipeline/FASTA
[2]http://www.ensembl.org/info/docs/eHive/index.html

_______________________________________________
Dev mailing list    Dev at ensembl.org
Posting guidelines and subscribe/unsubscribe info:
http://lists.ensembl.org/mailman/listinfo/dev
Ensembl Blog: http://www.ensembl.info/