[ensembl-dev] Disconnecting from Ensembl at random points and running in parallel
Kieron Taylor
ktaylor at ebi.ac.uk
Mon Sep 15 14:43:50 BST 2014
Dear Eli,
On 15/09/2014 14:19, Bio Sequence wrote:
> Hello all,
>
> I have a couple of questions:
>
> 1) My code goes over many genes and extracts data for each of them.
> When running my code, I have noticed that the connection to Ensembl breaks which naturally kills my script.
> This happens at random points and rerunning the code solves the problem (which leads me to believe it is not a specific problem with the script or the genes I am running my script over).
> Is there some kind of known solution for this problem?
We have an option for this called -RECONNECT_WHEN_LOST => 1, which gives
the script a chance at reconnecting to the server.
You can also call Bio::EnsEMBL::Registry->set_reconnect_when_lost(1) to
turn this on manually.
>
> The errors printed to screen when the connection breaks vary and include, for example:
> (example a) DBD::mysql::st execute failed: Out of resources when opening file './homo_sapiens_core_76_38/xref.MYD' (Errcode: 24) at ~/src/ensembl/modules/Bio/EnsEMBL/DBSQL/BaseAdaptor.pm line 482, <$source_species_genes> line 80.
>
> (example b) STACK Bio::EnsEMBL::DBSQL::BaseAdaptor::generic_fetch ~/src/ensembl/modules/Bio/EnsEMBL/DBSQL/BaseAdaptor.pm:483
> STACK Bio::EnsEMBL::DBSQL::GeneAdaptor::fetch_by_stable_id ~/src/ensembl/modules/Bio/EnsEMBL/DBSQL/GeneAdaptor.pm:249
> STACK toplevel ~/get_orthologs_for_gene.pl:56
> Date (localtime) = Mon Sep 15 12:27:12 2014
> Ensembl API version = 76
>
> (example c) DBI connect('host=ensembldb.ensembl.org;port=3306','anonymous',...) failed: Can't connect to MySQL server on 'ensembldb.ensembl.org' (111) at ~/src/ensembl/modules/Bio/EnsEMBL/Registry.pm line 1624.
These errors both speak of some technical issues with our database
server. It's possible that it was overloaded during your attempts to run
your script. We can check if anything bad is happening, but sometimes it
just has too many users, sorry about that.
> 2) Does anyone here have any experience with running several jobs on
> a
queue in parallel to obtain data from Ensembl? Is it possible to have
multiple connections simultaneously? Specifically, I am interested in
running the above mentioned code on different gene lists in parallel.
Would that be possible?
We do exactly this with our own pipelines that generate the many
downloadable files we provide on our FTP site. It is uncommon to create
several connections from a single script, but forks or parallel
processes can be used. Our eHive software is designed for complex
parallel workflows, but you can do a much more simple parallel job as
well. Take a look at [1] and [2] to determine whether you want the
higher engineering cost of using an existing
If you intend to fork within a single Perl instance, be aware that the
Ensembl Registry is a singleton that shares its connection with various
adaptors and may get in the way of your efforts unless you localise it
appropriately within threads.
In addition to this, you can also consider accessing the information via
our REST API: rest.ensembl.org
We have endpoints for homology, genes, transcripts etc. which would make
parallel access much easier in terms of coding complexity.
Regards,
Kieron
--
Kieron Taylor PhD.
Ensembl Core team
EBI
[1]https://github.com/Ensembl/ensembl-production/tree/release/76/modules/Bio/EnsEMBL/Production/Pipeline/FASTA
[2]http://www.ensembl.org/info/docs/eHive/index.html
More information about the Dev
mailing list