[ensembl-dev] Disconnecting from Ensembl at random points and running in parallel

Bio Sequence biosequence at tauex.tau.ac.il
Mon Sep 15 14:19:10 BST 2014


Hello all,

I have a couple of questions:

1) My code goes over many genes and extracts data for each of them.
When running my code, I have noticed that the connection to Ensembl breaks which naturally kills my script.
This happens at random points and rerunning the code solves the problem (which leads me to believe it is not a specific problem with the script or the genes I am running my script over).
Is there some kind of known solution for this problem?

Here is the basic structure of my script:

my $registry = 'Bio::EnsEMBL::Registry';
$registry->load_registry_from_db(
        -host => 'ensembldb.ensembl.org', 
        -user => 'anonymous'
);

my $gene_adaptor = $registry->get_adaptor( $species, 'Core', 'Gene' );
my $tr_adaptor = $registry->get_adaptor( $species, 'Core', 'Transcript' );

foreach my $gene_id ( @gene_ids )
{

        # get current gene object
        my $gene = $gene_adaptor->fetch_by_stable_id($gene_id);

        # get current transcripts associated with the current gene
        my $transcripts_ref = $tr_adaptor->fetch_all_by_Gene($gene);

        # call transcript_retrieval_function with ref to transcript array... 
        ...

        # call ortholog_retrieval_function...
        ...
}

sub transcript_retrieval_function
{

        my ( $transcripts_ref ) = @_;

        # do some things like:
        $transcript->status();
        $transcript->translateable_seq();

        return ($transcript);
}

sub ortholog_retrieval_function
{
        my ( $orthologs_ref ) = @_;
        
        foreach my $ortho (@orthologs)
        {
                #do some things on each ortholog
        }
}


The errors printed to screen when the connection breaks vary and include, for example:
(example a) DBD::mysql::st execute failed: Out of resources when opening file './homo_sapiens_core_76_38/xref.MYD' (Errcode: 24) at ~/src/ensembl/modules/Bio/EnsEMBL/DBSQL/BaseAdaptor.pm line 482, <$source_species_genes> line 80.

(example b) STACK Bio::EnsEMBL::DBSQL::BaseAdaptor::generic_fetch ~/src/ensembl/modules/Bio/EnsEMBL/DBSQL/BaseAdaptor.pm:483
STACK Bio::EnsEMBL::DBSQL::GeneAdaptor::fetch_by_stable_id ~/src/ensembl/modules/Bio/EnsEMBL/DBSQL/GeneAdaptor.pm:249
STACK toplevel ~/get_orthologs_for_gene.pl:56
Date (localtime)    = Mon Sep 15 12:27:12 2014
Ensembl API version = 76

(example c) DBI connect('host=ensembldb.ensembl.org;port=3306','anonymous',...) failed: Can't connect to MySQL server on 'ensembldb.ensembl.org' (111) at ~/src/ensembl/modules/Bio/EnsEMBL/Registry.pm line 1624.


2) Does anyone here have any experience with running several jobs on a queue in parallel to obtain data from Ensembl? Is it possible to have multiple connections simultaneously? Specifically, I am interested in running the above mentioned code on different gene lists in parallel. Would that be possible?



Many thanks in advance,
Eli



More information about the Dev mailing list