[ensembl-dev] [Compara] Memory leak?

Wed Jan 25 14:31:38 GMT 2012

Hi Giuseppe

The cache in the member adaptor has been added for a specific pipeline, 
but does not follow the same procedure as other adaptors

On 24/01/12 19:44, Giuseppe G. wrote:
> Hi,
>
> I'm running a pipeline composed of three blocks. Block 1 uses the 
> Ensembl Api, Block 2 uses the output of Block 1 but not the Ensembl 
> Api, Block 3 uses the output of Block 3 and processes it using the 
> Ensembl Api again. My code is structured as follows
>
>
> -set up registry
> -pass registry to block 1; create relevant adapters and do something
> -registry->clear
> -pass output_block_1 to bloc 2; do something
> -set up registry
> -pass registry and output_block_2 to block_3; create relevant adapters 
> and do something
> -registry->clear
> -finish
>
> Both block_1 and block_3 will operate on some text input file creating 
> as many different ensembl adapters as needed, cycling on each text 
> entry and then writing to a output file.
>
> Now this has worked well for a number of years (since rel. 58 I'd 
> say). It has been used on genome-wide lists of ensembl gene IDs 
> without any problems.
>
> Since r.64, however, I'm having problems completing the process, even 
> for relatively small input files (~1500 IDs). The pipeline will not 
> complete running block_3 and quit with an "out of memory!" perl error. 
> Memory usage will approach 60% halfway through block_3 on a i686 4GB 
> ram machine running Unix.
>
> I'm currently puzzled as to what the reason for the following 
> behaviour might be: if I run the third block alone (ie I comment out 
> the code from block1 and 2 in my script, and give block_3 the 
> completed output from block2) block3 will complete. But, if I 
> understand correctly, the clear() method should disconnect and release 
> all memory used by the ensembl connection in block_1. So why does the 
> presence of two pipeline blocks, both using the API, crash my script - 
> which will instead complete ok if I run only one block at a time? Is 
> there maybe a more appropriate method to use on completion of a 
> registry session, rather than clear()? Or maybe for some unknown 
> reason (to me) Perl does not fully release memory during runtime?
>
> I've started doing some memory profiling in block3, using Devel::Size, 
> Devel::Gladiator, etc. In block3 I create the following adapters:
>
> gene
> genomeDB
> member
> homology
> proteintree
> methodlinkspeciesset
> NCBItaxon
>
> iterating through the input lines, I'm checking the total size of each 
> of these. Having reached approximately 10% of my input file, all of 
> these stay constant in size - apart from the member_adaptor which has 
> grown 145-fold its initial size (I'm talking about the size of the 
> full data structures here).
>
> I was wondering if the member adaptor's size increase is to be 
> expected and if the reason for my out of memory errors has to be found 
> somewhere else. I did attempt deactivating caching through the 
> registry calls, but no luck. Your help would be greatly appreciates as 
> usual. Thanks a lot!
>
> Giuseppe
>

-- 
Javier Herrero, PhD
Ensembl Coordinator and Ensembl Compara Project Leader
European Bioinformatics Institute (EMBL-EBI)
Wellcome Trust Genome Campus, Hinxton
Cambridge - CB10 1SD - UK