[ensembl-dev] [Compara] Memory leak?

Tue Jan 24 19:44:33 GMT 2012

Hi,

I'm running a pipeline composed of three blocks. Block 1 uses the 
Ensembl Api, Block 2 uses the output of Block 1 but not the Ensembl Api, 
Block 3 uses the output of Block 3 and processes it using the Ensembl 
Api again. My code is structured as follows

-set up registry
-pass registry to block 1; create relevant adapters and do something
-registry->clear
-pass output_block_1 to bloc 2; do something
-set up registry
-pass registry and output_block_2 to block_3; create relevant adapters 
and do something
-registry->clear
-finish

Both block_1 and block_3 will operate on some text input file creating 
as many different ensembl adapters as needed, cycling on each text entry 
and then writing to a output file.

Now this has worked well for a number of years (since rel. 58 I'd say). 
It has been used on genome-wide lists of ensembl gene IDs without any 
problems.

Since r.64, however, I'm having problems completing the process, even 
for relatively small input files (~1500 IDs). The pipeline will not 
complete running block_3 and quit with an "out of memory!" perl error. 
Memory usage will approach 60% halfway through block_3 on a i686 4GB ram 
machine running Unix.

I'm currently puzzled as to what the reason for the following behaviour 
might be: if I run the third block alone (ie I comment out the code from 
block1 and 2 in my script, and give block_3 the completed output from 
block2) block3 will complete. But, if I understand correctly, the 
clear() method should disconnect and release all memory used by the 
ensembl connection in block_1. So why does the presence of two pipeline 
blocks, both using the API, crash my script - which will instead 
complete ok if I run only one block at a time? Is there maybe a more 
appropriate method to use on completion of a registry session, rather 
than clear()? Or maybe for some unknown reason (to me) Perl does not 
fully release memory during runtime?

I've started doing some memory profiling in block3, using Devel::Size, 
Devel::Gladiator, etc. In block3 I create the following adapters:

gene
genomeDB
member
homology
proteintree
methodlinkspeciesset
NCBItaxon

iterating through the input lines, I'm checking the total size of each 
of these. Having reached approximately 10% of my input file, all of 
these stay constant in size - apart from the member_adaptor which has 
grown 145-fold its initial size (I'm talking about the size of the full 
data structures here).

I was wondering if the member adaptor's size increase is to be expected 
and if the reason for my out of memory errors has to be found somewhere 
else. I did attempt deactivating caching through the registry calls, but 
no luck. Your help would be greatly appreciates as usual. Thanks a lot!

Giuseppe

-- 

The University of Edinburgh is a charitable body, registered in 
Scotland, with registration number SC005336.