[ensembl-dev] [Compara] Memory leak?

Wed Jan 25 14:35:48 GMT 2012

Apologies, my fingers slipped on the keyboard and sent the email before 
it was ready... :-(

So, the cache in the member adaptor has been added for a specific 
pipeline. Unfortunately, the implementation is different than in other 
adaptors and the common ways to clear the cache do not work.

We don't have a final solution right now. However, you can manually 
clear the cache by using:

$member_adaptor->{'_member_cache'} = undef;

Sorry for the inconvenience. I hope this works for you.

Javier

On 25/01/12 14:31, Javier Herrero wrote:
> Hi Giuseppe
>
> The cache in the member adaptor has been added for a specific 
> pipeline, but does not follow the same procedure as other adaptors
>
> On 24/01/12 19:44, Giuseppe G. wrote:
>> Hi,
>>
>> I'm running a pipeline composed of three blocks. Block 1 uses the 
>> Ensembl Api, Block 2 uses the output of Block 1 but not the Ensembl 
>> Api, Block 3 uses the output of Block 3 and processes it using the 
>> Ensembl Api again. My code is structured as follows
>>
>>
>> -set up registry
>> -pass registry to block 1; create relevant adapters and do something
>> -registry->clear
>> -pass output_block_1 to bloc 2; do something
>> -set up registry
>> -pass registry and output_block_2 to block_3; create relevant 
>> adapters and do something
>> -registry->clear
>> -finish
>>
>> Both block_1 and block_3 will operate on some text input file 
>> creating as many different ensembl adapters as needed, cycling on 
>> each text entry and then writing to a output file.
>>
>> Now this has worked well for a number of years (since rel. 58 I'd 
>> say). It has been used on genome-wide lists of ensembl gene IDs 
>> without any problems.
>>
>> Since r.64, however, I'm having problems completing the process, even 
>> for relatively small input files (~1500 IDs). The pipeline will not 
>> complete running block_3 and quit with an "out of memory!" perl 
>> error. Memory usage will approach 60% halfway through block_3 on a 
>> i686 4GB ram machine running Unix.
>>
>> I'm currently puzzled as to what the reason for the following 
>> behaviour might be: if I run the third block alone (ie I comment out 
>> the code from block1 and 2 in my script, and give block_3 the 
>> completed output from block2) block3 will complete. But, if I 
>> understand correctly, the clear() method should disconnect and 
>> release all memory used by the ensembl connection in block_1. So why 
>> does the presence of two pipeline blocks, both using the API, crash 
>> my script - which will instead complete ok if I run only one block at 
>> a time? Is there maybe a more appropriate method to use on completion 
>> of a registry session, rather than clear()? Or maybe for some unknown 
>> reason (to me) Perl does not fully release memory during runtime?
>>
>> I've started doing some memory profiling in block3, using 
>> Devel::Size, Devel::Gladiator, etc. In block3 I create the following 
>> adapters:
>>
>> gene
>> genomeDB
>> member
>> homology
>> proteintree
>> methodlinkspeciesset
>> NCBItaxon
>>
>> iterating through the input lines, I'm checking the total size of 
>> each of these. Having reached approximately 10% of my input file, all 
>> of these stay constant in size - apart from the member_adaptor which 
>> has grown 145-fold its initial size (I'm talking about the size of 
>> the full data structures here).
>>
>> I was wondering if the member adaptor's size increase is to be 
>> expected and if the reason for my out of memory errors has to be 
>> found somewhere else. I did attempt deactivating caching through the 
>> registry calls, but no luck. Your help would be greatly appreciates 
>> as usual. Thanks a lot!
>>
>> Giuseppe
>>
>

-- 
Javier Herrero, PhD
Ensembl Coordinator and Ensembl Compara Project Leader
European Bioinformatics Institute (EMBL-EBI)
Wellcome Trust Genome Campus, Hinxton
Cambridge - CB10 1SD - UK