[ensembl-dev] (more) memory efficient LD calculation possible via API?

Tue Sep 27 04:34:09 BST 2016

Hi Anja,

Thanks for the suggestion; however, the major memory-usage bottleneck for me right now is upstream of the get_all_ld_values() call.

$LDFC->fetch_by_VariationFeature() is blowing up to ~3 Gb RAM when I use:
	$LDFC->db->use_vcf(1)
	$LDFC->max_snp_distance(500000)
	against population 1000GENOMES:phase_3:EUR

Tracing fetch_by_VariationFeature() to self->fetch_by_Slice() to self->_ld_calc() suggests that all genotypes are loaded and calculated simultaneously .. what I'm looking for is a way to process such a call in pieces:
	If I'm looking for variants in high-r^2 with rs123 with a max snp distance of 500,000, then I'd like to break it into pieces:
	  1) Look for variants in high-r^2 with rs123 that are located between -500,000 and -400,000 away.
	  2) Look for variants in high-r^2 with rs123 that are located between -400,000 and -300,000 away.
	  3) Look for variants in high-r^2 with rs123 that are located between -300,000 and -200,000 away.
	  4) Look for variants in high-r^2 with rs123 that are located between -200,000 and -100,000 away.
	...
	  9) Look for variants in high-r^2 with rs123 that are located between +300,000 and +400,000 away.
	  10) Look for variants in high-r^2 with rs123 that are located between +400,000 and +500,000 away.

.. this way, I'm only getting the memory footprint of 100kb of variants at any one time.

Does that make sense?

Is such a thing possible?

Thanks for any guidance.

Best,

Andrew

> On Sep 26, 2016, at 6:47 AM, Anja Thormann <anja at ebi.ac.uk> wrote:
> 
> Hi Andrew,
> 
> try using $ldFeatureContainer->get_all_ld_values(1); Passing argument 1 to get_all_ld_values() prevents fetching objects for all the variants in the result set. Instead you get names for variants as strings and if you are interested in more attributes you need to create your object.
> 
> You could do the following:
> 
> foreach my $ld_hash (@{$LDFC->get_all_ld_values(1)}) {
>   my $d_prime = $ld_hash->{d_prime};
>   my $r2 $ld_hash->{r2};
>   my $variation_name1 = $ld_hash->{variation_name1};
>   my $variation_name2 = $ld_hash->{variation_name2};
> 
> ...
>  }
> 
> HTH,
> Anja
> 
> 
>> On 21 Sep 2016, at 22:36, andrew126 at mac.com wrote:
>> 
>> Hi,
>> 
>> I'm using version 84 of the API on 64-bit Ubuntu.
>> 
>> I'm using the $ldFeatureContainerAdaptor->fetch_by_VariationFeature() method against particular index SNPs and particular human populations/datasets (e.g. 1000GENOMES:phase_3:EUR)
>> 
>> I'm using these relevant options:
>> 	$ldFeatureContainerAdaptor->db->use_vcf(1);
>> 	$ldFeatureContainerAdaptor->max_snp_distance(300000);
>> 
>> I've noticed that memory usage of $ldFeatureContainerAdaptor->fetch_by_VariationFeature() appears to scale roughly linearly with max_snp_distance:
>> 	roughly 7Mb RAM per kb of max_snp_distance against the 1000GENOMES:phase_3:EUR population/datasets.
>> 
>> I'm assuming all data get loaded and processed simultaneously?
>> 
>> The problem I'm facing is that for a max_snp_distance of 500kb (less usual, but not unheard of to be meaningful) it requires ~3 Gb of RAM to process, which can get prohibitive.
>> 
>> Is there a way for the method to decrease its memory usage somehow?  Not trying to load everything simultaneously etc., even at the cost of a bit of CPU efficiency?
>> 
>> Thanks for any suggestions.
>> 
>> Best regards,
>> 
>> Andrew
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
> 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/