[ensembl-dev] (more) memory efficient LD calculation possible via API?

Thu Oct 6 09:17:36 BST 2016

Hi Anja,

Thanks very much .. no worries at all .. I'm sure you are all very busy.

That's great news .. I think it will be a very useful feature .. I'm not sure how I can help, but if you think of a way, please don't hesitate to let me know... smile.

Thanks again,

Andrew

> On Oct 5, 2016, at 12:39 PM, Anja Thormann <anja at ebi.ac.uk> wrote:
> 
> Hi Andrew,
> 
> I’m very sorry for the delay in responding to your question. At the moment, this seems to be a very good idea for keeping the memory footprint low. This might even be best achieved internally without having the user to specify the distance, if we only keep variants in high LD.
> 
> I will start implementing your idea and let you know as soon as I have any results for you.
> 
> Thank you very much for your feedback.
> 
> Anja
> 
> 
>> On 27 Sep 2016, at 04:34, andrew126 at mac.com wrote:
>> 
>> Hi Anja,
>> 
>> Thanks for the suggestion; however, the major memory-usage bottleneck for me right now is upstream of the get_all_ld_values() call.
>> 
>> $LDFC->fetch_by_VariationFeature() is blowing up to ~3 Gb RAM when I use:
>> 	$LDFC->db->use_vcf(1)
>> 	$LDFC->max_snp_distance(500000)
>> 	against population 1000GENOMES:phase_3:EUR
>> 
>> Tracing fetch_by_VariationFeature() to self->fetch_by_Slice() to self->_ld_calc() suggests that all genotypes are loaded and calculated simultaneously .. what I'm looking for is a way to process such a call in pieces:
>> 	If I'm looking for variants in high-r^2 with rs123 with a max snp distance of 500,000, then I'd like to break it into pieces:
>> 	  1) Look for variants in high-r^2 with rs123 that are located between -500,000 and -400,000 away.
>> 	  2) Look for variants in high-r^2 with rs123 that are located between -400,000 and -300,000 away.
>> 	  3) Look for variants in high-r^2 with rs123 that are located between -300,000 and -200,000 away.
>> 	  4) Look for variants in high-r^2 with rs123 that are located between -200,000 and -100,000 away.
>> 	...
>> 	  9) Look for variants in high-r^2 with rs123 that are located between +300,000 and +400,000 away.
>> 	  10) Look for variants in high-r^2 with rs123 that are located between +400,000 and +500,000 away.
>> 
>> .. this way, I'm only getting the memory footprint of 100kb of variants at any one time.
>> 
>> Does that make sense?
>> 
>> Is such a thing possible?
>> 
>> Thanks for any guidance.
>> 
>> Best,
>> 
>> Andrew
>> 
>> 
>>> On Sep 26, 2016, at 6:47 AM, Anja Thormann <anja at ebi.ac.uk> wrote:
>>> 
>>> Hi Andrew,
>>> 
>>> try using $ldFeatureContainer->get_all_ld_values(1); Passing argument 1 to get_all_ld_values() prevents fetching objects for all the variants in the result set. Instead you get names for variants as strings and if you are interested in more attributes you need to create your object.
>>> 
>>> You could do the following:
>>> 
>>> foreach my $ld_hash (@{$LDFC->get_all_ld_values(1)}) {
>>> my $d_prime = $ld_hash->{d_prime};
>>> my $r2 $ld_hash->{r2};
>>> my $variation_name1 = $ld_hash->{variation_name1};
>>> my $variation_name2 = $ld_hash->{variation_name2};
>>> 
>>> ...
>>> }
>>> 
>>> HTH,
>>> Anja
>>> 
>>> 
>>>> On 21 Sep 2016, at 22:36, andrew126 at mac.com wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> I'm using version 84 of the API on 64-bit Ubuntu.
>>>> 
>>>> I'm using the $ldFeatureContainerAdaptor->fetch_by_VariationFeature() method against particular index SNPs and particular human populations/datasets (e.g. 1000GENOMES:phase_3:EUR)
>>>> 
>>>> I'm using these relevant options:
>>>> 	$ldFeatureContainerAdaptor->db->use_vcf(1);
>>>> 	$ldFeatureContainerAdaptor->max_snp_distance(300000);
>>>> 
>>>> I've noticed that memory usage of $ldFeatureContainerAdaptor->fetch_by_VariationFeature() appears to scale roughly linearly with max_snp_distance:
>>>> 	roughly 7Mb RAM per kb of max_snp_distance against the 1000GENOMES:phase_3:EUR population/datasets.
>>>> 
>>>> I'm assuming all data get loaded and processed simultaneously?
>>>> 
>>>> The problem I'm facing is that for a max_snp_distance of 500kb (less usual, but not unheard of to be meaningful) it requires ~3 Gb of RAM to process, which can get prohibitive.
>>>> 
>>>> Is there a way for the method to decrease its memory usage somehow?  Not trying to load everything simultaneously etc., even at the cost of a bit of CPU efficiency?
>>>> 
>>>> Thanks for any suggestions.
>>>> 
>>>> Best regards,
>>>> 
>>>> Andrew
>>>> _______________________________________________
>>>> Dev mailing list    Dev at ensembl.org
>>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog: http://www.ensembl.info/
>>> 
>>> 
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>> 
>> 
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
> 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/