[ensembl-dev] RsIDs consistently fail with variation API, v83

Fri Mar 18 17:02:20 GMT 2016

Dear Johanne,
thank you very much for sharing your results. You are absolutely correct: the threshold of 100K is artificial. The parameter is used in our script for computing LD data. I increased the value to 500K. Could you please refresh your checkout of the ensembl-variation API again. You also need to compile the C code in ensembl-variation/C_code again. It is safest to delete calc_genotypes and calc_genotypes.o before doing this. I’m running your script again with the newly set parameter because I’m very interested in how this is going to change the distribution.

Thank you,
Anja 

> On 18 Mar 2016, at 14:12, Johanne Håøy Horn <johannhh at ifi.uio.no> wrote:
> 
> Hello,
> 
> Thank you for your help!
> 
> I have a list of ~2300 SNPs which I expanded with max_snp_distance(500_000). The API is at v84, and I have hg38 as my chromosome. It took longer, as expected, and I got more variants in LD. However, I still got a max distance of 100KB. I find this a bit suspicious, and believe this is probably due to a technicality rather than the fact that none of my SNPs have LD further away than 100K. Either my code has some bug, or I use the Ensembl API wrong. Could you please help me find out how to get distances > 100K?
> 
> I attach my code, a file of the tag SNPs I have expanded, and some histograms with different r2 to show the distribution of distances. The file of the actual expanded LD SNPs is large, ~45MB, but I can send it if you want to take a look. The plots are inline images, let me know if you cannot see them. The x axis are the distances between a tagSNP and a variant in LD. Especially the plot where r2 is 0, makes it quite clear a cutoff at 100k is present. Such a defined threshold seems more human than biological.
> 
> The max distances with the different r2 is as follows: 
> r2 = 0: max distance of 100,000
> Number of expanded SNPs in LD: 931,299
> 
> r2 = 0.5: max distance of 99,998
> Number of expanded SNPs in LD: 146,881
> 
> r2 = 0.7: max distance of 99,994
> Number of expanded SNPs in LD: 88,862
> 
> r2 = 0.8: max distance of 99,986
> Number of expanded SNPs in LD: 67,744
> 
> r2 = 0.9: max distance of 99,986
> Number of expanded SNPs in LD: 50,025
> 
> r2  = 1: max distance of 99,986
> Number of expanded SNPs in LD: 28,046
> 
> Inline plots:
> r2 = 0:
> <ld_distance_histogram0.png>
> r2 = 0.5:
> <ld_distance_histogram0.5.png>
> r2 = 0.7:
> <ld_distance_histogram0.7.png>
> r2 = 0.8:
> <ld_distance_histogram0.8.png>
> r2 = 0.9:
> <ld_distance_histogram0.9.png>
> r2 = 1:
> <ld_distance_histogram1.png>
> Best, 
> Johanne HÃ¥Ã¸y Horn
> 
> <expandSNPs.pl>
> <farh_hg38_diseases.txt>
> 
> 
> 
>> 11. mar. 2016 kl. 14.00 skrev Anja Thormann <anja at ebi.ac.uk <mailto:anja at ebi.ac.uk>>:
>> 
>>> 
>>> When you say 100 MB, do you mean 100KB? As in 100,000 bases? 
>> 
>> Yes you are correct I meant 100KB. Iâ€™m very sorry for the confusion.
>> 
>> 
>>> I noticed that with the script I have used so far, the longest physical distance between a SNP and one of its LD variants, is 100K. Is this because with the slice, my script will only find SNPs in LD 100K away in either direction? So theoretically, there can be other LD variants further away from the SNP I am expanding?
>> 
>> Yes, there could be variants that are further than 100KB away from your given variant and still in LD with the same. You could increase the max SNP distance which would however also increase the computation time. I need to check the literature for information on the distance at which LD decreases (variants are less likely to be in LD).
>> 
>> 
>>> 
>>> I have mistakenly thought that the default of getting ld SNPs used the entire chromosome as slice. Is this possible (especially given time considerations)? Or is it not any point in doing it, as LD only happen within 100K-blocks anyway?
>> 
>> It is not possible to pass an entire chromosome for computing LD. I would recommend a slice of length 300 to 500KB. However, 500KB could be almost too large for regions with many variants.
>> 
>> 
>>> 
>>> Best,
>>> Johanne
>>> 
>>>> 11. mar. 2016 kl. 12.50 skrev Anja Thormann <anja at ebi.ac.uk <mailto:anja at ebi.ac.uk>>:
>>>> 
>>>> Dear Johanne,
>>>> 
>>>> we have released ensembl 84 and you can now make use of our improvements for LD data computation. Just as a reminder, you would like to retrieve all variants that are in LD with a given variant. Depending on how many variants you have for which you want to compute LD data there are two ways:
>>>> 
>>>> - If your input list is small (< 1000) you can use fetch_by_VariationFeature($vf, $population). The default setting will create a slice (100MB + variant length + 100MB) expanding the input variant. The API computes LD values between the variant of interest and all variants surrounding the variant and not further away than 100MB. You could speed this up by decreasing the distance to 50MB:
>>>> 
>>>> my $ldfc_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'ldfeaturecontainer');
>>>> $ldfc_adaptor->max_snp_distance(50_000);
>>>> my $ldfc = $ldfc_adaptor->fetch_by_VariationFeature($vf, $ld_population);
>>>> 
>>>> - If your list of input variants is much longer I would recommend using fetch_by_Slice. You could create overlapping slice objects that cover your chromosomal region of interest and then the API computes all pairwise LD values for all variants in the region. You need to filter the results for your variants of interest. This requires a little bit more work because you have to choose the correct overlap size to make sure you allow for the same max distance for which you want to compute LD data for each of your input variants.
>>>> 
>>>> my $slice_adaptor = $registry->get_adaptor('homo_sapiens', 'core', 'slice');
>>>> my $chr = 6;
>>>> my $start = 32_102_292;
>>>> my $end = 32_402_292;
>>>> my $slice = $slice_adaptor->fetch_by_region('chromosome', $chr, $start, $end);
>>>> my $ldfc = $ldfc_adaptor->fetch_by_Slice($slice, $ld_population);
>>>> 
>>>> The second approach should be faster because it can summarise some of the computations that have do be done: If your input variants are in close proximity you can compute all their LD values in one computation by providing the slice that covers all of them instead of creating a region for each of them individually and then compute LD values.
>>>> 
>>>> 
>>>> Independent of the two approaches you can further speed up your computation: When asking for all ld values pass 1 as an argument. This will prevent the method (get_all_ld_values) from creating variation feature objects and instead will only return the variant names and their LD values:
>>>> 
>>>> my $ldfc = $ldfc_adaptor->fetch_by_VariationFeature($vf, $ld_population);
>>>> my $ld_values = $ldfc->get_all_ld_values(1);
>>>> foreach my $ld_hash (@$ld_values) {
>>>>   my $r2 = $ld_hash->{r2};
>>>>   my $d_prime = $ld_hash->{d_prime};
>>>>   my $variation_name1 = $ld_hash->{variation_name1};
>>>>   my $variation_name2 = $ld_hash->{variation_name2};
>>>> 
>>>> }
>>>> 
>>>> Don't forget to switch to the new release/84 branches for all ensembl repos.
>>>> 
>>>> Best,
>>>> Anja
>>>> 
>>>> 
>>>> 
>>>>> On 7 Mar 2016, at 20:02, Anja Thormann <anja at ebi.ac.uk <mailto:anja at ebi.ac.uk>> wrote:
>>>>> 
>>>>> Dear Johanne,
>>>>> 
>>>>> I was able to reproduce the lost connection error using the following list of 31 variants (concatenating all your input files):
>>>>> rs3117098
>>>>> rs9268516
>>>>> rs3129890
>>>>> rs9272346
>>>>> rs9273349
>>>>> rs7775228
>>>>> rs9275698
>>>>> rs148203517
>>>>> rs3853601
>>>>> rs41268896
>>>>> rs12153855
>>>>> rs176095
>>>>> rs9469099
>>>>> rs4722404
>>>>> rs7000782
>>>>> rs7815944
>>>>> rs549182
>>>>> rs9268480
>>>>> rs9268853
>>>>> rs9268877
>>>>> rs9268923
>>>>> rs2395185
>>>>> rs9271366
>>>>> rs6927022
>>>>> rs1063355
>>>>> rs943072
>>>>> rs6911490
>>>>> rs2858829
>>>>> rs6920220
>>>>> rs798502
>>>>> rs11764116
>>>>> 
>>>>> The chromosome names with a different format are called alternate loci. They are stand-alone, accessioned sequences. Several human chromosomal regions exhibit sufficient variability to prevent adequate representation by a single sequence. To address this, the GRCh38 assembly provides alternate sequence for selected variant regions through the inclusion of alternate loci scaffolds. You can read more about this here:
>>>>> http://genomeref.blogspot.co.uk/2013/12/announcing-grch38.html <http://genomeref.blogspot.co.uk/2013/12/announcing-grch38.html>
>>>>> 
>>>>> One example is variant s3129890. It maps to the reference sequence and to 6 alternate loci:
>>>>> http://www.ensembl.org/Homo_sapiens/Variation/Explore?db=core;r=CHR_HSCHR6_8_CTG1:32392790-32393790;v=rs3129890;vdb=variation;vf=146716554 <http://www.ensembl.org/Homo_sapiens/Variation/Explore?db=core;r=CHR_HSCHR6_8_CTG1:32392790-32393790;v=rs3129890;vdb=variation;vf=146716554>
>>>>> 
>>>>> The error seems to be caused by variants mapping alternate loci. Until I have figured out  what is going on it would be good if you could exclude them from your computations and just use the mappings to the reference sequence:
>>>>> 
>>>>> foreach my $vf (@var_features) {
>>>>>   next unless ($vf->slice->is_reference);
>>>>>   ...
>>>>> 
>>>>> 
>>>>> I hope that helps for now. I will get back to you as soon as I have more information. I will also let you know how you can speed up your computation as soon as we release Ensembl 84.
>>>>> 
>>>>> Best,
>>>>> Anja
>>>>> 
>>>>> 
>>>>>> On 6 Mar 2016, at 12:32, Johanne HÃ¥Ã¸y Horn <johannhh at ifi.uio.no <mailto:johannhh at ifi.uio.no>> wrote:
>>>>>> 
>>>>>> Dear ensembl team,
>>>>>> 
>>>>>> I have gotten quite some help from you previously with the variation api. I wish to expand variations to include the variants they are in LD with. My script works for the most part, but I am unable to run it on some of my rsID files. I have broken the files into increasingly smaller parts, but the runs still fails. Could you help me with what I am doing wrong?
>>>>>> 
>>>>>> Some of the issue seems to be connected to the chromosome the rsID is on. In many of the failing files of rsIDs, this is what the script produce (example taken from out_0_Asthma3.txt, full output file attached to this email): 
>>>>>> 6 rs3129969
>>>>>> 32414832 r2=0.538176
>>>>>> 6 rs3763312
>>>>>> 32408571 r2=0.074106
>>>>>> 6 rs553071835
>>>>>> 32487947 r2=0.058328
>>>>>> CHR_HSCHR6_MHC_DBB_CTG1
>>>>>> rs3129890 32421214
>>>>>> CHR_HSCHR6_MHC_MANN_CTG1
>>>>>> rs3129890 32485127
>>>>>> CHR_HSCHR6_MHC_MCF_CTG1
>>>>>> rs3129890 32521873
>>>>>> CHR_HSCHR6_MHC_QBL_CTG1
>>>>>> rs3129890 32403957
>>>>>> CHR_HSCHR6_MHC_SSTO_CTG1
>>>>>> rs3129890 32453108
>>>>>> CHR_HSCHR6_8_CTG1
>>>>>> rs3129890 32393290
>>>>>> 6 rs9272346
>>>>>> 32636595
>>>>>> 
>>>>>> In stead of getting a line of <chromosome> <rsID> <position> <r2> for the variation, the same SNP is repeated multiple times with a chromosome/region of a different format than what I get in the scripts that do not fail.
>>>>>> 
>>>>>> I have attached some of the GWAS disease files of rsIDs I wish to expand, but which seem to produce errors, and some error printouts I have gotten is pasted below. Two output files belonging to the erroneous input files, with out_0_<inputfilename> as filename is also attached. 
>>>>>> 
>>>>>> Is it something related to the rsIDs I am using? Are you able to run the script with these input files?
>>>>>> I noticed that longer files of rsIDs were more prone to fail, as the connection to the database was broken. Therefore, I have divided the files into multiple small files. For most of the GWAS data, this worked fine, and I get output files on the format I want. But I am now left with 10-15 files of 10-20 rsIDs which fail with various error messages. I have tried calling my script with them multiple times, and as they fail consistently, I do not think it is my internet connection that is the problem. 
>>>>>> 
>>>>>> _START TERMINAL PRINT ERROR 1_
>>>>>> inputfile Asthma3.txt
>>>>>> outputfile out_0_Asthma3.txt
>>>>>> DBD::mysql::st execute failed: Lost connection to MySQL server during query at /Users/Johanne/src/ensembl/modules//Bio/EnsEMBL/DBSQL/BaseAdaptor.pm line 482, <OUT> line 3739.
>>>>>> 
>>>>>> -------------------- EXCEPTION --------------------
>>>>>> MSG: Detected an error whilst executing SQL 'SELECT  v.variation_id, v.name AS v_name, v.class_attrib_id AS v_class_attrib_id, v.source_id AS v_source_id, v.somatic AS v_somatic, v.flipped AS v_flipped, v.ancestral_allele AS v_ancestral_allele, vs.moltype AS vs_moltype, vs.name AS vs_name, s2.name AS vs_source_name, v.minor_allele, v.minor_allele_freq, v.minor_allele_count, v.clinical_significance, v.evidence_attribs
>>>>>> FROM (( (variation v, source s1)
>>>>>>   LEFT JOIN variation_synonym vs ON v.variation_id = vs.variation_id )
>>>>>>   LEFT JOIN source s2 ON vs.source_id = s2.source_id )
>>>>>>  WHERE v.name = ?  AND
>>>>>> 
>>>>>>         s1.source_id = v.source_id
>>>>>>      AND v.display = 1
>>>>>> ': DBD::mysql::st execute failed: Lost connection to MySQL server during query at /Users/Johanne/src/ensembl/modules//Bio/EnsEMBL/DBSQL/BaseAdaptor.pm line 482, <OUT> line 3739.
>>>>>> 
>>>>>> STACK Bio::EnsEMBL::DBSQL::BaseAdaptor::generic_fetch /Users/Johanne/src/ensembl/modules//Bio/EnsEMBL/DBSQL/BaseAdaptor.pm:483
>>>>>> STACK Bio::EnsEMBL::Variation::DBSQL::VariationAdaptor::fetch_by_name /Users/Johanne/src/ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/VariationAdaptor.pm:497
>>>>>> STACK Bio::EnsEMBL::Variation::DBSQL::LDFeatureContainerAdaptor::_ld_calc /Users/Johanne/src/ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/LDFeatureContainerAdaptor.pm:865
>>>>>> STACK Bio::EnsEMBL::Variation::DBSQL::LDFeatureContainerAdaptor::fetch_by_Slice /Users/Johanne/src/ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/LDFeatureContainerAdaptor.pm:177
>>>>>> STACK Bio::EnsEMBL::Variation::DBSQL::LDFeatureContainerAdaptor::fetch_by_VariationFeature /Users/Johanne/src/ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/LDFeatureContainerAdaptor.pm:246
>>>>>> STACK toplevel expandSNPs.pl:42
>>>>>> Date (localtime)    = Sat Mar  5 18:03:25 2016
>>>>>> Ensembl API version = 83
>>>>>> ---------------------------------------------------
>>>>>> rs3117098
>>>>>> rs3117098
>>>>>> rs3117098
>>>>>> rs3117098
>>>>>> rs3117098
>>>>>> rs9268516
>>>>>> rs9268516
>>>>>> rs9268516
>>>>>> rs9268516
>>>>>> rs9268516
>>>>>> rs9268516
>>>>>> rs9268516
>>>>>> rs9268516
>>>>>> rs9268516
>>>>>> rs3129890
>>>>>> rs3129890
>>>>>> rs3129890
>>>>>> rs3129890
>>>>>> rs3129890
>>>>>> rs3129890
>>>>>> rs3129890
>>>>>> rs9272346
>>>>>> _END TERMINAL PRINT ERROR 1_
>>>>>> 
>>>>>> _START TERMINAL PRINT ERROR 2_
>>>>>> inputfile Atopicdermatitis3.txt
>>>>>> outputfile out_0_Atopicdermatitis3.txt
>>>>>> connect: Network is unreachable
>>>>>> rs1050654
>>>>>> _END TERMINAL PRINT ERROR 2_
>>>>>> 
>>>>>> _START TERMINAL PRINT ERROR 3_
>>>>>> inputfile textfiles/Atopicdermatitis3.txt
>>>>>> outputfile textfiles/out_0_Atopicdermatitis3.txt
>>>>>> Can't call method "get_all_VariationFeatures" on an undefined value at expandSNPs.pl line 33, <IN> line 3.
>>>>>> rs1050654
>>>>>> rs3853601
>>>>>> rs3853601
>>>>>> rs3853601
>>>>>> rs3853601
>>>>>> rs3853601
>>>>>> rs3853601
>>>>>> rs3853601
>>>>>> rs3853601
>>>>>> _END TERMINAL PRINT ERROR 3_
>>>>>> 
>>>>>> _START SCRIPT_
>>>>>> use strict;
>>>>>> use warnings;
>>>>>> use Bio::EnsEMBL::Registry;
>>>>>> 
>>>>>> my $inputfile = $ARGV[0];
>>>>>> my $outputfile = $ARGV[1];
>>>>>> my $r_limit = $ARGV[2];
>>>>>> my $start_run = time();
>>>>>> 
>>>>>> open (IN, "<$inputfile");
>>>>>> open (OUT, ">$outputfile");
>>>>>> 
>>>>>> my $registry = 'Bio::EnsEMBL::Registry';
>>>>>> 
>>>>>> $registry->load_registry_from_db(
>>>>>> -host => 'ensembldb.ensembl.org <http://ensembldb.ensembl.org/>',
>>>>>> -user => 'anonymous'
>>>>>> );
>>>>>> 
>>>>>> # Connect to the databases:
>>>>>> my $variation_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'variation');
>>>>>> my $ldfc_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'ldfeaturecontainer');
>>>>>> my $pop_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'population');
>>>>>> $variation_adaptor->db->use_vcf(1);
>>>>>> 
>>>>>> my $ld_population = $pop_adaptor->fetch_by_name('1000GENOMES:phase_3:CEU');
>>>>>> 
>>>>>> # Loop through all SNPs available and find SNPs in LD
>>>>>> while(<IN>) {
>>>>>> chomp;
>>>>>> my $variation_name = $_;
>>>>>> my $variation = $variation_adaptor->fetch_by_name($variation_name);
>>>>>> my @var_features;
>>>>>> 
>>>>>> if ($variation) {
>>>>>> @var_features = @{ $variation->get_all_VariationFeatures() };
>>>>>> } else {
>>>>>> print 'failing variation name: ', $variation_name, "\n";
>>>>>> next;
>>>>>> }
>>>>>> 
>>>>>> foreach my $vf (@var_features) {
>>>>>> my $rsid = $vf->name;
>>>>>> print $rsid, "\n";
>>>>>> my $start = $vf->start;
>>>>>> my $region = $vf->seq_region_name;
>>>>>> print OUT "$region\t$rsid\t$start\n";
>>>>>> my $ldfc = $ldfc_adaptor->fetch_by_VariationFeature($vf, $ld_population);
>>>>>> my @ld_values = @{ $ldfc->get_all_ld_values() };
>>>>>>       
>>>>>> foreach my $ld_hash (@ld_values) {
>>>>>>         
>>>>>> my $r2 = $ld_hash->{r2};
>>>>>>         
>>>>>> if ($r2 >= $r_limit) {
>>>>>>    
>>>>>> my $variation_name1 = $ld_hash->{variation1}->variation_name;
>>>>>>    
>>>>>> my $variation_name2 = $ld_hash->{variation2}->variation_name;
>>>>>>    
>>>>>> my $pos1 = $ld_hash->{variation1}->seq_region_name;
>>>>>>    
>>>>>> my $pos2 = $ld_hash->{variation2}->seq_region_name;
>>>>>>    
>>>>>> my $start1 = $ld_hash->{variation1}->start;
>>>>>>    
>>>>>> my $start2 = $ld_hash->{variation2}->start;
>>>>>> 
>>>>>>    
>>>>>> if ($variation_name1 eq $rsid) {
>>>>>>    
>>>>>> print OUT "$pos2\t$variation_name2\t$start2\tr2=$r2\n";
>>>>>>    
>>>>>> } else {
>>>>>>    
>>>>>> print OUT "$pos1\t$variation_name1\t$start1\tr2=$r2\n";        
>>>>>>    
>>>>>> }
>>>>>>         
>>>>>> }
>>>>>>       
>>>>>> }
>>>>>> }
>>>>>> }
>>>>>> 
>>>>>> print "done\n";
>>>>>> close OUT;
>>>>>> close IN;
>>>>>> my $end_run = time();
>>>>>> my $run_time = $end_run - $start_run;
>>>>>> print "Job took $run_time seconds\n";
>>>>>> _END SCRIPT_
>>>>>> <Asthma3.txt>
>>>>>> <out_0_Asthma3.txt>
>>>>>> <Atopicdermatitis3.txt>
>>>>>> <out_0_Atopicdermatitis3.txt>
>>>>>> <Ulcerativecolitis3.txt>
>>>>>> <Ulcerativecolitis5.txt>
>>>>>> _______________________________________________
>>>>>> Dev mailing list    Dev at ensembl.org <mailto:Dev at ensembl.org>
>>>>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev <http://lists.ensembl.org/mailman/listinfo/dev>
>>>>>> Ensembl Blog: http://www.ensembl.info/ <http://www.ensembl.info/>
>>>>> 
>>>>> _______________________________________________
>>>>> Dev mailing list    Dev at ensembl.org <mailto:Dev at ensembl.org>
>>>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev <http://lists.ensembl.org/mailman/listinfo/dev>
>>>>> Ensembl Blog: http://www.ensembl.info/ <http://www.ensembl.info/>
>>>> 
>>>> _______________________________________________
>>>> Dev mailing list    Dev at ensembl.org <mailto:Dev at ensembl.org>
>>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev <http://lists.ensembl.org/mailman/listinfo/dev>
>>>> Ensembl Blog: http://www.ensembl.info/ <http://www.ensembl.info/>
>>> 
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org <mailto:Dev at ensembl.org>
>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev <http://lists.ensembl.org/mailman/listinfo/dev>
>>> Ensembl Blog: http://www.ensembl.info/ <http://www.ensembl.info/>
>> 
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org <mailto:Dev at ensembl.org>
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev <http://lists.ensembl.org/mailman/listinfo/dev>
>> Ensembl Blog: http://www.ensembl.info/ <http://www.ensembl.info/>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org <mailto:Dev at ensembl.org>
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev <http://lists.ensembl.org/mailman/listinfo/dev>
> Ensembl Blog: http://www.ensembl.info/ <http://www.ensembl.info/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20160318/244f4684/attachment.html>