[ensembl-dev] RsIDs consistently fail with variation API, v83

Fri Mar 11 11:50:46 GMT 2016

Dear Johanne,

we have released ensembl 84 and you can now make use of our improvements for LD data computation. Just as a reminder, you would like to retrieve all variants that are in LD with a given variant. Depending on how many variants you have for which you want to compute LD data there are two ways:

- If your input list is small (< 1000) you can use fetch_by_VariationFeature($vf, $population). The default setting will create a slice (100MB + variant length + 100MB) expanding the input variant. The API computes LD values between the variant of interest and all variants surrounding the variant and not further away than 100MB. You could speed this up by decreasing the distance to 50MB:

my $ldfc_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'ldfeaturecontainer');
$ldfc_adaptor->max_snp_distance(50_000);
my $ldfc = $ldfc_adaptor->fetch_by_VariationFeature($vf, $ld_population);

- If your list of input variants is much longer I would recommend using fetch_by_Slice. You could create overlapping slice objects that cover your chromosomal region of interest and then the API computes all pairwise LD values for all variants in the region. You need to filter the results for your variants of interest. This requires a little bit more work because you have to choose the correct overlap size to make sure you allow for the same max distance for which you want to compute LD data for each of your input variants.

my $slice_adaptor = $registry->get_adaptor('homo_sapiens', 'core', 'slice');
my $chr = 6;
my $start = 32_102_292;
my $end = 32_402_292;
my $slice = $slice_adaptor->fetch_by_region('chromosome', $chr, $start, $end);
my $ldfc = $ldfc_adaptor->fetch_by_Slice($slice, $ld_population);

The second approach should be faster because it can summarise some of the computations that have do be done: If your input variants are in close proximity you can compute all their LD values in one computation by providing the slice that covers all of them instead of creating a region for each of them individually and then compute LD values.

Independent of the two approaches you can further speed up your computation: When asking for all ld values pass 1 as an argument. This will prevent the method (get_all_ld_values) from creating variation feature objects and instead will only return the variant names and their LD values:

my $ldfc = $ldfc_adaptor->fetch_by_VariationFeature($vf, $ld_population);
my $ld_values = $ldfc->get_all_ld_values(1);
foreach my $ld_hash (@$ld_values) {
  my $r2 = $ld_hash->{r2};
  my $d_prime = $ld_hash->{d_prime};
  my $variation_name1 = $ld_hash->{variation_name1};
  my $variation_name2 = $ld_hash->{variation_name2};

}

Don't forget to switch to the new release/84 branches for all ensembl repos.

Best,
Anja

> On 7 Mar 2016, at 20:02, Anja Thormann <anja at ebi.ac.uk> wrote:
> 
> Dear Johanne,
> 
> I was able to reproduce the lost connection error using the following list of 31 variants (concatenating all your input files):
> rs3117098
> rs9268516
> rs3129890
> rs9272346
> rs9273349
> rs7775228
> rs9275698
> rs148203517
> rs3853601
> rs41268896
> rs12153855
> rs176095
> rs9469099
> rs4722404
> rs7000782
> rs7815944
> rs549182
> rs9268480
> rs9268853
> rs9268877
> rs9268923
> rs2395185
> rs9271366
> rs6927022
> rs1063355
> rs943072
> rs6911490
> rs2858829
> rs6920220
> rs798502
> rs11764116
> 
> The chromosome names with a different format are called alternate loci. They are stand-alone, accessioned sequences. Several human chromosomal regions exhibit sufficient variability to prevent adequate representation by a single sequence. To address this, the GRCh38 assembly provides alternate sequence for selected variant regions through the inclusion of alternate loci scaffolds. You can read more about this here:
> http://genomeref.blogspot.co.uk/2013/12/announcing-grch38.html <http://genomeref.blogspot.co.uk/2013/12/announcing-grch38.html>
> 
> One example is variant s3129890. It maps to the reference sequence and to 6 alternate loci:
> http://www.ensembl.org/Homo_sapiens/Variation/Explore?db=core;r=CHR_HSCHR6_8_CTG1:32392790-32393790;v=rs3129890;vdb=variation;vf=146716554 <http://www.ensembl.org/Homo_sapiens/Variation/Explore?db=core;r=CHR_HSCHR6_8_CTG1:32392790-32393790;v=rs3129890;vdb=variation;vf=146716554>
> 
> The error seems to be caused by variants mapping alternate loci. Until I have figured out  what is going on it would be good if you could exclude them from your computations and just use the mappings to the reference sequence:
> 
> foreach my $vf (@var_features) {
>   next unless ($vf->slice->is_reference);
>   ...
> 
> 
> I hope that helps for now. I will get back to you as soon as I have more information. I will also let you know how you can speed up your computation as soon as we release Ensembl 84.
> 
> Best,
> Anja
> 
> 
>> On 6 Mar 2016, at 12:32, Johanne Håøy Horn <johannhh at ifi.uio.no <mailto:johannhh at ifi.uio.no>> wrote:
>> 
>> Dear ensembl team,
>> 
>> I have gotten quite some help from you previously with the variation api. I wish to expand variations to include the variants they are in LD with. My script works for the most part, but I am unable to run it on some of my rsID files. I have broken the files into increasingly smaller parts, but the runs still fails. Could you help me with what I am doing wrong?
>> 
>> Some of the issue seems to be connected to the chromosome the rsID is on. In many of the failing files of rsIDs, this is what the script produce (example taken from out_0_Asthma3.txt, full output file attached to this email): 
>> 6 rs3129969
>> 32414832 r2=0.538176
>> 6 rs3763312
>> 32408571 r2=0.074106
>> 6 rs553071835
>> 32487947 r2=0.058328
>> CHR_HSCHR6_MHC_DBB_CTG1
>> rs3129890 32421214
>> CHR_HSCHR6_MHC_MANN_CTG1
>> rs3129890 32485127
>> CHR_HSCHR6_MHC_MCF_CTG1
>> rs3129890 32521873
>> CHR_HSCHR6_MHC_QBL_CTG1
>> rs3129890 32403957
>> CHR_HSCHR6_MHC_SSTO_CTG1
>> rs3129890 32453108
>> CHR_HSCHR6_8_CTG1
>> rs3129890 32393290
>> 6 rs9272346
>> 32636595
>> 
>> In stead of getting a line of <chromosome> <rsID> <position> <r2> for the variation, the same SNP is repeated multiple times with a chromosome/region of a different format than what I get in the scripts that do not fail.
>> 
>> I have attached some of the GWAS disease files of rsIDs I wish to expand, but which seem to produce errors, and some error printouts I have gotten is pasted below. Two output files belonging to the erroneous input files, with out_0_<inputfilename> as filename is also attached. 
>> 
>> Is it something related to the rsIDs I am using? Are you able to run the script with these input files?
>> I noticed that longer files of rsIDs were more prone to fail, as the connection to the database was broken. Therefore, I have divided the files into multiple small files. For most of the GWAS data, this worked fine, and I get output files on the format I want. But I am now left with 10-15 files of 10-20 rsIDs which fail with various error messages. I have tried calling my script with them multiple times, and as they fail consistently, I do not think it is my internet connection that is the problem. 
>> 
>> _START TERMINAL PRINT ERROR 1_
>> inputfile Asthma3.txt
>> outputfile out_0_Asthma3.txt
>> DBD::mysql::st execute failed: Lost connection to MySQL server during query at /Users/Johanne/src/ensembl/modules//Bio/EnsEMBL/DBSQL/BaseAdaptor.pm line 482, <OUT> line 3739.
>> 
>> -------------------- EXCEPTION --------------------
>> MSG: Detected an error whilst executing SQL 'SELECT  v.variation_id, v.name AS v_name, v.class_attrib_id AS v_class_attrib_id, v.source_id AS v_source_id, v.somatic AS v_somatic, v.flipped AS v_flipped, v.ancestral_allele AS v_ancestral_allele, vs.moltype AS vs_moltype, vs.name AS vs_name, s2.name AS vs_source_name, v.minor_allele, v.minor_allele_freq, v.minor_allele_count, v.clinical_significance, v.evidence_attribs
>> FROM (( (variation v, source s1)
>>   LEFT JOIN variation_synonym vs ON v.variation_id = vs.variation_id )
>>   LEFT JOIN source s2 ON vs.source_id = s2.source_id )
>>  WHERE v.name = ?  AND
>> 
>>         s1.source_id = v.source_id
>>      AND v.display = 1
>> ': DBD::mysql::st execute failed: Lost connection to MySQL server during query at /Users/Johanne/src/ensembl/modules//Bio/EnsEMBL/DBSQL/BaseAdaptor.pm line 482, <OUT> line 3739.
>> 
>> STACK Bio::EnsEMBL::DBSQL::BaseAdaptor::generic_fetch /Users/Johanne/src/ensembl/modules//Bio/EnsEMBL/DBSQL/BaseAdaptor.pm:483
>> STACK Bio::EnsEMBL::Variation::DBSQL::VariationAdaptor::fetch_by_name /Users/Johanne/src/ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/VariationAdaptor.pm:497
>> STACK Bio::EnsEMBL::Variation::DBSQL::LDFeatureContainerAdaptor::_ld_calc /Users/Johanne/src/ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/LDFeatureContainerAdaptor.pm:865
>> STACK Bio::EnsEMBL::Variation::DBSQL::LDFeatureContainerAdaptor::fetch_by_Slice /Users/Johanne/src/ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/LDFeatureContainerAdaptor.pm:177
>> STACK Bio::EnsEMBL::Variation::DBSQL::LDFeatureContainerAdaptor::fetch_by_VariationFeature /Users/Johanne/src/ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/LDFeatureContainerAdaptor.pm:246
>> STACK toplevel expandSNPs.pl:42
>> Date (localtime)    = Sat Mar  5 18:03:25 2016
>> Ensembl API version = 83
>> ---------------------------------------------------
>> rs3117098
>> rs3117098
>> rs3117098
>> rs3117098
>> rs3117098
>> rs9268516
>> rs9268516
>> rs9268516
>> rs9268516
>> rs9268516
>> rs9268516
>> rs9268516
>> rs9268516
>> rs9268516
>> rs3129890
>> rs3129890
>> rs3129890
>> rs3129890
>> rs3129890
>> rs3129890
>> rs3129890
>> rs9272346
>> _END TERMINAL PRINT ERROR 1_
>> 
>> _START TERMINAL PRINT ERROR 2_
>> inputfile Atopicdermatitis3.txt
>> outputfile out_0_Atopicdermatitis3.txt
>> connect: Network is unreachable
>> rs1050654
>> _END TERMINAL PRINT ERROR 2_
>> 
>> _START TERMINAL PRINT ERROR 3_
>> inputfile textfiles/Atopicdermatitis3.txt
>> outputfile textfiles/out_0_Atopicdermatitis3.txt
>> Can't call method "get_all_VariationFeatures" on an undefined value at expandSNPs.pl line 33, <IN> line 3.
>> rs1050654
>> rs3853601
>> rs3853601
>> rs3853601
>> rs3853601
>> rs3853601
>> rs3853601
>> rs3853601
>> rs3853601
>> _END TERMINAL PRINT ERROR 3_
>> 
>> _START SCRIPT_
>> use strict;
>> use warnings;
>> use Bio::EnsEMBL::Registry;
>> 
>> my $inputfile = $ARGV[0];
>> my $outputfile = $ARGV[1];
>> my $r_limit = $ARGV[2];
>> my $start_run = time();
>> 
>> open (IN, "<$inputfile");
>> open (OUT, ">$outputfile");
>> 
>> my $registry = 'Bio::EnsEMBL::Registry';
>> 
>> $registry->load_registry_from_db(
>> -host => 'ensembldb.ensembl.org <http://ensembldb.ensembl.org/>',
>> -user => 'anonymous'
>> );
>> 
>> # Connect to the databases:
>> my $variation_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'variation');
>> my $ldfc_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'ldfeaturecontainer');
>> my $pop_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'population');
>> $variation_adaptor->db->use_vcf(1);
>> 
>> my $ld_population = $pop_adaptor->fetch_by_name('1000GENOMES:phase_3:CEU');
>> 
>> # Loop through all SNPs available and find SNPs in LD
>> while(<IN>) {
>> chomp;
>> my $variation_name = $_;
>> my $variation = $variation_adaptor->fetch_by_name($variation_name);
>> my @var_features;
>> 
>> if ($variation) {
>> @var_features = @{ $variation->get_all_VariationFeatures() };
>> } else {
>> print 'failing variation name: ', $variation_name, "\n";
>> next;
>> }
>> 
>> foreach my $vf (@var_features) {
>> my $rsid = $vf->name;
>> print $rsid, "\n";
>> my $start = $vf->start;
>> my $region = $vf->seq_region_name;
>> print OUT "$region\t$rsid\t$start\n";
>> my $ldfc = $ldfc_adaptor->fetch_by_VariationFeature($vf, $ld_population);
>> my @ld_values = @{ $ldfc->get_all_ld_values() };
>>       
>> foreach my $ld_hash (@ld_values) {
>>         
>> my $r2 = $ld_hash->{r2};
>>         
>> if ($r2 >= $r_limit) {
>>    
>> my $variation_name1 = $ld_hash->{variation1}->variation_name;
>>    
>> my $variation_name2 = $ld_hash->{variation2}->variation_name;
>>    
>> my $pos1 = $ld_hash->{variation1}->seq_region_name;
>>    
>> my $pos2 = $ld_hash->{variation2}->seq_region_name;
>>    
>> my $start1 = $ld_hash->{variation1}->start;
>>    
>> my $start2 = $ld_hash->{variation2}->start;
>> 
>>    
>> if ($variation_name1 eq $rsid) {
>>    
>> print OUT "$pos2\t$variation_name2\t$start2\tr2=$r2\n";
>>    
>> } else {
>>    
>> print OUT "$pos1\t$variation_name1\t$start1\tr2=$r2\n";        
>>    
>> }
>>         
>> }
>>       
>> }
>> }
>> }
>> 
>> print "done\n";
>> close OUT;
>> close IN;
>> my $end_run = time();
>> my $run_time = $end_run - $start_run;
>> print "Job took $run_time seconds\n";
>> _END SCRIPT_
>> <Asthma3.txt>
>> <out_0_Asthma3.txt>
>> <Atopicdermatitis3.txt>
>> <out_0_Atopicdermatitis3.txt>
>> <Ulcerativecolitis3.txt>
>> <Ulcerativecolitis5.txt>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org <mailto:Dev at ensembl.org>
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev <http://lists.ensembl.org/mailman/listinfo/dev>
>> Ensembl Blog: http://www.ensembl.info/ <http://www.ensembl.info/>
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20160311/21e339d3/attachment.html>