[ensembl-dev] RsIDs consistently fail with variation API, v83

Anja Thormann anja at ebi.ac.uk
Mon Mar 7 20:02:17 GMT 2016


Dear Johanne,

I was able to reproduce the lost connection error using the following list of 31 variants (concatenating all your input files):
rs3117098
rs9268516
rs3129890
rs9272346
rs9273349
rs7775228
rs9275698
rs148203517
rs3853601
rs41268896
rs12153855
rs176095
rs9469099
rs4722404
rs7000782
rs7815944
rs549182
rs9268480
rs9268853
rs9268877
rs9268923
rs2395185
rs9271366
rs6927022
rs1063355
rs943072
rs6911490
rs2858829
rs6920220
rs798502
rs11764116

The chromosome names with a different format are called alternate loci. They are stand-alone, accessioned sequences. Several human chromosomal regions exhibit sufficient variability to prevent adequate representation by a single sequence. To address this, the GRCh38 assembly provides alternate sequence for selected variant regions through the inclusion of alternate loci scaffolds. You can read more about this here:
http://genomeref.blogspot.co.uk/2013/12/announcing-grch38.html

One example is variant s3129890. It maps to the reference sequence and to 6 alternate loci:
http://www.ensembl.org/Homo_sapiens/Variation/Explore?db=core;r=CHR_HSCHR6_8_CTG1:32392790-32393790;v=rs3129890;vdb=variation;vf=146716554

The error seems to be caused by variants mapping alternate loci. Until I have figured out  what is going on it would be good if you could exclude them from your computations and just use the mappings to the reference sequence:

foreach my $vf (@var_features) {
  next unless ($vf->slice->is_reference);
  ...


I hope that helps for now. I will get back to you as soon as I have more information. I will also let you know how you can speed up your computation as soon as we release Ensembl 84.

Best,
Anja


> On 6 Mar 2016, at 12:32, Johanne Håøy Horn <johannhh at ifi.uio.no> wrote:
> 
> Dear ensembl team,
> 
> I have gotten quite some help from you previously with the variation api. I wish to expand variations to include the variants they are in LD with. My script works for the most part, but I am unable to run it on some of my rsID files. I have broken the files into increasingly smaller parts, but the runs still fails. Could you help me with what I am doing wrong?
> 
> Some of the issue seems to be connected to the chromosome the rsID is on. In many of the failing files of rsIDs, this is what the script produce (example taken from out_0_Asthma3.txt, full output file attached to this email): 
> 6 rs3129969
> 32414832 r2=0.538176
> 6 rs3763312
> 32408571 r2=0.074106
> 6 rs553071835
> 32487947 r2=0.058328
> CHR_HSCHR6_MHC_DBB_CTG1
> rs3129890 32421214
> CHR_HSCHR6_MHC_MANN_CTG1
> rs3129890 32485127
> CHR_HSCHR6_MHC_MCF_CTG1
> rs3129890 32521873
> CHR_HSCHR6_MHC_QBL_CTG1
> rs3129890 32403957
> CHR_HSCHR6_MHC_SSTO_CTG1
> rs3129890 32453108
> CHR_HSCHR6_8_CTG1
> rs3129890 32393290
> 6 rs9272346
> 32636595
> 
> In stead of getting a line of <chromosome> <rsID> <position> <r2> for the variation, the same SNP is repeated multiple times with a chromosome/region of a different format than what I get in the scripts that do not fail.
> 
> I have attached some of the GWAS disease files of rsIDs I wish to expand, but which seem to produce errors, and some error printouts I have gotten is pasted below. Two output files belonging to the erroneous input files, with out_0_<inputfilename> as filename is also attached. 
> 
> Is it something related to the rsIDs I am using? Are you able to run the script with these input files?
> I noticed that longer files of rsIDs were more prone to fail, as the connection to the database was broken. Therefore, I have divided the files into multiple small files. For most of the GWAS data, this worked fine, and I get output files on the format I want. But I am now left with 10-15 files of 10-20 rsIDs which fail with various error messages. I have tried calling my script with them multiple times, and as they fail consistently, I do not think it is my internet connection that is the problem. 
> 
> _START TERMINAL PRINT ERROR 1_
> inputfile Asthma3.txt
> outputfile out_0_Asthma3.txt
> DBD::mysql::st execute failed: Lost connection to MySQL server during query at /Users/Johanne/src/ensembl/modules//Bio/EnsEMBL/DBSQL/BaseAdaptor.pm line 482, <OUT> line 3739.
> 
> -------------------- EXCEPTION --------------------
> MSG: Detected an error whilst executing SQL 'SELECT  v.variation_id, v.name AS v_name, v.class_attrib_id AS v_class_attrib_id, v.source_id AS v_source_id, v.somatic AS v_somatic, v.flipped AS v_flipped, v.ancestral_allele AS v_ancestral_allele, vs.moltype AS vs_moltype, vs.name AS vs_name, s2.name AS vs_source_name, v.minor_allele, v.minor_allele_freq, v.minor_allele_count, v.clinical_significance, v.evidence_attribs
> FROM (( (variation v, source s1)
>   LEFT JOIN variation_synonym vs ON v.variation_id = vs.variation_id )
>   LEFT JOIN source s2 ON vs.source_id = s2.source_id )
>  WHERE v.name = ?  AND
> 
>         s1.source_id = v.source_id
>      AND v.display = 1
> ': DBD::mysql::st execute failed: Lost connection to MySQL server during query at /Users/Johanne/src/ensembl/modules//Bio/EnsEMBL/DBSQL/BaseAdaptor.pm line 482, <OUT> line 3739.
> 
> STACK Bio::EnsEMBL::DBSQL::BaseAdaptor::generic_fetch /Users/Johanne/src/ensembl/modules//Bio/EnsEMBL/DBSQL/BaseAdaptor.pm:483
> STACK Bio::EnsEMBL::Variation::DBSQL::VariationAdaptor::fetch_by_name /Users/Johanne/src/ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/VariationAdaptor.pm:497
> STACK Bio::EnsEMBL::Variation::DBSQL::LDFeatureContainerAdaptor::_ld_calc /Users/Johanne/src/ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/LDFeatureContainerAdaptor.pm:865
> STACK Bio::EnsEMBL::Variation::DBSQL::LDFeatureContainerAdaptor::fetch_by_Slice /Users/Johanne/src/ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/LDFeatureContainerAdaptor.pm:177
> STACK Bio::EnsEMBL::Variation::DBSQL::LDFeatureContainerAdaptor::fetch_by_VariationFeature /Users/Johanne/src/ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/LDFeatureContainerAdaptor.pm:246
> STACK toplevel expandSNPs.pl:42
> Date (localtime)    = Sat Mar  5 18:03:25 2016
> Ensembl API version = 83
> ---------------------------------------------------
> rs3117098
> rs3117098
> rs3117098
> rs3117098
> rs3117098
> rs9268516
> rs9268516
> rs9268516
> rs9268516
> rs9268516
> rs9268516
> rs9268516
> rs9268516
> rs9268516
> rs3129890
> rs3129890
> rs3129890
> rs3129890
> rs3129890
> rs3129890
> rs3129890
> rs9272346
> _END TERMINAL PRINT ERROR 1_
> 
> _START TERMINAL PRINT ERROR 2_
> inputfile Atopicdermatitis3.txt
> outputfile out_0_Atopicdermatitis3.txt
> connect: Network is unreachable
> rs1050654
> _END TERMINAL PRINT ERROR 2_
> 
> _START TERMINAL PRINT ERROR 3_
> inputfile textfiles/Atopicdermatitis3.txt
> outputfile textfiles/out_0_Atopicdermatitis3.txt
> Can't call method "get_all_VariationFeatures" on an undefined value at expandSNPs.pl line 33, <IN> line 3.
> rs1050654
> rs3853601
> rs3853601
> rs3853601
> rs3853601
> rs3853601
> rs3853601
> rs3853601
> rs3853601
> _END TERMINAL PRINT ERROR 3_
> 
> _START SCRIPT_
> use strict;
> use warnings;
> use Bio::EnsEMBL::Registry;
> 
> my $inputfile = $ARGV[0];
> my $outputfile = $ARGV[1];
> my $r_limit = $ARGV[2];
> my $start_run = time();
> 
> open (IN, "<$inputfile");
> open (OUT, ">$outputfile");
> 
> my $registry = 'Bio::EnsEMBL::Registry';
> 
> $registry->load_registry_from_db(
> -host => 'ensembldb.ensembl.org <http://ensembldb.ensembl.org/>',
> -user => 'anonymous'
> );
> 
> # Connect to the databases:
> my $variation_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'variation');
> my $ldfc_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'ldfeaturecontainer');
> my $pop_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'population');
> $variation_adaptor->db->use_vcf(1);
> 
> my $ld_population = $pop_adaptor->fetch_by_name('1000GENOMES:phase_3:CEU');
> 
> # Loop through all SNPs available and find SNPs in LD
> while(<IN>) {
> chomp;
> my $variation_name = $_;
> my $variation = $variation_adaptor->fetch_by_name($variation_name);
> my @var_features;
> 
> if ($variation) {
> @var_features = @{ $variation->get_all_VariationFeatures() };
> } else {
> print 'failing variation name: ', $variation_name, "\n";
> next;
> }
> 
> foreach my $vf (@var_features) {
> my $rsid = $vf->name;
> print $rsid, "\n";
> my $start = $vf->start;
> my $region = $vf->seq_region_name;
> print OUT "$region\t$rsid\t$start\n";
> my $ldfc = $ldfc_adaptor->fetch_by_VariationFeature($vf, $ld_population);
> my @ld_values = @{ $ldfc->get_all_ld_values() };
>       
> foreach my $ld_hash (@ld_values) {
>         
> my $r2 = $ld_hash->{r2};
>         
> if ($r2 >= $r_limit) {
>    
> my $variation_name1 = $ld_hash->{variation1}->variation_name;
>    
> my $variation_name2 = $ld_hash->{variation2}->variation_name;
>    
> my $pos1 = $ld_hash->{variation1}->seq_region_name;
>    
> my $pos2 = $ld_hash->{variation2}->seq_region_name;
>    
> my $start1 = $ld_hash->{variation1}->start;
>    
> my $start2 = $ld_hash->{variation2}->start;
> 
>    
> if ($variation_name1 eq $rsid) {
>    
> print OUT "$pos2\t$variation_name2\t$start2\tr2=$r2\n";
>    
> } else {
>    
> print OUT "$pos1\t$variation_name1\t$start1\tr2=$r2\n";        
>    
> }
>         
> }
>       
> }
> }
> }
> 
> print "done\n";
> close OUT;
> close IN;
> my $end_run = time();
> my $run_time = $end_run - $start_run;
> print "Job took $run_time seconds\n";
> _END SCRIPT_
> <Asthma3.txt>
> <out_0_Asthma3.txt>
> <Atopicdermatitis3.txt>
> <out_0_Atopicdermatitis3.txt>
> <Ulcerativecolitis3.txt>
> <Ulcerativecolitis5.txt>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20160307/29511b0b/attachment.html>


More information about the Dev mailing list