[ensembl-dev] RsIDs consistently fail with variation API, v83

Fri Mar 11 12:05:38 GMT 2016

Thank you for this information!

When you say 100 MB, do you mean 100KB? As in 100,000 bases?
I noticed that with the script I have used so far, the longest physical distance between a SNP and one of its LD variants, is 100K. Is this because with the slice, my script will only find SNPs in LD 100K away in either direction? So theoretically, there can be other LD variants further away from the SNP I am expanding?

I have mistakenly thought that the default of getting ld SNPs used the entire chromosome as slice. Is this possible (especially given time considerations)? Or is it not any point in doing it, as LD only happen within 100K-blocks anyway?

Best,
Johanne

11. mar. 2016 kl. 12.50 skrev Anja Thormann <anja at ebi.ac.uk<mailto:anja at ebi.ac.uk>>:

Dear Johanne,

we have released ensembl 84 and you can now make use of our improvements for LD data computation. Just as a reminder, you would like to retrieve all variants that are in LD with a given variant. Depending on how many variants you have for which you want to compute LD data there are two ways:

- If your input list is small (< 1000) you can use fetch_by_VariationFeature($vf, $population). The default setting will create a slice (100MB + variant length + 100MB) expanding the input variant. The API computes LD values between the variant of interest and all variants surrounding the variant and not further away than 100MB. You could speed this up by decreasing the distance to 50MB:

my $ldfc_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'ldfeaturecontainer');
$ldfc_adaptor->max_snp_distance(50_000);
my $ldfc = $ldfc_adaptor->fetch_by_VariationFeature($vf, $ld_population);

- If your list of input variants is much longer I would recommend using fetch_by_Slice. You could create overlapping slice objects that cover your chromosomal region of interest and then the API computes all pairwise LD values for all variants in the region. You need to filter the results for your variants of interest. This requires a little bit more work because you have to choose the correct overlap size to make sure you allow for the same max distance for which you want to compute LD data for each of your input variants.

my $slice_adaptor = $registry->get_adaptor('homo_sapiens', 'core', 'slice');
my $chr = 6;
my $start = 32_102_292;
my $end = 32_402_292;
my $slice = $slice_adaptor->fetch_by_region('chromosome', $chr, $start, $end);
my $ldfc = $ldfc_adaptor->fetch_by_Slice($slice, $ld_population);

The second approach should be faster because it can summarise some of the computations that have do be done: If your input variants are in close proximity you can compute all their LD values in one computation by providing the slice that covers all of them instead of creating a region for each of them individually and then compute LD values.

Independent of the two approaches you can further speed up your computation: When asking for all ld values pass 1 as an argument. This will prevent the method (get_all_ld_values) from creating variation feature objects and instead will only return the variant names and their LD values:

my $ldfc = $ldfc_adaptor->fetch_by_VariationFeature($vf, $ld_population);
my $ld_values = $ldfc->get_all_ld_values(1);
foreach my $ld_hash (@$ld_values) {
  my $r2 = $ld_hash->{r2};
  my $d_prime = $ld_hash->{d_prime};
  my $variation_name1 = $ld_hash->{variation_name1};
  my $variation_name2 = $ld_hash->{variation_name2};

}

Don't forget to switch to the new release/84 branches for all ensembl repos.

Best,
Anja

On 7 Mar 2016, at 20:02, Anja Thormann <anja at ebi.ac.uk<mailto:anja at ebi.ac.uk>> wrote:

Dear Johanne,

I was able to reproduce the lost connection error using the following list of 31 variants (concatenating all your input files):
rs3117098
rs9268516
rs3129890
rs9272346
rs9273349
rs7775228
rs9275698
rs148203517
rs3853601
rs41268896
rs12153855
rs176095
rs9469099
rs4722404
rs7000782
rs7815944
rs549182
rs9268480
rs9268853
rs9268877
rs9268923
rs2395185
rs9271366
rs6927022
rs1063355
rs943072
rs6911490
rs2858829
rs6920220
rs798502
rs11764116

The chromosome names with a different format are called alternate loci. They are stand-alone, accessioned sequences. Several human chromosomal regions exhibit sufficient variability to prevent adequate representation by a single sequence. To address this, the GRCh38 assembly provides alternate sequence for selected variant regions through the inclusion of alternate loci scaffolds. You can read more about this here:
http://genomeref.blogspot.co.uk/2013/12/announcing-grch38.html

One example is variant s3129890. It maps to the reference sequence and to 6 alternate loci:
http://www.ensembl.org/Homo_sapiens/Variation/Explore?db=core;r=CHR_HSCHR6_8_CTG1:32392790-32393790;v=rs3129890;vdb=variation;vf=146716554

The error seems to be caused by variants mapping alternate loci. Until I have figured out  what is going on it would be good if you could exclude them from your computations and just use the mappings to the reference sequence:

foreach my $vf (@var_features) {
  next unless ($vf->slice->is_reference);
  ...

I hope that helps for now. I will get back to you as soon as I have more information. I will also let you know how you can speed up your computation as soon as we release Ensembl 84.

Best,
Anja

On 6 Mar 2016, at 12:32, Johanne Håøy Horn <johannhh at ifi.uio.no<mailto:johannhh at ifi.uio.no>> wrote:

Dear ensembl team,

I have gotten quite some help from you previously with the variation api. I wish to expand variations to include the variants they are in LD with. My script works for the most part, but I am unable to run it on some of my rsID files. I have broken the files into increasingly smaller parts, but the runs still fails. Could you help me with what I am doing wrong?

Some of the issue seems to be connected to the chromosome the rsID is on. In many of the failing files of rsIDs, this is what the script produce (example taken from out_0_Asthma3.txt, full output file attached to this email):
6 rs3129969 32414832 r2=0.538176
6 rs3763312 32408571 r2=0.074106
6 rs553071835 32487947 r2=0.058328
CHR_HSCHR6_MHC_DBB_CTG1 rs3129890 32421214
CHR_HSCHR6_MHC_MANN_CTG1 rs3129890 32485127
CHR_HSCHR6_MHC_MCF_CTG1 rs3129890 32521873
CHR_HSCHR6_MHC_QBL_CTG1 rs3129890 32403957
CHR_HSCHR6_MHC_SSTO_CTG1 rs3129890 32453108
CHR_HSCHR6_8_CTG1 rs3129890 32393290
6 rs9272346 32636595

In stead of getting a line of <chromosome> <rsID> <position> <r2> for the variation, the same SNP is repeated multiple times with a chromosome/region of a different format than what I get in the scripts that do not fail.

I have attached some of the GWAS disease files of rsIDs I wish to expand, but which seem to produce errors, and some error printouts I have gotten is pasted below. Two output files belonging to the erroneous input files, with out_0_<inputfilename> as filename is also attached.

Is it something related to the rsIDs I am using? Are you able to run the script with these input files?
I noticed that longer files of rsIDs were more prone to fail, as the connection to the database was broken. Therefore, I have divided the files into multiple small files. For most of the GWAS data, this worked fine, and I get output files on the format I want. But I am now left with 10-15 files of 10-20 rsIDs which fail with various error messages. I have tried calling my script with them multiple times, and as they fail consistently, I do not think it is my internet connection that is the problem.

_START TERMINAL PRINT ERROR 1_
inputfile Asthma3.txt
outputfile out_0_Asthma3.txt
DBD::mysql::st execute failed: Lost connection to MySQL server during query at /Users/Johanne/src/ensembl/modules//Bio/EnsEMBL/DBSQL/BaseAdaptor.pm line 482, <OUT> line 3739.

-------------------- EXCEPTION --------------------
MSG: Detected an error whilst executing SQL 'SELECT  v.variation_id, v.name AS v_name, v.class_attrib_id AS v_class_attrib_id, v.source_id AS v_source_id, v.somatic AS v_somatic, v.flipped AS v_flipped, v.ancestral_allele AS v_ancestral_allele, vs.moltype AS vs_moltype, vs.name AS vs_name, s2.name AS vs_source_name, v.minor_allele, v.minor_allele_freq, v.minor_allele_count, v.clinical_significance, v.evidence_attribs
FROM (( (variation v, source s1)
  LEFT JOIN variation_synonym vs ON v.variation_id = vs.variation_id )
  LEFT JOIN source s2 ON vs.source_id = s2.source_id )
 WHERE v.name = ?  AND

        s1.source_id = v.source_id
     AND v.display = 1
': DBD::mysql::st execute failed: Lost connection to MySQL server during query at /Users/Johanne/src/ensembl/modules//Bio/EnsEMBL/DBSQL/BaseAdaptor.pm line 482, <OUT> line 3739.

STACK Bio::EnsEMBL::DBSQL::BaseAdaptor::generic_fetch /Users/Johanne/src/ensembl/modules//Bio/EnsEMBL/DBSQL/BaseAdaptor.pm:483
STACK Bio::EnsEMBL::Variation::DBSQL::VariationAdaptor::fetch_by_name /Users/Johanne/src/ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/VariationAdaptor.pm:497
STACK Bio::EnsEMBL::Variation::DBSQL::LDFeatureContainerAdaptor::_ld_calc /Users/Johanne/src/ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/LDFeatureContainerAdaptor.pm:865
STACK Bio::EnsEMBL::Variation::DBSQL::LDFeatureContainerAdaptor::fetch_by_Slice /Users/Johanne/src/ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/LDFeatureContainerAdaptor.pm:177
STACK Bio::EnsEMBL::Variation::DBSQL::LDFeatureContainerAdaptor::fetch_by_VariationFeature /Users/Johanne/src/ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/LDFeatureContainerAdaptor.pm:246
STACK toplevel expandSNPs.pl:42
Date (localtime)    = Sat Mar  5 18:03:25 2016
Ensembl API version = 83
---------------------------------------------------
rs3117098
rs3117098
rs3117098
rs3117098
rs3117098
rs9268516
rs9268516
rs9268516
rs9268516
rs9268516
rs9268516
rs9268516
rs9268516
rs9268516
rs3129890
rs3129890
rs3129890
rs3129890
rs3129890
rs3129890
rs3129890
rs9272346
_END TERMINAL PRINT ERROR 1_

_START TERMINAL PRINT ERROR 2_
inputfile Atopicdermatitis3.txt
outputfile out_0_Atopicdermatitis3.txt
connect: Network is unreachable
rs1050654
_END TERMINAL PRINT ERROR 2_

_START TERMINAL PRINT ERROR 3_
inputfile textfiles/Atopicdermatitis3.txt
outputfile textfiles/out_0_Atopicdermatitis3.txt
Can't call method "get_all_VariationFeatures" on an undefined value at expandSNPs.pl line 33, <IN> line 3.
rs1050654
rs3853601
rs3853601
rs3853601
rs3853601
rs3853601
rs3853601
rs3853601
rs3853601
_END TERMINAL PRINT ERROR 3_

_START SCRIPT_
use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $inputfile = $ARGV[0];
my $outputfile = $ARGV[1];
my $r_limit = $ARGV[2];
my $start_run = time();

open (IN, "<$inputfile");
open (OUT, ">$outputfile");

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
-host => 'ensembldb.ensembl.org<http://ensembldb.ensembl.org/>',
-user => 'anonymous'
);

# Connect to the databases:
my $variation_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'variation');
my $ldfc_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'ldfeaturecontainer');
my $pop_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'population');
$variation_adaptor->db->use_vcf(1);

my $ld_population = $pop_adaptor->fetch_by_name('1000GENOMES:phase_3:CEU');

# Loop through all SNPs available and find SNPs in LD
while(<IN>) {
chomp;
my $variation_name = $_;
my $variation = $variation_adaptor->fetch_by_name($variation_name);
my @var_features;

if ($variation) {
@var_features = @{ $variation->get_all_VariationFeatures() };
} else {
print 'failing variation name: ', $variation_name, "\n";
next;
}

foreach my $vf (@var_features) {
my $rsid = $vf->name;
print $rsid, "\n";
my $start = $vf->start;
my $region = $vf->seq_region_name;
print OUT "$region\t$rsid\t$start\n";
my $ldfc = $ldfc_adaptor->fetch_by_VariationFeature($vf, $ld_population);
my @ld_values = @{ $ldfc->get_all_ld_values() };
      foreach my $ld_hash (@ld_values) {
        my $r2 = $ld_hash->{r2};
        if ($r2 >= $r_limit) {
    my $variation_name1 = $ld_hash->{variation1}->variation_name;
    my $variation_name2 = $ld_hash->{variation2}->variation_name;
    my $pos1 = $ld_hash->{variation1}->seq_region_name;
    my $pos2 = $ld_hash->{variation2}->seq_region_name;
    my $start1 = $ld_hash->{variation1}->start;
    my $start2 = $ld_hash->{variation2}->start;

    if ($variation_name1 eq $rsid) {
    print OUT "$pos2\t$variation_name2\t$start2\tr2=$r2\n";
    } else {
    print OUT "$pos1\t$variation_name1\t$start1\tr2=$r2\n";
    }
        }
      }
}
}

print "done\n";
close OUT;
close IN;
my $end_run = time();
my $run_time = $end_run - $start_run;
print "Job took $run_time seconds\n";
_END SCRIPT_
<Asthma3.txt>
<out_0_Asthma3.txt>
<Atopicdermatitis3.txt>
<out_0_Atopicdermatitis3.txt>
<Ulcerativecolitis3.txt>
<Ulcerativecolitis5.txt>
_______________________________________________
Dev mailing list    Dev at ensembl.org<mailto:Dev at ensembl.org>
Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
Ensembl Blog: http://www.ensembl.info/

_______________________________________________
Dev mailing list    Dev at ensembl.org<mailto:Dev at ensembl.org>
Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
Ensembl Blog: http://www.ensembl.info/

_______________________________________________
Dev mailing list    Dev at ensembl.org<mailto:Dev at ensembl.org>
Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
Ensembl Blog: http://www.ensembl.info/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20160311/075c7473/attachment.html>