[ensembl-dev] Issues with VariationFeature

Johanne Håøy Horn johannhh at ifi.uio.no
Fri Feb 26 14:30:29 GMT 2016


Ah, I missed that part of your previous email. Thank you for great support, it works now :)

26. feb. 2016 kl. 14.57 skrev Anja Thormann <anja at ebi.ac.uk<mailto:anja at ebi.ac.uk>>:

Updating (git pull) your ensembl-io repo should fix this.

Best,
Anja
On 26 Feb 2016, at 13:53, Johanne Håøy Horn <johannhh at ifi.uio.no<mailto:johannhh at ifi.uio.no>> wrote:

Thank you for the fix and explanation!

I now get a different error message when fetching LDFeatureContainer, however. I have attached the input file of rsIDs I pass to the script. With the script pasted at the end of this email, I get this error:
$ perl expandSNPs.pl ldSNPs.txt
1000GENOMES:phase_3:CEU
after ldfc
after ld values
1000GENOMES:phase_3:CEU
TabixIterator::tabix_iter_free: iter is not of type ti_iter_t at /Users/Johanne/src/ensembl-io/modules/Bio/EnsEMBL/IO/TabixParser.pm line 91, <> line 2.

The error seems to occur if $ldfc_adaptor->fetch_by_VariationFeature($vf, $ld_population); is called more than once, as I get the exact same error message if I use a larger number of SNPs.

__SCRIPT__:
use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $start_run = time();
open (OUT, ">expandedSNPS.txt");

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
-host => 'ensembldb.ensembl.org<http://ensembldb.ensembl.org/>',
-user => 'anonymous'
);

# Connect to the databases:
my $variation_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'variation');
my $ldfc_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'ldfeaturecontainer');
my $pop_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'population');
$variation_adaptor->db->use_vcf(1);

# Print out all available populations:
#my $ld_populations = $pop_adaptor->fetch_all_LD_Populations();
#foreach my $ld_pop (@$ld_populations) {
# print $ld_pop->name, "\n";
#}

my $ld_population = $pop_adaptor->fetch_by_name('1000GENOMES:phase_3:CEU');

# Loop through all SNPs available and find SNPs in LD
while(<>) {
chomp;
my $variation_name = $_;
my $variation = $variation_adaptor->fetch_by_name($variation_name);
my @var_features = @{ $variation->get_all_VariationFeatures() };

foreach my $vf (@var_features) {
my $rsid = $vf->name;
my $start = $vf->start;
my $region = $vf->seq_region_name;
print OUT "$rsid\t$start\t$region\n";
print $ld_population->name, "\n";

my $ldfc = $ldfc_adaptor->fetch_by_VariationFeature($vf, $ld_population);
print "after ldfc\n";
my @ld_values = @{ $ldfc->get_all_ld_values() };
print "after ld values\n";
      foreach my $ld_hash (@ld_values) {
        my $r2 = $ld_hash->{r2};
        my $variation_name1 = $ld_hash->{variation1}->variation_name;
        my $variation_name2 = $ld_hash->{variation2}->variation_name;
        my $pos1 = $ld_hash->{variation1}->seq_region_name;
        my $pos2 = $ld_hash->{variation2}->seq_region_name;
        print OUT "\t$variation_name1 $pos1\t$variation_name2 $pos2\tr2=$r2\n";
      }
}
}

print "done\n";
close OUT;
my $end_run = time();
my $run_time = $end_run - $start_run;
print "Job took $run_time seconds\n»;

Best,
Johanne



<ldSNPs.txt>

26. feb. 2016 kl. 12.41 skrev Anja Thormann <anja at ebi.ac.uk<mailto:anja at ebi.ac.uk>>:

Hi Johanne,

Thank you very much for the detailed bug report. I could reproduce the problem and we pushed another bug fix to the release/83 branch dealing with the number of open connections. Please update ensembl-variation and ensembl-io repos both for release/83.

The files you see in your working directory are binary index files that are created locally for remote VCF files that store the genotypes. The index files speed up location based lookups and you can delete them after your computations.

We don't support computation of LD values bewteen SNPs located on different chromosomes. The assumption is that different chromosomes segregate independently during meiosis and only genes/alleles located on the same chromosome can be linked.

HTH,
Anja

On 25 Feb 2016, at 18:00, Johanne Håøy Horn <johannhh at ifi.uio.no<mailto:johannhh at ifi.uio.no>> wrote:

Hello again!

Thank you for the bug fix. My apologies, I read this old correspondence and mistakenly thought you still calculated LD on the fly in stead of having it in the database: https://www.biostars.org/p/109785/#147784

I use all LD populations for now, as you suggested. However, a couple more questions appeared:

1) When I call my $ldfc = $ldfc_adaptor->fetch_by_VariationFeature($vf, $ld_population);, lots of vcf files are downloaded. The terminal prints out «[get_local_version] downloading the index file…» when the files are downloaded, and the following files appear in my folder:

ALL.chr1.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz.tbi
ALL.chr10.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz.tbi
ALL.chr11.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz.tbi
ALL.chr12.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz.tbi
ALL.chr14.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz.tbi
ALL.chr15.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz.tbi
ALL.chr16.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz.tbi
ALL.chr19.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz.tbi
ALL.chr2.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz.tbi
ALL.chr3.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz.tbi
ALL.chr4.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz.tbi
ALL.chr5.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz.tbi
ALL.chr8.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz.tbi
ALL.chr9.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz.tbi

What are these files? Why are they downloaded? Will release 84 have a different way of doing it?

2) When I try to use a large file of SNPs (example attached in this email), I get the following message after all the 14 vcf-files posted above have been downloaded:
[kftp_connect_file]
[main] fail to open the data file.
[get_local_version] downloading the index file...
[kftp_connect_file] 530 You may not have more than 15 concurrent connections at any time.
[download_from_remote] fail to open remote file.

If it tries to download another file at the same time the restriction threshold is met, I get this error message:
[get_local_version] downloading the index file...
[kftp_connect_file] 530 You may not have more than 15 concurrent connections at any time.
[download_from_remote] fail to open remote file.
[tabix] failed to load the index file.
[get_local_version] downloading the index file...
Can't use an undefined value as an ARRAY reference at /Users/Johanne/src/ensembl-io/modules/Bio/EnsEMBL/IO/Parser/BaseVCF4.pm line 570, <> line 22.

Does  this mean I cannot connect to all chromosomes at the same time? Does LD only get computed between variants on the same chromosome? What is the reason behind this restriction? I am not too familiar with reasoning behind choices of structure etc in LD data sets, so these questions are based on my impression that variants can be in LD even though they are located on different chromosomes. If you have a different opinion on this, I would be happy to hear it!

My script is the following (basically a copy of your previous suggestions :) )
use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $start_run = time();
open (OUT, ">expandedSNPS.txt");

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
-host => 'ensembldb.ensembl.org<http://ensembldb.ensembl.org/>',
-user => 'anonymous'
);

# Connect to the databases:
my $variation_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'variation');
my $ldfc_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'ldfeaturecontainer');
my $pop_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'population');
$variation_adaptor->db->use_vcf(1);

my $ld_population = $pop_adaptor->fetch_by_name('1000GENOMES:phase_3:CEU');

# Loop through all SNPs available and find SNPs in LD
while(<>) {
chomp;
my $variation_name = $_;
my $variation = $variation_adaptor->fetch_by_name($variation_name);
my @var_features = @{ $variation->get_all_VariationFeatures() };

foreach my $vf (@var_features) {
my $rsid = $vf->name;
my $start = $vf->start;
my $region = $vf->seq_region_name;
print OUT "$rsid\t$start\t$region\n";
print $ld_population->name, "\n";

my $ldfc = $ldfc_adaptor->fetch_by_VariationFeature($vf, $ld_population);
my @ld_values = @{ $ldfc->get_all_ld_values() };

      foreach my $ld_hash (@ld_values) {
        my $r2 = $ld_hash->{r2};
        my $variation_name1 = $ld_hash->{variation1}->variation_name;
        my $variation_name2 = $ld_hash->{variation2}->variation_name;
        print OUT "\t$variation_name1\t$variation_name2\tr2=$r2\n";
      }
    }
}

close OUT;
my $end_run = time();
my $run_time = $end_run - $start_run;
print "Job took $run_time seconds\n";

<longLD.txt>

25. feb. 2016 kl. 14.24 skrev Anja Thormann <anja at ebi.ac.uk<mailto:anja at ebi.ac.uk>>:

Hi Johanne,

thank you for reporting this. The problem was caused by an error in the sql statement. I fixed the wrong statement and pushed the fix again to the release/83 branch. Please pull the changes. However, the call is not going to return any data at the moment. We recently switched from using a database for storing genotypes to reading the the genotypes from VCF files instead. This functionality is currently not supported by the method ($vf->get_all_LD_Populations()). We will add this for release/84. In the  meantime please use the LDFeatureContainerAdaptor and loop over all the populations for which we could compute LD data. The populations can be retrieved from the population adaptor using fetch_all_LD_Populations. We compute LD data on 1000 Genomes phase 3 data. In addition to computing LD data for a given SNP you can also specify a region or a a set of SNPs for which you want to compute LD data. Use the following methods from the LDFeatureContainerAdaptor (http://www.ensembl.org/info/docs/Doxygen/variation-api/classBio_1_1EnsEMBL_1_1Variation_1_1DBSQL_1_1LDFeatureContainerAdaptor.html): fetch_by_VariationFeatures, fetch_by_Slice.

Best,
Anja

On 25 Feb 2016, at 09:57, Johanne Håøy Horn <johannhh at ifi.uio.no<mailto:johannhh at ifi.uio.no>> wrote:

Dear Ensembl team,

When I do the following call: my @vf_pops = @{ $vf->get_all_LD_Populations() }; I get this error:
DBD::mysql::st execute failed: Unknown column 'ip.population_id' in 'field list' at /Users/Johanne/src/ensembl-variation/modules/Bio/EnsEMBL/Variation/VariationFeature.pm line 1429, <> line 1.DBD::mysql::st execute failed: Unknown column 'ip.population_id' in 'field list' at /Users/Johanne/src/ensembl-variation/modules/Bio/EnsEMBL/Variation/VariationFeature.pm line 1429, <> line 1.

Here’s the full script:
use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $start_run = time();

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org<http://ensembldb.ensembl.org/>',
  -user => 'anonymous'
);
my $variation_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'variation' );
my $ldfc_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'ldfeaturecontainer');
my $population_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'population');
$variation_adaptor->db->use_vcf(1); # To get 1000G phase 3 data also

my $ld_populations = $population_adaptor->fetch_all_LD_Populations();
foreach my $ld_population (@$ld_populations) {
    print $ld_population->name, "\n";
}

my $variation_name = 'rs157580';
my $variation = $variation_adaptor->fetch_by_name($variation_name);
my @vfs = @{ $variation->get_all_VariationFeatures() };

foreach my $vf (@vfs) {

  print $vf->name, "\n";
  my @vf_pops = @{ $vf->get_all_LD_Populations() };
  foreach my $ld_population (@$ld_populations) {
    print $ld_population->name, "\n";
    my $ldfc = $ldfc_adaptor->fetch_by_VariationFeature($vf, $ld_population);
    foreach my $ld_hash (@{$ldfc->get_all_ld_values}) {
      my $d_prime = $ld_hash->{d_prime};
      my $r2 = $ld_hash->{r2};
my $variation_name1 = $ld_hash->{variation1}->variation_name;
      my $variation_name2 = $ld_hash->{variation2}->variation_name;
      print "$variation_name1 $variation_name2 d_prime=$d_prime r2=$r2\n";
    }
  }
}

my $end_run = time();
my $run_time = $end_run - $start_run;
print "Job took $run_time seconds\n";

If I remove the call to get_all_LD_Populations, the script runs fine again. Do you have any idea on what I am doing wrong? Could it be a bug in the code, like the error I reported yesterday?

Also, I have visited a lot of forums where LD calculation is discussed. Many users ask for a database one can query to find LD between SNPs, and genomic LD tracks, but all such services are only available on HapMap data. Do you know why there is none for all SNPs and LD produced up until 1000G phase 3? What kind of restrictions is there that makes it easier to compute LD on the fly, for instance? Space maybe?
(If there actually does exist databases/tracks of LD, I would be happy to know!)

Best,
Johanne

_______________________________________________
Dev mailing list    Dev at ensembl.org<mailto:Dev at ensembl.org>
Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
Ensembl Blog: http://www.ensembl.info/

_______________________________________________
Dev mailing list    Dev at ensembl.org<mailto:Dev at ensembl.org>
Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
Ensembl Blog: http://www.ensembl.info/

_______________________________________________
Dev mailing list    Dev at ensembl.org<mailto:Dev at ensembl.org>
Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
Ensembl Blog: http://www.ensembl.info/

_______________________________________________
Dev mailing list    Dev at ensembl.org<mailto:Dev at ensembl.org>
Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
Ensembl Blog: http://www.ensembl.info/

_______________________________________________
Dev mailing list    Dev at ensembl.org<mailto:Dev at ensembl.org>
Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
Ensembl Blog: http://www.ensembl.info/

_______________________________________________
Dev mailing list    Dev at ensembl.org<mailto:Dev at ensembl.org>
Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
Ensembl Blog: http://www.ensembl.info/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20160226/9335811f/attachment.html>


More information about the Dev mailing list