[ensembl-dev] Issues with VariationFeature

Johanne Håøy Horn johannhh at ifi.uio.no
Thu Feb 25 18:00:36 GMT 2016


Hello again!

Thank you for the bug fix. My apologies, I read this old correspondence and mistakenly thought you still calculated LD on the fly in stead of having it in the database: https://www.biostars.org/p/109785/#147784

I use all LD populations for now, as you suggested. However, a couple more questions appeared:

1) When I call my $ldfc = $ldfc_adaptor->fetch_by_VariationFeature($vf, $ld_population);, lots of vcf files are downloaded. The terminal prints out «[get_local_version] downloading the index file…» when the files are downloaded, and the following files appear in my folder:

ALL.chr1.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz.tbi
ALL.chr10.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz.tbi
ALL.chr11.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz.tbi
ALL.chr12.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz.tbi
ALL.chr14.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz.tbi
ALL.chr15.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz.tbi
ALL.chr16.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz.tbi
ALL.chr19.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz.tbi
ALL.chr2.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz.tbi
ALL.chr3.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz.tbi
ALL.chr4.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz.tbi
ALL.chr5.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz.tbi
ALL.chr8.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz.tbi
ALL.chr9.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz.tbi

What are these files? Why are they downloaded? Will release 84 have a different way of doing it?

2) When I try to use a large file of SNPs (example attached in this email), I get the following message after all the 14 vcf-files posted above have been downloaded:
[kftp_connect_file]
[main] fail to open the data file.
[get_local_version] downloading the index file...
[kftp_connect_file] 530 You may not have more than 15 concurrent connections at any time.
[download_from_remote] fail to open remote file.

If it tries to download another file at the same time the restriction threshold is met, I get this error message:
[get_local_version] downloading the index file...
[kftp_connect_file] 530 You may not have more than 15 concurrent connections at any time.
[download_from_remote] fail to open remote file.
[tabix] failed to load the index file.
[get_local_version] downloading the index file...
Can't use an undefined value as an ARRAY reference at /Users/Johanne/src/ensembl-io/modules/Bio/EnsEMBL/IO/Parser/BaseVCF4.pm line 570, <> line 22.

Does  this mean I cannot connect to all chromosomes at the same time? Does LD only get computed between variants on the same chromosome? What is the reason behind this restriction? I am not too familiar with reasoning behind choices of structure etc in LD data sets, so these questions are based on my impression that variants can be in LD even though they are located on different chromosomes. If you have a different opinion on this, I would be happy to hear it!

My script is the following (basically a copy of your previous suggestions :) )
use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $start_run = time();
open (OUT, ">expandedSNPS.txt");

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
-host => 'ensembldb.ensembl.org<http://ensembldb.ensembl.org>',
-user => 'anonymous'
);

# Connect to the databases:
my $variation_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'variation');
my $ldfc_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'ldfeaturecontainer');
my $pop_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'population');
$variation_adaptor->db->use_vcf(1);

my $ld_population = $pop_adaptor->fetch_by_name('1000GENOMES:phase_3:CEU');

# Loop through all SNPs available and find SNPs in LD
while(<>) {
chomp;
my $variation_name = $_;
my $variation = $variation_adaptor->fetch_by_name($variation_name);
my @var_features = @{ $variation->get_all_VariationFeatures() };

foreach my $vf (@var_features) {
my $rsid = $vf->name;
my $start = $vf->start;
my $region = $vf->seq_region_name;
print OUT "$rsid\t$start\t$region\n";
print $ld_population->name, "\n";

my $ldfc = $ldfc_adaptor->fetch_by_VariationFeature($vf, $ld_population);
my @ld_values = @{ $ldfc->get_all_ld_values() };

      foreach my $ld_hash (@ld_values) {
        my $r2 = $ld_hash->{r2};
        my $variation_name1 = $ld_hash->{variation1}->variation_name;
        my $variation_name2 = $ld_hash->{variation2}->variation_name;
        print OUT "\t$variation_name1\t$variation_name2\tr2=$r2\n";
      }
    }
}

close OUT;
my $end_run = time();
my $run_time = $end_run - $start_run;
print "Job took $run_time seconds\n";

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20160225/c7e80744/attachment.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: longLD.txt
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20160225/c7e80744/attachment.txt>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20160225/c7e80744/attachment.htm>


More information about the Dev mailing list