[ensembl-dev] Questions about the ENSEMBL population

Johanne Håøy Horn johannhh at ifi.uio.no
Sun Feb 28 14:50:54 GMT 2016


Dear Ensembl dev team,

I try to see what populations are available from your ENSEMBL api, and with the following script I get the populations listed below:

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
-host => 'ensembldb.ensembl.org<http://ensembldb.ensembl.org>',
-user => 'anonymous'
);

# Connect to the databases:
my $variation_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'variation');
my $pop_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'population');
$variation_adaptor->db->use_vcf(1);
$pop_adaptor->db->use_vcf(1);

# Print out all available populations:
my $ld_populations = $pop_adaptor->fetch_all_LD_Populations();
foreach my $ld_pop (@$ld_populations) {
print $ld_pop->name, "\n";
}

my $ld_population = $pop_adaptor->fetch_by_name('EUR');

POPULATIONS:
1000GENOMES:phase_3:ACB
1000GENOMES:phase_3:ASW
1000GENOMES:phase_3:BEB
1000GENOMES:phase_3:CDX
1000GENOMES:phase_3:CEU
1000GENOMES:phase_3:CHB
1000GENOMES:phase_3:CHS
1000GENOMES:phase_3:CLM
1000GENOMES:phase_3:ESN
1000GENOMES:phase_3:FIN
1000GENOMES:phase_3:GBR
1000GENOMES:phase_3:GIH
1000GENOMES:phase_3:IBS
1000GENOMES:phase_3:ITU
1000GENOMES:phase_3:JPT
1000GENOMES:phase_3:KHV
1000GENOMES:phase_3:LWK
1000GENOMES:phase_3:MAG # <— This is the population I cannot find documentation for
1000GENOMES:phase_3:MSL
1000GENOMES:phase_3:MXL
1000GENOMES:phase_3:PEL
1000GENOMES:phase_3:PJL
1000GENOMES:phase_3:PUR
1000GENOMES:phase_3:STU
1000GENOMES:phase_3:TSI
1000GENOMES:phase_3:YRI

This page explains the population codes: http://www.1000genomes.org/category/population/
However, I can’t find the extended version of the MAG abbreviation (marked above), not when searching elsewhere either. Which population is it, and what is its super population?
Also, do you have populations from other research initiatives than 1000G?

I have also been wondering if I should tell all adaptors, that is LDFeatureContainerAdaptor, VariationAdaptor and PopulationAdaptor, to use vcf(1) when I want both database and vcf data? I mainly use it to get out variants in LD with each other. Will setting different vcfs on the different adaptors affect the variants extracted? Or is it enough to set vcf(1) on the VariationAdaptor, for instance?

Even though the super populations are not on the printed list when using the population adaptor to print out all LD populations, is it possible to use $population_adaptor->fetch_by_name(‘EUR’) for instance? I tried this, but got no variants in LD. Do you have any ways this is possible to do, or a alternative list of super populations one can generate in some way? Or should I simply gather all subpopulations, such as EUR = (CEU, TSI, FIN, GBR, IBS) and loop through all these to get the super population LD?

Here is a script where I tried to use the super population «EUR» directly, without success (but no error message either, it just finds no LD variants):

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $inputfile = $ARGV[0];
my $outputfile = $ARGV[1];
my $r_limit = $ARGV[2];

open (IN, "<$inputfile");
open (OUT, ">$outputfile");

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
-host => 'ensembldb.ensembl.org<http://ensembldb.ensembl.org>',
-user => 'anonymous'
);

# Connect to the databases:
my $variation_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'variation');
my $ldfc_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'ldfeaturecontainer');
my $pop_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'population');
$variation_adaptor->db->use_vcf(1);

my $ld_population = $pop_adaptor->fetch_by_name('EUR');

# Loop through all SNPs available and find SNPs in LD
while(<IN>) {
chomp;
my $variation_name = $_;
my $variation = $variation_adaptor->fetch_by_name($variation_name);
my @var_features = @{ $variation->get_all_VariationFeatures() };

foreach my $vf (@var_features) {
my $rsid = $vf->name;
print $rsid, "\n";
my $start = $vf->start;
my $region = $vf->seq_region_name;
print OUT "$region\t$rsid\t$start\n";
my $ldfc = $ldfc_adaptor->fetch_by_VariationFeature($vf, $ld_population);
my @ld_values = @{ $ldfc->get_all_ld_values() };
      foreach my $ld_hash (@ld_values) {
        my $r2 = $ld_hash->{r2};
        if ($r2 >= $r_limit) {
    my $variation_name1 = $ld_hash->{variation1}->variation_name;
    my $variation_name2 = $ld_hash->{variation2}->variation_name;
    my $pos1 = $ld_hash->{variation1}->seq_region_name;
    my $pos2 = $ld_hash->{variation2}->seq_region_name;
    my $start1 = $ld_hash->{variation1}->start;
    my $start2 = $ld_hash->{variation2}->start;

    if ($variation_name1 eq $rsid) {
    print OUT "$pos2\t$variation_name2\t$start2\tr2=$r2\n";
    } else {
    print OUT "$pos1\t$variation_name1\t$start1\tr2=$r2\n";
    }
        }
      }
}
}
close OUT;
close IN;

Best,
Johanne Håøy Horn




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20160228/599bc882/attachment.html>


More information about the Dev mailing list