[ensembl-dev] Questions about the ENSEMBL population

Anja Thormann anja at ebi.ac.uk
Mon Feb 29 12:44:43 GMT 2016


Hi Johanne,

The name for the European super population is 1000GENOMES:phase_3:EUR. You can use the name for fetching the population object like this:

my $ld_population = $pop_adaptor->fetch_by_name('1000GENOMES:phase_3:EUR');

However, the European super population contains 503 samples and exceeds the number of genotypes supported by our script for computing LD data. Therefore, you need to loop over all European populations and we will need to check if we can increase the number of allowed populations. This could become availabe in release 85 which is planned for summer.

MAG stands for Mandinka in The Gambia and is part of the African super population. The 1000 Genomes website is using the correct name which is GWD (Gambian in Western Divisions in the Gambia). We will move to using GWD in release 84. You can find a list of all 1000 Genomes phase 3 populations here: http://www.ensembl.org/info/genome/variation/data_description.html#populations <http://www.ensembl.org/info/genome/variation/data_description.html#populations>.

It doesn't matter which adaptor you use to set the use_vcf flag. The flag is passed on to the DBAdaptor. All other adaptors will have access to the flag and check the value if needed.

You can set and update use_vcf accordingly: non-zero value instructs the API to look in configured VCF files for genotype-based data. Setting to 1 tells the API to use both the DB and VCF; setting to 2 tells the API to use only VCF.

Best,
Anja

> On 28 Feb 2016, at 14:50, Johanne Håøy Horn <johannhh at ifi.uio.no> wrote:
> 
> Dear Ensembl dev team,
> 
> I try to see what populations are available from your ENSEMBL api, and with the following script I get the populations listed below:
> 
> use strict;
> use warnings;
> use Bio::EnsEMBL::Registry;
> 
> my $registry = 'Bio::EnsEMBL::Registry';
> 
> $registry->load_registry_from_db(
> -host => 'ensembldb.ensembl.org <http://ensembldb.ensembl.org/>',
> -user => 'anonymous'
> );
> 
> # Connect to the databases:
> my $variation_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'variation');
> my $pop_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'population');
> $variation_adaptor->db->use_vcf(1);
> $pop_adaptor->db->use_vcf(1);
> 
> # Print out all available populations:
> my $ld_populations = $pop_adaptor->fetch_all_LD_Populations();
> foreach my $ld_pop (@$ld_populations) {
> print $ld_pop->name, "\n";
> }
> 
> my $ld_population = $pop_adaptor->fetch_by_name('EUR');
> 
> POPULATIONS:
> 1000GENOMES:phase_3:ACB
> 1000GENOMES:phase_3:ASW
> 1000GENOMES:phase_3:BEB
> 1000GENOMES:phase_3:CDX
> 1000GENOMES:phase_3:CEU
> 1000GENOMES:phase_3:CHB
> 1000GENOMES:phase_3:CHS
> 1000GENOMES:phase_3:CLM
> 1000GENOMES:phase_3:ESN
> 1000GENOMES:phase_3:FIN
> 1000GENOMES:phase_3:GBR
> 1000GENOMES:phase_3:GIH
> 1000GENOMES:phase_3:IBS
> 1000GENOMES:phase_3:ITU
> 1000GENOMES:phase_3:JPT
> 1000GENOMES:phase_3:KHV
> 1000GENOMES:phase_3:LWK
> 1000GENOMES:phase_3:MAG # <— This is the population I cannot find documentation for
> 1000GENOMES:phase_3:MSL
> 1000GENOMES:phase_3:MXL
> 1000GENOMES:phase_3:PEL
> 1000GENOMES:phase_3:PJL
> 1000GENOMES:phase_3:PUR
> 1000GENOMES:phase_3:STU
> 1000GENOMES:phase_3:TSI
> 1000GENOMES:phase_3:YRI
> 
> This page explains the population codes: http://www.1000genomes.org/category/population/ <http://www.1000genomes.org/category/population/>
> However, I can’t find the extended version of the MAG abbreviation (marked above), not when searching elsewhere either. Which population is it, and what is its super population? 
> Also, do you have populations from other research initiatives than 1000G?
> 
> I have also been wondering if I should tell all adaptors, that is LDFeatureContainerAdaptor, VariationAdaptor and PopulationAdaptor, to use vcf(1) when I want both database and vcf data? I mainly use it to get out variants in LD with each other. Will setting different vcfs on the different adaptors affect the variants extracted? Or is it enough to set vcf(1) on the VariationAdaptor, for instance?
> 
> Even though the super populations are not on the printed list when using the population adaptor to print out all LD populations, is it possible to use $population_adaptor->fetch_by_name(‘EUR’) for instance? I tried this, but got no variants in LD. Do you have any ways this is possible to do, or a alternative list of super populations one can generate in some way? Or should I simply gather all subpopulations, such as EUR = (CEU, TSI, FIN, GBR, IBS) and loop through all these to get the super population LD?
> 
> Here is a script where I tried to use the super population «EUR» directly, without success (but no error message either, it just finds no LD variants):
> 
> use strict;
> use warnings;
> use Bio::EnsEMBL::Registry;
> 
> my $inputfile = $ARGV[0];
> my $outputfile = $ARGV[1];
> my $r_limit = $ARGV[2];
> 
> open (IN, "<$inputfile");
> open (OUT, ">$outputfile");
> 
> my $registry = 'Bio::EnsEMBL::Registry';
> 
> $registry->load_registry_from_db(
> -host => 'ensembldb.ensembl.org <http://ensembldb.ensembl.org/>',
> -user => 'anonymous'
> );
> 
> # Connect to the databases:
> my $variation_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'variation');
> my $ldfc_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'ldfeaturecontainer');
> my $pop_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'population');
> $variation_adaptor->db->use_vcf(1);
> 
> my $ld_population = $pop_adaptor->fetch_by_name('EUR');
> 
> # Loop through all SNPs available and find SNPs in LD
> while(<IN>) {
> chomp;
> my $variation_name = $_;
> my $variation = $variation_adaptor->fetch_by_name($variation_name);
> my @var_features = @{ $variation->get_all_VariationFeatures() };
> 
> foreach my $vf (@var_features) {
> my $rsid = $vf->name;
> print $rsid, "\n";
> my $start = $vf->start;
> my $region = $vf->seq_region_name;
> print OUT "$region\t$rsid\t$start\n";
> my $ldfc = $ldfc_adaptor->fetch_by_VariationFeature($vf, $ld_population);
> my @ld_values = @{ $ldfc->get_all_ld_values() };
>       
> foreach my $ld_hash (@ld_values) {
>         
> my $r2 = $ld_hash->{r2};
>         if ($r2 >= $r_limit) {
>    
> my $variation_name1 = $ld_hash->{variation1}->variation_name;
>    
> my $variation_name2 = $ld_hash->{variation2}->variation_name;
>    
> my $pos1 = $ld_hash->{variation1}->seq_region_name;
>    
> my $pos2 = $ld_hash->{variation2}->seq_region_name;
>    
> my $start1 = $ld_hash->{variation1}->start;
>    
> my $start2 = $ld_hash->{variation2}->start;
> 
>    
> if ($variation_name1 eq $rsid) {
>    
> print OUT "$pos2\t$variation_name2\t$start2\tr2=$r2\n";
>    
> } else {
>    
> print OUT "$pos1\t$variation_name1\t$start1\tr2=$r2\n";        
>    
> }
>         
> }
>       
> }
> }
> }
> close OUT;
> close IN;
> 
> Best,
> Johanne Håøy Horn
> 
> 
> 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20160229/c3fdf465/attachment.html>


More information about the Dev mailing list