[ensembl-dev] 1KGp3 markers under-annotated for human chr 22 in Ensembl API v96?

andrew126 at mac.com andrew126 at mac.com
Mon Sep 30 06:25:33 BST 2019


Hi,

We have a local install of Ensembl 96 (API and human db), and the below perl script returns the following marker counts for 1KGp3:EUR, chromosomes 20, 21, and 22:

	20 .. 1747663
	21 .. 1066903
	22 .. 3487

We see the exact same counts regardless of if we use locally downloaded 1KG vcfs or the remote vcfs (as config'd in ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/vcf_config.json).

I can't find a separate resource with summary numbers, but chr 22 being three orders of magnitude below chr 20 or 21 seems unexpected.  Is that value correct?

Further, gunzip'ing the ALL.chr###.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz files for chroms 20-22 gives more expected consistency via 'cut -f3 FILE | grep rs | grep -v "#" | wc -l':

	20: 1,811,392
	21: 1,100,065
	22: 1,100,429

All of the chrom 22 variants are QUAL=100, FILTER=PASS.

Any help understanding why the chr-22 1KGp3:EUR marker count is so low would be greatly appreciated.  I'm not sure if this impacts other chromosomes or not.

Trying to reproduce the problem/result against remote Ensembl 96 (ensembldb.ensembl.org) has exposed another issue:  the same script returns 0 counts for each chromosome, regardless of if is using remote or local 1KG vcfs.  Further, no errors are reported/thrown.

	20 .. 0
	21 .. 0
	22 .. 0

Can you also clarify why we see that behavior, given that the script reports values against a local Ensembl 96 install?

Many thanks for any help/information.

Best,

Andrew


Here is the perl script:

use strict;
$|=1;
use Bio::EnsEMBL::Registry;
use Bio::EnsEMBL::ApiVersion;

my $registry = 'Bio::EnsEMBL::Registry';
$registry->load_all();
$registry->load_registry_from_db(
    -host => 'ensembldb.ensembl.org',
    -user => 'anonymous'
    );

my $vs_adaptor = $registry->get_adaptor('human','variation','variationset');
$vs_adaptor->db->use_vcf(1);
my $vs = $vs_adaptor->fetch_by_short_name('1kg_3_eur');

foreach my $chr (20..22) {
    my $slice_adaptor = $registry->get_adaptor('homo_sapiens', 'core', 'slice');
    my $slice = $slice_adaptor->fetch_by_region('chromosome', $chr);
    my $s_start = $slice->start;
    my $s_end = $slice->end;
    my $tmp_start = $s_start;
    my $tmp_end = -1;
    my $tcnt=0;
    while ($tmp_start<=$s_end) {
        $tmp_end = $tmp_start + 1000;
        $slice = $slice_adaptor->fetch_by_region('chromosome', $chr, $tmp_start, $tmp_end);
        my $vfs = $vs->get_all_VariationFeatures_by_Slice($slice);
        my @tmp = @{$vfs};
        $tcnt+=($#tmp+1);
        $tmp_start = $tmp_end+1;
    }
    print "$chr .. $tcnt\n";
}





More information about the Dev mailing list