[ensembl-dev] 1KGp3 markers under-annotated for human chr 22 in Ensembl API v96?
andrew126 at mac.com
andrew126 at mac.com
Mon Sep 30 06:25:33 BST 2019
Hi,
We have a local install of Ensembl 96 (API and human db), and the below perl script returns the following marker counts for 1KGp3:EUR, chromosomes 20, 21, and 22:
20 .. 1747663
21 .. 1066903
22 .. 3487
We see the exact same counts regardless of if we use locally downloaded 1KG vcfs or the remote vcfs (as config'd in ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/vcf_config.json).
I can't find a separate resource with summary numbers, but chr 22 being three orders of magnitude below chr 20 or 21 seems unexpected. Is that value correct?
Further, gunzip'ing the ALL.chr###.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz files for chroms 20-22 gives more expected consistency via 'cut -f3 FILE | grep rs | grep -v "#" | wc -l':
20: 1,811,392
21: 1,100,065
22: 1,100,429
All of the chrom 22 variants are QUAL=100, FILTER=PASS.
Any help understanding why the chr-22 1KGp3:EUR marker count is so low would be greatly appreciated. I'm not sure if this impacts other chromosomes or not.
Trying to reproduce the problem/result against remote Ensembl 96 (ensembldb.ensembl.org) has exposed another issue: the same script returns 0 counts for each chromosome, regardless of if is using remote or local 1KG vcfs. Further, no errors are reported/thrown.
20 .. 0
21 .. 0
22 .. 0
Can you also clarify why we see that behavior, given that the script reports values against a local Ensembl 96 install?
Many thanks for any help/information.
Best,
Andrew
Here is the perl script:
use strict;
$|=1;
use Bio::EnsEMBL::Registry;
use Bio::EnsEMBL::ApiVersion;
my $registry = 'Bio::EnsEMBL::Registry';
$registry->load_all();
$registry->load_registry_from_db(
-host => 'ensembldb.ensembl.org',
-user => 'anonymous'
);
my $vs_adaptor = $registry->get_adaptor('human','variation','variationset');
$vs_adaptor->db->use_vcf(1);
my $vs = $vs_adaptor->fetch_by_short_name('1kg_3_eur');
foreach my $chr (20..22) {
my $slice_adaptor = $registry->get_adaptor('homo_sapiens', 'core', 'slice');
my $slice = $slice_adaptor->fetch_by_region('chromosome', $chr);
my $s_start = $slice->start;
my $s_end = $slice->end;
my $tmp_start = $s_start;
my $tmp_end = -1;
my $tcnt=0;
while ($tmp_start<=$s_end) {
$tmp_end = $tmp_start + 1000;
$slice = $slice_adaptor->fetch_by_region('chromosome', $chr, $tmp_start, $tmp_end);
my $vfs = $vs->get_all_VariationFeatures_by_Slice($slice);
my @tmp = @{$vfs};
$tcnt+=($#tmp+1);
$tmp_start = $tmp_end+1;
}
print "$chr .. $tcnt\n";
}
More information about the Dev
mailing list