[ensembl-dev] 1KGp3 markers under-annotated for human chr 22 in Ensembl API v96?

Helen Schuilenburg helens at ebi.ac.uk
Mon Sep 30 18:40:20 BST 2019


Hi Andrew

Thank you for reporting these issues.

We are looking into why the marker counts for chr22 are low and will 
correct this in a future release.

We are also investigating why the ensembldb.ensembl.org is returning 0 
counts.

Regards

Helen

On 30/09/2019 06:25, andrew126 at mac.com wrote:
> Hi,
>
> We have a local install of Ensembl 96 (API and human db), and the below perl script returns the following marker counts for 1KGp3:EUR, chromosomes 20, 21, and 22:
>
> 	20 .. 1747663
> 	21 .. 1066903
> 	22 .. 3487
>
> We see the exact same counts regardless of if we use locally downloaded 1KG vcfs or the remote vcfs (as config'd in ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/vcf_config.json).
>
> I can't find a separate resource with summary numbers, but chr 22 being three orders of magnitude below chr 20 or 21 seems unexpected.  Is that value correct?
>
> Further, gunzip'ing the ALL.chr###.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz files for chroms 20-22 gives more expected consistency via 'cut -f3 FILE | grep rs | grep -v "#" | wc -l':
>
> 	20: 1,811,392
> 	21: 1,100,065
> 	22: 1,100,429
>
> All of the chrom 22 variants are QUAL=100, FILTER=PASS.
>
> Any help understanding why the chr-22 1KGp3:EUR marker count is so low would be greatly appreciated.  I'm not sure if this impacts other chromosomes or not.
>
> Trying to reproduce the problem/result against remote Ensembl 96 (ensembldb.ensembl.org) has exposed another issue:  the same script returns 0 counts for each chromosome, regardless of if is using remote or local 1KG vcfs.  Further, no errors are reported/thrown.
>
> 	20 .. 0
> 	21 .. 0
> 	22 .. 0
>
> Can you also clarify why we see that behavior, given that the script reports values against a local Ensembl 96 install?
>
> Many thanks for any help/information.
>
> Best,
>
> Andrew
>
>
> Here is the perl script:
>
> use strict;
> $|=1;
> use Bio::EnsEMBL::Registry;
> use Bio::EnsEMBL::ApiVersion;
>
> my $registry = 'Bio::EnsEMBL::Registry';
> $registry->load_all();
> $registry->load_registry_from_db(
>      -host => 'ensembldb.ensembl.org',
>      -user => 'anonymous'
>      );
>
> my $vs_adaptor = $registry->get_adaptor('human','variation','variationset');
> $vs_adaptor->db->use_vcf(1);
> my $vs = $vs_adaptor->fetch_by_short_name('1kg_3_eur');
>
> foreach my $chr (20..22) {
>      my $slice_adaptor = $registry->get_adaptor('homo_sapiens', 'core', 'slice');
>      my $slice = $slice_adaptor->fetch_by_region('chromosome', $chr);
>      my $s_start = $slice->start;
>      my $s_end = $slice->end;
>      my $tmp_start = $s_start;
>      my $tmp_end = -1;
>      my $tcnt=0;
>      while ($tmp_start<=$s_end) {
>          $tmp_end = $tmp_start + 1000;
>          $slice = $slice_adaptor->fetch_by_region('chromosome', $chr, $tmp_start, $tmp_end);
>          my $vfs = $vs->get_all_VariationFeatures_by_Slice($slice);
>          my @tmp = @{$vfs};
>          $tcnt+=($#tmp+1);
>          $tmp_start = $tmp_end+1;
>      }
>      print "$chr .. $tcnt\n";
> }
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
> Ensembl Blog: http://www.ensembl.info/





More information about the Dev mailing list