[ensembl-dev] 1KGp3 markers under-annotated for human chr 22 in Ensembl API v96?

Helen Schuilenburg helens at ebi.ac.uk
Fri Oct 4 16:30:07 BST 2019


Hi Andrew

Thank you for the list.  We will correct chr22 in the next release. 
Thank you for reporting this.

The issue with the database on ensembldb.ensembl.org is fixed. Counts 
should be returned if you run your script.

Regards

Helen

On 01/10/2019 10:22, andrew126 at mac.com wrote:
> Hi Helen,
>
> Thanks for the response.
>
> I ran the numbers for all autosomes.  Chr 9 seems to show a bit of a deficit relative to its neighbors, but nothing as obvious as chr 22.
>
> 1 .. 6123396
> 2 .. 6873996
> 3 .. 5642383
> 4 .. 5585136
> 5 .. 5110655
> 6 .. 4842774
> 7 .. 4559179
> 8 .. 4465690
> 9 .. 3433163
> 10 .. 3847658
> 11 .. 3912747
> 12 .. 3647900
> 13 .. 2772148
> 14 .. 2556819
> 15 .. 2345835
> 16 .. 2609023
> 17 .. 2235443
> 18 .. 2192269
> 19 .. 1752444
> 20 .. 1747663
> 21 .. 1066903
> 22 .. 3487
>
> Thanks for the help.
>
> Best,
>
> Andrew
>
>
>
>
>> On Sep 30, 2019, at 10:40 AM, Helen Schuilenburg <helens at ebi.ac.uk> wrote:
>>
>> Hi Andrew
>>
>> Thank you for reporting these issues.
>>
>> We are looking into why the marker counts for chr22 are low and will correct this in a future release.
>>
>> We are also investigating why the ensembldb.ensembl.org is returning 0 counts.
>>
>> Regards
>>
>> Helen
>>
>> On 30/09/2019 06:25, andrew126 at mac.com wrote:
>>> Hi,
>>>
>>> We have a local install of Ensembl 96 (API and human db), and the below perl script returns the following marker counts for 1KGp3:EUR, chromosomes 20, 21, and 22:
>>>
>>> 	20 .. 1747663
>>> 	21 .. 1066903
>>> 	22 .. 3487
>>>
>>> We see the exact same counts regardless of if we use locally downloaded 1KG vcfs or the remote vcfs (as config'd in ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/vcf_config.json).
>>>
>>> I can't find a separate resource with summary numbers, but chr 22 being three orders of magnitude below chr 20 or 21 seems unexpected.  Is that value correct?
>>>
>>> Further, gunzip'ing the ALL.chr###.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz files for chroms 20-22 gives more expected consistency via 'cut -f3 FILE | grep rs | grep -v "#" | wc -l':
>>>
>>> 	20: 1,811,392
>>> 	21: 1,100,065
>>> 	22: 1,100,429
>>>
>>> All of the chrom 22 variants are QUAL=100, FILTER=PASS.
>>>
>>> Any help understanding why the chr-22 1KGp3:EUR marker count is so low would be greatly appreciated.  I'm not sure if this impacts other chromosomes or not.
>>>
>>> Trying to reproduce the problem/result against remote Ensembl 96 (ensembldb.ensembl.org) has exposed another issue:  the same script returns 0 counts for each chromosome, regardless of if is using remote or local 1KG vcfs.  Further, no errors are reported/thrown.
>>>
>>> 	20 .. 0
>>> 	21 .. 0
>>> 	22 .. 0
>>>
>>> Can you also clarify why we see that behavior, given that the script reports values against a local Ensembl 96 install?
>>>
>>> Many thanks for any help/information.
>>>
>>> Best,
>>>
>>> Andrew
>>>
>>>
>>> Here is the perl script:
>>>
>>> use strict;
>>> $|=1;
>>> use Bio::EnsEMBL::Registry;
>>> use Bio::EnsEMBL::ApiVersion;
>>>
>>> my $registry = 'Bio::EnsEMBL::Registry';
>>> $registry->load_all();
>>> $registry->load_registry_from_db(
>>>      -host => 'ensembldb.ensembl.org',
>>>      -user => 'anonymous'
>>>      );
>>>
>>> my $vs_adaptor = $registry->get_adaptor('human','variation','variationset');
>>> $vs_adaptor->db->use_vcf(1);
>>> my $vs = $vs_adaptor->fetch_by_short_name('1kg_3_eur');
>>>
>>> foreach my $chr (20..22) {
>>>      my $slice_adaptor = $registry->get_adaptor('homo_sapiens', 'core', 'slice');
>>>      my $slice = $slice_adaptor->fetch_by_region('chromosome', $chr);
>>>      my $s_start = $slice->start;
>>>      my $s_end = $slice->end;
>>>      my $tmp_start = $s_start;
>>>      my $tmp_end = -1;
>>>      my $tcnt=0;
>>>      while ($tmp_start<=$s_end) {
>>>          $tmp_end = $tmp_start + 1000;
>>>          $slice = $slice_adaptor->fetch_by_region('chromosome', $chr, $tmp_start, $tmp_end);
>>>          my $vfs = $vs->get_all_VariationFeatures_by_Slice($slice);
>>>          my @tmp = @{$vfs};
>>>          $tcnt+=($#tmp+1);
>>>          $tmp_start = $tmp_end+1;
>>>      }
>>>      print "$chr .. $tcnt\n";
>>> }
>>>
>>>
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info: https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
>>> Ensembl Blog: http://www.ensembl.info/
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
>> Ensembl Blog: http://www.ensembl.info/
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
> Ensembl Blog: http://www.ensembl.info/





More information about the Dev mailing list