[ensembl-dev] bgzf_read_block message in stderr

andrew126 at mac.com andrew126 at mac.com
Thu Dec 7 11:59:31 GMT 2017


Hi Will,

Thanks for the response.

Yes, downloading the vcfs locally seems to resolve the issue, but I guess my main point/request is that if an API call suffers from truncated data, it seems that event should be fatal to any API script, and not just generate a "you're probably missing data" message to stderr and keep on running.  Or at least it would be great to have a way for that behavior to be an option.

Thanks again.

Best,

Andrew

> On Dec 7, 2017, at 3:18 AM, William McLaren <wm2 at ebi.ac.uk> wrote:
> 
> Hi Andrew,
> 
> I’m not able to recreate the issue you’re experiencing, but my guess is that there’s a network issue causing incomplete data to be transmitted. This particular part of the API uses a perl library that interfaces with VCFs via htslib (https://github.com/samtools/htslib <https://github.com/samtools/htslib>); we hope to able to address this and a couple of other developments to this library in the new year.
> 
> For now, a possible solution is for you to download the VCFs to your local machine. You will find them listed in the JSON configuration file found in ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/vcf_config.json:
> 
> $ grep vcf ~/src/ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/vcf_config.json | grep GRCh38
>       "filename_template": "ftp://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh38/variation_genotype/ALL.chr###CHR###.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz <ftp://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh38/variation_genotype/ALL.chr###CHR###.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz>",
>       "filename_template": "ftp://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh38/variation_genotype/gnomad.genomes.r2.0.1.sites.GRCh38.noVEP.vcf.gz <ftp://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh38/variation_genotype/gnomad.genomes.r2.0.1.sites.GRCh38.noVEP.vcf.gz>",
>       "filename_template": "ftp://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh38/variation_genotype/gnomad.exomes.r2.0.1.sites.GRCh38.noVEP.vcf.gz <ftp://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh38/variation_genotype/gnomad.exomes.r2.0.1.sites.GRCh38.noVEP.vcf.gz>",
>       "filename_template": "ftp://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh38/variation_genotype/TOPMED_GRCh38.vcf.gz <ftp://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh38/variation_genotype/TOPMED_GRCh38.vcf.gz>",
>       "filename_template": "ftp://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh38/variation_genotype/UK10K_COHORT.20160215.sites.GRCh38.vcf.gz <ftp://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh38/variation_genotype/UK10K_COHORT.20160215.sites.GRCh38.vcf.gz>”,
> 
> Note that for the 1000 genomes files there is one VCF per chromosome. You’ll need to download the listed files as well as the tabix index [file].tbi .
> 
> Once they are downloaded you can edit the JSON file in place to point to the paths to which you have downloaded the files.
> 
> You should then be able to run your API scripts as before.
> 
> Regards
> 
> Will
> 
> On 5 December 2017 at 12:29:55 pm, andrew126 at mac.com <mailto:andrew126 at mac.com> (andrew126 at mac.com <mailto:andrew126 at mac.com>) wrote:
> 
>> Hi Will,
>> 
>> Thanks for the response.
>> 
>> Extra information:
>> 
>> *There is no marker that reliably recreates the problem, but rs887204533 is one which has previously shown the issue.  My current work-around is to have a master script run a sub-script job which collects allele frequencies for a small chunk of the region of interest .. upon the completion of the sub-script, the master script scans the sub-script's stderr for the bgzf message .. if found, the sub-script runs its job over again.  This repeats until no bgzf message is found.  Rerunning the sub-script a small number of times on the same region/chunk always seems to create a run with no bgzf message, fortunately.
>> 
>> *These vcfs are being accessed remotely (I had no idea the vcfs carried frequency information (vrs. just genotypes)).
>> 
>> Below shows the recovered allele frequencies for rs887204533 without/with the bgzf message.
>> Below that is a simplified version of the sub-script.
>> 
>> Please let me know if anything else would be helpful.
>> 
>> Thanks for the assistance.
>> 
>> Best,
>> 
>> Andrew
>> 
>> 
>> When no bgzf message found:
>> 
>>         rs887204533
>>         G/A
>>         Chromosome: 16
>>         Start-End: 5985270-5985270
>>         Consequences: non_coding_transcript_variant,intron_variant
>>         Allele Frequencies:
>>                 Allele A has frequency: 3.4343e-005 in population TOPMed.
>>                 Allele G has frequency: 0.999965657 in population TOPMed.
>>                 Allele A has frequency: 0.00000e+00 in population gnomADg:AMR.
>>                 Allele G has frequency: 1 in population gnomADg:AMR.
>>                 Allele A has frequency: 0.00000e+00 in population gnomADg:NFE.
>>                 Allele G has frequency: 1 in population gnomADg:NFE.
>>                 Allele A has frequency: 1.14837e-04 in population gnomADg:AFR.
>>                 Allele G has frequency: 0.999885163 in population gnomADg:AFR.
>>                 Allele A has frequency: 0.00000e+00 in population gnomADg:FIN.
>>                 Allele G has frequency: 1 in population gnomADg:FIN.
>>                 Allele A has frequency: 0.00000e+00 in population gnomADg:EAS.
>>                 Allele G has frequency: 1 in population gnomADg:EAS.
>>                 Allele A has frequency: 0.00000e+00 in population gnomADg:OTH.
>>                 Allele G has frequency: 1 in population gnomADg:OTH.
>>                 Allele A has frequency: 3.24570e-05 in population gnomADg:ALL.
>>                 Allele G has frequency: 0.999967543 in population gnomADg:ALL.
>>                 Allele A has frequency: 0.00000e+00 in population gnomADg:ASJ.
>>                 Allele G has frequency: 1 in population gnomADg:ASJ.
>> 
>> When bgzf message found:
>> 
>>         rs887204533
>>         G/A
>>         Chromosome: 16
>>         Start-End: 5985270-5985270
>>         Consequences: non_coding_transcript_variant,intron_variant
>>         Allele Frequencies:
>>                 Allele A has frequency: 3.4343e-005 in population TOPMed.
>>                 Allele G has frequency: 0.999965657 in population TOPMed.
>> 
>> 
>> 
>> use strict;
>> $|=1;
>> use Bio::EnsEMBL::Registry;
>> use Bio::EnsEMBL::ApiVersion;
>> 
>> my $registry = 'Bio::EnsEMBL::Registry';
>> $registry->load_all();
>> $registry->load_registry_from_db(
>>     -host => 'erismysql01.eisai.local', # alternatively 'useastdb.ensembl.org <http://useastdb.ensembl.org/>'
>>     -user => 'ensembl_ro',
>>     -pass => 'ensembl_ro'
>>     );
>> 
>> $registry->set_reconnect_when_lost(1);
>> 
>> my $chr = $ARGV[0];
>> my $lw = $ARGV[1];
>> my $hg = $ARGV[2];
>> 
>> my $gene_adaptor = $registry->get_adaptor('homo_sapiens','core','gene');
>> my $slice_adaptor = $registry->get_adaptor('homo_sapiens','core','slice');
>> my $vf_adaptor = $registry->get_adaptor( 'human', 'variation', 'variationfeature' );
>> $vf_adaptor->db->use_vcf(1);
>> 
>> my $slice2 = $slice_adaptor->fetch_by_region( 'chromosome', "$chr", $lw, $hg);
>> my $varfs = $vf_adaptor->fetch_all_by_Slice($slice2); #return ALL variationfeatures defined in $slice
>> foreach my $vf (@{$varfs}){
>>     print "\t".$vf->variation_name(),"\n"; # print rsID
>>     print "\t".$vf->allele_string(),"\n"; # print alleles
>>     print "\tChromosome: ".$slice2->seq_region_name."\n";
>>     print "\tStart-End: ".$vf->seq_region_start."-".$vf->seq_region_end."\n";
>>     print "\tConsequences: ".join(",",@{$vf->consequence_type()}),"\n"; # print consequenceType
>>     my $akc = 1;
>>     print "\tAllele Frequencies:\n";
>>     my $alleles = $vf->variation->get_all_Alleles();
>>     foreach my $allele (@{$alleles}) {
>>         next unless (defined $allele->population);
>>         my $allele_string   = $allele->allele;
>>         my $frequency       = $allele->frequency || 'NA';
>>         my $population_name = $allele->population->name;
>>         printf("\t\tAllele %s has frequency: %s in population %s.\n", $allele_string, $frequency, $population_name);
>>     }
>>     print "\n";
>> }
>> 
>> 
>> 
>> 
>> 
>>> On Dec 5, 2017, at 1:41 AM, William McLaren <wm2 at ebi.ac.uk <mailto:wm2 at ebi.ac.uk>> wrote:
>>> 
>>> Hi Andrew,
>>> 
>>> A few of pieces of information would be useful:
>>> 
>>> - a particular variant or location you’re querying that recreates the issue
>>> - the code and context from which you’re making the call
>>> - whether you’re using remotely accessed VCFs or locally downloaded (if you don’t know then it’s likely the VCFs are being accessed remotely)
>>> 
>>> Regards
>>> 
>>> Will McLaren
>>> Ensembl Variation
>>> 
>>> 
>>> On 4 December 2017 at 5:31:07 pm, andrew126 at mac.com <mailto:andrew126 at mac.com> (andrew126 at mac.com <mailto:andrew126 at mac.com>) wrote:
>>> 
>>>> Hi, 
>>>> 
>>>> In using API 90, when recovering all available allele frequencies for a particular variation feature, occassionally this message is shown in stderr: 
>>>> 
>>>> [W::bgzf_read_block] EOF marker is absent. The input is probably truncated 
>>>> 
>>>> That message seems to correlate with some allele frequencies not being found/reported by my script (in my case, the gnomAD frequencies). 
>>>> 
>>>> That is, when that message is not seen in stderr, I recover gnomAD frequencies for a particular variation feature; but when the message is seen, then it can look like that same variation feature doesn't have any gnomAD frequencies (even though it does). 
>>>> 
>>>> Is there any way to make such an event fatal to any API script? I'd much rather have my script fail/exit under such errors, than to not have it fail and risk believing such allele frequencies don't exist. 
>>>> 
>>>> Thanks. 
>>>> 
>>>> Please let me know if any other information would be useful. 
>>>> 
>>>> Best, 
>>>> 
>>>> Andrew 
>>>> 
>>>> 
>>>> _______________________________________________ 
>>>> Dev mailing list  Dev at ensembl.org <mailto:Dev at ensembl.org> 
>>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev <http://lists.ensembl.org/mailman/listinfo/dev> 
>>>> Ensembl Blog: http://www.ensembl.info/ <http://www.ensembl.info/> 
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org <mailto:Dev at ensembl.org>
>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev <http://lists.ensembl.org/mailman/listinfo/dev>
>>> Ensembl Blog: http://www.ensembl.info/ <http://www.ensembl.info/>
>> _______________________________________________ 
>> Dev mailing list  Dev at ensembl.org <mailto:Dev at ensembl.org> 
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev <http://lists.ensembl.org/mailman/listinfo/dev> 
>> Ensembl Blog: http://www.ensembl.info/ <http://www.ensembl.info/> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org <mailto:Dev at ensembl.org>
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev <http://lists.ensembl.org/mailman/listinfo/dev>
> Ensembl Blog: http://www.ensembl.info/ <http://www.ensembl.info/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20171207/735dfa83/attachment.html>


More information about the Dev mailing list