[ensembl-dev] bgzf_read_block message in stderr

William McLaren wm2 at ebi.ac.uk
Thu Dec 7 11:18:47 GMT 2017


Hi Andrew,

I’m not able to recreate the issue you’re experiencing, but my guess is that there’s a network issue causing incomplete data to be transmitted. This particular part of the API uses a perl library that interfaces with VCFs via htslib (https://github.com/samtools/htslib); we hope to able to address this and a couple of other developments to this library in the new year.

For now, a possible solution is for you to download the VCFs to your local machine. You will find them listed in the JSON configuration file found in ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/vcf_config.json:

$ grep vcf ~/src/ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/vcf_config.json | grep GRCh38
      "filename_template": "ftp://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh38/variation_genotype/ALL.chr###CHR###.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz",
      "filename_template": "ftp://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh38/variation_genotype/gnomad.genomes.r2.0.1.sites.GRCh38.noVEP.vcf.gz",
      "filename_template": "ftp://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh38/variation_genotype/gnomad.exomes.r2.0.1.sites.GRCh38.noVEP.vcf.gz",
      "filename_template": "ftp://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh38/variation_genotype/TOPMED_GRCh38.vcf.gz",
      "filename_template": "ftp://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh38/variation_genotype/UK10K_COHORT.20160215.sites.GRCh38.vcf.gz”,

Note that for the 1000 genomes files there is one VCF per chromosome. You’ll need to download the listed files as well as the tabix index [file].tbi .

Once they are downloaded you can edit the JSON file in place to point to the paths to which you have downloaded the files.

You should then be able to run your API scripts as before.

Regards

Will

On 5 December 2017 at 12:29:55 pm, andrew126 at mac.com (andrew126 at mac.com) wrote:

Hi Will,

Thanks for the response.

Extra information:

*There is no marker that reliably recreates the problem, but rs887204533 is one which has previously shown the issue.  My current work-around is to have a master script run a sub-script job which collects allele frequencies for a small chunk of the region of interest .. upon the completion of the sub-script, the master script scans the sub-script's stderr for the bgzf message .. if found, the sub-script runs its job over again.  This repeats until no bgzf message is found.  Rerunning the sub-script a small number of times on the same region/chunk always seems to create a run with no bgzf message, fortunately.

*These vcfs are being accessed remotely (I had no idea the vcfs carried frequency information (vrs. just genotypes)).

Below shows the recovered allele frequencies for rs887204533 without/with the bgzf message.
Below that is a simplified version of the sub-script.

Please let me know if anything else would be helpful.

Thanks for the assistance.

Best,

Andrew


When no bgzf message found:

        rs887204533
        G/A
        Chromosome: 16
        Start-End: 5985270-5985270
        Consequences: non_coding_transcript_variant,intron_variant
        Allele Frequencies:
                Allele A has frequency: 3.4343e-005 in population TOPMed.
                Allele G has frequency: 0.999965657 in population TOPMed.
                Allele A has frequency: 0.00000e+00 in population gnomADg:AMR.
                Allele G has frequency: 1 in population gnomADg:AMR.
                Allele A has frequency: 0.00000e+00 in population gnomADg:NFE.
                Allele G has frequency: 1 in population gnomADg:NFE.
                Allele A has frequency: 1.14837e-04 in population gnomADg:AFR.
                Allele G has frequency: 0.999885163 in population gnomADg:AFR.
                Allele A has frequency: 0.00000e+00 in population gnomADg:FIN.
                Allele G has frequency: 1 in population gnomADg:FIN.
                Allele A has frequency: 0.00000e+00 in population gnomADg:EAS.
                Allele G has frequency: 1 in population gnomADg:EAS.
                Allele A has frequency: 0.00000e+00 in population gnomADg:OTH.
                Allele G has frequency: 1 in population gnomADg:OTH.
                Allele A has frequency: 3.24570e-05 in population gnomADg:ALL.
                Allele G has frequency: 0.999967543 in population gnomADg:ALL.
                Allele A has frequency: 0.00000e+00 in population gnomADg:ASJ.
                Allele G has frequency: 1 in population gnomADg:ASJ.

When bgzf message found:

        rs887204533
        G/A
        Chromosome: 16
        Start-End: 5985270-5985270
        Consequences: non_coding_transcript_variant,intron_variant
        Allele Frequencies:
                Allele A has frequency: 3.4343e-005 in population TOPMed.
                Allele G has frequency: 0.999965657 in population TOPMed.



use strict;
$|=1;
use Bio::EnsEMBL::Registry;
use Bio::EnsEMBL::ApiVersion;

my $registry = 'Bio::EnsEMBL::Registry';
$registry->load_all();
$registry->load_registry_from_db(
    -host => 'erismysql01.eisai.local', # alternatively 'useastdb.ensembl.org'
    -user => 'ensembl_ro',
    -pass => 'ensembl_ro'
    );

$registry->set_reconnect_when_lost(1);

my $chr = $ARGV[0];
my $lw = $ARGV[1];
my $hg = $ARGV[2];

my $gene_adaptor = $registry->get_adaptor('homo_sapiens','core','gene');
my $slice_adaptor = $registry->get_adaptor('homo_sapiens','core','slice');
my $vf_adaptor = $registry->get_adaptor( 'human', 'variation', 'variationfeature' );
$vf_adaptor->db->use_vcf(1);

my $slice2 = $slice_adaptor->fetch_by_region( 'chromosome', "$chr", $lw, $hg);
my $varfs = $vf_adaptor->fetch_all_by_Slice($slice2); #return ALL variationfeatures defined in $slice
foreach my $vf (@{$varfs}){
    print "\t".$vf->variation_name(),"\n"; # print rsID
    print "\t".$vf->allele_string(),"\n"; # print alleles
    print "\tChromosome: ".$slice2->seq_region_name."\n";
    print "\tStart-End: ".$vf->seq_region_start."-".$vf->seq_region_end."\n";
    print "\tConsequences: ".join(",",@{$vf->consequence_type()}),"\n"; # print consequenceType
    my $akc = 1;
    print "\tAllele Frequencies:\n";
    my $alleles = $vf->variation->get_all_Alleles();
    foreach my $allele (@{$alleles}) {
        next unless (defined $allele->population);
        my $allele_string   = $allele->allele;
        my $frequency       = $allele->frequency || 'NA';
        my $population_name = $allele->population->name;
        printf("\t\tAllele %s has frequency: %s in population %s.\n", $allele_string, $frequency, $population_name);
    }
    print "\n";
}





On Dec 5, 2017, at 1:41 AM, William McLaren <wm2 at ebi.ac.uk> wrote:

Hi Andrew,

A few of pieces of information would be useful:

- a particular variant or location you’re querying that recreates the issue
- the code and context from which you’re making the call
- whether you’re using remotely accessed VCFs or locally downloaded (if you don’t know then it’s likely the VCFs are being accessed remotely)

Regards

Will McLaren
Ensembl Variation


On 4 December 2017 at 5:31:07 pm, andrew126 at mac.com (andrew126 at mac.com) wrote:

Hi, 

In using API 90, when recovering all available allele frequencies for a particular variation feature, occassionally this message is shown in stderr: 

[W::bgzf_read_block] EOF marker is absent. The input is probably truncated 

That message seems to correlate with some allele frequencies not being found/reported by my script (in my case, the gnomAD frequencies). 

That is, when that message is not seen in stderr, I recover gnomAD frequencies for a particular variation feature; but when the message is seen, then it can look like that same variation feature doesn't have any gnomAD frequencies (even though it does). 

Is there any way to make such an event fatal to any API script? I'd much rather have my script fail/exit under such errors, than to not have it fail and risk believing such allele frequencies don't exist. 

Thanks. 

Please let me know if any other information would be useful. 

Best, 

Andrew 


_______________________________________________ 
Dev mailing list  Dev at ensembl.org 
Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev 
Ensembl Blog: http://www.ensembl.info/ 
_______________________________________________
Dev mailing list    Dev at ensembl.org
Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
Ensembl Blog: http://www.ensembl.info/

_______________________________________________  
Dev mailing list Dev at ensembl.org  
Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev  
Ensembl Blog: http://www.ensembl.info/  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20171207/c46fff5b/attachment.html>


More information about the Dev mailing list