[ensembl-dev] Variant Effect Predictor not using HTS to read gzipped fasta - SOLVED

Hollizeck, Sebastian Sebastian.Hollizeck at med.uni-muenchen.de
Wed Dec 7 15:17:33 GMT 2016


Hi,

thank you so much, this pointed me to the right direction.
The module did not change the LD_LIBRARY_PATH to enable the htslib.so.1 to be found

Now it works like a charm!
Great help


Sebastian Hollizeck
Bioinformatics

Dr. von Hauner Children's Hospital
Kubus Research Center
Lindwurmstraße 2a
D-80337 München
phone +49 89-4400-57488
fax +49 89-4400-57979


----------------------------------------------------------------------

Message: 1
Date: Wed, 7 Dec 2016 14:45:04 +0000
From: Will McLaren <wm2 at ebi.ac.uk>
Subject: Re: [ensembl-dev] Variant Effect Predictor not using HTS to
        read gzipped fata
To: Ensembl developers list <dev at ensembl.org>
Message-ID:
        <CAMVEDX3g40CXj8HT-8etD0aZr3TDVR-rBibrv6ozwouS9f_cCA at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi Sebastian,

For some reason or other it looks as though your installation of
Bio::DB::HTS is not working. You can double check by executing:

$ perl -MBio::DB::HTS -e1

This should give you an error if the module is not installed properly, and
nothing if it is working OK.

In the case of the error, I'd try setting up Bio::DB::HTS again according
to the README here: http://search.cpan.org/dist/Bio-DB-HTS/

Having said this, it may be that you can avoid having to use it. From your
command it looks as though you are not using a VEP flag that requires
reading sequence data from a FASTA file (the role of Bio::DB::HTS) - those
that require this are --hgvs and --check_ref.

You may therefore try one of two solutions:

1) unarchive the bgzipped FASTA file that VEP is finding, something like:

$ gzip -d /opt/modules/i12g/ensembl-
tools/85/cachedir/homo_sapiens/85_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz

This will allow the code to use Bio::DB::Fasta to index the unarchived
file, meaning that if you wish to use --hgvs, for example, it should work
OK.

2) move the FASTA file so that it is not picked up by VEP for indexing:

$ mv /opt/modules/i12g/ensembl-
tools/85/cachedir/homo_sapiens/85_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
/opt/modules/i12g/ensembl-
tools/85/cachedir/homo_sapiens/85_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz.bak

You could then move it back again in future if you do require it.

Hope that solves the issue for you!

Will McLaren
Ensembl Variation

On 7 December 2016 at 13:12, Hollizeck, Sebastian <
Sebastian.Hollizeck at med.uni-muenchen.de> wrote:

> Hi,
>
> I installed the standalone script in my module system for usage on our
> server, but i cant seem to get it working with the cache.
> The script works with just using the --database option, but when i use the
> --cache i get the following error
>
> MSG: ERROR: Cannot index bgzipped FASTA file with Bio::DB::Fasta
>
> this is the full output with verbose
>
> [hollizeck at kkf0f6ee kleinPipeline]$ variant_effect_predictor.pl
> --check_existing --gmaf --maf_1kg --maf_esp --maf_exac  --pubmed
> --regulatory --species homo_sapiens --port 3337 --buffer_size 40000 --cache
> --minimal  --sift b --polyphen b --ccds --symbol --biotype --gene_phenotype
> --variant_class  --fork 6 --force_overwrite --filter_common --dir_cache
> /opt/modules/i12g/ensembl-tools/85/cachedir -i
> /archive/sample_data/scn281/scn281pa_fa.recalibrated_variants.vcf.gz
> --vcf -o /archive/sample_data/scn281/scn281pa_fa.recalibrated_variants_vep.vcf.gz
> -v#----------------------------------#
> # ENSEMBL VARIANT EFFECT PREDICTOR #
> #----------------------------------#
>
> version 85
>
> By Will McLaren (wm2 at ebi.ac.uk)
>
> Configuration options:
>
> biotype            1
> buffer_size        40000
> cache              1
> ccds               1
> check_existing     1
> core_type          core
> dir                /opt/modules/i12g/ensembl-tools/85/cachedir
> dir_cache          /opt/modules/i12g/ensembl-tools/85/cachedir
> dir_plugins        /home/hollizeck/.vep/Plugins
> filter_common      1
> force_overwrite    1
> fork               6
> gene_phenotype     1
> gmaf               1
> host               ensembldb.ensembl.org
> input_file         /archive/sample_data/scn281/scn281pa_fa.recalibrated_
> variants.vcf.gz
> maf_1kg            1
> maf_esp            1
> maf_exac           1
> minimal            1
> numbers            1
> output_file        /archive/sample_data/scn281/scn281pa_fa.recalibrated_
> variants_vep.vcf.gz
> polyphen           b
> port               3337
> pubmed             1
> regulatory         1
> sift               b
> species            homo_sapiens
> stats              HASH(0x58ccbc0)
> symbol             1
> variant_class      1
> vcf                1
> verbose            1
>
> --------------------
>
> Will only load v85 databases
> Species 'homo_sapiens' loaded from database 'homo_sapiens_core_85_37'
> Species 'homo_sapiens' loaded from database 'homo_sapiens_cdna_85_37'
> Species 'homo_sapiens' loaded from database 'homo_sapiens_vega_85_37'
> Species 'homo_sapiens' loaded from database 'homo_sapiens_otherfeatures_
> 85_37'
> Species 'homo_sapiens' loaded from database 'homo_sapiens_rnaseq_85_37'
> homo_sapiens_variation_85_37 loaded
> homo_sapiens_funcgen_85_37 loaded
> No ancestral database found
> No ontology database found
> No taxonomy database found
> No ensembl_metadata database found
> No production database or adaptor found
> 2016-12-07 14:04:31 - Connected to core version 85 database and variation
> version 85 database
> 2016-12-07 14:04:31 - Read existing cache info
> 2016-12-07 14:04:32 - Auto-detected FASTA file in cache directory
>
> -------------------- EXCEPTION --------------------
> MSG: ERROR: Cannot index bgzipped FASTA file with Bio::DB::Fasta
>
> STACK Bio::EnsEMBL::Variation::Utils::FastaSequence::setup_fasta
> /opt/modules/i12g/ensembl-tools/85/modules/Bio/EnsEMBL/
> Variation/Utils/FastaSequence.pm:194
> STACK main::configure /opt/modules/i12g/ensembl-tools/85/bin/
> variant_effect_predictor.pl:835
> STACK toplevel /opt/modules/i12g/ensembl-tools/85/bin/variant_effect_
> predictor.pl:146
> Date (localtime)    = Wed Dec  7 14:04:32 2016
> Ensembl API version = 85
> ---------------------------------------------------
>
> This looks like the program does not select the Bio:DB:HTS module but it
> is present and also in the perl path
>
> [hollizeck at kkf0f6ee kleinPipeline]$ ll /opt/modules/i12g/ensembl-
> tools/85/modules/Bio/DB/HTS.pm
> -rw-rw-rw- 1 root root 77700 Sep 21 18:44 /opt/modules/i12g/ensembl-
> tools/85/modules/Bio/DB/HTS.pm
> [hollizeck at kkf0f6ee kleinPipeline]$ echo $PERL5LIB
> /opt/modules/i12g/perl/5.24.0/perl5lib/lib/perl5:/opt/
> modules/i12g/ensembl-tools/85/modules
>
> I do not know, if I even look at the right spot.
> Can you please help me to debug this further?
>
> Is there any way to specify which plugin to use?
>
>
> Sebastian Hollizeck
> Bioinformatics
>
> Dr. von Hauner Children's Hospital
> Kubus Research Center
> Lindwurmstra?e 2a
> D-80337 M?nchen
> phone +49 89-4400-57488
> fax +49 89-4400-57979
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ensembl.org/pipermail/dev/attachments/20161207/6069b85d/attachment-0001.htm>

------------------------------

Message: 2
Date: Wed, 7 Dec 2016 15:08:47 +0000
From: Helen Sparrow <hsparrow at ebi.ac.uk>
Subject: Re: [ensembl-dev] Fwd: Ensembl API - pseudoautosomal regions
To: Ensembl developers list <dev at ensembl.org>
Message-ID: <BDDD502C-FE35-4C71-A289-1CEAB1826E05 at ebi.ac.uk>
Content-Type: text/plain; charset="utf-8"

Dear Julia,

I have just responded to your similar query on Biostars: https://www.biostars.org/p/225802/#226037 <https://www.biostars.org/p/225802/#226037>

But to directly answer your three questions:

-How come the positions are identical for some of the genes?
They have identical coordinates in some cases, as this is how the PARs are mapped between X & Y:
Y:10001-2781479 to X:10001-2781479
and
Y:56887903-57217415 to X:155701383-156030895

-Why do I get these duplicate gene entries?
You are seeing duplicates as these genes occur on both the X & Y chromosomes, in the PARs.

-How can I prevent this?
If you do not want to see these duplicate genes, but would rather see one copy, for example on the X chromosome, you can run your query to only search the Y chromosome outside of these coordinates which map to the X.


Let me know if you have further questions.

Best wishes,

Helen
_ _

Helen Sparrow
Ensembl Outreach Officer
European Bioinformatics Institute (EMBL-EBI)
Wellcome Trust Genome Campus, Hinxton, Cambridge
CB10 1SD
UK




> On 7 Dec 2016, at 13:33, Julia S?llner <julia.f.soellner at gmail.com> wrote:
>
> Dear Ensembl developers,
>
> I access Ensembl data via the Perl API and retrieve information on genes, transcripts etc. I have made the observation that if I get data from the database's gene table there are genes which occur twice, once on the X and once on the Y chromosome. This affects 45 human genes, for 34/45 genes the start and end positions on X and Y are identical.
>
> Two examples:
>
> geneID        biotype chromosome      start   end
> ENSG00000002586       protein_coding  X       2691179 2741309
> ENSG00000002586       protein_coding  Y       2691179 2741309
> ENSG00000124333       protein_coding  X       155881293       155943769
> ENSG00000124333       protein_coding  Y       57067813        57130289
>
> When querying some of these genes via the Ensembl website it turned out that they are mapped to pseudoautosomal regions (identical sequence on X and Y).
>
>
> Some more information on how I retrieve the data:
>
> I use the API version 86.
>
> To speed things up I iterate over chromosomes in parallel and retrieve all genes as follows:
>
> $slice = $slice_adaptor -> fetch_by_region('chromosome', $chr_name);
> my @genes = @{$slice -> get_all_Genes()};
> So basically ENSG00000124333 is in @genes when querying information on X and when querying information on Y. If I, however, go via the gene I only get the X chromosome:
>
> my $gene_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Gene' );
>
> my $gene = $gene_adaptor->fetch_by_stable_id( 'ENSG00000124333');
>
> print $gene->seq_region_name(); # => X
> On http://lists.ensembl.org/pipermail/dev/2010-October/000214.html <http://lists.ensembl.org/pipermail/dev/2010-October/000214.html> they say that a gene might exceed a pseudoautosomal region and thus extend into a region unique to the Y chromosome. This could be a reason why a gene shows up for X and Y. However, I checked this and there is no overlap between unique regions of Y and the gene coordinates. PAR-Coordinates from http://www.ensembl.org/info/genome/genebuild/assembly.html <http://www.ensembl.org/info/genome/genebuild/assembly.html> were used.
>
> Questions
>
> How come the positions are identical for some of the genes?
> Why do I get these duplicate gene entries?
> How can I prevent this?
>
> Thanks in advance and kind regards,
> Julia
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ensembl.org/pipermail/dev/attachments/20161207/984ccdca/attachment.htm>

------------------------------

_______________________________________________
Dev mailing list    Dev at ensembl.org
Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
Ensembl Blog: http://www.ensembl.info/


End of Dev Digest, Vol 78, Issue 7
**********************************




More information about the Dev mailing list