[ensembl-dev] Variant effect predictor - write_cache option

Will McLaren wm2 at ebi.ac.uk
Wed Mar 4 09:25:15 GMT 2015


Hi Jan,

On 4 March 2015 at 03:40, Jan Vogel <jan.vogel at gmail.com> wrote:

>
>
> Hello Will,
>
> I’m annotating some large scale data and I was thinking to create my own
> cache with the —write_cache option - my idea is that everytime I annotate a
> VCF, I add the variations, which have not been cached preciously, to the
> cache.* Is this what the write_cache option is intend to do ?*
>
>
No, this is not how the cache works. The cache only contains data read from
the Ensembl DBs. The --write_cache option is an alternative to e.g.
"--build all" that creates the cache bit by bit according to your input:

1) Read variants

2) Get overlapped genomic region

3) If region overlapped found in cache, read data (transcripts, regulatory
features, known variant locations) from cache

4) If region not found in cache, read data from database then write data to
cache (then next time same region found use 3)

I don't expect many people to use this flag and in fact I'm considering
removing it as it does seem to cause confusion amongst users. My initial
idea was that it would be of use to people whose input always only spans a
particular genomic region or regions (for example those doing targetted
sequencing).

I think now it is just easier for everyone to download the whole genome
cache file (or generate it using --build all if you have custom data in
your Ensembl DBs) and not have to worry about any of the above.


I ran into a bit of trouble when using it, as
>  - I had the same CODE ref exception from Storable.pm  ( fixed it with
> your —no_adaptor_cache option) - might be a good idea to add this to
> http://uswest.ensembl.org/info/docs/tools/vep/script/vep_example.html
>

As I said, this is fixed in newer versions, to which the documentation
refers.


>
> I also run into trouble when forking the script - it seems to me that
> there is a race condition, and that the forking processes are modifying the
> same cache files - so I end up with corrupted cache, and error message
> like:
>
> gzip: /.vep/homo_sapiens/77_GRCh38/1/2000001-3000000_var.gz: unexpected
> end of file
> gzip: /.vep/homo_sapiens/77_GRCh38/1/3000001-4000000_var.gz: unexpected
> end of file
> gzip: /.vep/homo_sapiens/77_GRCh38/1/11000001-12000000_var.gz: unexpected
> end of file
>
> Have you seen this before ?
>

I'm not surprised this happens, there is no file-level locking so using
fork and write_cache would indeed give you this issue. I'll add them to the
list of incompatible option pairs.


>
> Also, I can’t get VEP to work with two different cache files - my ideal
> setup would be
>
> a) *a system-wide cache* with pre-computed cache data from EnsEMBL
> b)* a “by-user”  cache* - once a user computed a single variation which
> is not in the system-wide cache, it would be great to add it to the
> user-cache - so it does not get re-computed.
>

I've looked into this before but there are a number of issues with the
by-user cache:

1) the cache would need to contain all possible annotations for each
variant, or be able to be updated if user requests e.g. HGVS when they
didn't the first time it was cached

2) the cache would become out of date if the user updated VEP (transcripts
etc may change in a newer version of Ensembl)

3) the cache would need to operate fast enough so that the performance
benefit would outweigh just re-computing every time

Other VEP users have created pipelines using VCFs or a noSQL DB as the VEP
results cache, with a pre-VEP stage that looks up results from those
resources then only runs VEP on the novel variants. A super simple way
would be to use VCF output from the VEP, then use vcf-annotate to copy
results to any subsequent input VCFs.

HTH

Will


> Ideally, it would also be *possible to merge both caches* ( user +
> system-wide) - so other users can benefit from pre-calculated variations.
>
>
> I’m in a multi-user environment, that’s why I am hesitant to have all
> users write to the same system-wide cache.
>
> Do such options currently exist and did I just not find them ? Or am I
> running VEP the wrong way ? I was hoping that the —dir and —dir_cache
> options can be used this way ...
>
> Here’s my command line :
>
> perl ensembl-tools-release-78/scripts/variant_effect_predictor/
> variant_effect_predictor.pl
> —write_cache
>    —verbose
>        —cache
>         —force_overwrite
>              -i test.vcf -o test.out
>                 --dir_cache
> /gne/research/workspace/vogelj4/variant_effect_predictor/jensenmann/igis_cache/new_cache
> --cache_version 77
>                   --species homo_sapiens
>                      --db_version 77
>                          --dir
> /gne/research/workspace/vogelj4/variant_effect_predictor/jensenmann/igis_cache/e77.1/VEP/
>                               --fork 12
>
>
> Thanks for this great tool !
>
>    Jan
>
>
>
>
>
>
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20150304/be92fb67/attachment.html>


More information about the Dev mailing list