[ensembl-dev] Variant effect predictor - write_cache option

Fri Mar 6 09:27:50 GMT 2015

We extracted all variants, uniqued them, ran vep, then tabix index the vcf files.
These are your cache. You then can annotate your VCFs from this cache using a script to lookup variants in your VEP vcfs.

On 6 Mar 2015, at 07:25, Jan Vogel <jan.vogel at gmail.com> wrote:

> 
> Hi Will, 
> 
> thank you for the explanation on the —write_cache - as the option does not work with the —fork together, it’s use is pretty limited to only small regions. 
> 
> Concerning the idea of a pre-VEP stage:  I like your idea to work with vcf-annotate; a nice solution which will scale pretty well as don’t have a DB bottleneck.
> 
> To summarize 
> 
> run VEP on a few samples
> collect VEP output files and write (unique) file for vcf-annotate
> a script to combine vcf-annotae for known variants and VEP for missing variants ( output has the same format produced by VEP )
> 
> If anyone has done this before and has some code on github, I’d love to take a peek. 
> 
> Cheers, 
>  
>    Jan 
> 
> 
> 
> 
> 
> 
> On Mar 4, 2015, at 1:25 AM, Will McLaren <wm2 at ebi.ac.uk> wrote:
> 
>> Hi Jan,
>> 
>> On 4 March 2015 at 03:40, Jan Vogel <jan.vogel at gmail.com> wrote:
>> 
>> 
>> Hello Will, 
>> 
>> I’m annotating some large scale data and I was thinking to create my own cache with the —write_cache option - my idea is that everytime I annotate a VCF, I add the variations, which have not been cached preciously, to the cache. Is this what the write_cache option is intend to do ? 
>> 
>> 
>> No, this is not how the cache works. The cache only contains data read from the Ensembl DBs. The --write_cache option is an alternative to e.g. "--build all" that creates the cache bit by bit according to your input:
>> 
>> 1) Read variants
>> 
>> 2) Get overlapped genomic region
>> 
>> 3) If region overlapped found in cache, read data (transcripts, regulatory features, known variant locations) from cache
>> 
>> 4) If region not found in cache, read data from database then write data to cache (then next time same region found use 3)
>> 
>> I don't expect many people to use this flag and in fact I'm considering removing it as it does seem to cause confusion amongst users. My initial idea was that it would be of use to people whose input always only spans a particular genomic region or regions (for example those doing targetted sequencing).
>> 
>> I think now it is just easier for everyone to download the whole genome cache file (or generate it using --build all if you have custom data in your Ensembl DBs) and not have to worry about any of the above.
>> 
>> 
>> I ran into a bit of trouble when using it, as 
>>  - I had the same CODE ref exception from Storable.pm  ( fixed it with your —no_adaptor_cache option) - might be a good idea to add this to http://uswest.ensembl.org/info/docs/tools/vep/script/vep_example.html 
>> 
>> As I said, this is fixed in newer versions, to which the documentation refers.
>>  
>> 
>> I also run into trouble when forking the script - it seems to me that there is a race condition, and that the forking processes are modifying the same cache files - so I end up with corrupted cache, and error message like: 
>> 
>> gzip: /.vep/homo_sapiens/77_GRCh38/1/2000001-3000000_var.gz: unexpected end of file
>> gzip: /.vep/homo_sapiens/77_GRCh38/1/3000001-4000000_var.gz: unexpected end of file
>> gzip: /.vep/homo_sapiens/77_GRCh38/1/11000001-12000000_var.gz: unexpected end of file
>> 
>> Have you seen this before ? 
>> 
>> I'm not surprised this happens, there is no file-level locking so using fork and write_cache would indeed give you this issue. I'll add them to the list of incompatible option pairs.
>>  
>> 
>> Also, I can’t get VEP to work with two different cache files - my ideal setup would be 
>> 
>> a) a system-wide cache with pre-computed cache data from EnsEMBL 
>> b) a “by-user”  cache - once a user computed a single variation which is not in the system-wide cache, it would be great to add it to the user-cache - so it does not get re-computed. 
>> 
>> I've looked into this before but there are a number of issues with the by-user cache:
>> 
>> 1) the cache would need to contain all possible annotations for each variant, or be able to be updated if user requests e.g. HGVS when they didn't the first time it was cached
>> 
>> 2) the cache would become out of date if the user updated VEP (transcripts etc may change in a newer version of Ensembl)
>> 
>> 3) the cache would need to operate fast enough so that the performance benefit would outweigh just re-computing every time
>>  
>> Other VEP users have created pipelines using VCFs or a noSQL DB as the VEP results cache, with a pre-VEP stage that looks up results from those resources then only runs VEP on the novel variants. A super simple way would be to use VCF output from the VEP, then use vcf-annotate to copy results to any subsequent input VCFs.
>> 
>> HTH
>> 
>> Will
>> 
>> 
>> Ideally, it would also be possible to merge both caches ( user + system-wide) - so other users can benefit from pre-calculated variations. 
>> 
>> 
>> I’m in a multi-user environment, that’s why I am hesitant to have all users write to the same system-wide cache. 
>> 
>> Do such options currently exist and did I just not find them ? Or am I running VEP the wrong way ? I was hoping that the —dir and —dir_cache options can be used this way ... 
>> 
>> Here’s my command line : 
>> 
>> perl ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl 
>> —write_cache 
>>    —verbose
>>        —cache 
>>         —force_overwrite
>>              -i test.vcf -o test.out 
>>                 --dir_cache /gne/research/workspace/vogelj4/variant_effect_predictor/jensenmann/igis_cache/new_cache --cache_version 77 
>>                   --species homo_sapiens 
>>                      --db_version 77
>>                          --dir /gne/research/workspace/vogelj4/variant_effect_predictor/jensenmann/igis_cache/e77.1/VEP/ 
>>                               --fork 12
>> 
>> 
>> Thanks for this great tool ! 
>> 
>>    Jan 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>> 
>> 
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20150306/654e3c6c/attachment.html>