[ensembl-dev] build cache from gtf file

Schmucki, Roland roland.schmucki at roche.com
Wed Nov 11 09:22:10 GMT 2015


Hello!

I would like to build a VEP cache from a GTF file which I downloaded from
Ensembl (Escherichia_coli_str_k_12_substr_mg1655.GCA_000005845.2.29.gtf)

The following commands were used to create the cache and were applied on a
test vcf file that includes all sorts of variants (missense and silent
SNPs, short indels, etc):

set VEPDIR=variant_effect_predictor_version79
set REF=Escherichia_coli_str_k_12_substr_mg1655.fa
set species=Escherichia_coli_str_k_12_substr_mg1655.GCA_000005845.2.29
perl $VEPDIR/gtf2vep.pl -i $species.gtf -f
Escherichia_coli_str_k_12_substr_mg1655.fa -d 79 -species $species --dir
cache_files
rm -rf ${species}
mv cache_files${species} ${species}
perl $VEPDIR/variant_effect_predictor.pl --force_overwrite -offline -i
test.vcf -o test_${species}_vep.txt -species $species --dir .

It works well and without any warning:
Building the cache:

2015-11-11 10:09:08 - Checking/creating FASTA index
2015-11-11 10:09:08 - Processing chromosome Chromosome
2015-11-11 10:09:17 - All done!


Applying to test.vcf gives

2015-11-11 10:09:50 - Starting...
2015-11-11 10:09:50 - Detected format of input file as vcf
2015-11-11 10:09:50 - Read 387 variants into buffer
2015-11-11 10:09:50 - Reading transcript data from cache and/or database
[================================================================================================================================================================================================================================]
 [ 100% ]
2015-11-11 10:09:51 - Retrieved 4497 transcripts (0 mem, 4497 cached, 0 DB,
0 duplicates)
2015-11-11 10:09:51 - Analyzing chromosome Chromosome
2015-11-11 10:09:51 - Analyzing variants
[================================================================================================================================================================================================================================]
 [ 100% ]
2015-11-11 10:09:53 - Calculating consequences
[================================================================================================================================================================================================================================]
 [ 100% ]
2015-11-11 10:09:55 - Processed 387 total variants (77 vars/sec, 77
vars/sec total)
2015-11-11 10:09:55 - Wrote stats summary to
test_Escherichia_coli_str_k_12_substr_mg1655.GCA_000005845.2.29_vep.txt_summary.html
2015-11-11 10:09:55 - Finished!




However, in the output I do not obtain the amino acid changes/codons as
well as the position of the changes in the protein:

Chromosome_66528_T/C    Chromosome:66528        C       b0061   AAC73172
     Transcript
 non_coding_transcript_exon_variant,non_coding_transcript_variant        23
     -       -       -       -       -       IMPACT=MODIFIER;STRAND=-1


On the other side, I get all this information when I download the pre-built
cache file (escherichia_coli_str_k_12_substr_mg1655) and run it on the
command line using (source:
ftp://ftp.ensemblgenomes.org/pub/bacteria/current/):

set species=escherichia_coli_str_k_12_substr_mg1655
perl $VEPDIR/variant_effect_predictor.pl --force_overwrite -offline -i
test.vcf -o test_${species}_vep.txt -species $species --dir .

Chromosome_66528_T/C    Chromosome:66528        C       b0061   AAC73172
     Transcript      missense_variant        23      23      8       Q/R
  cAg/cGg -       IMPACT=MODERATE;STRAND=-1



Does anyone know how to build/apply the cache from a GTF file so that I get
the same output as from the pre-built cache?
I want to compare the downloaded GTF file with the one that was used to
generate the pre-built cache files (in order to fully understand the
required format).
Moreover, I would like to understand how to make a valid GTF for other
genomes assemblies and annotations (which are not in Ensembl) so that I can
create my own VEP cache files.

Thanks for any help and suggestions!

Roland



-- 

Roland Schmucki, PhD
Computational Biologist, Pharmaceutical Sciences
Roche Pharma Research and Early Development


Roche Innovation Center Basel

F. Hoffmann-La Roche Ltd
Grenzacherstrasse 124
4070 Basel

Switzerland
Phone +41 61 687 13 30




Confidentiality Note: This message is intended only for the use of the
named recipient(s) and may contain confidential and/or proprietary
information. If you are not the intended recipient, please contact the
sender and delete this message. Any unauthorized use of the information
contained in this message is prohibited.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20151111/81fd6b01/attachment.html>


More information about the Dev mailing list