[ensembl-dev] VEP: building cache with custom transcript on reverse strand, bug?

Nicolas Thierry-Mieg Nicolas.Thierry-Mieg at imag.fr
Wed Oct 1 12:08:03 BST 2014


Hello developers,

we would like to use VEP with custom transcript definitions. To this end 
we are building a VEP cache, following instructions found here:
http://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html

For some transcripts things seem to work fine, but for others the SNP 
effects reported by VEP are incorrect.

We have constructed a small example that exhibits the problem. I am 
attaching the GFF but also copying it inline, in case attachments don't 
work on this ML:
3	protein_coding	exon	113063155	113063561	.	-	.	gene_id toto; 
transcript_id toto_a; exon_number 2;
3	protein_coding	stop_codon	113063352	113063354	.	-	.	gene_id toto; 
transcript_id toto_a; product_id toto_a;
3	protein_coding	CDS	113063355	113063561	.	-	0	gene_id toto; 
transcript_id toto_a; product_id toto_a; exon_number 2;
3	protein_coding	exon	113077591	113077946	.	-	.	gene_id toto; 
transcript_id toto_a; exon_number 1;
3	protein_coding	CDS	113077591	113077746	.	-	0	gene_id toto; 
transcript_id toto_a; product_id toto_a; exon_number 1;
3	protein_coding	start_codon	113077744	113077746	.	-	.	gene_id toto; 
transcript_id toto_a; product_id toto_a;

This is a (fabricated) two-exon transcript on the reverse strand of 
chromosome 3, using hg19 coordinates.
I have checked and re-checked, the two CDS features indeed represent a 
CDS (no STOPs in-frame).
For reference, the CDS sequences from exons 1 and 2 (hg19, chrom3, 
reverse strand) are respectively:
atgaacagatcttcttatctgcaggattttgatgtagattcccaaatccgtgctgagatgcacagaaaaacagctttcaaaattcaacaagtggaaaaggaattagcttgggaaaaagagaaacatgaactcggcctaatgaagctaaagaatcgg
agatttcgagatccactggaaagtgatactattgtggttcatgccatactgagtgaccacaagatatcctcttacaggctggtgcagccctctaagtactccaaattcaaacgagctagtcaatcagagagaaaaccaagcaaattggacaggtttgaaaaagagggacctggaagaaaggacagccagagagatgcaggtagccta


Using this GFF, I build the cache with:
bgzip -c transcript.gff > transcript.gff.gz
tabix -p gff transcript.gff.gz
perl gtf2vep.pl -i transcript.gff -f hs.chrom_3.fasta --dir cache/ -d 1 
-s homo_sapiens


I then have a single SNV in a VCF file:
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
3	113063450	.	G	A	.	PASS

This SNV falls within the CDS, it is a nonsense mutation (stop-gained). 
However, when I run VEP it sees it as a "3_prime_UTR_variant".


Specifically, I run VEP with:
perl variant_effect_predictor.pl --custom 
transcript.gff.gz,MyGene,gff,overlap,0 --offline -i snp.vcf --vcf 
--dir_cache cache/ --cache_version 1

I obtain:
##VEP=v76 cache=cache//homo_sapiens/1 db=.
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as 
predicted by VEP. Format: 
Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND">
##INFO=<ID=MyGene,Number=.,Type=String,Description="transcript.gff.gz 
(overlap)">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
3	113063450	.	G	A	.	PASS 
CSQ=A|toto|toto_a|Transcript|3_prime_UTR_variant|468|||||||-1;MyGene=CDS_3:113063355-113063561,exon_3:113063155-113063561



This is with the latest release of VEP (76).


Please let me know if more info is needed to debug this. It would be 
great if VEP could be used with custom annotations!


Regards,
Nicolas Thierry-Mieg



-- 
-----------------------------------------------------------
Nicolas Thierry-Mieg
Laboratoire TIMC-IMAG/BCM, CNRS UMR 5525
Pavillon Taillefer, Faculte de Medecine
38700 La Tronche, France
tel: (+33)456.520.067, fax: (+33)456.520.055
------------------------------------------------------------

-------------- next part --------------
3	protein_coding	exon	113063155	113063561	.	-	.	gene_id toto; transcript_id toto_a; exon_number 2;
3	protein_coding	stop_codon	113063352	113063354	.	-	.	gene_id toto; transcript_id toto_a; product_id toto_a;
3	protein_coding	CDS	113063355	113063561	.	-	0	gene_id toto; transcript_id toto_a; product_id toto_a; exon_number 2;
3	protein_coding	exon	113077591	113077946	.	-	.	gene_id toto; transcript_id toto_a; exon_number 1;
3	protein_coding	CDS	113077591	113077746	.	-	0	gene_id toto; transcript_id toto_a; product_id toto_a; exon_number 1;
3	protein_coding	start_codon	113077744	113077746	.	-	.	gene_id toto; transcript_id toto_a; product_id toto_a;


More information about the Dev mailing list