[ensembl-dev] VEP: building cache with custom transcript on reverse strand, bug?
Nicolas Thierry-Mieg
Nicolas.Thierry-Mieg at imag.fr
Wed Oct 1 12:08:03 BST 2014
Hello developers,
we would like to use VEP with custom transcript definitions. To this end
we are building a VEP cache, following instructions found here:
http://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html
For some transcripts things seem to work fine, but for others the SNP
effects reported by VEP are incorrect.
We have constructed a small example that exhibits the problem. I am
attaching the GFF but also copying it inline, in case attachments don't
work on this ML:
3 protein_coding exon 113063155 113063561 . - . gene_id toto;
transcript_id toto_a; exon_number 2;
3 protein_coding stop_codon 113063352 113063354 . - . gene_id toto;
transcript_id toto_a; product_id toto_a;
3 protein_coding CDS 113063355 113063561 . - 0 gene_id toto;
transcript_id toto_a; product_id toto_a; exon_number 2;
3 protein_coding exon 113077591 113077946 . - . gene_id toto;
transcript_id toto_a; exon_number 1;
3 protein_coding CDS 113077591 113077746 . - 0 gene_id toto;
transcript_id toto_a; product_id toto_a; exon_number 1;
3 protein_coding start_codon 113077744 113077746 . - . gene_id toto;
transcript_id toto_a; product_id toto_a;
This is a (fabricated) two-exon transcript on the reverse strand of
chromosome 3, using hg19 coordinates.
I have checked and re-checked, the two CDS features indeed represent a
CDS (no STOPs in-frame).
For reference, the CDS sequences from exons 1 and 2 (hg19, chrom3,
reverse strand) are respectively:
atgaacagatcttcttatctgcaggattttgatgtagattcccaaatccgtgctgagatgcacagaaaaacagctttcaaaattcaacaagtggaaaaggaattagcttgggaaaaagagaaacatgaactcggcctaatgaagctaaagaatcgg
agatttcgagatccactggaaagtgatactattgtggttcatgccatactgagtgaccacaagatatcctcttacaggctggtgcagccctctaagtactccaaattcaaacgagctagtcaatcagagagaaaaccaagcaaattggacaggtttgaaaaagagggacctggaagaaaggacagccagagagatgcaggtagccta
Using this GFF, I build the cache with:
bgzip -c transcript.gff > transcript.gff.gz
tabix -p gff transcript.gff.gz
perl gtf2vep.pl -i transcript.gff -f hs.chrom_3.fasta --dir cache/ -d 1
-s homo_sapiens
I then have a single SNV in a VCF file:
#CHROM POS ID REF ALT QUAL FILTER INFO
3 113063450 . G A . PASS
This SNV falls within the CDS, it is a nonsense mutation (stop-gained).
However, when I run VEP it sees it as a "3_prime_UTR_variant".
Specifically, I run VEP with:
perl variant_effect_predictor.pl --custom
transcript.gff.gz,MyGene,gff,overlap,0 --offline -i snp.vcf --vcf
--dir_cache cache/ --cache_version 1
I obtain:
##VEP=v76 cache=cache//homo_sapiens/1 db=.
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as
predicted by VEP. Format:
Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND">
##INFO=<ID=MyGene,Number=.,Type=String,Description="transcript.gff.gz
(overlap)">
#CHROM POS ID REF ALT QUAL FILTER INFO
3 113063450 . G A . PASS
CSQ=A|toto|toto_a|Transcript|3_prime_UTR_variant|468|||||||-1;MyGene=CDS_3:113063355-113063561,exon_3:113063155-113063561
This is with the latest release of VEP (76).
Please let me know if more info is needed to debug this. It would be
great if VEP could be used with custom annotations!
Regards,
Nicolas Thierry-Mieg
--
-----------------------------------------------------------
Nicolas Thierry-Mieg
Laboratoire TIMC-IMAG/BCM, CNRS UMR 5525
Pavillon Taillefer, Faculte de Medecine
38700 La Tronche, France
tel: (+33)456.520.067, fax: (+33)456.520.055
------------------------------------------------------------
-------------- next part --------------
3 protein_coding exon 113063155 113063561 . - . gene_id toto; transcript_id toto_a; exon_number 2;
3 protein_coding stop_codon 113063352 113063354 . - . gene_id toto; transcript_id toto_a; product_id toto_a;
3 protein_coding CDS 113063355 113063561 . - 0 gene_id toto; transcript_id toto_a; product_id toto_a; exon_number 2;
3 protein_coding exon 113077591 113077946 . - . gene_id toto; transcript_id toto_a; exon_number 1;
3 protein_coding CDS 113077591 113077746 . - 0 gene_id toto; transcript_id toto_a; product_id toto_a; exon_number 1;
3 protein_coding start_codon 113077744 113077746 . - . gene_id toto; transcript_id toto_a; product_id toto_a;
More information about the Dev
mailing list