[ensembl-dev] VEP: building cache with custom transcript on reverse strand, bug?
Nicolas Thierry-Mieg
Nicolas.Thierry-Mieg at imag.fr
Thu Oct 9 13:48:09 BST 2014
Hi Will,
this is great, thanks a lot!
We were a little confused and thought that to use custom annotations, we
needed both to build a custom cache, and to use the --custom switch to
variant_effect_predictor.pl with a tabix-indexed version of the annotations.
It turns out that the following much simpler procedure works (at least
on my small test):
perl gtf2vep.pl -i transcript_reorder.gff -f hs.chrom_3.fasta --dir
cache/ -d 1 -s homo_sapiens
perl variant_effect_predictor.pl --offline -i snp.vcf --vcf --dir_cache
cache/ --cache_version 1 --force --no_stats
So: a single GTF file, and no messing with bgzip and tabix.
Thanks a lot for your help!
Nicolas
On 10/09/2014 12:25 PM, Will McLaren wrote:
> Hi Nicolas,
>
> You don't need to have the tabix indexed one - this serves a different
> purpose for the VEP, in that it just finds any overlaps between features
> in that file and your input variants. Correct me if I'm wrong but I
> don't think this adds anything to the output that you can't get from the
> custom cache files.
>
> If you have created cache files from the GTF then this will give you
> essentially the same location-based annotation and of course the
> additional sequence-based annotation.
>
> Will
>
> On 9 October 2014 10:03, Nicolas Thierry-Mieg
> <Nicolas.Thierry-Mieg at imag.fr <mailto:Nicolas.Thierry-Mieg at imag.fr>> wrote:
>
> Hi Will,
>
> thanks for your reply.
>
> So, we need two different GTF files: one with exons and CDS's in
> transcript order for building the VEP cache, and one in genomic
> position order (required by tabix and therefore by
> variant_effect_predictor.pl <http://variant_effect_predictor.pl>,
> since the latter requires a tabix-indexed bgzipped gff).
>
> I can confirm that the following works on my small example:
> transcript.gff is my original file (sorted by increasing genomic
> positions), and transcript_reorder.gff is similar but sorted in
> transcript order.
>
> perl gtf2vep.pl <http://gtf2vep.pl> -i transcript_reorder.gff -f
> hs.chrom_3.fasta --dir cache/ -d 1 -s homo_sapiens
>
> bgzip -c transcript.gff > transcript.gff.gz
> tabix -p gff transcript.gff.gz
> perl variant_effect_predictor.pl
> <http://variant_effect_predictor.pl> --custom
> transcript.gff.gz,MyGene,gff,__overlap,0 --offline -i snp.vcf --vcf
> --dir_cache cache/ --cache_version 1 --force --no_stats
>
> We'll now script sortGFF4VepCache.pl and test the above process on a
> genome scale. I'll report back to the list with our findings.
>
> You may want to update the following page, which incorrectly asserts
> that "The GTF file have its features sorted in chromosomal order" to
> build a cache:
> http://www.ensembl.org/info/__docs/tools/vep/script/vep___cache.html#gtf
> <http://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html#gtf>
>
>
> Thanks again.
>
> Regards,
> Nicolas
>
>
> On 10/08/2014 03:04 PM, Will McLaren wrote:
>
> Hi Nicolas,
>
> Apologies for the delay in replying to this.
>
> The gtf2vep.pl <http://gtf2vep.pl> <http://gtf2vep.pl> parser at
> the moment is somewhat
> limited, and it has very strict expectations on the format of
> the data
> coming in.
>
> One of these is that the exon and CDS lines appear in
> _transcript_ order
> (i.e. 5' -> 3'), even if this means the first exon listed for a
> transcript in a gene has a genomic position larger (more 3' on the
> genome) than the last.
>
> If I change the order of your input to the following (the
> start_codon
> and stop_codon lines are actually ignored by the parser):
>
> 3protein___codingexon113077591113077946.-__.gene_id toto;
> transcript_id
> toto_a; exon_number 1;
> 3protein___codingCDS113077591113077746.-__0gene_id toto;
> transcript_id
> toto_a; product_id toto_a; exon_number 1;
> 3protein___codingexon113063155113063561.-__.gene_id toto;
> transcript_id
> toto_a; exon_number 2;
> 3protein___codingCDS113063355113063561.-__0gene_id toto;
> transcript_id
> toto_a; product_id toto_a; exon_number 2;
>
> it works OK, and I get the expected stop_gained consequence for your
> input variant:
>
> #Uploaded_variation Location Allele Gene Feature
> Feature_type Consequence cDNA_position CDS_position
> Protein_position Amino_acids Codons
> Existing_variation Extra
> 3_113063450_G/A 3:113063450 A toto toto_a Transcript
> stop_gained 468 268 90 R/* Cga/Tga -
> STRAND=-1
>
> The parser is something we intend to revisit and improve
> sometime soon,
> so thanks for your patience in the meantime!
>
> Regards
>
> Will McLaren
> Ensembl Variation
>
>
> On 1 October 2014 12:08, Nicolas Thierry-Mieg
> <Nicolas.Thierry-Mieg at imag.fr
> <mailto:Nicolas.Thierry-Mieg at imag.fr>
> <mailto:Nicolas.Thierry-Mieg at __imag.fr
> <mailto:Nicolas.Thierry-Mieg at imag.fr>>> wrote:
>
> Hello developers,
>
> we would like to use VEP with custom transcript
> definitions. To this
> end we are building a VEP cache, following instructions
> found here:
> http://www.ensembl.org/info/____docs/tools/vep/script/vep_____cache.html
> <http://www.ensembl.org/info/__docs/tools/vep/script/vep___cache.html>
>
>
> <http://www.ensembl.org/info/__docs/tools/vep/script/vep___cache.html
> <http://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html>>
>
> For some transcripts things seem to work fine, but for
> others the
> SNP effects reported by VEP are incorrect.
>
> We have constructed a small example that exhibits the
> problem. I am
> attaching the GFF but also copying it inline, in case
> attachments
> don't work on this ML:
> 3 protein_coding exon 113063155 113063561
> .
> - . gene_id toto; transcript_id toto_a;
> exon_number 2;
> 3 protein_coding stop_codon 113063352
> 113063354
> . - . gene_id toto; transcript_id toto_a;
> product_id toto_a;
> 3 protein_coding CDS 113063355 113063561
> .
> - 0 gene_id toto; transcript_id toto_a;
> product_id
> toto_a; exon_number 2;
> 3 protein_coding exon 113077591 113077946
> .
> - . gene_id toto; transcript_id toto_a;
> exon_number 1;
> 3 protein_coding CDS 113077591 113077746
> .
> - 0 gene_id toto; transcript_id toto_a;
> product_id
> toto_a; exon_number 1;
> 3 protein_coding start_codon 113077744
> 113077746
> . - . gene_id toto; transcript_id toto_a;
> product_id toto_a;
>
> This is a (fabricated) two-exon transcript on the reverse
> strand of
> chromosome 3, using hg19 coordinates.
> I have checked and re-checked, the two CDS features indeed
> represent
> a CDS (no STOPs in-frame).
> For reference, the CDS sequences from exons 1 and 2 (hg19,
> chrom3,
> reverse strand) are respectively:
>
> atgaacagatcttcttatctgcaggatttt______gatgtagattcccaaatccgtgctgagatg______cacagaaaaacagctttcaaaattcaacaa______gtggaaaaggaattagcttgggaaaaagag______aaacatgaactcggcctaatgaagctaaag____aatcgg
>
> agatttcgagatccactggaaagtgatact______attgtggttcatgccatactgagtgaccac______aagatatcctcttacaggctggtgcagccc______tctaagtactccaaattcaaacgagctagt______caatcagagagaaaaccaagcaaattggac______aggtttgaaaaagagggacctggaagaaag____gacagccagagagatgcaggtagccta
>
>
> Using this GFF, I build the cache with:
> bgzip -c transcript.gff > transcript.gff.gz
> tabix -p gff transcript.gff.gz
> perl gtf2vep.pl <http://gtf2vep.pl> <http://gtf2vep.pl> -i
> transcript.gff -f
> hs.chrom_3.fasta --dir cache/ -d 1 -s homo_sapiens
>
>
> I then have a single SNV in a VCF file:
> #CHROM POS ID REF ALT QUAL FILTER INFO
> 3 113063450 . G A . PASS
>
> This SNV falls within the CDS, it is a nonsense mutation
> (stop-gained). However, when I run VEP it sees it as a
> "3_prime_UTR_variant".
>
>
> Specifically, I run VEP with:
> perl variant_effect_predictor.pl
> <http://variant_effect_predictor.pl>
> <http://variant_effect___predictor.pl
> <http://variant_effect_predictor.pl>> --custom
> transcript.gff.gz,MyGene,gff,____overlap,0 --offline -i
> snp.vcf --vcf
> --dir_cache cache/ --cache_version 1
>
> I obtain:
> ##VEP=v76 cache=cache//homo_sapiens/1 db=.
>
> ##INFO=<ID=CSQ,Number=.,Type=____String,Description="____Consequence
> type as predicted by VEP. Format:
>
> Allele|Gene|Feature|Feature_____type|Consequence|cDNA_____position|CDS_position|Protein_____position|Amino_acids|Codons|____Existing_variation|DISTANCE|____STRAND">
>
> ##INFO=<ID=MyGene,Number=.,____Type=String,Description="____transcript.gff.gz
> (overlap)">
> #CHROM POS ID REF ALT QUAL FILTER INFO
> 3 113063450 . G A . PASS
>
> CSQ=A|toto|toto_a|Transcript|____3_prime_UTR_variant|468||||||__|__-1;MyGene=CDS_3:113063355-____113063561,exon_3:113063155-____113063561
>
>
>
> This is with the latest release of VEP (76).
>
>
> Please let me know if more info is needed to debug this. It
> would be
> great if VEP could be used with custom annotations!
>
>
> Regards,
> Nicolas Thierry-Mieg
>
>
More information about the Dev
mailing list