[ensembl-dev] VEP: building cache with custom transcript on reverse strand, bug?

Thu Oct 9 13:48:09 BST 2014

Hi Will,

this is great, thanks a lot!

We were a little confused and thought that to use custom annotations, we 
needed both to build a custom cache, and to use the --custom switch to 
variant_effect_predictor.pl with a tabix-indexed version of the annotations.

It turns out that the following much simpler procedure works (at least 
on my small test):

perl gtf2vep.pl -i transcript_reorder.gff -f hs.chrom_3.fasta --dir 
cache/ -d 1 -s homo_sapiens
perl variant_effect_predictor.pl --offline -i snp.vcf --vcf --dir_cache 
cache/ --cache_version 1 --force --no_stats

So: a single GTF file, and no messing with bgzip and tabix.

Thanks a lot for your help!

Nicolas

On 10/09/2014 12:25 PM, Will McLaren wrote:
> Hi Nicolas,
>
> You don't need to have the tabix indexed one - this serves a different
> purpose for the VEP, in that it just finds any overlaps between features
> in that file and your input variants. Correct me if I'm wrong but I
> don't think this adds anything to the output that you can't get from the
> custom cache files.
>
> If you have created cache files from the GTF then this will give you
> essentially the same location-based annotation and of course the
> additional sequence-based annotation.
>
> Will
>
> On 9 October 2014 10:03, Nicolas Thierry-Mieg
> <Nicolas.Thierry-Mieg at imag.fr <mailto:Nicolas.Thierry-Mieg at imag.fr>> wrote:
>
>     Hi Will,
>
>     thanks for your reply.
>
>     So, we need two different GTF files: one with exons and CDS's in
>     transcript order for building the VEP cache, and one in genomic
>     position order (required by tabix and therefore by
>     variant_effect_predictor.pl <http://variant_effect_predictor.pl>,
>     since the latter requires a tabix-indexed bgzipped gff).
>
>     I can confirm that the following works on my small example:
>     transcript.gff is my original file (sorted by increasing genomic
>     positions), and transcript_reorder.gff is similar but sorted in
>     transcript order.
>
>     perl gtf2vep.pl <http://gtf2vep.pl> -i transcript_reorder.gff -f
>     hs.chrom_3.fasta --dir cache/ -d 1 -s homo_sapiens
>
>     bgzip -c transcript.gff > transcript.gff.gz
>     tabix -p gff transcript.gff.gz
>     perl variant_effect_predictor.pl
>     <http://variant_effect_predictor.pl> --custom
>     transcript.gff.gz,MyGene,gff,__overlap,0 --offline -i snp.vcf --vcf
>     --dir_cache cache/ --cache_version 1 --force --no_stats
>
>     We'll now script sortGFF4VepCache.pl and test the above process on a
>     genome scale. I'll report back to the list with our findings.
>
>     You may want to update the following page, which incorrectly asserts
>     that "The GTF file have its features sorted in chromosomal order" to
>     build a cache:
>     http://www.ensembl.org/info/__docs/tools/vep/script/vep___cache.html#gtf
>     <http://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html#gtf>
>
>
>     Thanks again.
>
>     Regards,
>     Nicolas
>
>
>     On 10/08/2014 03:04 PM, Will McLaren wrote:
>
>         Hi Nicolas,
>
>         Apologies for the delay in replying to this.
>
>         The gtf2vep.pl <http://gtf2vep.pl> <http://gtf2vep.pl> parser at
>         the moment is somewhat
>         limited, and it has very strict expectations on the format of
>         the data
>         coming in.
>
>         One of these is that the exon and CDS lines appear in
>         _transcript_ order
>         (i.e. 5' -> 3'), even if this means the first exon listed for a
>         transcript in a gene has a genomic position larger (more 3' on the
>         genome) than the last.
>
>         If I change the order of your input to the following (the
>         start_codon
>         and stop_codon lines are actually ignored by the parser):
>
>         3protein___codingexon113077591113077946.-__.gene_id toto;
>         transcript_id
>         toto_a; exon_number 1;
>         3protein___codingCDS113077591113077746.-__0gene_id toto;
>         transcript_id
>         toto_a; product_id toto_a; exon_number 1;
>         3protein___codingexon113063155113063561.-__.gene_id toto;
>         transcript_id
>         toto_a; exon_number 2;
>         3protein___codingCDS113063355113063561.-__0gene_id toto;
>         transcript_id
>         toto_a; product_id toto_a; exon_number 2;
>
>         it works OK, and I get the expected stop_gained consequence for your
>         input variant:
>
>         #Uploaded_variation     Location        Allele  Gene    Feature
>         Feature_type    Consequence     cDNA_position   CDS_position
>            Protein_position    Amino_acids     Codons
>         Existing_variation      Extra
>         3_113063450_G/A 3:113063450     A       toto    toto_a  Transcript
>            stop_gained     468     268     90      R/*     Cga/Tga -
>         STRAND=-1
>
>         The parser is something we intend to revisit and improve
>         sometime soon,
>         so thanks for your patience in the meantime!
>
>         Regards
>
>         Will McLaren
>         Ensembl Variation
>
>
>         On 1 October 2014 12:08, Nicolas Thierry-Mieg
>         <Nicolas.Thierry-Mieg at imag.fr
>         <mailto:Nicolas.Thierry-Mieg at imag.fr>
>         <mailto:Nicolas.Thierry-Mieg at __imag.fr
>         <mailto:Nicolas.Thierry-Mieg at imag.fr>>> wrote:
>
>              Hello developers,
>
>              we would like to use VEP with custom transcript
>         definitions. To this
>              end we are building a VEP cache, following instructions
>         found here:
>         http://www.ensembl.org/info/____docs/tools/vep/script/vep_____cache.html
>         <http://www.ensembl.org/info/__docs/tools/vep/script/vep___cache.html>
>
>
>         <http://www.ensembl.org/info/__docs/tools/vep/script/vep___cache.html
>         <http://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html>>
>
>              For some transcripts things seem to work fine, but for
>         others the
>              SNP effects reported by VEP are incorrect.
>
>              We have constructed a small example that exhibits the
>         problem. I am
>              attaching the GFF but also copying it inline, in case
>         attachments
>              don't work on this ML:
>              3       protein_coding  exon    113063155       113063561
>               .
>                  -       .       gene_id toto; transcript_id toto_a;
>         exon_number 2;
>              3       protein_coding  stop_codon      113063352
>           113063354
>                  .       -       .       gene_id toto; transcript_id toto_a;
>              product_id toto_a;
>              3       protein_coding  CDS     113063355       113063561
>               .
>                  -       0       gene_id toto; transcript_id toto_a;
>         product_id
>              toto_a; exon_number 2;
>              3       protein_coding  exon    113077591       113077946
>               .
>                  -       .       gene_id toto; transcript_id toto_a;
>         exon_number 1;
>              3       protein_coding  CDS     113077591       113077746
>               .
>                  -       0       gene_id toto; transcript_id toto_a;
>         product_id
>              toto_a; exon_number 1;
>              3       protein_coding  start_codon     113077744
>           113077746
>                  .       -       .       gene_id toto; transcript_id toto_a;
>              product_id toto_a;
>
>              This is a (fabricated) two-exon transcript on the reverse
>         strand of
>              chromosome 3, using hg19 coordinates.
>              I have checked and re-checked, the two CDS features indeed
>         represent
>              a CDS (no STOPs in-frame).
>              For reference, the CDS sequences from exons 1 and 2 (hg19,
>         chrom3,
>              reverse strand) are respectively:
>
>         atgaacagatcttcttatctgcaggatttt______gatgtagattcccaaatccgtgctgagatg______cacagaaaaacagctttcaaaattcaacaa______gtggaaaaggaattagcttgggaaaaagag______aaacatgaactcggcctaatgaagctaaag____aatcgg
>
>         agatttcgagatccactggaaagtgatact______attgtggttcatgccatactgagtgaccac______aagatatcctcttacaggctggtgcagccc______tctaagtactccaaattcaaacgagctagt______caatcagagagaaaaccaagcaaattggac______aggtttgaaaaagagggacctggaagaaag____gacagccagagagatgcaggtagccta
>
>
>              Using this GFF, I build the cache with:
>              bgzip -c transcript.gff > transcript.gff.gz
>              tabix -p gff transcript.gff.gz
>              perl gtf2vep.pl <http://gtf2vep.pl> <http://gtf2vep.pl> -i
>         transcript.gff -f
>              hs.chrom_3.fasta --dir cache/ -d 1 -s homo_sapiens
>
>
>              I then have a single SNV in a VCF file:
>              #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
>              3       113063450       .       G       A       .       PASS
>
>              This SNV falls within the CDS, it is a nonsense mutation
>              (stop-gained). However, when I run VEP it sees it as a
>              "3_prime_UTR_variant".
>
>
>              Specifically, I run VEP with:
>              perl variant_effect_predictor.pl
>         <http://variant_effect_predictor.pl>
>              <http://variant_effect___predictor.pl
>         <http://variant_effect_predictor.pl>> --custom
>              transcript.gff.gz,MyGene,gff,____overlap,0 --offline -i
>         snp.vcf --vcf
>              --dir_cache cache/ --cache_version 1
>
>              I obtain:
>              ##VEP=v76 cache=cache//homo_sapiens/1 db=.
>
>         ##INFO=<ID=CSQ,Number=.,Type=____String,Description="____Consequence
>              type as predicted by VEP. Format:
>
>         Allele|Gene|Feature|Feature_____type|Consequence|cDNA_____position|CDS_position|Protein_____position|Amino_acids|Codons|____Existing_variation|DISTANCE|____STRAND">
>
>         ##INFO=<ID=MyGene,Number=.,____Type=String,Description="____transcript.gff.gz
>              (overlap)">
>              #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
>              3       113063450       .       G       A       .       PASS
>
>         CSQ=A|toto|toto_a|Transcript|____3_prime_UTR_variant|468||||||__|__-1;MyGene=CDS_3:113063355-____113063561,exon_3:113063155-____113063561
>
>
>
>              This is with the latest release of VEP (76).
>
>
>              Please let me know if more info is needed to debug this. It
>         would be
>              great if VEP could be used with custom annotations!
>
>
>              Regards,
>              Nicolas Thierry-Mieg
>
>