[ensembl-dev] VEP: building cache with custom transcript on reverse strand, bug?

Thu Oct 9 10:03:46 BST 2014

Hi Will,

thanks for your reply.

So, we need two different GTF files: one with exons and CDS's in 
transcript order for building the VEP cache, and one in genomic position 
order (required by tabix and therefore by variant_effect_predictor.pl, 
since the latter requires a tabix-indexed bgzipped gff).

I can confirm that the following works on my small example: 
transcript.gff is my original file (sorted by increasing genomic 
positions), and transcript_reorder.gff is similar but sorted in 
transcript order.

perl gtf2vep.pl -i transcript_reorder.gff -f hs.chrom_3.fasta --dir 
cache/ -d 1 -s homo_sapiens

bgzip -c transcript.gff > transcript.gff.gz
tabix -p gff transcript.gff.gz
perl variant_effect_predictor.pl --custom 
transcript.gff.gz,MyGene,gff,overlap,0 --offline -i snp.vcf --vcf 
--dir_cache cache/ --cache_version 1 --force --no_stats

We'll now script sortGFF4VepCache.pl and test the above process on a 
genome scale. I'll report back to the list with our findings.

You may want to update the following page, which incorrectly asserts 
that "The GTF file have its features sorted in chromosomal order" to 
build a cache:
http://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html#gtf

Thanks again.

Regards,
Nicolas

On 10/08/2014 03:04 PM, Will McLaren wrote:
> Hi Nicolas,
>
> Apologies for the delay in replying to this.
>
> The gtf2vep.pl <http://gtf2vep.pl> parser at the moment is somewhat
> limited, and it has very strict expectations on the format of the data
> coming in.
>
> One of these is that the exon and CDS lines appear in _transcript_ order
> (i.e. 5' -> 3'), even if this means the first exon listed for a
> transcript in a gene has a genomic position larger (more 3' on the
> genome) than the last.
>
> If I change the order of your input to the following (the start_codon
> and stop_codon lines are actually ignored by the parser):
>
> 3protein_codingexon113077591113077946.-.gene_id toto; transcript_id
> toto_a; exon_number 1;
> 3protein_codingCDS113077591113077746.-0gene_id toto; transcript_id
> toto_a; product_id toto_a; exon_number 1;
> 3protein_codingexon113063155113063561.-.gene_id toto; transcript_id
> toto_a; exon_number 2;
> 3protein_codingCDS113063355113063561.-0gene_id toto; transcript_id
> toto_a; product_id toto_a; exon_number 2;
>
> it works OK, and I get the expected stop_gained consequence for your
> input variant:
>
> #Uploaded_variation     Location        Allele  Gene    Feature
> Feature_type    Consequence     cDNA_position   CDS_position
>   Protein_position    Amino_acids     Codons  Existing_variation      Extra
> 3_113063450_G/A 3:113063450     A       toto    toto_a  Transcript
>   stop_gained     468     268     90      R/*     Cga/Tga -  STRAND=-1
>
> The parser is something we intend to revisit and improve sometime soon,
> so thanks for your patience in the meantime!
>
> Regards
>
> Will McLaren
> Ensembl Variation
>
>
> On 1 October 2014 12:08, Nicolas Thierry-Mieg
> <Nicolas.Thierry-Mieg at imag.fr <mailto:Nicolas.Thierry-Mieg at imag.fr>> wrote:
>
>     Hello developers,
>
>     we would like to use VEP with custom transcript definitions. To this
>     end we are building a VEP cache, following instructions found here:
>     http://www.ensembl.org/info/__docs/tools/vep/script/vep___cache.html
>     <http://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html>
>
>     For some transcripts things seem to work fine, but for others the
>     SNP effects reported by VEP are incorrect.
>
>     We have constructed a small example that exhibits the problem. I am
>     attaching the GFF but also copying it inline, in case attachments
>     don't work on this ML:
>     3       protein_coding  exon    113063155       113063561       .
>         -       .       gene_id toto; transcript_id toto_a; exon_number 2;
>     3       protein_coding  stop_codon      113063352       113063354
>         .       -       .       gene_id toto; transcript_id toto_a;
>     product_id toto_a;
>     3       protein_coding  CDS     113063355       113063561       .
>         -       0       gene_id toto; transcript_id toto_a; product_id
>     toto_a; exon_number 2;
>     3       protein_coding  exon    113077591       113077946       .
>         -       .       gene_id toto; transcript_id toto_a; exon_number 1;
>     3       protein_coding  CDS     113077591       113077746       .
>         -       0       gene_id toto; transcript_id toto_a; product_id
>     toto_a; exon_number 1;
>     3       protein_coding  start_codon     113077744       113077746
>         .       -       .       gene_id toto; transcript_id toto_a;
>     product_id toto_a;
>
>     This is a (fabricated) two-exon transcript on the reverse strand of
>     chromosome 3, using hg19 coordinates.
>     I have checked and re-checked, the two CDS features indeed represent
>     a CDS (no STOPs in-frame).
>     For reference, the CDS sequences from exons 1 and 2 (hg19, chrom3,
>     reverse strand) are respectively:
>     atgaacagatcttcttatctgcaggatttt__gatgtagattcccaaatccgtgctgagatg__cacagaaaaacagctttcaaaattcaacaa__gtggaaaaggaattagcttgggaaaaagag__aaacatgaactcggcctaatgaagctaaag__aatcgg
>     agatttcgagatccactggaaagtgatact__attgtggttcatgccatactgagtgaccac__aagatatcctcttacaggctggtgcagccc__tctaagtactccaaattcaaacgagctagt__caatcagagagaaaaccaagcaaattggac__aggtttgaaaaagagggacctggaagaaag__gacagccagagagatgcaggtagccta
>
>
>     Using this GFF, I build the cache with:
>     bgzip -c transcript.gff > transcript.gff.gz
>     tabix -p gff transcript.gff.gz
>     perl gtf2vep.pl <http://gtf2vep.pl> -i transcript.gff -f
>     hs.chrom_3.fasta --dir cache/ -d 1 -s homo_sapiens
>
>
>     I then have a single SNV in a VCF file:
>     #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
>     3       113063450       .       G       A       .       PASS
>
>     This SNV falls within the CDS, it is a nonsense mutation
>     (stop-gained). However, when I run VEP it sees it as a
>     "3_prime_UTR_variant".
>
>
>     Specifically, I run VEP with:
>     perl variant_effect_predictor.pl
>     <http://variant_effect_predictor.pl> --custom
>     transcript.gff.gz,MyGene,gff,__overlap,0 --offline -i snp.vcf --vcf
>     --dir_cache cache/ --cache_version 1
>
>     I obtain:
>     ##VEP=v76 cache=cache//homo_sapiens/1 db=.
>     ##INFO=<ID=CSQ,Number=.,Type=__String,Description="__Consequence
>     type as predicted by VEP. Format:
>     Allele|Gene|Feature|Feature___type|Consequence|cDNA___position|CDS_position|Protein___position|Amino_acids|Codons|__Existing_variation|DISTANCE|__STRAND">
>     ##INFO=<ID=MyGene,Number=.,__Type=String,Description="__transcript.gff.gz
>     (overlap)">
>     #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
>     3       113063450       .       G       A       .       PASS
>     CSQ=A|toto|toto_a|Transcript|__3_prime_UTR_variant|468|||||||__-1;MyGene=CDS_3:113063355-__113063561,exon_3:113063155-__113063561
>
>
>
>     This is with the latest release of VEP (76).
>
>
>     Please let me know if more info is needed to debug this. It would be
>     great if VEP could be used with custom annotations!
>
>
>     Regards,
>     Nicolas Thierry-Mieg
>
>
>
>     --
>     ------------------------------__-----------------------------
>     Nicolas Thierry-Mieg
>     Laboratoire TIMC-IMAG/BCM, CNRS UMR 5525
>     Pavillon Taillefer, Faculte de Medecine
>     38700 La Tronche, France
>     tel: (+33)456.520.067 <tel:%28%2B33%29456.520.067>, fax:
>     (+33)456.520.055 <tel:%28%2B33%29456.520.055>
>     ------------------------------__------------------------------
>
>
>     _______________________________________________
>     Dev mailing list Dev at ensembl.org <mailto:Dev at ensembl.org>
>     Posting guidelines and subscribe/unsubscribe info:
>     http://lists.ensembl.org/mailman/listinfo/dev
>     Ensembl Blog: http://www.ensembl.info/
>
>
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>

-- 
-----------------------------------------------------------
Nicolas Thierry-Mieg
Laboratoire TIMC-IMAG/BCM, CNRS UMR 5525
Pavillon Taillefer, Faculte de Medecine
38700 La Tronche, France
tel: (+33)456.520.067, fax: (+33)456.520.055
------------------------------------------------------------