[ensembl-dev] VEP: building cache with custom transcript on reverse strand, bug?

Will McLaren wm2 at ebi.ac.uk
Wed Oct 8 14:04:31 BST 2014


Hi Nicolas,

Apologies for the delay in replying to this.

The gtf2vep.pl parser at the moment is somewhat limited, and it has very
strict expectations on the format of the data coming in.

One of these is that the exon and CDS lines appear in _transcript_ order
(i.e. 5' -> 3'), even if this means the first exon listed for a transcript
in a gene has a genomic position larger (more 3' on the genome) than the
last.

If I change the order of your input to the following (the start_codon and
stop_codon lines are actually ignored by the parser):

3 protein_coding exon 113077591 113077946 . - . gene_id toto; transcript_id
toto_a; exon_number 1;
3 protein_coding CDS 113077591 113077746 . - 0 gene_id toto; transcript_id
toto_a; product_id toto_a; exon_number 1;
3 protein_coding exon 113063155 113063561 . - . gene_id toto; transcript_id
toto_a; exon_number 2;
3 protein_coding CDS 113063355 113063561 . - 0 gene_id toto; transcript_id
toto_a; product_id toto_a; exon_number 2;

it works OK, and I get the expected stop_gained consequence for your input
variant:

#Uploaded_variation     Location        Allele  Gene    Feature
Feature_type    Consequence     cDNA_position   CDS_position
 Protein_position    Amino_acids     Codons  Existing_variation      Extra
3_113063450_G/A 3:113063450     A       toto    toto_a  Transcript
 stop_gained     468     268     90      R/*     Cga/Tga -  STRAND=-1

The parser is something we intend to revisit and improve sometime soon, so
thanks for your patience in the meantime!

Regards

Will McLaren
Ensembl Variation


On 1 October 2014 12:08, Nicolas Thierry-Mieg <Nicolas.Thierry-Mieg at imag.fr>
wrote:

> Hello developers,
>
> we would like to use VEP with custom transcript definitions. To this end
> we are building a VEP cache, following instructions found here:
> http://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html
>
> For some transcripts things seem to work fine, but for others the SNP
> effects reported by VEP are incorrect.
>
> We have constructed a small example that exhibits the problem. I am
> attaching the GFF but also copying it inline, in case attachments don't
> work on this ML:
> 3       protein_coding  exon    113063155       113063561       .       -
>      .       gene_id toto; transcript_id toto_a; exon_number 2;
> 3       protein_coding  stop_codon      113063352       113063354       .
>      -       .       gene_id toto; transcript_id toto_a; product_id toto_a;
> 3       protein_coding  CDS     113063355       113063561       .       -
>      0       gene_id toto; transcript_id toto_a; product_id toto_a;
> exon_number 2;
> 3       protein_coding  exon    113077591       113077946       .       -
>      .       gene_id toto; transcript_id toto_a; exon_number 1;
> 3       protein_coding  CDS     113077591       113077746       .       -
>      0       gene_id toto; transcript_id toto_a; product_id toto_a;
> exon_number 1;
> 3       protein_coding  start_codon     113077744       113077746       .
>      -       .       gene_id toto; transcript_id toto_a; product_id toto_a;
>
> This is a (fabricated) two-exon transcript on the reverse strand of
> chromosome 3, using hg19 coordinates.
> I have checked and re-checked, the two CDS features indeed represent a CDS
> (no STOPs in-frame).
> For reference, the CDS sequences from exons 1 and 2 (hg19, chrom3, reverse
> strand) are respectively:
> atgaacagatcttcttatctgcaggattttgatgtagattcccaaatccgtgctgagatg
> cacagaaaaacagctttcaaaattcaacaagtggaaaaggaattagcttgggaaaaagag
> aaacatgaactcggcctaatgaagctaaagaatcgg
> agatttcgagatccactggaaagtgatactattgtggttcatgccatactgagtgaccac
> aagatatcctcttacaggctggtgcagccctctaagtactccaaattcaaacgagctagt
> caatcagagagaaaaccaagcaaattggacaggtttgaaaaagagggacctggaagaaag
> gacagccagagagatgcaggtagccta
>
>
> Using this GFF, I build the cache with:
> bgzip -c transcript.gff > transcript.gff.gz
> tabix -p gff transcript.gff.gz
> perl gtf2vep.pl -i transcript.gff -f hs.chrom_3.fasta --dir cache/ -d 1
> -s homo_sapiens
>
>
> I then have a single SNV in a VCF file:
> #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
> 3       113063450       .       G       A       .       PASS
>
> This SNV falls within the CDS, it is a nonsense mutation (stop-gained).
> However, when I run VEP it sees it as a "3_prime_UTR_variant".
>
>
> Specifically, I run VEP with:
> perl variant_effect_predictor.pl --custom transcript.gff.gz,MyGene,gff,overlap,0
> --offline -i snp.vcf --vcf --dir_cache cache/ --cache_version 1
>
> I obtain:
> ##VEP=v76 cache=cache//homo_sapiens/1 db=.
> ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as
> predicted by VEP. Format: Allele|Gene|Feature|Feature_
> type|Consequence|cDNA_position|CDS_position|Protein_
> position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND">
> ##INFO=<ID=MyGene,Number=.,Type=String,Description="transcript.gff.gz
> (overlap)">
> #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
> 3       113063450       .       G       A       .       PASS
> CSQ=A|toto|toto_a|Transcript|3_prime_UTR_variant|468|||||||
> -1;MyGene=CDS_3:113063355-113063561,exon_3:113063155-113063561
>
>
>
> This is with the latest release of VEP (76).
>
>
> Please let me know if more info is needed to debug this. It would be great
> if VEP could be used with custom annotations!
>
>
> Regards,
> Nicolas Thierry-Mieg
>
>
>
> --
> -----------------------------------------------------------
> Nicolas Thierry-Mieg
> Laboratoire TIMC-IMAG/BCM, CNRS UMR 5525
> Pavillon Taillefer, Faculte de Medecine
> 38700 La Tronche, France
> tel: (+33)456.520.067, fax: (+33)456.520.055
> ------------------------------------------------------------
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20141008/185f6a86/attachment.html>


More information about the Dev mailing list