[ensembl-dev] Codon called wrong in VEP when using custom build cache

Will McLaren wm2 at ebi.ac.uk
Tue Mar 26 14:04:21 GMT 2013


Hi Heidi,

Thanks for your patience, I've had a chance to look at this now.

If I build a cache file from the following files:

ftp://ftp.ensembl.org/pub/release-70/gtf/gasterosteus_aculeatus/Gasterosteus_aculeatus.BROADS1.70.gtf.gz

and

ftp://ftp.ensembl.org/pub/release-70/fasta/gasterosteus_aculeatus/dna/Gasterosteus_aculeatus.BROADS1.70.dna.toplevel.fa.gz

I get (I think!) the correct output from the VEP:

perl gtf2vep.pl -i Gasterosteus_aculeatus.BROADS1.70.gtf.gz -fasta
Gasterosteus_aculeatus.BROADS1.70.dna.toplevel.fa -species
gasterosteus_aculeatus -dir test/ -db 70
perl variant_effect_predictor.pl -i gastero_in.txt -species
gasterosteus_aculeatus -force -off -dir test/ -db 70
grep -v # variant_effect_output.txt

groupXIX_2822477_C/T    groupXIX:2822477        T       ENSGACG00000003129
     ENSGACT00000004109      Transcript      missense_variant  67       49
     17      A/T     Gcg/Acg -
groupXIX_2822500_T/C    groupXIX:2822500        C       ENSGACG00000003129
     ENSGACT00000004109      Transcript      missense_variant  44       26
     9       D/G     gAc/gGc -
groupXIX_2822523_C/T    groupXIX:2822523        T       ENSGACG00000003129
     ENSGACT00000004109      Transcript      initiator_codon_variant    21
     3       1       M/I     atG/atA -
groupXIX_2822541_T/A    groupXIX:2822541        A       ENSGACG00000003129
     ENSGACT00000004109      Transcript      5_prime_UTR_variant
        3       -       -       -       -       -

This works the same if I use the version 67 files as it appears you have.

So I suspect there is something different about your FASTA file - you could
check that the sequence of the groupXIX file matches that in the file I
link to above (do an md5sum or some such thing).

It is also possible that an issue with older versions of BioPerl is to
blame - there was a known bug in the way BioPerl indexes large FASTA file.
Normally for Ensembl we recommend using BioPerl 1.2.3 (which contains the
bug), but VEP works fine with the latest version. I'd try updating your
BioPerl install to the latest version, remove the *.fa.index file that is
generated next to your .fa file, and try re-running gtf2vep.pl

Beyond this it's hard to say what's happening without seeing the contents
of your GTF and FASTA files. If the problem persists, perhaps you could
just pull out the lines in the GTF for ENSGACT00000004109 and the sequence
for groupXIX and if that still gives you the same problem, send them to me
so I can debug.

Hope this helps!

Will


On 19 March 2013 12:07, Heidi Viitaniemi <hmviit at utu.fi> wrote:

>  Hi Will,
>
> And thank you for your response. I'll wait for the solution. I like the
> idea that you can incorporate your own data to run VEP.
>
> Thanks,
> Heidi Viitaniemi
>
>
>
> 19.3.2013 13:42, Will McLaren kirjoitti:
>
> Hello Heidi,
>
>  Thanks for finding this - the causes of this bug are I believe somewhat
> complex so may take a while to get to the bottom of it.
>
>  Just wanted to let you know that your mail is not being ignored!
>
>  Regards
>
>  Will McLaren
> Ensembl Variation
>
>
> On 18 March 2013 13:48, Heidi Viitaniemi <hmviit at utu.fi> wrote:
>
>>  Hi,
>>
>> I'm running version 2.7 on a unix server. I want to create a custom cache
>> using my own gtf and fasta with gtf2vep.pl. This works without problem
>> and also running VEP seems to go fine. The problem is that, in the output
>> it seems that the cDNA_position, CDS_position and Protein_position are
>> correct given my input gtf file but the calls for Amino_acids and Codons
>> seem completely random. If I run against the cache retrieved from ensembl
>> these are all correct. The version of the genome didn't have an effect on
>> the output, the gtf's haven't changed. The gtf and the fasta that I'm using
>> for the custom originate from the ensembl reference so I don't see any
>> reason why the custom cache shouldn't perform the same way as the reference
>> from ensembl cache. Could there be bug that somehow messes up the link
>> between the custom gtf and fasta in my run? Below are the commands I ran
>> and a snippet of the output's I got.
>>
>> Thanks,
>> Heidi Viitaniemi
>>
>> For custom cache I'm running (wrong output for Amino_acids and Codons)
>> perl gtf2vep.pl -i GasAcu1.67_group_xixflip.gtf -f
>> gasAcu_group_withoutbac_inv7.fa -d 67 -s
>> Gasterosteus_aculeatus_XIXflipped_18032013
>> perl variant_effect_predictor.pl -offline 1 -dir $HOME/.vep -i
>> ens_realigned_AK_F.var.vcf -format vcf -fork 4 -db_version 67 -species
>> Gasterosteus_aculeatus_XIXflipped_18032013 -numbers -per_gene -buffer_size
>> 10000 -o VEP_18032013_exon_pergene_AK_F.var.vcf.txt
>>
>>        groupXIX_2822477_C/T groupXIX:2822477 T ENSGACG00000003129
>> ENSGACT00000004109 Transcript missense_variant 67 49 17 G/R Gga/Aga -
>> EXON=1/2  groupXIX_2822500_T/C groupXIX:2822500 C ENSGACG00000003129
>> ENSGACT00000004109 Transcript missense_variant 44 26 9 Y/C tAt/tGt -
>> EXON=1/2  groupXIX_2822523_C/T groupXIX:2822523 T ENSGACG00000003129
>> ENSGACT00000004109 Transcript synonymous_variant 21 3 1 R cgG/cgA -
>> EXON=1/2  groupXIX_2822541_T/A groupXIX:2822541 A ENSGACG00000003129
>> ENSGACT00000004109 Transcript 5_prime_UTR_variant 3 - - - - - EXON=1/2
>>
>> For ensembl cache I'm running (correct output for Amino_acids and Codons)
>> perl variant_effect_predictor.pl -offline -dir $HOME/.vep -i
>> ens_realigned_AK_F.var.vcf -format vcf -fork 4 -db_version 69 -species
>> gasterosteus_aculeatus -numbers -per_gene -buffer_size 10000 -o
>> ensVEP_18032013_exon_pergene_AK_F.var.vcf.txt
>>
>>         groupXIX_2822477_C/T groupXIX:2822477 T ENSGACG00000003129
>> ENSGACT00000004109 Transcript missense_variant 67 49 17 A/T Gcg/Acg -
>> EXON=1/2  groupXIX_2822500_T/C groupXIX:2822500 C ENSGACG00000003129
>> ENSGACT00000004109 Transcript missense_variant 44 26 9 D/G gAc/gGc -
>> EXON=1/2  groupXIX_2822523_C/T groupXIX:2822523 T ENSGACG00000003129
>> ENSGACT00000004109 Transcript initiator_codon_variant 21 3 1 M/I atG/atA
>> - EXON=1/2  groupXIX_2822541_T/A groupXIX:2822541 A ENSGACG00000003129
>> ENSGACT00000004109 Transcript 5_prime_UTR_variant 3 - - - - - EXON=1/2
>>
>> --
>> ______________________________________________
>>
>> Heidi Viitaniemi
>> PhD student
>> Division of Genetics and Physiology
>> Department of Biology
>> Itäinen Pitkäkatu 4A, 7th floor (Pharmacity)
>> University of Turku
>> 20520 Turku
>>
>> FINLAND
>>
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
> --
> ______________________________________________
>
> Heidi Viitaniemi
> PhD student
> Division of Genetics and Physiology
> Department of Biology
> Itäinen Pitkäkatu 4A, 7th floor (Pharmacity)
> University of Turku
> 20520 Turku
>
> FINLAND
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20130326/a18b58a0/attachment.html>


More information about the Dev mailing list