[ensembl-dev] Request to add one species to VEP pre-built cache

Dan Sun meredithfy at gmail.com
Tue Aug 4 15:52:51 BST 2015


Thank you, Fiona. That is very helpful!

Dan

On Tue, Aug 4, 2015 at 10:37 AM, Fiona Cunningham <fiona at ebi.ac.uk> wrote:

> Hi Dan,
>
> Thanks for getting in touch. The VEP considers each variant separately,
> even if they are in the same codon. This is because the variants may be on
> different strands. You can add information on this using a plugin e.g.
> https://github.com/ensembl-variation/VEP_plugins/blob/master/SameCodon.pm
>
> See more info here:
> http://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html
>
> Fiona
> -----------------------------------------------------------------
> Fiona Cunningham, PhD
> Variation Annotation Coordinator,
> European Bioinformatics Institute (EMBL-EBI)
> www.ensembl.org || www.lrg-sequence.org
>
> On 30 July 2015 at 23:00, Dan Sun <meredithfy at gmail.com> wrote:
>
>> Hi Will,
>>
>> Thanks again! I have another minor bug to report.
>>
>> For mutations in the same codon, VEP annotates them separately. This
>> could sometimes cause problems. The following is the VEP output for two
>> mutations in the same codon:
>>
>> NW_005081561.1_649917_G/A NW_005081561.1:649917 A 102066196
>> XM_005485125.1 Transcript missense_variant 6577 6577 2193 D/N Gat/Aat -
>> IMPACT=MODERATE;STRAND=1
>> NW_005081561.1_649918_A/G NW_005081561.1:649918 G 102066196
>> XM_005485125.1 Transcript missense_variant 6578 6578 2193 D/G gAt/gGt -
>> IMPACT=MODERATE;STRAND=1
>>
>> However, instead of GAT -> AAT or GAT -> GGT, the true mutation is GAT ->
>> AGT. The amino acid changes from D to S, not to N or G. I think an output
>> like this might make more sense:
>>
>> NW_005081561.1_649917_GA/AG NW_005081561.1:649917-649918 AG 102066196
>> XM_005485125.1 Transcript missense_variant 6577-6578 6577-6578 2193 D/S
>> Gat/AGt - IMPACT=MODERATE;STRAND=1
>>
>> Thanks,
>> Dan
>>
>> On Thu, Jul 30, 2015 at 10:21 AM, Dan Sun <meredithfy at gmail.com> wrote:
>>
>>> Hi Will,
>>>
>>> Thank you! It works like a charm.
>>>
>>> Have a great day!
>>>
>>> Dan
>>>
>>>
>>> On Thu, Jul 30, 2015 at 5:49 AM, Will McLaren <wm2 at ebi.ac.uk> wrote:
>>>
>>>> Hi Dan,
>>>>
>>>> Thanks for the report, we are still working on ironing out some issues
>>>> in the GFF parser.
>>>>
>>>> I've added some fixes to the release/81 version of gtf2vep.pl which
>>>> should correct the problems you are seeing.
>>>>
>>>> Regards
>>>>
>>>> Will
>>>>
>>>> On 29 July 2015 at 22:21, Dan Sun <meredithfy at gmail.com> wrote:
>>>>
>>>>> Hi Will and Christian,
>>>>>
>>>>> Thank you both for your help.
>>>>>
>>>>> I have an additional question. Once I annotated my vcf file using your
>>>>> cache, I notice non-coding variants are marked "intergenic variant" instead
>>>>> of something like "non coding exon variant". For example, NW_005081553.1:
>>>>> 4008346G->T is a variant located in an exon of non-coding transcripts of
>>>>> gene KHDRBS2 (XR_270793.1, XR_270792.1, XR_270795.1, XR_270797.1,
>>>>> XR_270794.1). You have any ideas about how to improve the annotation of
>>>>> SNPs in exons of non-coding genes for this species? You can find these
>>>>> non-coding transcripts in the GFF3 file you downloaded from NCBI.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Best,
>>>>> Dan
>>>>>
>>>>> On Tue, Jul 28, 2015 at 5:52 AM, Christian Cole (Staff) <
>>>>> C.Cole at dundee.ac.uk> wrote:
>>>>>
>>>>>> Sorry, I couldn't leave this alone. I don't think I've done enough
>>>>>> coding lately ;)
>>>>>>
>>>>>> You can shorten it a fair bit further with the magic -a (auto-split)
>>>>>> and -p (auto-print) switches:
>>>>>> gzip -dc 44394_ref_Zonotrichia_albicollis-1.0.1_chrUn.fa.gz | perl
>>>>>> -F'/\|/' -lape 's/^>.*/>$F[3]/' >
>>>>>> 44394_ref_Zonotrichia_albicollis-1.0.1_chrUn.fa
>>>>>>
>>>>>> -a splits each line by the pattern given by -F (whitespace by
>>>>>> default) and puts it into @F
>>>>>> -p puts while{<>} { print } around your code
>>>>>>
>>>>>> Using substitution rather than an if() simplifies the defline fix.
>>>>>> Although, it's a lot less legible.
>>>>>>
>>>>>> OK. I feel better now...
>>>>>> Cheers,
>>>>>>
>>>>>> Chris
>>>>>>
>>>>>> From: <dev-bounces at ensembl.org> on behalf of Will McLaren
>>>>>> Reply-To: Ensembl developers list
>>>>>> Date: Tuesday, 28 July 2015 10:16
>>>>>>
>>>>>> To: Ensembl developers list
>>>>>> Subject: Re: [ensembl-dev] Request to add one species to VEP
>>>>>> pre-built cache
>>>>>>
>>>>>> Thanks Chris - always good to shorten one-liners.
>>>>>>
>>>>>> And you're correct, the space is not intentional; the command should
>>>>>> be:
>>>>>>
>>>>>> gzip -dc 44394_ref_Zonotrichia_albicollis-1.0.1_chrUn.fa.gz | perl
>>>>>> -lne 'if(/^\>/) { $id = (split /\|/, $_)[3]; print ">$id";} else {print}' >
>>>>>> 44394_ref_Zonotrichia_albicollis-1.0.1_chrUn.fa
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Will
>>>>>>
>>>>>> On 28 July 2015 at 10:09, Christian Cole (Staff) <C.Cole at dundee.ac.uk
>>>>>> > wrote:
>>>>>>
>>>>>>> Hi Will,
>>>>>>>
>>>>>>> Just a quick tip. Using the perl -n switch avoids 'while(<>) { }'
>>>>>>> and -l switch avoids having to terminate print statements with '\n'. So
>>>>>>> your code can be tidied up a touch with:
>>>>>>> gzip -dc 44394_ref_Zonotrichia_albicollis-1.0.1_chrUn.fa.gz | perl
>>>>>>> -lne 'if(/^\>/) { $id = (split /\|/, $_)[3]; print "> $id";} else {print}'
>>>>>>> > 44394_ref_Zonotrichia_albicollis-1.0.1_chrUn.fa
>>>>>>>
>>>>>>> Also, is the space in '> $id' intentional? That's not typical
>>>>>>> behaviour for fasta files.
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Chris
>>>>>>>
>>>>>>> From: <dev-bounces at ensembl.org> on behalf of Will McLaren
>>>>>>> Reply-To: Ensembl developers list
>>>>>>> Date: Monday, 27 July 2015 17:27
>>>>>>> To: Ensembl developers list
>>>>>>> Subject: Re: [ensembl-dev] Request to add one species to VEP
>>>>>>> pre-built cache
>>>>>>>
>>>>>>> Hi Dan,
>>>>>>>
>>>>>>> We have in fact just updated our GTF converter script to support GFF
>>>>>>> too (get the new release, 81, for this capability).
>>>>>>>
>>>>>>> However, giving it a go just now with that file I noticed the FASTA
>>>>>>> file supplied doesn't play nicely with our indexer, so I tweaked the FASTA
>>>>>>> to get it to run. Long story short, here's the cache:
>>>>>>>
>>>>>>>
>>>>>>> https://dl.dropboxusercontent.com/u/12936195/zonotrichia_albicollis.tar.gz
>>>>>>>
>>>>>>> And here's the long story, i.e. what I did to generate it if you
>>>>>>> want to do the same:
>>>>>>>
>>>>>>> wget
>>>>>>> ftp://ftp.ncbi.nlm.nih.gov/genomes/Zonotrichia_albicollis/GFF/ref_Zonotrichia_albicollis-1.0.1_scaffolds.gff3.gz
>>>>>>> wget
>>>>>>> ftp://ftp.ncbi.nlm.nih.gov/genomes/Zonotrichia_albicollis/CHR_Un/44394_ref_Zonotrichia_albicollis-1.0.1_chrUn.fa.gz
>>>>>>> gzip -dc 44394_ref_Zonotrichia_albicollis-1.0.1_chrUn.fa.gz | perl
>>>>>>> -e 'while(<>) { if(/^\>/) { $id = (split /\|/, $_)[3]; print "> $id\n";}
>>>>>>> else {print}}' > 44394_ref_Zonotrichia_albicollis-1.0.1_chrUn.fa
>>>>>>> perl gtf2vep.pl -i
>>>>>>> ref_Zonotrichia_albicollis-1.0.1_scaffolds.gff3.gz -fasta
>>>>>>> 44394_ref_Zonotrichia_albicollis-1.0.1_chrUn.fa -species
>>>>>>> zonotrichia_albicollis
>>>>>>>
>>>>>>> Then run the VEP as follows:
>>>>>>>
>>>>>>> perl variant_effect_predictor.pl -offline -species
>>>>>>> zonotrichia_albicollis -i variants.vcf
>>>>>>>
>>>>>>> Regards
>>>>>>>
>>>>>>> Will McLaren
>>>>>>> Ensembl Variation
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 27 July 2015 at 16:49, Dan Sun <meredithfy at gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I was trying to build a cache from GTF for white-throated sparrow
>>>>>>>> by myself following the tutorial, but was not successful. If possible,
>>>>>>>> could you please add this species to the download list? I would really
>>>>>>>> appreciate that!
>>>>>>>>
>>>>>>>> You may download the GFF3 annotation for this species from NCBI ftp
>>>>>>>> (
>>>>>>>> ftp://ftp.ncbi.nlm.nih.gov/genomes/Zonotrichia_albicollis/GFF/ref_Zonotrichia_albicollis-1.0.1_scaffolds.gff3.gz)
>>>>>>>> and convert it to GTF.
>>>>>>>>
>>>>>>>> Thank you very much!
>>>>>>>>
>>>>>>>> --
>>>>>>>> Dan
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Dev mailing list    Dev at ensembl.org
>>>>>>>> Posting guidelines and subscribe/unsubscribe info:
>>>>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> The University of Dundee is a registered Scottish Charity, No:
>>>>>>> SC015096
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Dev mailing list    Dev at ensembl.org
>>>>>>> Posting guidelines and subscribe/unsubscribe info:
>>>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> The University of Dundee is a registered Scottish Charity, No:
>>>>>> SC015096
>>>>>>
>>>>>> _______________________________________________
>>>>>> Dev mailing list    Dev at ensembl.org
>>>>>> Posting guidelines and subscribe/unsubscribe info:
>>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Dan Sun
>>>>> Graduate student of Bioinformatics
>>>>> School of Biology
>>>>> Georgia Institute of Technology
>>>>>
>>>>> _______________________________________________
>>>>> Dev mailing list    Dev at ensembl.org
>>>>> Posting guidelines and subscribe/unsubscribe info:
>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Dev mailing list    Dev at ensembl.org
>>>> Posting guidelines and subscribe/unsubscribe info:
>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog: http://www.ensembl.info/
>>>>
>>>>
>>>
>>>
>>> --
>>> Dan Sun
>>> Graduate student of Bioinformatics
>>> School of Biology
>>> Georgia Institute of Technology
>>>
>>
>>
>>
>> --
>> Dan Sun
>> Graduate student of Bioinformatics
>> School of Biology
>> Georgia Institute of Technology
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>


-- 
Dan Sun
Graduate student of Bioinformatics
School of Biology
Georgia Institute of Technology
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20150804/758df3f2/attachment.html>


More information about the Dev mailing list