[ensembl-dev] Request to add one species to VEP pre-built cache

Dan Sun meredithfy at gmail.com
Thu Jul 30 23:00:48 BST 2015


Hi Will,

Thanks again! I have another minor bug to report.

For mutations in the same codon, VEP annotates them separately. This could
sometimes cause problems. The following is the VEP output for two mutations
in the same codon:

NW_005081561.1_649917_G/A NW_005081561.1:649917 A 102066196 XM_005485125.1
Transcript missense_variant 6577 6577 2193 D/N Gat/Aat -
IMPACT=MODERATE;STRAND=1
NW_005081561.1_649918_A/G NW_005081561.1:649918 G 102066196 XM_005485125.1
Transcript missense_variant 6578 6578 2193 D/G gAt/gGt -
IMPACT=MODERATE;STRAND=1

However, instead of GAT -> AAT or GAT -> GGT, the true mutation is GAT ->
AGT. The amino acid changes from D to S, not to N or G. I think an output
like this might make more sense:

NW_005081561.1_649917_GA/AG NW_005081561.1:649917-649918 AG 102066196
XM_005485125.1 Transcript missense_variant 6577-6578 6577-6578 2193 D/S
Gat/AGt - IMPACT=MODERATE;STRAND=1

Thanks,
Dan

On Thu, Jul 30, 2015 at 10:21 AM, Dan Sun <meredithfy at gmail.com> wrote:

> Hi Will,
>
> Thank you! It works like a charm.
>
> Have a great day!
>
> Dan
>
>
> On Thu, Jul 30, 2015 at 5:49 AM, Will McLaren <wm2 at ebi.ac.uk> wrote:
>
>> Hi Dan,
>>
>> Thanks for the report, we are still working on ironing out some issues in
>> the GFF parser.
>>
>> I've added some fixes to the release/81 version of gtf2vep.pl which
>> should correct the problems you are seeing.
>>
>> Regards
>>
>> Will
>>
>> On 29 July 2015 at 22:21, Dan Sun <meredithfy at gmail.com> wrote:
>>
>>> Hi Will and Christian,
>>>
>>> Thank you both for your help.
>>>
>>> I have an additional question. Once I annotated my vcf file using your
>>> cache, I notice non-coding variants are marked "intergenic variant" instead
>>> of something like "non coding exon variant". For example, NW_005081553.1:
>>> 4008346G->T is a variant located in an exon of non-coding transcripts of
>>> gene KHDRBS2 (XR_270793.1, XR_270792.1, XR_270795.1, XR_270797.1,
>>> XR_270794.1). You have any ideas about how to improve the annotation of
>>> SNPs in exons of non-coding genes for this species? You can find these
>>> non-coding transcripts in the GFF3 file you downloaded from NCBI.
>>>
>>> Thanks!
>>>
>>> Best,
>>> Dan
>>>
>>> On Tue, Jul 28, 2015 at 5:52 AM, Christian Cole (Staff) <
>>> C.Cole at dundee.ac.uk> wrote:
>>>
>>>>   Sorry, I couldn't leave this alone. I don't think I've done enough
>>>> coding lately ;)
>>>>
>>>>  You can shorten it a fair bit further with the magic -a (auto-split)
>>>> and -p (auto-print) switches:
>>>> gzip -dc 44394_ref_Zonotrichia_albicollis-1.0.1_chrUn.fa.gz | perl
>>>> -F'/\|/' -lape 's/^>.*/>$F[3]/' >
>>>> 44394_ref_Zonotrichia_albicollis-1.0.1_chrUn.fa
>>>>
>>>>  -a splits each line by the pattern given by -F (whitespace by
>>>> default) and puts it into @F
>>>> -p puts while{<>} { print } around your code
>>>>
>>>>  Using substitution rather than an if() simplifies the defline fix.
>>>> Although, it's a lot less legible.
>>>>
>>>>  OK. I feel better now...
>>>> Cheers,
>>>>
>>>>  Chris
>>>>
>>>>   From: <dev-bounces at ensembl.org> on behalf of Will McLaren
>>>> Reply-To: Ensembl developers list
>>>> Date: Tuesday, 28 July 2015 10:16
>>>>
>>>> To: Ensembl developers list
>>>> Subject: Re: [ensembl-dev] Request to add one species to VEP pre-built
>>>> cache
>>>>
>>>>   Thanks Chris - always good to shorten one-liners.
>>>>
>>>>  And you're correct, the space is not intentional; the command should
>>>> be:
>>>>
>>>> gzip -dc 44394_ref_Zonotrichia_albicollis-1.0.1_chrUn.fa.gz | perl -lne
>>>> 'if(/^\>/) { $id = (split /\|/, $_)[3]; print ">$id";} else {print}' >
>>>> 44394_ref_Zonotrichia_albicollis-1.0.1_chrUn.fa
>>>>
>>>>  Regards
>>>>
>>>> Will
>>>>
>>>> On 28 July 2015 at 10:09, Christian Cole (Staff) <C.Cole at dundee.ac.uk>
>>>> wrote:
>>>>
>>>>>   Hi Will,
>>>>>
>>>>>  Just a quick tip. Using the perl -n switch avoids 'while(<>) { }'
>>>>> and -l switch avoids having to terminate print statements with '\n'. So
>>>>> your code can be tidied up a touch with:
>>>>> gzip -dc 44394_ref_Zonotrichia_albicollis-1.0.1_chrUn.fa.gz | perl
>>>>> -lne 'if(/^\>/) { $id = (split /\|/, $_)[3]; print "> $id";} else {print}'
>>>>> > 44394_ref_Zonotrichia_albicollis-1.0.1_chrUn.fa
>>>>>
>>>>>    Also, is the space in '> $id' intentional? That's not typical
>>>>> behaviour for fasta files.
>>>>> Cheers,
>>>>>
>>>>>  Chris
>>>>>
>>>>>   From: <dev-bounces at ensembl.org> on behalf of Will McLaren
>>>>> Reply-To: Ensembl developers list
>>>>> Date: Monday, 27 July 2015 17:27
>>>>> To: Ensembl developers list
>>>>> Subject: Re: [ensembl-dev] Request to add one species to VEP
>>>>> pre-built cache
>>>>>
>>>>>   Hi Dan,
>>>>>
>>>>>  We have in fact just updated our GTF converter script to support GFF
>>>>> too (get the new release, 81, for this capability).
>>>>>
>>>>>  However, giving it a go just now with that file I noticed the FASTA
>>>>> file supplied doesn't play nicely with our indexer, so I tweaked the FASTA
>>>>> to get it to run. Long story short, here's the cache:
>>>>>
>>>>>
>>>>> https://dl.dropboxusercontent.com/u/12936195/zonotrichia_albicollis.tar.gz
>>>>>
>>>>>  And here's the long story, i.e. what I did to generate it if you
>>>>> want to do the same:
>>>>>
>>>>>  wget
>>>>> ftp://ftp.ncbi.nlm.nih.gov/genomes/Zonotrichia_albicollis/GFF/ref_Zonotrichia_albicollis-1.0.1_scaffolds.gff3.gz
>>>>>  wget
>>>>> ftp://ftp.ncbi.nlm.nih.gov/genomes/Zonotrichia_albicollis/CHR_Un/44394_ref_Zonotrichia_albicollis-1.0.1_chrUn.fa.gz
>>>>>  gzip -dc 44394_ref_Zonotrichia_albicollis-1.0.1_chrUn.fa.gz | perl
>>>>> -e 'while(<>) { if(/^\>/) { $id = (split /\|/, $_)[3]; print "> $id\n";}
>>>>> else {print}}' > 44394_ref_Zonotrichia_albicollis-1.0.1_chrUn.fa
>>>>> perl gtf2vep.pl -i ref_Zonotrichia_albicollis-1.0.1_scaffolds.gff3.gz
>>>>> -fasta 44394_ref_Zonotrichia_albicollis-1.0.1_chrUn.fa -species
>>>>> zonotrichia_albicollis
>>>>>
>>>>>  Then run the VEP as follows:
>>>>>
>>>>>  perl variant_effect_predictor.pl -offline -species
>>>>> zonotrichia_albicollis -i variants.vcf
>>>>>
>>>>>  Regards
>>>>>
>>>>>  Will McLaren
>>>>> Ensembl Variation
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 27 July 2015 at 16:49, Dan Sun <meredithfy at gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>  I was trying to build a cache from GTF for white-throated sparrow
>>>>>> by myself following the tutorial, but was not successful. If possible,
>>>>>> could you please add this species to the download list? I would really
>>>>>> appreciate that!
>>>>>>
>>>>>>  You may download the GFF3 annotation for this species from NCBI ftp
>>>>>> (
>>>>>> ftp://ftp.ncbi.nlm.nih.gov/genomes/Zonotrichia_albicollis/GFF/ref_Zonotrichia_albicollis-1.0.1_scaffolds.gff3.gz)
>>>>>> and convert it to GTF.
>>>>>>
>>>>>>  Thank you very much!
>>>>>>
>>>>>>  --
>>>>>>  Dan
>>>>>>
>>>>>> _______________________________________________
>>>>>> Dev mailing list    Dev at ensembl.org
>>>>>> Posting guidelines and subscribe/unsubscribe info:
>>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>>>
>>>>>>
>>>>>
>>>>> The University of Dundee is a registered Scottish Charity, No: SC015096
>>>>>
>>>>> _______________________________________________
>>>>> Dev mailing list    Dev at ensembl.org
>>>>> Posting guidelines and subscribe/unsubscribe info:
>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>>
>>>>>
>>>>
>>>> The University of Dundee is a registered Scottish Charity, No: SC015096
>>>>
>>>> _______________________________________________
>>>> Dev mailing list    Dev at ensembl.org
>>>> Posting guidelines and subscribe/unsubscribe info:
>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog: http://www.ensembl.info/
>>>>
>>>>
>>>
>>>
>>> --
>>> Dan Sun
>>> Graduate student of Bioinformatics
>>> School of Biology
>>> Georgia Institute of Technology
>>>
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info:
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>>
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>
>
> --
> Dan Sun
> Graduate student of Bioinformatics
> School of Biology
> Georgia Institute of Technology
>



-- 
Dan Sun
Graduate student of Bioinformatics
School of Biology
Georgia Institute of Technology
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20150730/9c5e7b02/attachment.html>


More information about the Dev mailing list