[ensembl-dev] Request to add one species to VEP pre-built cache

Will McLaren wm2 at ebi.ac.uk
Thu Jul 30 10:49:37 BST 2015


Hi Dan,

Thanks for the report, we are still working on ironing out some issues in
the GFF parser.

I've added some fixes to the release/81 version of gtf2vep.pl which should
correct the problems you are seeing.

Regards

Will

On 29 July 2015 at 22:21, Dan Sun <meredithfy at gmail.com> wrote:

> Hi Will and Christian,
>
> Thank you both for your help.
>
> I have an additional question. Once I annotated my vcf file using your
> cache, I notice non-coding variants are marked "intergenic variant" instead
> of something like "non coding exon variant". For example, NW_005081553.1:
> 4008346G->T is a variant located in an exon of non-coding transcripts of
> gene KHDRBS2 (XR_270793.1, XR_270792.1, XR_270795.1, XR_270797.1,
> XR_270794.1). You have any ideas about how to improve the annotation of
> SNPs in exons of non-coding genes for this species? You can find these
> non-coding transcripts in the GFF3 file you downloaded from NCBI.
>
> Thanks!
>
> Best,
> Dan
>
> On Tue, Jul 28, 2015 at 5:52 AM, Christian Cole (Staff) <
> C.Cole at dundee.ac.uk> wrote:
>
>>   Sorry, I couldn't leave this alone. I don't think I've done enough
>> coding lately ;)
>>
>>  You can shorten it a fair bit further with the magic -a (auto-split)
>> and -p (auto-print) switches:
>> gzip -dc 44394_ref_Zonotrichia_albicollis-1.0.1_chrUn.fa.gz | perl
>> -F'/\|/' -lape 's/^>.*/>$F[3]/' >
>> 44394_ref_Zonotrichia_albicollis-1.0.1_chrUn.fa
>>
>>  -a splits each line by the pattern given by -F (whitespace by default)
>> and puts it into @F
>> -p puts while{<>} { print } around your code
>>
>>  Using substitution rather than an if() simplifies the defline fix.
>> Although, it's a lot less legible.
>>
>>  OK. I feel better now...
>> Cheers,
>>
>>  Chris
>>
>>   From: <dev-bounces at ensembl.org> on behalf of Will McLaren
>> Reply-To: Ensembl developers list
>> Date: Tuesday, 28 July 2015 10:16
>>
>> To: Ensembl developers list
>> Subject: Re: [ensembl-dev] Request to add one species to VEP pre-built
>> cache
>>
>>   Thanks Chris - always good to shorten one-liners.
>>
>>  And you're correct, the space is not intentional; the command should be:
>>
>> gzip -dc 44394_ref_Zonotrichia_albicollis-1.0.1_chrUn.fa.gz | perl -lne
>> 'if(/^\>/) { $id = (split /\|/, $_)[3]; print ">$id";} else {print}' >
>> 44394_ref_Zonotrichia_albicollis-1.0.1_chrUn.fa
>>
>>  Regards
>>
>> Will
>>
>> On 28 July 2015 at 10:09, Christian Cole (Staff) <C.Cole at dundee.ac.uk>
>> wrote:
>>
>>>   Hi Will,
>>>
>>>  Just a quick tip. Using the perl -n switch avoids 'while(<>) { }' and
>>> -l switch avoids having to terminate print statements with '\n'. So your
>>> code can be tidied up a touch with:
>>> gzip -dc 44394_ref_Zonotrichia_albicollis-1.0.1_chrUn.fa.gz | perl -lne
>>> 'if(/^\>/) { $id = (split /\|/, $_)[3]; print "> $id";} else {print}' >
>>> 44394_ref_Zonotrichia_albicollis-1.0.1_chrUn.fa
>>>
>>>    Also, is the space in '> $id' intentional? That's not typical
>>> behaviour for fasta files.
>>> Cheers,
>>>
>>>  Chris
>>>
>>>   From: <dev-bounces at ensembl.org> on behalf of Will McLaren
>>> Reply-To: Ensembl developers list
>>> Date: Monday, 27 July 2015 17:27
>>> To: Ensembl developers list
>>> Subject: Re: [ensembl-dev] Request to add one species to VEP pre-built
>>> cache
>>>
>>>   Hi Dan,
>>>
>>>  We have in fact just updated our GTF converter script to support GFF
>>> too (get the new release, 81, for this capability).
>>>
>>>  However, giving it a go just now with that file I noticed the FASTA
>>> file supplied doesn't play nicely with our indexer, so I tweaked the FASTA
>>> to get it to run. Long story short, here's the cache:
>>>
>>>
>>> https://dl.dropboxusercontent.com/u/12936195/zonotrichia_albicollis.tar.gz
>>>
>>>  And here's the long story, i.e. what I did to generate it if you want
>>> to do the same:
>>>
>>>  wget
>>> ftp://ftp.ncbi.nlm.nih.gov/genomes/Zonotrichia_albicollis/GFF/ref_Zonotrichia_albicollis-1.0.1_scaffolds.gff3.gz
>>>  wget
>>> ftp://ftp.ncbi.nlm.nih.gov/genomes/Zonotrichia_albicollis/CHR_Un/44394_ref_Zonotrichia_albicollis-1.0.1_chrUn.fa.gz
>>>  gzip -dc 44394_ref_Zonotrichia_albicollis-1.0.1_chrUn.fa.gz | perl -e
>>> 'while(<>) { if(/^\>/) { $id = (split /\|/, $_)[3]; print "> $id\n";} else
>>> {print}}' > 44394_ref_Zonotrichia_albicollis-1.0.1_chrUn.fa
>>> perl gtf2vep.pl -i ref_Zonotrichia_albicollis-1.0.1_scaffolds.gff3.gz
>>> -fasta 44394_ref_Zonotrichia_albicollis-1.0.1_chrUn.fa -species
>>> zonotrichia_albicollis
>>>
>>>  Then run the VEP as follows:
>>>
>>>  perl variant_effect_predictor.pl -offline -species
>>> zonotrichia_albicollis -i variants.vcf
>>>
>>>  Regards
>>>
>>>  Will McLaren
>>> Ensembl Variation
>>>
>>>
>>>
>>>
>>> On 27 July 2015 at 16:49, Dan Sun <meredithfy at gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>>  I was trying to build a cache from GTF for white-throated sparrow by
>>>> myself following the tutorial, but was not successful. If possible, could
>>>> you please add this species to the download list? I would really appreciate
>>>> that!
>>>>
>>>>  You may download the GFF3 annotation for this species from NCBI ftp (
>>>> ftp://ftp.ncbi.nlm.nih.gov/genomes/Zonotrichia_albicollis/GFF/ref_Zonotrichia_albicollis-1.0.1_scaffolds.gff3.gz)
>>>> and convert it to GTF.
>>>>
>>>>  Thank you very much!
>>>>
>>>>  --
>>>>  Dan
>>>>
>>>> _______________________________________________
>>>> Dev mailing list    Dev at ensembl.org
>>>> Posting guidelines and subscribe/unsubscribe info:
>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog: http://www.ensembl.info/
>>>>
>>>>
>>>
>>> The University of Dundee is a registered Scottish Charity, No: SC015096
>>>
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info:
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>>
>>
>> The University of Dundee is a registered Scottish Charity, No: SC015096
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>
>
> --
> Dan Sun
> Graduate student of Bioinformatics
> School of Biology
> Georgia Institute of Technology
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20150730/14cf28f3/attachment.html>


More information about the Dev mailing list