[ensembl-dev] VEP on 37, but Gencode 25?

Will McLaren wm2 at ebi.ac.uk
Fri Sep 30 11:03:48 BST 2016


We're straying into the realm of the hack here!

I'm struggling to get it working in the manner you're trying, I can get
past the error you're seeing but I'm now encountering new ones.

Is there a reason you can't use the new code? The following works nicely
for me:

perl vep.pl -gff gencode.v24lift37.annotation.gff3.gz -fa
Homo_sapiens.GRCh37.dna.toplevel.fa.gz -i example_GRCh37.vcf -force -sift b
-poly b -database -port 3337 -db 85 -transcript_filter "_source_cache"

The final flag filters out transcripts loaded from the DB so the variants
are not annotated against these too.

I'm working on getting a plugin going to work with a separate cache of
SIFT/PolyPhen data, but it may take a little while to get published.

Will

On 29 September 2016 at 15:48, Konrad Karczewski <konradk at broadinstitute.org
> wrote:

> Hmm, that's interesting. When I added info.txt, now everything failed with:
>
> Can't call method "db" on unblessed reference at
> /humgen/atgu1/fs03/DM-Lab/vep/ensembl-tools-release-85/
> scripts/variant_effect_predictor/Bio/EnsEMBL/
> Variation/TranscriptVariation.pm line 324.
>
> -Konrad
>
> On September 29, 2016 at 9:34:18 AM, Will McLaren (wm2 at ebi.ac.uk) wrote:
>
> We might be able to write a plugin to read the data from a pair of table
> dump files.
>
> Let me have a go at doing that, as you are not the only person requesting
> similar at the moment!
>
> Will
>
> On 29 September 2016 at 14:21, Konrad Karczewski <
> konradk at broadinstitute.org> wrote:
>
>> Great, thanks! Will check that out.
>>
>> Is that to say there's no way to get the SIFT and PolyPhen annotations
>> locally? Happy to do some legwork if it means I can recreate the entire
>> thing with this new annotation set!
>>
>> -Konrad
>>
>> On September 29, 2016 at 3:52:47 AM, Will McLaren (wm2 at ebi.ac.uk) wrote:
>>
>> You'd also need to copy over the homo_sapiens/85_GRCh37/info.txt file,
>> this contains the column headers for the _var files, hence the warnings
>> when it finds data that doesn't match its best guess of those headers.
>>
>> RE: SIFT and PolyPhen, if you use --cache instead of --offline you
>> *might* find that it is able to retrieve SIFT and PolyPhen matrices from
>> the database server. I've tested this with the new code but not the version
>> you're on. You might also want to use "--host useastdb.ensembl.org",
>> assuming you're East Coast, this will give you the fastest (public) DB
>> connection.
>>
>> Will
>>
>> On 28 September 2016 at 21:00, Konrad Karczewski <
>> konradk at broadinstitute.org> wrote:
>>
>>> Ok, I think I got that mostly working (sorted it properly and converted
>>> transcript_type to transcript_biotype, appears to have worked). I then
>>> pulled the _var and _reg caches over as-is from 85 (not sure if wise).
>>>
>>> Now when I run it, it appears to complete without error, but I'm running
>>> into many of these warnings:
>>>
>>> Use of uninitialized value in list assignment at
>>> /humgen/atgu1/fs03/DM-Lab/vep/ensembl-tools-release-85/scrip
>>> ts/variant_effect_predictor/Bio/EnsEMBL/Variation/Utils/VEP.pm line
>>> 5344, <DUMP> line 1.
>>>
>>> Also, SIFT and PolyPhen don't appear to get output alongside it. Is that
>>> expected (or perhaps related to above warnings)? Anything I can do to get
>>> those in there?
>>>
>>> -Konrad
>>>
>>> On September 27, 2016 at 10:58:54 AM, Will McLaren (wm2 at ebi.ac.uk)
>>> wrote:
>>>
>>> You can try running it with --verbose, it will give you some error
>>> logging.
>>>
>>> Will
>>>
>>> On 27 September 2016 at 15:56, Konrad Karczewski <
>>> konradk at broadinstitute.org> wrote:
>>>
>>>> Ok good to know - I actually tried it, but I think something is being
>>>> odd. It gets through the whole thing (going back and forth between
>>>> chromosomes like you said, so I can try to fix that), but then appears to
>>>> finish:
>>>>
>>>> 2016-09-26 16:12:30 - Processing chromosome Y
>>>> WARNING: Could not find chromosome named M in FASTA file
>>>> 2016-09-26 16:12:52 - All done!
>>>>
>>>> But the output directory (either ~/.vep or the directory I pointed to
>>>> with --dir) are empty. Is this a related issue? Thought you might want to
>>>> know to add a bit of error logging if so.
>>>>
>>>> -Konrad
>>>>
>>>> On September 27, 2016 at 8:30:15 AM, Will McLaren (wm2 at ebi.ac.uk)
>>>> wrote:
>>>>
>>>> In theory this should work, but the gtf2vep.pl script doesn't seem to
>>>> work too well with this particular GFF (it was designed really to work with
>>>> GFF/GTFs as produced by Ensembl or NCBI). Probably with some tweaks it
>>>> could be made to work - I believe the major issues are caused by features
>>>> being out of the order that the script expects.
>>>>
>>>> The new code uses a much more robust system for constructing
>>>> transcripts and has been tested with GFFs from Ensembl, NCBI and GENCODE.
>>>>
>>>> Will
>>>>
>>>> On 27 September 2016 at 13:22, Konrad Karczewski <
>>>> konradk at broadinstitute.org> wrote:
>>>>
>>>>> I just also realized - would creating a cache from this gff file
>>>>> (using gtf2vep.pl) not be recommended?
>>>>>
>>>>> -Konrad
>>>>>
>>>>> On September 27, 2016 at 5:16:42 AM, Will McLaren (wm2 at ebi.ac.uk)
>>>>> wrote:
>>>>>
>>>>> Hi Konrad,
>>>>>
>>>>> The beta ensembl-vep code [1] supports annotation directly from a GFF
>>>>> file, such as the one available from the GENCODE website [2].
>>>>>
>>>>> $ curl ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/releas
>>>>> e_25/GRCh37_mapping/gencode.v25lift37.annotation.gff3.gz | gzip -dc |
>>>>> grep -v "#" | sort -k1,1 -k4,4n -k5,5n | bgzip -c >
>>>>> gencode.v25lift37.annotation.gff3.gz
>>>>> $ tabix -p gff gencode.v25lift37.annotation.gff3.gz
>>>>> $ perl vep.pl -i variants.vcf -gff gencode.v25lift37.annotation.gff3.gz
>>>>> -fasta homo_sapiens.fa
>>>>>
>>>>> This comes with limitations as the GFF file contains only the
>>>>> transcript structure and not any of the additional annotations. However I
>>>>> do know of someone successfully using LOFTEE with this exact setup.
>>>>>
>>>>> Of course usual beta caveats apply, so if you do use it and find bugs
>>>>> please report on the GitHub page.
>>>>>
>>>>> Regards
>>>>>
>>>>> Will McLaren
>>>>> Ensembl Variation
>>>>>
>>>>> [1] : https://github.com/willmclaren/ensembl-vep
>>>>> [2] : http://www.gencodegenes.org/releases/25lift37.html
>>>>>
>>>>> On 26 September 2016 at 20:40, Konrad Karczewski <
>>>>> konradk at broadinstitute.org> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> When running VEP 85 on GRCh37, I believe the process has been to
>>>>>> annotate against Gencode 19 (the info.txt seems to confirm this). Realizing
>>>>>> the ridiculousness of my request, is there any chance there is a cache
>>>>>> floating around for Gencode 25lift37? Would go a long way for ExAC
>>>>>> releases.
>>>>>>
>>>>>> Thanks!
>>>>>> -Konrad
>>>>>>
>>>>>> _______________________________________________
>>>>>> Dev mailing list    Dev at ensembl.org
>>>>>> Posting guidelines and subscribe/unsubscribe info:
>>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> Dev mailing list Dev at ensembl.org
>>>>> Posting guidelines and subscribe/unsubscribe info:
>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20160930/cae660ef/attachment.html>


More information about the Dev mailing list