[ensembl-dev] Problem with gtf2vep.pl

Harris, Ronald Alan rharris1 at bcm.edu
Fri Jan 24 03:19:05 GMT 2014


Hi Will,

Everything looks to be working correctly now. Thanks a lot for helping me with this.

Alan

________________________________
From: dev-bounces at ensembl.org [dev-bounces at ensembl.org] On Behalf Of Will McLaren [wm2 at ebi.ac.uk]
Sent: Thursday, January 23, 2014 6:09 AM
To: Ensembl developers list
Subject: Re: [ensembl-dev] Problem with gtf2vep.pl

Hi Alan,

I got a chance to look at this again, and I've found a couple of other issues with the script. I've fixed these and updated the script again on GitHub.

There was also an issue with your GTF file, in that you had transcript entries denoted as being protein_coding but without CDS entries to define the coding sequence region.

I've added a fix to the script such that if these are found the transcript is converted to the pseudogene biotype, but of course it would be best to fix the input type.

I've tested a cache built with the initial GTF you sent to me and it works across a random set of 250k variants from dbSNP, so I think it's working OK now.

Cheers

Will


On 22 January 2014 05:39, Harris, Ronald Alan <rharris1 at bcm.edu<mailto:rharris1 at bcm.edu>> wrote:
Hi Will,

Thanks for your help with this. Yes, I am aware of the Ensembl rhesus gene annotations, but the RhesusBase annotations I am using are newer and include RNA-Seq data, so they are supposed to be better than the Ensembl predictions.

Your modified version of gtf2vep.pl<http://gtf2vep.pl> does indeed make the cache files, but I am now running into issues running vep using the cache files. I am getting the following errors. I took your approach to stopping the errors by checking for undefined values and dealing with them. Each successive error occurs after fixing an error.

Can't call method "strand" on an undefined value at /home/rharris1/work/ensembl/vep/variant_effect_predictor//Bio/EnsEMBL/Transcript.pm line 1141, <GEN0> line 5035.

Can't call method "strand" on an undefined value at /home/rharris1/work/ensembl/vep/variant_effect_predictor//Bio/EnsEMBL/Transcript.pm line 1095, <GEN0> line 5035.

Can't call method "phase" on an undefined value at /home/rharris1/work/ensembl/vep/variant_effect_predictor//Bio/EnsEMBL/Transcript.pm line 930, <GEN0> line 45035.

Can't call method "strand" on an undefined value at /home/rharris1/work/ensembl/vep/variant_effect_predictor//Bio/EnsEMBL/Translation.pm line 399, <GEN0> line 45035

Can't call method "strand" on an undefined value at /home/rharris1/work/ensembl/vep/variant_effect_predictor//Bio/EnsEMBL/Translation.pm line 428, <GEN0> line 45035.

With each successive fix, vep runs a bit farther, but I get a lot of the following warnings:

Use of uninitialized value in addition (+) at /home/rharris1/work/ensembl/vep/variant_effect_predictor//Bio/EnsEMBL/Variation/Utils/VariationEffect.pm line 536, <GEN0> line 5035.
Use of uninitialized value in subtraction (-) at /home/rharris1/work/ensembl/vep/variant_effect_predictor//Bio/EnsEMBL/Variation/Utils/VariationEffect.pm line 519, <GEN0> line 5035.

The vep results I do get look to be correct for intergenic, upstream, downstream, and intronic variants, but variants in exons are sometimes not being reported correctly. In several cases, a variant that should be identified as synonymous or missense is identified as "5_prime_UTR_variant,3_prime_UTR_variant".

I just noticed that in the Ensembl gtf file (which I ran through gtf2vep.pl<http://gtf2vep.pl> and vep and it worked) the negative strand genes are in reverse chromosome order. I changed my gtf to that order and it ran through the original gtf2vep.pl<http://gtf2vep.pl> (before your fixes) without throwing an error, but I still get errors using vep. Should the negative strand genes be in reverse chromosome order?

Please let me know if you have any ideas about this.

Thanks,

Alan
________________________________
From: dev-bounces at ensembl.org<mailto:dev-bounces at ensembl.org> [dev-bounces at ensembl.org<mailto:dev-bounces at ensembl.org>] On Behalf Of Will McLaren [wm2 at ebi.ac.uk<mailto:wm2 at ebi.ac.uk>]
Sent: Wednesday, January 08, 2014 7:23 AM
To: Ensembl developers list
Subject: Re: [ensembl-dev] Problem with gtf2vep.pl<http://gtf2vep.pl>

Hello Alan,

Thanks for the detailed report. There's an odd bug happening here which I can't get to the bottom of at the moment.

I've added a fix for now which stops the error happening, and the cache builds fine for me from your input.

Since we're in the process of switching our code hosting to Git, for now I have only pushed the fix to our GitHub - you can get the fixed script here:

https://github.com/Ensembl/ensembl-tools/blob/release/74/scripts/variant_effect_predictor/gtf2vep.pl

Let me know if this isn't convenient and I can get the fix pushed to our CVS tree too.

PS I assume you are aware we build a cache file for macaque already? ftp://ftp.ensembl.org/pub/release-74/variation/VEP/

Thanks again

Will McLaren
Ensembl Variation


On 8 January 2014 05:38, Harris, Ronald Alan <rharris1 at bcm.edu<mailto:rharris1 at bcm.edu>> wrote:
Hi,

I have been trying to use gtf2vep.pl<http://gtf2vep.pl> to generate a cache file based on RhesusBase (http://www.rhesusbase.org/) gene annotations on the UCSC rheMac2/Ensembl MMUL_1 assembly. I downloaded their rb2 gene predictions as a gtf file through their UCSC mirror, changed the source column to "protein_coding", added "exon_number" and the appropriate number in the description field, and sorted the annotations by chromosome position. The gtf file can be downloaded from here:

https://bigfile.bcm.edu/download.php?claimID=tnwUAesf9rRRH3u5&claimPasscode=B8mm8RNVZG4Ub6Xy&fid=52811&emailAddr=rharris1@bcm.edu

When I run gtf2vep.pl<http://gtf2vep.pl> I get this error:

Can't call method "start" on an undefined value at gtf2vep.pl<http://gtf2vep.pl> line 376.

This error occurs after generating some of the cache files in the .vep directory. I tried to run gtf2vep.pl<http://gtf2vep.pl> using gtf files with only a single chromosome and it looks like the error consistently occurs when trying to generate the 1-1000000 cache file. Oddly, if I just run gtf2vep.pl<http://gtf2vep.pl> on the annotations from 1-1000000 on a single chromosome I do not get this error.

I don't think this is due to chr in chromosome names because the fasta file I am using has chr in the chromosome names.

I would appreciate any help you could give me with this.

Thanks,

Alan

_______________________________________________
Dev mailing list    Dev at ensembl.org<mailto:Dev at ensembl.org>
Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
Ensembl Blog: http://www.ensembl.info/



_______________________________________________
Dev mailing list    Dev at ensembl.org<mailto:Dev at ensembl.org>
Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
Ensembl Blog: http://www.ensembl.info/


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20140123/263bcf73/attachment.html>


More information about the Dev mailing list