[ensembl-dev] variant effect predictor 2.0 question

Mon Jul 18 11:05:46 BST 2011

Hi Mark,

I've been taking a look at the input file that you sent to Graham. It
looks like the format of your file is a little odd - it looks loosely
like VCF (you basically have the first 5 columns).

Taking a small chunk of your file:

1       1900186 KIAA1751.11     T       T
1       1900186 KIAA1751.11     T       C
1       1900232 KIAA1751.11     T       T
1       1900232 KIAA1751.11     T       C

I'm not quite sure what you're trying to represent here. You have two
lines for each of the loci, and you have given all four lines the same
identifier (column 3 in the VCF format is assumed to be a
user-provided identifier for the variant). Furthermore, lines 1 and 3
above are not even variant - assuming T is the reference allele at
both loci, you are reporting a variant allele of T too (column 4
should be the reference allele, and column 5 a list of variant
alleles).

You should either make sure the variant identifiers are different, or
just make the column blank (replace with "." in VCF).

Furthermore, while the VCF will happily process the non-variant lines,
you will get odd things in the output - for example, assuming the
first line above falls in a coding region, it would be reported as
synonymous coding, even though technically you haven't even provided a
variant.

The VEP does not by default check the reference allele is correct (you
can force it to with --check_ref).

Please see the following web page for information on the VCF format:

http://www.1000genomes.org/wiki/Analysis/vcf4.0

Hope this helps,

Will

On 14 July 2011 14:14, Mark Aquino <aquinom85 at me.com> wrote:
> Hi,
>
>        I've been trying to switch over to the new version of the VEP (2.x) but have been getting some strange output.  Basically, it seems like the normal output is multiplied by 4 or 5 times (or more).  For one run on an output that was 70K lines using the original variant effect predictor I got over 1 million lines of output consisting of the each transcript being analyzed and/or output 10+ times each, and on a subsequent re-run I got 330K lines of output with each being repeated 5 times.  (I checked this by doing a simple grep on the output with line numbers to ensure that there weren't just more transcripts per variation being repeated).
>
> Has anyone else experienced this issue?
>
>
> Best,
> Mark Aquino
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>