[ensembl-dev] VEP75 input

Will McLaren wm2 at ebi.ac.uk
Mon Mar 2 15:55:50 GMT 2015


Hi Eva,

I'm not sure as I can't see more than one example of your input, but it
looks like you might have the best luck converting your input to a VCF-like
string; this way you can let the VEP take care of the position and allele
transformations that you're attempting:

print "$al[0] $al[1] . $al2[0] $al2[1]\n";

which should result in:

8 133984814 . G GTT

and should be interpreted by the VEP as valid VCF (use --format vcf if it
doesn't detect the format automatically).


Regards

Will McLaren
Ensembl Variation



A more thorough explanation:

The reason you are seeing an error is because the start coordinate should
never be > end + 1; in most cases start<=end, but in the specific case of
an insertion start = end + 1 (see
http://www.ensembl.org/info/docs/tools/vep/vep_formats.html#vcf). This
example is an insertion of TT between bases 133984815 and 133984816, so
should be described in Ensembl format as:

8 133984816 133984815 -/TT

This means checking for and stripping off any leading matching bases. The
VEP's VCF parser expects to see these leading bases (any unbalanced
substitution in VCF is represented this way), so it does the allele and
position conversion for you.

If your example did not have matching bases in the reference and alternate
alleles, then it would not be considered an insertion, rather an unbalanced
substitution, e.g.

8:133984814 c.6056-29 C>GTT

would become

8 133984814 133984814 C/GTT

in Ensembl format. The key point to note is that the start and end
coordinates represent the bases overlapped by the reference allele alone;
the alternate allele can be any length or valid string and this won't
affect the coordinates, i.e.

8 133984814 133984814 C/GTTACGGGAT

8 133984814 133984816 CTC/GTTACGGGAT


On 2 March 2015 at 15:23, Eva Goncalves Serra <egs at sanger.ac.uk> wrote:

>   Hi,
>
>  I am trying to use vep75 (with cache) and had to re-format my input
> (which was not in vcf/or other compatible formats) to the ensembl input
> format. Thought I have done this successfully but I get an error in a
> specific insertion:
>
>  Original file entry:
> 8:133984814 c.6056-29 G>GTT
>
>  Formatted to ensembl format:
> 8 133984816 133984814 G/GTT +
>
>  Error I get:
> WARNING: start > end+1 : (START=133984816, END=133984814) on line 19.
>
>  My code to reformat the input was this:
>
>        my @split = split(/\t/); # splitting file by tabs
>        my @al = split(':',$split[1]); # getting the chr:pos
>       my @al2 = split('>',$split[3]); # getting the ref>alt
>
>        if ((length $al2[0]==1) && (length $al2[1]==1)) {
>         print "$al[0] $al[1] $al[1] $al2[0]/$al2[1] +\n";
>       } elsif (length $al2[1] > length $al2[0]) {
>             my $sub = (length $al2[1])-(length $al2[0]);
>             my $new = $al[1]+$sub;
>             print "$al[0] $new $al[1] $al2[0]/$al2[1] +\n";
>       } elsif (length $al2[1] < length $al2[0]) {
>              my $sub2 = (length $al2[0])-(length $al2[1]);
>              my $new2 = $al[1]-$sub2;
>              print "$al[0] $new2 $al[1] $al2[0]/$al2[1] +\n";
>       }
>
>  Am I missing something?
>
>  Thanks a lot!
>
>  Eva
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20150302/c44e5ced/attachment.html>


More information about the Dev mailing list