[ensembl-dev] Question regarding the Variant Effect Predictor script

Will McLaren wm2 at ebi.ac.uk
Wed Apr 20 11:20:37 BST 2011


Hello,

This relates to the different ways in which VCF and Ensembl represent
insertions and deletions.

In VCF format, the base immediately before the change is included in
the reference sequence column, and the start coordinate represents
this base.

In Ensembl format, we only include the bases affected by the change,
so we have to trim off that extra base in order for the API to be able
to interpret the variant in the same way as those given to the script
in other formats.

Here is an example of a deletion:

Reference: AACTG
Variant: AACG

In VCF, this would be denoted with the reference sequence CT, variant
sequence C, and a coordinate of 3

In Ensembl, this is denoted with reference sequence T, variant
sequence - (where "-" represents the absence of any sequence), and a
start coordinate of 4, end coordinate 4.

Hence to convert between the two we trim off the first base of the
reference and variant sequence columns (using substr($string, 1)) and
increment the coordinate by 1. In the case of the variant sequence,
trimming off the first base leaves an empty string, so we substitute
this with "-".

Here is an example of an insertion:

Reference: AACTA
Variant: AACGGTA

In VCF, this is denoted with reference sequence C, variant sequence
CGG, coordinate 3

In Ensembl, the same variant is denoted reference sequence -, variant
sequence GG, start 4 and end 3 (start is greater than end to denote an
insertion).

Again, to convert between the two we trim off the first base; in this
case the reference sequence is now an empty string, so it is replaced
by "-".

I hope this makes sense!

Cheers

Will McLaren
Ensembl Variation

On 19 April 2011 21:53, Duarte Molha <duartemolha at gmail.com> wrote:
> Hello
> I have been looking at the source code for the variant predictor script and
> I have a question regarding a section of the code:
> 828  else {
> 829 $ref = substr($ref, 1);
> 830 $ref = '-' if $ref eq '';
> 831 $start++;
> 832
> 833 foreach my $alt_allele(split /\,/, $alt) {
> 834 $alt_allele = substr($alt_allele, 1);
> 835 $alt_allele = '-' if $alt_allele eq '';
> 836 push @alts, $alt_allele;
> 837 }
> 838 }
> 839
> 840 $alt = join "/", @alts;
> It does not make sense to me that you define $alt_allele as a substring  of
> $alt_allele in line 834
> Indels can be bigger than 1 bp so why are you doing this?
> You do the same for variations that have only 1 alternative allele (line
> 871):
> 868 else {
> 869 # chop off first base
> 870 $ref = substr($ref, 1);
> 871 $alt = substr($alt, 1);
> Can you explain why this is necessary or indeed correct?
> Best regards,
>      Duarte Molha
> =========================
>      Duarte Miguel Paulo Molha
>          Tel: +447772111304
>   Email: duartemolha at gmail.com
> =========================
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe):
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>




More information about the Dev mailing list