[ensembl-dev] how to get original VCF file's POS from variant effect predictor output?

Will McLaren wm2 at ebi.ac.uk
Thu Aug 4 09:46:20 BST 2011


Hi Michael,

While this is possible to do, I wouldn't necessarily recommend it.

Far easier is to add an identifier to your VCF file (in the third
column) - this will be reproduced as the first column (in place of the
made-up identifier) in the output.

If you really want to do it, the rule generally is:

start = POS
end = POS + length(REF) - 1

This is different for unbalanced variations (insertions, deletions,
unbalanced substitutions), since in this case VCF specifies that the
base before the variant should be included. We need to chop off this
base to make the variant work as expected in the VEP. So:

if( length(REF) != length(ALT) ) {
  end = POS + length(REF) - 1
  start = POS + 1
  REF = substring(REF, 1) OR '-'
  ALT = substring(ALT, 1) OR '-'
}

The OR '-' means that the allele is replaced with '-' in the case of a
deletion (ALT) or insertion (REF) that leaves the REF or ALT an empty
string.

As I say though, far easier to just put an identifier into the third
column of your VCF :-)

Will

On 3 August 2011 21:01, Michael Yourshaw <myourshaw at ucla.edu> wrote:
> In order to associate the output of the variant effect predictor back to the
> original VCF 4.0 file, I need to be able to determine the value of the POS
> field of the VCF file from data in the VEP output. How can I do this?
> At the risk of revealing my ignorance of VCF format and algebra, I think the
> following works, but it depends on there never being a VCF where len(REF) ==
> len(ALT) == 2 — I am not sure this is a safe assumption:
> get chromStart and chromEnd from the VEP Location field (chromEnd=chromStart
> if not chromEnd). Can’t use Uploaded variation, which might get turned into
> rs ID.
> if chromStart == chromEnd: #SNV or indel with len(REF) == 2
> POS = chromStart
> elif chromStart == chromEnd-1: # indel with len(REF) == 1 and len(ALT) > 2 (
> if len(REF) == len(ALT) == 2 POS would be chromStart)
> POS = chromEnd
> else:
> POS = chromStart-1
>
>>
> Michael Yourshaw
>
> UCLA Geffen School of Medicine
> Department of Human Genetics, Nelson Lab
> 695 Charles E Young Drive S
> Gonda 5554
>
> Los Angeles CA 90095-8348 USA
>
> myourshaw at ucla.edu
>
> 970.691.8299
>
> This message is intended only for the use of the addressee and may contain
> information that is PRIVILEGED and CONFIDENTIAL, and/or may contain ATTORNEY
> WORK PRODUCT. If you are not the intended recipient, you are hereby notified
> that any dissemination of this communication is strictly prohibited. If you
> have received this communication in error, please erase all copies of the
> message and its attachments and notify us immediately. Thank you.
>
>
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe):
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>




More information about the Dev mailing list