As you've seen, this data can be somewhat confusing.

The GMAF field always reports the minor allele frequency, whereas the other
frequency fields report the frequencies of the ALT (or ALTs if there is
more than one).

Ideally the VEP would report the frequency of the ALT allele that you input
in your VCF, but this raises further problems if the ALT allele you report
does not match either the REF or ALT alleles from the 1000 Genomes VCF. It
is something we are hoping to improve in a future VEP release.

For your second question, it looks like frequencies have been mistakenly
assigned to the two reported co-located variants (rs3902057 and
so the frequencies appear twice. We'll look into a fix for this.


> Hi,
> I am trying to figure out some of the output I get from VEP (version 79)
> when annotating vcf files. See end of email for input and command. Please
> note, I am new to this field, so I might misunderstand a few concepts...
> For the variant (1   197390368   rs3902057   A   G) I get the following
> output:
> CSQ=G|upstream_gene_variant|MODIFIER|CRB1|ENSG00000134376|Transcript|ENST00000480086|processed_transcript||||||||||rs3902057&RISN_CRB1:c.1410A>G|1|1573|1|HGNC|2343||||||||A:0.0803|G:0.7065&G:0.7065|G:0.9813&G:0.9813||G:1&G:1|G:0.999&G:0.999|G:1&G:1|G:0.7696&G:0.7696|G:0.9986&G:0.9986|||19339744||||
> {rest of transcripts omitted...}
> - This might be a silly question, but why is GMAF given for REF, while the
> subpopulations are given for ALT? In my case I'm interested in the
> frequency for the ALT, not the REF. I assume it's giving the minor allele
> frequency always? But why is there a difference in the allele given for
> GMAF vs e.g. AFR_MAF?
> Looking at a later transcript for same variant, I see the following:
> G|synonymous_variant|LOW|CRB1|23418|Transcript|NM_001193640.1|protein_coding|4/10||NM_001193640.1:c.1074A>G|NM_001193640.1:c.1074A>G(p.=)|1283|1074|358|L|ctA/ctG|rs3902057&RISN_CRB1:c.1410A>G|1||1|||||NP_001180569.1|rseq_mrna_nonmatch&rseq_cds_mismatch&rseq_ens_match_cds||||A:0.0803|G:0.7065&G:0.7065|G:0.9813&G:0.9813||G:1&G:1|G:0.999&G:0.999|G:1&G:1|G:0.7696&G:0.7696|G:0.9986&G:0.9986|||19339744||||,G|5_prime_UTR_variant|MODIFIER|CRB1|ENSG00000134376|Transcript|ENST00000367397|protein_coding|2/6||ENST00000367397.1:c.-448A>G||411|||||rs3902057&RISN_CRB1:c.1410A>G|1||1|HGNC|2343|||ENSP00000356367|||||A:0.0803|G:0.7065&G:0.7065|G:0.9813&G:0.9813||G:1&G:1|G:0.999&G:0.999|G:1&G:1|G:0.7696&G:0.7696|G:0.9986&G:0.9986|||19339744||||
> - Why is the frequency for the subpopulation alleles repeated twice with
> same value? Why not always give the frequency for all alleles?
> Best regards,
> Svein Tore Koksrud Seljebotn
> **** Example VCF: *****
> ##fileformat=VCFv4.1
> ##INFO=<ID=class,Number=.,Type=String,Description="class">
> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
> #CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO FORMAT    H02
> 1   197390368   rs3902057   A   G   7128.77 .
> AC=2;AF=1.00;AN=2;DB;DP=193;Dels=0.00;FS=0.000;HaplotypeScore=4.6974;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;QD=29.21
> GT:AD:DP:GQ:PL  1/1:0,192:193:99:7157,518,0
> ***** Command: *****
> vep --cache --dir_cache=/work/VEP/cache/
> --fasta=/work/human_g1k_v37_decoy.fasta --offline --sift=b --polyphen=b
> --ccds --hgvs --numbers --domains --regulatory --canonical --protein
> --biotype --gmaf --maf_1kg --maf_esp --pubmed --allow_non_variant --fork=4
> --vcf --allele_number --no_escape --failed=1 --no_stats --merged --symbol
> -i testfile.vcf -o testfile.annotated.vcf
