[ensembl-dev] Question regarding MAF frequencies from VEP
Svein Tore Koksrud Seljebotn
s.t.seljebotn at medisin.uio.no
Fri May 29 15:42:35 BST 2015
Hi again,
thanks for your reply. That clears it up!
Calling the subpopulations *_MAF is a bit misleading in that case?
Anyways, in either case, why not always include both REF and ALT(s),
like G:0.002&C:0.998 and call the frequencies *_AF? Sometimes one or
more will not be available in 1000g, but at least you provide all the
data the user is likely to need.
Another question, in some regions, especially intronic, I have quite a
lot variants where GMAF is the REF allele. Does this sound plausible?
Shouldn't the reference genome normally contain major alleles? The
reference genome I use is from the GATK bundle (v37).
Thanks again!
Svein Tore Koksrud Seljebotn
>Hello,
>
>As you've seen, this data can be somewhat confusing.
>
>The GMAF field always reports the minor allele frequency, whereas the
other
>frequency fields report the frequencies of the ALT (or ALTs if there is
>more than one).
>
>Ideally the VEP would report the frequency of the ALT allele that you
input
>in your VCF, but this raises further problems if the ALT allele you report
>does not match either the REF or ALT alleles from the 1000 Genomes VCF. It
>is something we are hoping to improve in a future VEP release.
>
>For your second question, it looks like frequencies have been mistakenly
>assigned to the two reported co-located variants (rs3902057 and
>RISN_CRB1:c.1410A>G),
>so the frequencies appear twice. We'll look into a fix for this.
>
>Regards
>
>Will McLaren
>Ensembl Variation
>
>On 29 May 2015 at 14:29, Svein Tore Koksrud Seljebotn <
>s.t.seljebotn at medisin.uio.no> wrote:
>
>> Hi,
>>
>> I am trying to figure out some of the output I get from VEP (version 79)
>> when annotating vcf files. See end of email for input and command.
Please
>> note, I am new to this field, so I might misunderstand a few concepts...
>>
>> For the variant (1 197390368 rs3902057 A G) I get the following
>> output:
>>
>>
CSQ=G|upstream_gene_variant|MODIFIER|CRB1|ENSG00000134376|Transcript|ENST00000480086|processed_transcript||||||||||rs3902057&RISN_CRB1:c.1410A>G|1|1573|1|HGNC|2343||||||||A:0.0803|G:0.7065&G:0.7065|G:0.9813&G:0.9813||G:1&G:1|G:0.999&G:0.999|G:1&G:1|G:0.7696&G:0.7696|G:0.9986&G:0.9986|||19339744||||
>> {rest of transcripts omitted...}
>>
>> - This might be a silly question, but why is GMAF given for REF,
while the
>> subpopulations are given for ALT? In my case I'm interested in the
>> frequency for the ALT, not the REF. I assume it's giving the minor
allele
>> frequency always? But why is there a difference in the allele given for
>> GMAF vs e.g. AFR_MAF?
>>
>> Looking at a later transcript for same variant, I see the following:
>>
>>
>>
G|synonymous_variant|LOW|CRB1|23418|Transcript|NM_001193640.1|protein_coding|4/10||NM_001193640.1:c.1074A>G|NM_001193640.1:c.1074A>G(p.=)|1283|1074|358|L|ctA/ctG|rs3902057&RISN_CRB1:c.1410A>G|1||1|||||NP_001180569.1|rseq_mrna_nonmatch&rseq_cds_mismatch&rseq_ens_match_cds||||A:0.0803|G:0.7065&G:0.7065|G:0.9813&G:0.9813||G:1&G:1|G:0.999&G:0.999|G:1&G:1|G:0.7696&G:0.7696|G:0.9986&G:0.9986|||19339744||||,G|5_prime_UTR_variant|MODIFIER|CRB1|ENSG00000134376|Transcript|ENST00000367397|protein_coding|2/6||ENST00000367397.1:c.-448A>G||411|||||rs3902057&RISN_CRB1:c.1410A>G|1||1|HGNC|2343|||ENSP00000356367|||||A:0.0803|G:0.7065&G:0.7065|G:0.9813&G:0.9813||G:1&G:1|G:0.999&G:0.999|G:1&G:1|G:0.7696&G:0.7696|G:0.9986&G:0.9986|||19339744||||
>>
>> - Why is the frequency for the subpopulation alleles repeated twice with
>> same value? Why not always give the frequency for all alleles?
>>
>>
>> Best regards,
>> Svein Tore Koksrud Seljebotn
>>
>>
>>
>>
>> **** Example VCF: *****
>>
>> ##fileformat=VCFv4.1
>> ##INFO=<ID=class,Number=.,Type=String,Description="class">
>> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
>> #CHROM POS ID REF ALT QUAL FILTER INFO
FORMAT H02
>> 1 197390368 rs3902057 A G 7128.77 .
>>
AC=2;AF=1.00;AN=2;DB;DP=193;Dels=0.00;FS=0.000;HaplotypeScore=4.6974;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;QD=29.21
>> GT:AD:DP:GQ:PL 1/1:0,192:193:99:7157,518,0
>>
>> ***** Command: *****
>> vep --cache --dir_cache=/work/VEP/cache/
>> --fasta=/work/human_g1k_v37_decoy.fasta --offline --sift=b --polyphen=b
>> --ccds --hgvs --numbers --domains --regulatory --canonical --protein
>> --biotype --gmaf --maf_1kg --maf_esp --pubmed --allow_non_variant
--fork=4
>> --vcf --allele_number --no_escape --failed=1 --no_stats --merged
--symbol
>> -i testfile.vcf -o testfile.annotated.vcf
>>
>>
>>
More information about the Dev
mailing list