[ensembl-dev] Question regarding MAF frequencies from VEP

Will McLaren wm2 at ebi.ac.uk
Fri May 29 15:50:25 BST 2015


Hello,

On 29 May 2015 at 15:42, Svein Tore Koksrud Seljebotn <
s.t.seljebotn at medisin.uio.no> wrote:

> Hi again,
>
> thanks for your reply. That clears it up!
>
> Calling the subpopulations *_MAF is a bit misleading in that case?
>

Yes, they should be called *_AF strictly, but it's kind of stuck that way.


>
> Anyways, in either case, why not always include both REF and ALT(s), like
> G:0.002&C:0.998 and call the frequencies *_AF? Sometimes one or more will
> not be available in 1000g, but at least you provide all the data the user
> is likely to need.
>

This is one solution, yes.


>
> Another question, in some regions, especially intronic, I have quite a lot
> variants where GMAF is the REF allele. Does this sound plausible? Shouldn't
> the reference genome normally contain major alleles? The reference genome I
> use is from the  GATK bundle (v37).
>

It's definitely plausible, yes. The reference genome has been corrected for
GRCh38 at many loci using 1000 genomes frequencies as reference. I believe
that due to various reasons the original reference genome ended up being
not especially representative of the most common alleles at many loci.

If you are finding an abnormally large number, we'd be interested in taking
a look at some of these cases to see if there's any systematic error
anywhere.

Will


>
> Thanks again!
> Svein Tore Koksrud Seljebotn
>
> >Hello,
> >
> >As you've seen, this data can be somewhat confusing.
> >
> >The GMAF field always reports the minor allele frequency, whereas the
> other
> >frequency fields report the frequencies of the ALT (or ALTs if there is
> >more than one).
> >
> >Ideally the VEP would report the frequency of the ALT allele that you
> input
> >in your VCF, but this raises further problems if the ALT allele you report
> >does not match either the REF or ALT alleles from the 1000 Genomes VCF. It
> >is something we are hoping to improve in a future VEP release.
> >
> >For your second question, it looks like frequencies have been mistakenly
> >assigned to the two reported co-located variants (rs3902057 and
> >RISN_CRB1:c.1410A>G),
> >so the frequencies appear twice. We'll look into a fix for this.
> >
> >Regards
> >
> >Will McLaren
> >Ensembl Variation
> >
> >On 29 May 2015 at 14:29, Svein Tore Koksrud Seljebotn <
> >s.t.seljebotn at medisin.uio.no> wrote:
> >
> >> Hi,
> >>
> >> I am trying to figure out some of the output I get from VEP (version 79)
> >> when annotating vcf files. See end of email for input and command.
> Please
> >> note, I am new to this field, so I might misunderstand a few concepts...
> >>
> >> For the variant (1   197390368   rs3902057   A   G) I get the following
> >> output:
> >>
> >>
> CSQ=G|upstream_gene_variant|MODIFIER|CRB1|ENSG00000134376|Transcript|ENST00000480086|processed_transcript||||||||||rs3902057&RISN_CRB1:c.1410A>G|1|1573|1|HGNC|2343||||||||A:0.0803|G:0.7065&G:0.7065|G:0.9813&G:0.9813||G:1&G:1|G:0.999&G:0.999|G:1&G:1|G:0.7696&G:0.7696|G:0.9986&G:0.9986|||19339744||||
> >> {rest of transcripts omitted...}
> >>
> >> - This might be a silly question, but why is GMAF given for REF, while
> the
> >> subpopulations are given for ALT? In my case I'm interested in the
> >> frequency for the ALT, not the REF. I assume it's giving the minor
> allele
> >> frequency always? But why is there a difference in the allele given for
> >> GMAF vs e.g. AFR_MAF?
> >>
> >> Looking at a later transcript for same variant, I see the following:
> >>
> >>
> >>
> G|synonymous_variant|LOW|CRB1|23418|Transcript|NM_001193640.1|protein_coding|4/10||NM_001193640.1:c.1074A>G|NM_001193640.1:c.1074A>G(p.=)|1283|1074|358|L|ctA/ctG|rs3902057&RISN_CRB1:c.1410A>G|1||1|||||NP_001180569.1|rseq_mrna_nonmatch&rseq_cds_mismatch&rseq_ens_match_cds||||A:0.0803|G:0.7065&G:0.7065|G:0.9813&G:0.9813||G:1&G:1|G:0.999&G:0.999|G:1&G:1|G:0.7696&G:0.7696|G:0.9986&G:0.9986|||19339744||||,G|5_prime_UTR_variant|MODIFIER|CRB1|ENSG00000134376|Transcript|ENST00000367397|protein_coding|2/6||ENST00000367397.1:c.-448A>G||411|||||rs3902057&RISN_CRB1:c.1410A>G|1||1|HGNC|2343|||ENSP00000356367|||||A:0.0803|G:0.7065&G:0.7065|G:0.9813&G:0.9813||G:1&G:1|G:0.999&G:0.999|G:1&G:1|G:0.7696&G:0.7696|G:0.9986&G:0.9986|||19339744||||
> >>
> >> - Why is the frequency for the subpopulation alleles repeated twice with
> >> same value? Why not always give the frequency for all alleles?
> >>
> >>
> >> Best regards,
> >> Svein Tore Koksrud Seljebotn
> >>
> >>
> >>
> >>
> >> **** Example VCF: *****
> >>
> >> ##fileformat=VCFv4.1
> >> ##INFO=<ID=class,Number=.,Type=String,Description="class">
> >> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
> >> #CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO FORMAT
> H02
> >> 1   197390368   rs3902057   A   G   7128.77 .
> >>
> AC=2;AF=1.00;AN=2;DB;DP=193;Dels=0.00;FS=0.000;HaplotypeScore=4.6974;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;QD=29.21
> >> GT:AD:DP:GQ:PL  1/1:0,192:193:99:7157,518,0
> >>
> >> ***** Command: *****
> >> vep --cache --dir_cache=/work/VEP/cache/
> >> --fasta=/work/human_g1k_v37_decoy.fasta --offline --sift=b --polyphen=b
> >> --ccds --hgvs --numbers --domains --regulatory --canonical --protein
> >> --biotype --gmaf --maf_1kg --maf_esp --pubmed --allow_non_variant
> --fork=4
> >> --vcf --allele_number --no_escape --failed=1 --no_stats --merged
> --symbol
> >> -i testfile.vcf -o testfile.annotated.vcf
> >>
> >>
> >>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20150529/de382045/attachment.html>


More information about the Dev mailing list