[ensembl-dev] Confused by --gene_phenotype and PHENO in VEP v82

Tue Dec 15 21:26:03 GMT 2015

Each "&" separated value corresponds to an ID in the Existing_variation
field as there may be more than one per input position.

Will
On 15 Dec 2015 17:21, "JESSICA X. CHONG" <jxchong at uw.edu> wrote:

> Thanks! And what does it mean when the PHENO field’s value is something
> like "0&1&1”?  I got that when annotated chr1:1412699 C>T (hg19)
>
>
> > On Dec 15, 2015, at 1:49 AM, William McLaren <wm2 at ebi.ac.uk> wrote:
> >
> > Hi Jessica
> >
> > The PHENO field is populated for known variants that have associated
> phenotypes; use --check_existing to enable looking up known variants.
> >
> > Apologies, I see this is not clear in the docs.
> >
> > Will
> >
> >
> >
> > On 14 Dec 2015 23:48, "JESSICA X. CHONG" <jxchong at uw.edu> wrote:
> >>
> >> Hi Will,
> >>
> >> It looks like --gene_phenotype (GENE_PHENO) now gets populated with VEP
> v83. Yay!
> >>
> >> However PHENO does not (at least when I tested annotated CFTR dF508). I
> apologize if I’m being dense but despite searching the documentation, I
> can’t figure out which --option populates the PHENO field.
> >>
> >> Thanks,
> >> Jessica
> >>
> >>
> >>> On Nov 20, 2015, at 2:33 AM, Will McLaren <wm2 at ebi.ac.uk> wrote:
> >>>
> >>> Hi Cyriac, Jessica,
> >>>
> >>> We are planning on writing a phenotype plugin in the near future
> exactly as you describe. We'd be very happy if you'd be willing to
> contribute to this, in the form of testing, feedback or code.
> >>>
> >>> Regarding output format, this is the part that is difficult with
> unstructured and expansive data like this, particularly trying to jam it
> into our already jammed line/separator-based text output formats. Do you
> use the VEP's JSON output format already (--json)? If so, it will be easy
> to extend this output with phenotype data, and being JSON will make it nice
> and easy to process afterwards.
> >>>
> >>> If you don't use --json, and you are thinking the JSON should be some
> sidecar output file, how do you intend on integrating this with the "main"
> VEP results file?
> >>>
> >>> The gene phenotype flag data will be fixed for release 83, due next
> month.
> >>>
> >>> Regards
> >>>
> >>> Will McLaren
> >>> Ensembl Variation
> >>>
> >>> On 19 November 2015 at 18:01, Cyriac Kandoth <kandoth at cbio.mskcc.org>
> wrote:
> >>>>
> >>>> Hi Will, sorry to fork this thread. Do you have any immediate plans
> to write a VEP plugin that will poop out all this phenotype information
> into a nicely organized JSON? That will be super useful for web front-ends
> like cBioPortal. If not, we plan to do this by mid-December, and submit a
> pull request to https://github.com/ensembl-variation/VEP_plugins
> >>>>
> >>>> Thanks,
> >>>> Cyriac
> >>>>
> >>>> On Tue, Nov 10, 2015 at 5:06 AM, Will McLaren <wm2 at ebi.ac.uk> wrote:
> >>>>>
> >>>>> Hi Jessica,
> >>>>>
> >>>>> Here's a way you can get the phenotype names in VEP without me
> writing any new code!
> >>>>>
> >>>>> 1) Create a BED file from Ensembl's database of phenotype
> annotations. This will take a few minutes:
> >>>>>
> >>>>> mysql -hensembldb.ensembl.org -uanonymous
> -Dhomo_sapiens_variation_82_38 -e'select sr.name,
> pf.seq_region_start-1,pf.seq_region_end, concat("\"", concat_ws(":",
> pf.type, s.name, pf.object_id, p.description), "\"") from seq_region sr,
> source s, phenotype p, phenotype_feature pf where pf.seq_region_id =
> sr.seq_region_id and pf.source_id = s.source_id and pf.phenotype_id =
> p.phenotype_id' | grep -v concat_ws | sort -k1,1 -k2,2n -k3,3n | bgzip -c >
> phenotypes.bed.gz
> >>>>>
> >>>>> 2) Index it with tabix:
> >>>>>
> >>>>> tabix -p bed phenotypes.bed.gz
> >>>>>
> >>>>> 3) Use it as a custom annotation in VEP (see
> http://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html):
> >>>>>
> >>>>> echo "rs533747784" | perl variant_effect_predictor.pl -force -cache
> -custom phenotypes.bed.gz,phenotypes,bed,overlap
> >>>>>
> >>>>> Then have fun parsing the output :-)
> >>>>>
> >>>>> You could change the MySQL query above to limit it to one source of
> data, or remove the structural variants for example (there are a lot of
> them and they tend to overlap a significant portion of the genome).
> >>>>>
> >>>>> Regards
> >>>>>
> >>>>> Will
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 9 November 2015 at 19:45, JESSICA X. CHONG <jxchong at uw.edu>
> wrote:
> >>>>>>
> >>>>>> Could you maybe consider making it an option to output the
> phenotype names? At least for me, in most cases, if I’m dealing with a
> small number of samples/variants, it’s much more useful to be able to see
> the phenotype names within the output file rather than have to see a “1”
> and then have to start searching each of the individual phenotype sources
> (OMIM, clinvar, DDG2P, etc etc etc) to figure out which database had a
> phenotype for the gene and maybe the phenotype might not even be relevant.
> >>>>>>
> >>>>>> The GEMINI developers were considering adding this as a feature (
> https://github.com/arq5x/gemini/issues/571) but obviously it’s much
> better to rely on VEP to do this when VEP is so close to already having
> this implemented.
> >>>>>>
> >>>>>> Thanks again,
> >>>>>> Jessica
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> On Nov 9, 2015, at 9:41 AM, JESSICA X. CHONG <jxchong at uw.edu>
> wrote:
> >>>>>>>
> >>>>>>> Hi Will,
> >>>>>>>
> >>>>>>>> On Nov 9, 2015, at 1:44 AM, Will McLaren <wm2 at ebi.ac.uk> wrote:
> >>>>>>>>
> >>>>>>>> Hi Jessica,
> >>>>>>>>
> >>>>>>>> On 9 November 2015 at 00:20, Jessica Chong <jxchong at uw.edu>
> wrote:
> >>>>>>>>>
> >>>>>>>>> I am trying to use the --gene_phenotype option in VEP v82 but I
> am having problems.
> >>>>>>>>>
> >>>>>>>>> 1) if a variant is in a gene that is associated with a
> particular phenotype (e.g. CFTR and cystic fibrosis), where does this
> information get stored in the resulting annotated vcf? I don’t see a
> corresponding field name listed as a possible extras column output on this
> page http://uswest.ensembl.org/info/docs/tools/vep/vep_formats.html#output
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> The phenotype name does not get stored in the output. Firstly,
> many phenotype names are long and contain odd characters that can break
> file format encoding. Secondly, many genes and/or variants have several
> phenotypes associated with them, so to list all of them (and multiple times
> in the case where you are reporting e.g. the same gene multiple times)
> would cause the output file size to increase hugely.
> >>>>>>>>
> >>>>>>>> Example for CFTR:
> http://www.ensembl.org/Homo_sapiens/Gene/Phenotype?g=ENSG00000001626
> >>>>>>>
> >>>>>>>
> >>>>>>> Ok, I see. Thanks.
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> 2) if a variant itself is associated with a particular
> “phenotype, disease, or trait” then my understanding from the VEP output
> documentation is that I should expect this information to show up under the
> PHENO field?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> I don't think this is stated anywhere in the documentation.
> http://www.ensembl.org/info/docs/tools/vep/vep_formats.html#output shows
> the output fields and their descriptions.
> >>>>>>>
> >>>>>>>
> >>>>>>> Ahhh, ok. Because GENE_PHENO isn’t listed in the VEP documentation
> as a possible output field, the closest I could find was “PHENO,” so I was
> just guessing.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>>> I tried annotating a tiny vcf that just includes a variant in
> CFTR (and the variant is dF508, so it is definitely pathogenic and CFTR
> should certainly be associated with a phenotype as well on the gene level)
> but PHENO is always blank (and I don’t see any field mentioning cystic
> fibrosis as a phenotype/disease name).
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Apologies, there seems to be an issue with the v82 caches in that
> they are missing the necessary information to populate the fields for
> --gene_phenotype. However, the field you want is GENE_PHENO, which should
> show a binary flag indicating whether the gene is associated with a
> phenotype.
> >>>>>>>>
> >>>>>>>> I will look into why this information is missing.
> >>>>>>>
> >>>>>>>
> >>>>>>> Thanks!
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Regards
> >>>>>>>>
> >>>>>>>> Will McLaren
> >>>>>>>> Ensembl Variation
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Here is what I ran:
> >>>>>>>>> perl variant_effect_predictor/variant_effect_predictor.pl \
> >>>>>>>>> -i CFTR.VT.vcf \
> >>>>>>>>> -o CFTR.VT.VEP.vcf \
> >>>>>>>>> --vcf --offline --cache \
> >>>>>>>>> --dir_cache variant_effect_predictor/cache/ \
> >>>>>>>>> --species homo_sapiens --assembly GRCh37 \
> >>>>>>>>> --fasta Homo_sapiens_assembly19.fasta \
> >>>>>>>>> --fork 8 --force_overwrite \
> >>>>>>>>> --compress 'gunzip -c' \
> >>>>>>>>> --sift b --polyphen b --symbol --numbers --biotype \
> >>>>>>>>> --total_length --canonical --ccds --hgvs --shift_hgvs 1
> --gene_phenotype \
> >>>>>>>>> --fields
> Consequence,Codons,Amino_acids,Gene,SYMBOL,Feature,EXON,PolyPhen,SIFT,Protein_position,BIOTYPE,CANONICAL,CCDS,HGVSc,HGVSp,PHENO
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> The resulting vcf contains these lines:
> >>>>>>>>> ##VEP=v82
> cache=/ensembl-tools/82/Linux/RHEL6/x86_64/variant_effect_predictor/cache//homo_sapiens/82_GRCh37
> db=. polyphen=2.2.2 sift=sift5.2.2 COSMIC=71 ESP=20141103 gencode=GENCODE
> 19 HGMD-PUBLIC=20152 genebuild=2011-04 regbuild=13 assembly=GRCh37.p13
> dbSNP=144 ClinVar=201507
> >>>>>>>>> ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence
> annotations from Ensembl VEP. Format:
> Consequence|Codons|Amino_acids|Gene|SYMBOL|Feature|EXON|PolyPhen|SIFT|Protein_position|BIOTYPE|CANONICAL|CCDS|HGVSc|HGVSp|PHENO”>
> >>>>>>>>> #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
> FORMAT  sample1A3       sample1A4       sample1A5       sample1A6
>  sample1A7       sample1A8       sample1A3       sample1A4       sample1A5
>      sample1A6       sample1A7       sample1A8
> >>>>>>>>> 7       117199644       rs199826652     ATCT    A       3996.41
> PASS
> AC=4;AF=0.023;AN=24;BaseQRankSum=-0.941;ClippingRankSum=-0.033;DB;DP=7203;FS=2.556;InbreedingCoeff=-0.0257;MLEAC=5;MLEAF=0.023;MQ0=0;MQ=60.36;MQRankSum=0.129;QD=21.37;ReadPosRankSum=0.673;SOR=0.921;VQSLOD=1.51;culprit=SOR;CSQ=downstream_gene_variant|||ENSG00000232661|AC000111.3|ENST00000441019|||||antisense|YES||||,inframe_deletion|aTCTtt/att|IF/I|ENSG00000001626|CFTR|ENST00000426809|10/26|||477-478/1438|protein_coding|||ENST00000426809.1:c.1431_1433delCTT|ENSP00000389119.1:p.Phe478del|,inframe_deletion|aTCTtt/att|IF/I|ENSG00000001626|CFTR|ENST00000454343|10/26|||446-447/1419|protein_coding|||ENST00000454343.1:c.1338_1340delCTT|ENSP00000403677.1:p.Phe447del|,inframe_deletion|aTCTtt/att|IF/I|ENSG00000001626|CFTR|ENST00000003084|11/27|||507-508/1480|protein_coding|YES|CCDS5773.1|ENST00000003084.6:c.1521_1523delCTT|ENSP00000003084.6:p.Phe508del|,upstream_gene_variant|||ENSG00000001626|CFTR|ENST00000472848|||||processed_transcript|||||
>      GT:AD:DP:GQ:PL  0/0:4,0:4:9:0,9,135     0/0:10,0:10:24:0,24,360
> 0/0:3,0:3:5:0,5,86      0/1:10,12:22:99:441,0,474
>  0/1:9,18:27:99:729,0,424        0/0:23,0:23:63:0,63,945
> 0/0:22,0:22:66:0,66,714 0/0:33,0:33:84:0,84,1096
> 0/0:22,0:22:62:0,62,736 0/1:17,14:31:99:537,0,881
>  0/1:13,16:29:99:620,0,632       0/0:24,0:24:60:0,60,791
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thanks!
> >>>>>>>>> _______________________________________________
> >>>>>>>>> Dev mailing list    Dev at ensembl.org
> >>>>>>>>> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> >>>>>>>>> Ensembl Blog: http://www.ensembl.info/
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> Dev mailing list    Dev at ensembl.org
> >>>>>>>> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> >>>>>>>> Ensembl Blog: http://www.ensembl.info/
> >>>>>>>
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> Dev mailing list    Dev at ensembl.org
> >>>>>>> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> >>>>>>> Ensembl Blog: http://www.ensembl.info/
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> Dev mailing list    Dev at ensembl.org
> >>>>>> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> >>>>>> Ensembl Blog: http://www.ensembl.info/
> >>>>>>
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> Dev mailing list    Dev at ensembl.org
> >>>>> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> >>>>> Ensembl Blog: http://www.ensembl.info/
> >>>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Dev mailing list    Dev at ensembl.org
> >>>> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> >>>> Ensembl Blog: http://www.ensembl.info/
> >>>>
> >>>
> >>> _______________________________________________
> >>> Dev mailing list    Dev at ensembl.org
> >>> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> >>> Ensembl Blog: http://www.ensembl.info/
> >>
> >>
> > _______________________________________________
> > Dev mailing list    Dev at ensembl.org
> > Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> > Ensembl Blog: http://www.ensembl.info/
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20151215/72d57b80/attachment.html>