[ensembl-dev] Confused by --gene_phenotype and PHENO in VEP v82

Cyriac Kandoth kandoth at cbio.mskcc.org
Fri Nov 20 17:00:41 GMT 2015


Great to hear about the fixes coming in v83.

I imagined the plugin would initially only support JSON output, after
filling up a nested Perl hash with all the phenotype data. It could also be
squeezed into a VCF or VEP's TSV format, but like you said, that's a
stretch. People who took the trouble to install the VEP plugin, likely
already have the dev experience to parse JSONs, and pull out the important
bits into a TSV or in a web frontend. In the future, the plugin can
optionally do this to add select phenotype info into the TSV or the VCF,
depending on how many people want that feature.

~Cyriac

On Fri, Nov 20, 2015 at 5:33 AM, Will McLaren <wm2 at ebi.ac.uk> wrote:

> Hi Cyriac, Jessica,
>
> We are planning on writing a phenotype plugin in the near future exactly
> as you describe. We'd be very happy if you'd be willing to contribute to
> this, in the form of testing, feedback or code.
>
> Regarding output format, this is the part that is difficult with
> unstructured and expansive data like this, particularly trying to jam it
> into our already jammed line/separator-based text output formats. Do you
> use the VEP's JSON output format already (--json)? If so, it will be easy
> to extend this output with phenotype data, and being JSON will make it nice
> and easy to process afterwards.
>
> If you don't use --json, and you are thinking the JSON should be some
> sidecar output file, how do you intend on integrating this with the "main"
> VEP results file?
>
> The gene phenotype flag data will be fixed for release 83, due next month.
>
> Regards
>
> Will McLaren
> Ensembl Variation
>
> On 19 November 2015 at 18:01, Cyriac Kandoth <kandoth at cbio.mskcc.org>
> wrote:
>
>> Hi Will, sorry to fork this thread. Do you have any immediate plans to
>> write a VEP plugin that will poop out all this phenotype information into a
>> nicely organized JSON? That will be super useful for web front-ends like
>> cBioPortal. If not, we plan to do this by mid-December, and submit a pull
>> request to https://github.com/ensembl-variation/VEP_plugins
>>
>> Thanks,
>> Cyriac
>>
>> On Tue, Nov 10, 2015 at 5:06 AM, Will McLaren <wm2 at ebi.ac.uk> wrote:
>>
>>> Hi Jessica,
>>>
>>> Here's a way you can get the phenotype names in VEP without me writing
>>> any new code!
>>>
>>> 1) Create a BED file from Ensembl's database of phenotype annotations.
>>> This will take a few minutes:
>>>
>>> mysql -hensembldb.ensembl.org -uanonymous
>>> -Dhomo_sapiens_variation_82_38 -e'select sr.name,
>>> pf.seq_region_start-1,pf.seq_region_end, concat("\"", concat_ws(":",
>>> pf.type, s.name, pf.object_id, p.description), "\"") from seq_region
>>> sr, source s, phenotype p, phenotype_feature pf where pf.seq_region_id =
>>> sr.seq_region_id and pf.source_id = s.source_id and pf.phenotype_id =
>>> p.phenotype_id' | grep -v concat_ws | sort -k1,1 -k2,2n -k3,3n | bgzip -c >
>>> phenotypes.bed.gz
>>>
>>> 2) Index it with tabix:
>>>
>>> tabix -p bed phenotypes.bed.gz
>>>
>>> 3) Use it as a custom annotation in VEP (see
>>> http://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html):
>>>
>>> echo "rs533747784" | perl variant_effect_predictor.pl -force -cache
>>> -custom phenotypes.bed.gz,phenotypes,bed,overlap
>>>
>>> Then have fun parsing the output :-)
>>>
>>> You could change the MySQL query above to limit it to one source of
>>> data, or remove the structural variants for example (there are a lot of
>>> them and they tend to overlap a significant portion of the genome).
>>>
>>> Regards
>>>
>>> Will
>>>
>>>
>>>
>>> On 9 November 2015 at 19:45, JESSICA X. CHONG <jxchong at uw.edu> wrote:
>>>
>>>> Could you maybe consider making it an option to output the phenotype
>>>> names? At least for me, in most cases, if I’m dealing with a small number
>>>> of samples/variants, it’s much more useful to be able to see the phenotype
>>>> names within the output file rather than have to see a “1” and then have to
>>>> start searching each of the individual phenotype sources (OMIM, clinvar,
>>>> DDG2P, etc etc etc) to figure out which database had a phenotype for the
>>>> gene and maybe the phenotype might not even be relevant.
>>>>
>>>> The GEMINI developers were considering adding this as a feature (
>>>> https://github.com/arq5x/gemini/issues/571) but obviously it’s much
>>>> better to rely on VEP to do this when VEP is so close to already having
>>>> this implemented.
>>>>
>>>> Thanks again,
>>>> Jessica
>>>>
>>>>
>>>>
>>>> On Nov 9, 2015, at 9:41 AM, JESSICA X. CHONG <jxchong at uw.edu> wrote:
>>>>
>>>> Hi Will,
>>>>
>>>> On Nov 9, 2015, at 1:44 AM, Will McLaren <wm2 at ebi.ac.uk> wrote:
>>>>
>>>> Hi Jessica,
>>>>
>>>> On 9 November 2015 at 00:20, Jessica Chong <jxchong at uw.edu> wrote:
>>>>
>>>>> I am trying to use the --gene_phenotype option in VEP v82 but I am
>>>>> having problems.
>>>>>
>>>>> 1) if a variant is in a gene that is associated with a particular
>>>>> phenotype (e.g. CFTR and cystic fibrosis), where does this information get
>>>>> stored in the resulting annotated vcf? I don’t see a corresponding field
>>>>> name listed as a possible extras column output on this page
>>>>> http://uswest.ensembl.org/info/docs/tools/vep/vep_formats.html#output
>>>>
>>>>
>>>> The phenotype name does not get stored in the output. Firstly, many
>>>> phenotype names are long and contain odd characters that can break file
>>>> format encoding. Secondly, many genes and/or variants have several
>>>> phenotypes associated with them, so to list all of them (and multiple times
>>>> in the case where you are reporting e.g. the same gene multiple times)
>>>> would cause the output file size to increase hugely.
>>>>
>>>> Example for CFTR:
>>>> http://www.ensembl.org/Homo_sapiens/Gene/Phenotype?g=ENSG00000001626
>>>>
>>>>
>>>> Ok, I see. Thanks.
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>> 2) if a variant itself is associated with a particular “phenotype,
>>>>> disease, or trait” then my understanding from the VEP output documentation
>>>>> is that I should expect this information to show up under the PHENO field?
>>>>>
>>>>
>>>> I don't think this is stated anywhere in the documentation.
>>>> http://www.ensembl.org/info/docs/tools/vep/vep_formats.html#output shows
>>>> the output fields and their descriptions.
>>>>
>>>>
>>>> Ahhh, ok. Because GENE_PHENO isn’t listed in the VEP documentation as a
>>>> possible output field, the closest I could find was “PHENO,” so I was just
>>>> guessing.
>>>>
>>>>
>>>>
>>>> I tried annotating a tiny vcf that just includes a variant in CFTR (and
>>>>> the variant is dF508, so it is definitely pathogenic and CFTR should
>>>>> certainly be associated with a phenotype as well on the gene level) but
>>>>> PHENO is always blank (and I don’t see any field mentioning cystic fibrosis
>>>>> as a phenotype/disease name).
>>>>>
>>>>
>>>> Apologies, there seems to be an issue with the v82 caches in that they
>>>> are missing the necessary information to populate the fields for
>>>> --gene_phenotype. However, the field you want is GENE_PHENO, which should
>>>> show a binary flag indicating whether the gene is associated with a
>>>> phenotype.
>>>>
>>>> I will look into why this information is missing.
>>>>
>>>>
>>>> Thanks!
>>>>
>>>>
>>>> Regards
>>>>
>>>> Will McLaren
>>>> Ensembl Variation
>>>>
>>>>
>>>>>
>>>>>
>>>>> Here is what I ran:
>>>>> perl variant_effect_predictor/variant_effect_predictor.pl \
>>>>> -i CFTR.VT.vcf \
>>>>> -o CFTR.VT.VEP.vcf \
>>>>> --vcf --offline --cache \
>>>>> --dir_cache variant_effect_predictor/cache/ \
>>>>> --species homo_sapiens --assembly GRCh37 \
>>>>> --fasta Homo_sapiens_assembly19.fasta \
>>>>> --fork 8 --force_overwrite \
>>>>> --compress 'gunzip -c' \
>>>>> --sift b --polyphen b --symbol --numbers --biotype \
>>>>> --total_length --canonical --ccds --hgvs --shift_hgvs 1
>>>>> --gene_phenotype \
>>>>> --fields
>>>>> Consequence,Codons,Amino_acids,Gene,SYMBOL,Feature,EXON,PolyPhen,SIFT,Protein_position,BIOTYPE,CANONICAL,CCDS,HGVSc,HGVSp,PHENO
>>>>>
>>>>>
>>>>> The resulting vcf contains these lines:
>>>>> ##VEP=v82
>>>>> cache=/ensembl-tools/82/Linux/RHEL6/x86_64/variant_effect_predictor/cache//homo_sapiens/82_GRCh37
>>>>> db=. polyphen=2.2.2 sift=sift5.2.2 COSMIC=71 ESP=20141103 gencode=GENCODE
>>>>> 19 HGMD-PUBLIC=20152 genebuild=2011-04 regbuild=13 assembly=GRCh37.p13
>>>>> dbSNP=144 ClinVar=201507
>>>>> ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence
>>>>> annotations from Ensembl VEP. Format:
>>>>> Consequence|Codons|Amino_acids|Gene|SYMBOL|Feature|EXON|PolyPhen|SIFT|Protein_position|BIOTYPE|CANONICAL|CCDS|HGVSc|HGVSp|PHENO”>
>>>>> #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
>>>>> FORMAT  sample1A3       sample1A4       sample1A5       sample1A6
>>>>>  sample1A7       sample1A8       sample1A3       sample1A4       sample1A5
>>>>>      sample1A6       sample1A7       sample1A8
>>>>> 7       117199644       rs199826652     ATCT    A       3996.41 PASS
>>>>>
>>>>> AC=4;AF=0.023;AN=24;BaseQRankSum=-0.941;ClippingRankSum=-0.033;DB;DP=7203;FS=2.556;InbreedingCoeff=-0.0257;MLEAC=5;MLEAF=0.023;MQ0=0;MQ=60.36;MQRankSum=0.129;QD=21.37;ReadPosRankSum=0.673;SOR=0.921;VQSLOD=1.51;culprit=SOR;CSQ=downstream_gene_variant|||ENSG00000232661|AC000111.3|ENST00000441019|||||antisense|YES||||,inframe_deletion|aTCTtt/att|IF/I|ENSG00000001626|CFTR|ENST00000426809|10/26|||477-478/1438|protein_coding|||ENST00000426809.1:c.1431_1433delCTT|ENSP00000389119.1:p.Phe478del|,inframe_deletion|aTCTtt/att|IF/I|ENSG00000001626|CFTR|ENST00000454343|10/26|||446-447/1419|protein_coding|||ENST00000454343.1:c.1338_1340delCTT|ENSP00000403677.1:p.Phe447del|,inframe_deletion|aTCTtt/att|IF/I|ENSG00000001626|CFTR|ENST00000003084|11/27|||507-508/1480|protein_coding|YES|CCDS5773.1|ENST00000003084.6:c.1521_1523delCTT|ENSP00000003084.6:p.Phe508del|,upstream_gene_variant|||ENSG00000001626|CFTR|ENST00000472848|||||processed_transcript|||||
>>>>>      GT:AD:DP:GQ:PL  0/0:4,0:4:9:0,9,135     0/0:10,0:10:24:0,24,360
>>>>> 0/0:3,0:3:5:0,5,86      0/1:10,12:22:99:441,0,474
>>>>>  0/1:9,18:27:99:729,0,424        0/0:23,0:23:63:0,63,945
>>>>> 0/0:22,0:22:66:0,66,714 0/0:33,0:33:84:0,84,1096
>>>>> 0/0:22,0:22:62:0,62,736 0/1:17,14:31:99:537,0,881
>>>>>  0/1:13,16:29:99:620,0,632       0/0:24,0:24:60:0,60,791
>>>>>
>>>>>
>>>>> Thanks!
>>>>> _______________________________________________
>>>>> Dev mailing list    Dev at ensembl.org
>>>>> Posting guidelines and subscribe/unsubscribe info:
>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>>
>>>>
>>>> _______________________________________________
>>>> Dev mailing list    Dev at ensembl.org
>>>> Posting guidelines and subscribe/unsubscribe info:
>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog: http://www.ensembl.info/
>>>>
>>>>
>>>> _______________________________________________
>>>> Dev mailing list    Dev at ensembl.org
>>>> Posting guidelines and subscribe/unsubscribe info:
>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog: http://www.ensembl.info/
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Dev mailing list    Dev at ensembl.org
>>>> Posting guidelines and subscribe/unsubscribe info:
>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog: http://www.ensembl.info/
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info:
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>>
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20151120/ae46845c/attachment.html>


More information about the Dev mailing list