[ensembl-dev] Confused by --gene_phenotype and PHENO in VEP v82

Will McLaren wm2 at ebi.ac.uk
Fri Nov 20 10:33:17 GMT 2015


Hi Cyriac, Jessica,

We are planning on writing a phenotype plugin in the near future exactly as
you describe. We'd be very happy if you'd be willing to contribute to this,
in the form of testing, feedback or code.

Regarding output format, this is the part that is difficult with
unstructured and expansive data like this, particularly trying to jam it
into our already jammed line/separator-based text output formats. Do you
use the VEP's JSON output format already (--json)? If so, it will be easy
to extend this output with phenotype data, and being JSON will make it nice
and easy to process afterwards.

If you don't use --json, and you are thinking the JSON should be some
sidecar output file, how do you intend on integrating this with the "main"
VEP results file?

The gene phenotype flag data will be fixed for release 83, due next month.

Regards

Will McLaren
Ensembl Variation

On 19 November 2015 at 18:01, Cyriac Kandoth <kandoth at cbio.mskcc.org> wrote:

> Hi Will, sorry to fork this thread. Do you have any immediate plans to
> write a VEP plugin that will poop out all this phenotype information into a
> nicely organized JSON? That will be super useful for web front-ends like
> cBioPortal. If not, we plan to do this by mid-December, and submit a pull
> request to https://github.com/ensembl-variation/VEP_plugins
>
> Thanks,
> Cyriac
>
> On Tue, Nov 10, 2015 at 5:06 AM, Will McLaren <wm2 at ebi.ac.uk> wrote:
>
>> Hi Jessica,
>>
>> Here's a way you can get the phenotype names in VEP without me writing
>> any new code!
>>
>> 1) Create a BED file from Ensembl's database of phenotype annotations.
>> This will take a few minutes:
>>
>> mysql -hensembldb.ensembl.org -uanonymous -Dhomo_sapiens_variation_82_38
>> -e'select sr.name, pf.seq_region_start-1,pf.seq_region_end, concat("\"",
>> concat_ws(":", pf.type, s.name, pf.object_id, p.description), "\"") from
>> seq_region sr, source s, phenotype p, phenotype_feature pf where
>> pf.seq_region_id = sr.seq_region_id and pf.source_id = s.source_id and
>> pf.phenotype_id = p.phenotype_id' | grep -v concat_ws | sort -k1,1 -k2,2n
>> -k3,3n | bgzip -c > phenotypes.bed.gz
>>
>> 2) Index it with tabix:
>>
>> tabix -p bed phenotypes.bed.gz
>>
>> 3) Use it as a custom annotation in VEP (see
>> http://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html):
>>
>> echo "rs533747784" | perl variant_effect_predictor.pl -force -cache
>> -custom phenotypes.bed.gz,phenotypes,bed,overlap
>>
>> Then have fun parsing the output :-)
>>
>> You could change the MySQL query above to limit it to one source of data,
>> or remove the structural variants for example (there are a lot of them and
>> they tend to overlap a significant portion of the genome).
>>
>> Regards
>>
>> Will
>>
>>
>>
>> On 9 November 2015 at 19:45, JESSICA X. CHONG <jxchong at uw.edu> wrote:
>>
>>> Could you maybe consider making it an option to output the phenotype
>>> names? At least for me, in most cases, if I’m dealing with a small number
>>> of samples/variants, it’s much more useful to be able to see the phenotype
>>> names within the output file rather than have to see a “1” and then have to
>>> start searching each of the individual phenotype sources (OMIM, clinvar,
>>> DDG2P, etc etc etc) to figure out which database had a phenotype for the
>>> gene and maybe the phenotype might not even be relevant.
>>>
>>> The GEMINI developers were considering adding this as a feature (
>>> https://github.com/arq5x/gemini/issues/571) but obviously it’s much
>>> better to rely on VEP to do this when VEP is so close to already having
>>> this implemented.
>>>
>>> Thanks again,
>>> Jessica
>>>
>>>
>>>
>>> On Nov 9, 2015, at 9:41 AM, JESSICA X. CHONG <jxchong at uw.edu> wrote:
>>>
>>> Hi Will,
>>>
>>> On Nov 9, 2015, at 1:44 AM, Will McLaren <wm2 at ebi.ac.uk> wrote:
>>>
>>> Hi Jessica,
>>>
>>> On 9 November 2015 at 00:20, Jessica Chong <jxchong at uw.edu> wrote:
>>>
>>>> I am trying to use the --gene_phenotype option in VEP v82 but I am
>>>> having problems.
>>>>
>>>> 1) if a variant is in a gene that is associated with a particular
>>>> phenotype (e.g. CFTR and cystic fibrosis), where does this information get
>>>> stored in the resulting annotated vcf? I don’t see a corresponding field
>>>> name listed as a possible extras column output on this page
>>>> http://uswest.ensembl.org/info/docs/tools/vep/vep_formats.html#output
>>>
>>>
>>> The phenotype name does not get stored in the output. Firstly, many
>>> phenotype names are long and contain odd characters that can break file
>>> format encoding. Secondly, many genes and/or variants have several
>>> phenotypes associated with them, so to list all of them (and multiple times
>>> in the case where you are reporting e.g. the same gene multiple times)
>>> would cause the output file size to increase hugely.
>>>
>>> Example for CFTR:
>>> http://www.ensembl.org/Homo_sapiens/Gene/Phenotype?g=ENSG00000001626
>>>
>>>
>>> Ok, I see. Thanks.
>>>
>>>
>>>
>>>>
>>>>
>>>> 2) if a variant itself is associated with a particular “phenotype,
>>>> disease, or trait” then my understanding from the VEP output documentation
>>>> is that I should expect this information to show up under the PHENO field?
>>>>
>>>
>>> I don't think this is stated anywhere in the documentation.
>>> http://www.ensembl.org/info/docs/tools/vep/vep_formats.html#output shows
>>> the output fields and their descriptions.
>>>
>>>
>>> Ahhh, ok. Because GENE_PHENO isn’t listed in the VEP documentation as a
>>> possible output field, the closest I could find was “PHENO,” so I was just
>>> guessing.
>>>
>>>
>>>
>>> I tried annotating a tiny vcf that just includes a variant in CFTR (and
>>>> the variant is dF508, so it is definitely pathogenic and CFTR should
>>>> certainly be associated with a phenotype as well on the gene level) but
>>>> PHENO is always blank (and I don’t see any field mentioning cystic fibrosis
>>>> as a phenotype/disease name).
>>>>
>>>
>>> Apologies, there seems to be an issue with the v82 caches in that they
>>> are missing the necessary information to populate the fields for
>>> --gene_phenotype. However, the field you want is GENE_PHENO, which should
>>> show a binary flag indicating whether the gene is associated with a
>>> phenotype.
>>>
>>> I will look into why this information is missing.
>>>
>>>
>>> Thanks!
>>>
>>>
>>> Regards
>>>
>>> Will McLaren
>>> Ensembl Variation
>>>
>>>
>>>>
>>>>
>>>> Here is what I ran:
>>>> perl variant_effect_predictor/variant_effect_predictor.pl \
>>>> -i CFTR.VT.vcf \
>>>> -o CFTR.VT.VEP.vcf \
>>>> --vcf --offline --cache \
>>>> --dir_cache variant_effect_predictor/cache/ \
>>>> --species homo_sapiens --assembly GRCh37 \
>>>> --fasta Homo_sapiens_assembly19.fasta \
>>>> --fork 8 --force_overwrite \
>>>> --compress 'gunzip -c' \
>>>> --sift b --polyphen b --symbol --numbers --biotype \
>>>> --total_length --canonical --ccds --hgvs --shift_hgvs 1
>>>> --gene_phenotype \
>>>> --fields
>>>> Consequence,Codons,Amino_acids,Gene,SYMBOL,Feature,EXON,PolyPhen,SIFT,Protein_position,BIOTYPE,CANONICAL,CCDS,HGVSc,HGVSp,PHENO
>>>>
>>>>
>>>> The resulting vcf contains these lines:
>>>> ##VEP=v82
>>>> cache=/ensembl-tools/82/Linux/RHEL6/x86_64/variant_effect_predictor/cache//homo_sapiens/82_GRCh37
>>>> db=. polyphen=2.2.2 sift=sift5.2.2 COSMIC=71 ESP=20141103 gencode=GENCODE
>>>> 19 HGMD-PUBLIC=20152 genebuild=2011-04 regbuild=13 assembly=GRCh37.p13
>>>> dbSNP=144 ClinVar=201507
>>>> ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence
>>>> annotations from Ensembl VEP. Format:
>>>> Consequence|Codons|Amino_acids|Gene|SYMBOL|Feature|EXON|PolyPhen|SIFT|Protein_position|BIOTYPE|CANONICAL|CCDS|HGVSc|HGVSp|PHENO”>
>>>> #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT
>>>> sample1A3       sample1A4       sample1A5       sample1A6       sample1A7
>>>>      sample1A8       sample1A3       sample1A4       sample1A5
>>>>  sample1A6       sample1A7       sample1A8
>>>> 7       117199644       rs199826652     ATCT    A       3996.41 PASS
>>>> AC=4;AF=0.023;AN=24;BaseQRankSum=-0.941;ClippingRankSum=-0.033;DB;DP=7203;FS=2.556;InbreedingCoeff=-0.0257;MLEAC=5;MLEAF=0.023;MQ0=0;MQ=60.36;MQRankSum=0.129;QD=21.37;ReadPosRankSum=0.673;SOR=0.921;VQSLOD=1.51;culprit=SOR;CSQ=downstream_gene_variant|||ENSG00000232661|AC000111.3|ENST00000441019|||||antisense|YES||||,inframe_deletion|aTCTtt/att|IF/I|ENSG00000001626|CFTR|ENST00000426809|10/26|||477-478/1438|protein_coding|||ENST00000426809.1:c.1431_1433delCTT|ENSP00000389119.1:p.Phe478del|,inframe_deletion|aTCTtt/att|IF/I|ENSG00000001626|CFTR|ENST00000454343|10/26|||446-447/1419|protein_coding|||ENST00000454343.1:c.1338_1340delCTT|ENSP00000403677.1:p.Phe447del|,inframe_deletion|aTCTtt/att|IF/I|ENSG00000001626|CFTR|ENST00000003084|11/27|||507-508/1480|protein_coding|YES|CCDS5773.1|ENST00000003084.6:c.1521_1523delCTT|ENSP00000003084.6:p.Phe508del|,upstream_gene_variant|||ENSG00000001626|CFTR|ENST00000472848|||||processed_transcript|||||
>>>>      GT:AD:DP:GQ:PL  0/0:4,0:4:9:0,9,135     0/0:10,0:10:24:0,24,360
>>>> 0/0:3,0:3:5:0,5,86      0/1:10,12:22:99:441,0,474
>>>>  0/1:9,18:27:99:729,0,424        0/0:23,0:23:63:0,63,945
>>>> 0/0:22,0:22:66:0,66,714 0/0:33,0:33:84:0,84,1096
>>>> 0/0:22,0:22:62:0,62,736 0/1:17,14:31:99:537,0,881
>>>>  0/1:13,16:29:99:620,0,632       0/0:24,0:24:60:0,60,791
>>>>
>>>>
>>>> Thanks!
>>>> _______________________________________________
>>>> Dev mailing list    Dev at ensembl.org
>>>> Posting guidelines and subscribe/unsubscribe info:
>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog: http://www.ensembl.info/
>>>>
>>>
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info:
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>>
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info:
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>>
>>>
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info:
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>>
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20151120/3f0a1865/attachment.html>


More information about the Dev mailing list