[ensembl-dev] Confused by --gene_phenotype and PHENO in VEP v82

Cyriac Kandoth kandoth at cbio.mskcc.org
Thu Nov 19 18:01:20 GMT 2015


Hi Will, sorry to fork this thread. Do you have any immediate plans to
write a VEP plugin that will poop out all this phenotype information into a
nicely organized JSON? That will be super useful for web front-ends like
cBioPortal. If not, we plan to do this by mid-December, and submit a pull
request to https://github.com/ensembl-variation/VEP_plugins

Thanks,
Cyriac

On Tue, Nov 10, 2015 at 5:06 AM, Will McLaren <wm2 at ebi.ac.uk> wrote:

> Hi Jessica,
>
> Here's a way you can get the phenotype names in VEP without me writing any
> new code!
>
> 1) Create a BED file from Ensembl's database of phenotype annotations.
> This will take a few minutes:
>
> mysql -hensembldb.ensembl.org -uanonymous -Dhomo_sapiens_variation_82_38
> -e'select sr.name, pf.seq_region_start-1,pf.seq_region_end, concat("\"",
> concat_ws(":", pf.type, s.name, pf.object_id, p.description), "\"") from
> seq_region sr, source s, phenotype p, phenotype_feature pf where
> pf.seq_region_id = sr.seq_region_id and pf.source_id = s.source_id and
> pf.phenotype_id = p.phenotype_id' | grep -v concat_ws | sort -k1,1 -k2,2n
> -k3,3n | bgzip -c > phenotypes.bed.gz
>
> 2) Index it with tabix:
>
> tabix -p bed phenotypes.bed.gz
>
> 3) Use it as a custom annotation in VEP (see
> http://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html):
>
> echo "rs533747784" | perl variant_effect_predictor.pl -force -cache
> -custom phenotypes.bed.gz,phenotypes,bed,overlap
>
> Then have fun parsing the output :-)
>
> You could change the MySQL query above to limit it to one source of data,
> or remove the structural variants for example (there are a lot of them and
> they tend to overlap a significant portion of the genome).
>
> Regards
>
> Will
>
>
>
> On 9 November 2015 at 19:45, JESSICA X. CHONG <jxchong at uw.edu> wrote:
>
>> Could you maybe consider making it an option to output the phenotype
>> names? At least for me, in most cases, if I’m dealing with a small number
>> of samples/variants, it’s much more useful to be able to see the phenotype
>> names within the output file rather than have to see a “1” and then have to
>> start searching each of the individual phenotype sources (OMIM, clinvar,
>> DDG2P, etc etc etc) to figure out which database had a phenotype for the
>> gene and maybe the phenotype might not even be relevant.
>>
>> The GEMINI developers were considering adding this as a feature (
>> https://github.com/arq5x/gemini/issues/571) but obviously it’s much
>> better to rely on VEP to do this when VEP is so close to already having
>> this implemented.
>>
>> Thanks again,
>> Jessica
>>
>>
>>
>> On Nov 9, 2015, at 9:41 AM, JESSICA X. CHONG <jxchong at uw.edu> wrote:
>>
>> Hi Will,
>>
>> On Nov 9, 2015, at 1:44 AM, Will McLaren <wm2 at ebi.ac.uk> wrote:
>>
>> Hi Jessica,
>>
>> On 9 November 2015 at 00:20, Jessica Chong <jxchong at uw.edu> wrote:
>>
>>> I am trying to use the --gene_phenotype option in VEP v82 but I am
>>> having problems.
>>>
>>> 1) if a variant is in a gene that is associated with a particular
>>> phenotype (e.g. CFTR and cystic fibrosis), where does this information get
>>> stored in the resulting annotated vcf? I don’t see a corresponding field
>>> name listed as a possible extras column output on this page
>>> http://uswest.ensembl.org/info/docs/tools/vep/vep_formats.html#output
>>
>>
>> The phenotype name does not get stored in the output. Firstly, many
>> phenotype names are long and contain odd characters that can break file
>> format encoding. Secondly, many genes and/or variants have several
>> phenotypes associated with them, so to list all of them (and multiple times
>> in the case where you are reporting e.g. the same gene multiple times)
>> would cause the output file size to increase hugely.
>>
>> Example for CFTR:
>> http://www.ensembl.org/Homo_sapiens/Gene/Phenotype?g=ENSG00000001626
>>
>>
>> Ok, I see. Thanks.
>>
>>
>>
>>>
>>>
>>> 2) if a variant itself is associated with a particular “phenotype,
>>> disease, or trait” then my understanding from the VEP output documentation
>>> is that I should expect this information to show up under the PHENO field?
>>>
>>
>> I don't think this is stated anywhere in the documentation.
>> http://www.ensembl.org/info/docs/tools/vep/vep_formats.html#output shows
>> the output fields and their descriptions.
>>
>>
>> Ahhh, ok. Because GENE_PHENO isn’t listed in the VEP documentation as a
>> possible output field, the closest I could find was “PHENO,” so I was just
>> guessing.
>>
>>
>>
>> I tried annotating a tiny vcf that just includes a variant in CFTR (and
>>> the variant is dF508, so it is definitely pathogenic and CFTR should
>>> certainly be associated with a phenotype as well on the gene level) but
>>> PHENO is always blank (and I don’t see any field mentioning cystic fibrosis
>>> as a phenotype/disease name).
>>>
>>
>> Apologies, there seems to be an issue with the v82 caches in that they
>> are missing the necessary information to populate the fields for
>> --gene_phenotype. However, the field you want is GENE_PHENO, which should
>> show a binary flag indicating whether the gene is associated with a
>> phenotype.
>>
>> I will look into why this information is missing.
>>
>>
>> Thanks!
>>
>>
>> Regards
>>
>> Will McLaren
>> Ensembl Variation
>>
>>
>>>
>>>
>>> Here is what I ran:
>>> perl variant_effect_predictor/variant_effect_predictor.pl \
>>> -i CFTR.VT.vcf \
>>> -o CFTR.VT.VEP.vcf \
>>> --vcf --offline --cache \
>>> --dir_cache variant_effect_predictor/cache/ \
>>> --species homo_sapiens --assembly GRCh37 \
>>> --fasta Homo_sapiens_assembly19.fasta \
>>> --fork 8 --force_overwrite \
>>> --compress 'gunzip -c' \
>>> --sift b --polyphen b --symbol --numbers --biotype \
>>> --total_length --canonical --ccds --hgvs --shift_hgvs 1 --gene_phenotype
>>> \
>>> --fields
>>> Consequence,Codons,Amino_acids,Gene,SYMBOL,Feature,EXON,PolyPhen,SIFT,Protein_position,BIOTYPE,CANONICAL,CCDS,HGVSc,HGVSp,PHENO
>>>
>>>
>>> The resulting vcf contains these lines:
>>> ##VEP=v82
>>> cache=/ensembl-tools/82/Linux/RHEL6/x86_64/variant_effect_predictor/cache//homo_sapiens/82_GRCh37
>>> db=. polyphen=2.2.2 sift=sift5.2.2 COSMIC=71 ESP=20141103 gencode=GENCODE
>>> 19 HGMD-PUBLIC=20152 genebuild=2011-04 regbuild=13 assembly=GRCh37.p13
>>> dbSNP=144 ClinVar=201507
>>> ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations
>>> from Ensembl VEP. Format:
>>> Consequence|Codons|Amino_acids|Gene|SYMBOL|Feature|EXON|PolyPhen|SIFT|Protein_position|BIOTYPE|CANONICAL|CCDS|HGVSc|HGVSp|PHENO”>
>>> #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT
>>> sample1A3       sample1A4       sample1A5       sample1A6       sample1A7
>>>      sample1A8       sample1A3       sample1A4       sample1A5
>>>  sample1A6       sample1A7       sample1A8
>>> 7       117199644       rs199826652     ATCT    A       3996.41 PASS
>>> AC=4;AF=0.023;AN=24;BaseQRankSum=-0.941;ClippingRankSum=-0.033;DB;DP=7203;FS=2.556;InbreedingCoeff=-0.0257;MLEAC=5;MLEAF=0.023;MQ0=0;MQ=60.36;MQRankSum=0.129;QD=21.37;ReadPosRankSum=0.673;SOR=0.921;VQSLOD=1.51;culprit=SOR;CSQ=downstream_gene_variant|||ENSG00000232661|AC000111.3|ENST00000441019|||||antisense|YES||||,inframe_deletion|aTCTtt/att|IF/I|ENSG00000001626|CFTR|ENST00000426809|10/26|||477-478/1438|protein_coding|||ENST00000426809.1:c.1431_1433delCTT|ENSP00000389119.1:p.Phe478del|,inframe_deletion|aTCTtt/att|IF/I|ENSG00000001626|CFTR|ENST00000454343|10/26|||446-447/1419|protein_coding|||ENST00000454343.1:c.1338_1340delCTT|ENSP00000403677.1:p.Phe447del|,inframe_deletion|aTCTtt/att|IF/I|ENSG00000001626|CFTR|ENST00000003084|11/27|||507-508/1480|protein_coding|YES|CCDS5773.1|ENST00000003084.6:c.1521_1523delCTT|ENSP00000003084.6:p.Phe508del|,upstream_gene_variant|||ENSG00000001626|CFTR|ENST00000472848|||||processed_transcript|||||
>>>      GT:AD:DP:GQ:PL  0/0:4,0:4:9:0,9,135     0/0:10,0:10:24:0,24,360
>>> 0/0:3,0:3:5:0,5,86      0/1:10,12:22:99:441,0,474
>>>  0/1:9,18:27:99:729,0,424        0/0:23,0:23:63:0,63,945
>>> 0/0:22,0:22:66:0,66,714 0/0:33,0:33:84:0,84,1096
>>> 0/0:22,0:22:62:0,62,736 0/1:17,14:31:99:537,0,881
>>>  0/1:13,16:29:99:620,0,632       0/0:24,0:24:60:0,60,791
>>>
>>>
>>> Thanks!
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info:
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20151119/a9738e6c/attachment.html>


More information about the Dev mailing list