[ensembl-dev] Confused by --gene_phenotype and PHENO in VEP v82

Will McLaren wm2 at ebi.ac.uk
Tue Nov 10 10:06:50 GMT 2015


Hi Jessica,

Here's a way you can get the phenotype names in VEP without me writing any
new code!

1) Create a BED file from Ensembl's database of phenotype annotations. This
will take a few minutes:

mysql -hensembldb.ensembl.org -uanonymous -Dhomo_sapiens_variation_82_38
-e'select sr.name, pf.seq_region_start-1,pf.seq_region_end, concat("\"",
concat_ws(":", pf.type, s.name, pf.object_id, p.description), "\"") from
seq_region sr, source s, phenotype p, phenotype_feature pf where
pf.seq_region_id = sr.seq_region_id and pf.source_id = s.source_id and
pf.phenotype_id = p.phenotype_id' | grep -v concat_ws | sort -k1,1 -k2,2n
-k3,3n | bgzip -c > phenotypes.bed.gz

2) Index it with tabix:

tabix -p bed phenotypes.bed.gz

3) Use it as a custom annotation in VEP (see
http://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html):

echo "rs533747784" | perl variant_effect_predictor.pl -force -cache -custom
phenotypes.bed.gz,phenotypes,bed,overlap

Then have fun parsing the output :-)

You could change the MySQL query above to limit it to one source of data,
or remove the structural variants for example (there are a lot of them and
they tend to overlap a significant portion of the genome).

Regards

Will



On 9 November 2015 at 19:45, JESSICA X. CHONG <jxchong at uw.edu> wrote:

> Could you maybe consider making it an option to output the phenotype
> names? At least for me, in most cases, if I’m dealing with a small number
> of samples/variants, it’s much more useful to be able to see the phenotype
> names within the output file rather than have to see a “1” and then have to
> start searching each of the individual phenotype sources (OMIM, clinvar,
> DDG2P, etc etc etc) to figure out which database had a phenotype for the
> gene and maybe the phenotype might not even be relevant.
>
> The GEMINI developers were considering adding this as a feature (
> https://github.com/arq5x/gemini/issues/571) but obviously it’s much
> better to rely on VEP to do this when VEP is so close to already having
> this implemented.
>
> Thanks again,
> Jessica
>
>
>
> On Nov 9, 2015, at 9:41 AM, JESSICA X. CHONG <jxchong at uw.edu> wrote:
>
> Hi Will,
>
> On Nov 9, 2015, at 1:44 AM, Will McLaren <wm2 at ebi.ac.uk> wrote:
>
> Hi Jessica,
>
> On 9 November 2015 at 00:20, Jessica Chong <jxchong at uw.edu> wrote:
>
>> I am trying to use the --gene_phenotype option in VEP v82 but I am having
>> problems.
>>
>> 1) if a variant is in a gene that is associated with a particular
>> phenotype (e.g. CFTR and cystic fibrosis), where does this information get
>> stored in the resulting annotated vcf? I don’t see a corresponding field
>> name listed as a possible extras column output on this page
>> http://uswest.ensembl.org/info/docs/tools/vep/vep_formats.html#output
>
>
> The phenotype name does not get stored in the output. Firstly, many
> phenotype names are long and contain odd characters that can break file
> format encoding. Secondly, many genes and/or variants have several
> phenotypes associated with them, so to list all of them (and multiple times
> in the case where you are reporting e.g. the same gene multiple times)
> would cause the output file size to increase hugely.
>
> Example for CFTR:
> http://www.ensembl.org/Homo_sapiens/Gene/Phenotype?g=ENSG00000001626
>
>
> Ok, I see. Thanks.
>
>
>
>>
>>
>> 2) if a variant itself is associated with a particular “phenotype,
>> disease, or trait” then my understanding from the VEP output documentation
>> is that I should expect this information to show up under the PHENO field?
>>
>
> I don't think this is stated anywhere in the documentation.
> http://www.ensembl.org/info/docs/tools/vep/vep_formats.html#output shows
> the output fields and their descriptions.
>
>
> Ahhh, ok. Because GENE_PHENO isn’t listed in the VEP documentation as a
> possible output field, the closest I could find was “PHENO,” so I was just
> guessing.
>
>
>
> I tried annotating a tiny vcf that just includes a variant in CFTR (and
>> the variant is dF508, so it is definitely pathogenic and CFTR should
>> certainly be associated with a phenotype as well on the gene level) but
>> PHENO is always blank (and I don’t see any field mentioning cystic fibrosis
>> as a phenotype/disease name).
>>
>
> Apologies, there seems to be an issue with the v82 caches in that they are
> missing the necessary information to populate the fields for
> --gene_phenotype. However, the field you want is GENE_PHENO, which should
> show a binary flag indicating whether the gene is associated with a
> phenotype.
>
> I will look into why this information is missing.
>
>
> Thanks!
>
>
> Regards
>
> Will McLaren
> Ensembl Variation
>
>
>>
>>
>> Here is what I ran:
>> perl variant_effect_predictor/variant_effect_predictor.pl \
>> -i CFTR.VT.vcf \
>> -o CFTR.VT.VEP.vcf \
>> --vcf --offline --cache \
>> --dir_cache variant_effect_predictor/cache/ \
>> --species homo_sapiens --assembly GRCh37 \
>> --fasta Homo_sapiens_assembly19.fasta \
>> --fork 8 --force_overwrite \
>> --compress 'gunzip -c' \
>> --sift b --polyphen b --symbol --numbers --biotype \
>> --total_length --canonical --ccds --hgvs --shift_hgvs 1 --gene_phenotype \
>> --fields
>> Consequence,Codons,Amino_acids,Gene,SYMBOL,Feature,EXON,PolyPhen,SIFT,Protein_position,BIOTYPE,CANONICAL,CCDS,HGVSc,HGVSp,PHENO
>>
>>
>> The resulting vcf contains these lines:
>> ##VEP=v82
>> cache=/ensembl-tools/82/Linux/RHEL6/x86_64/variant_effect_predictor/cache//homo_sapiens/82_GRCh37
>> db=. polyphen=2.2.2 sift=sift5.2.2 COSMIC=71 ESP=20141103 gencode=GENCODE
>> 19 HGMD-PUBLIC=20152 genebuild=2011-04 regbuild=13 assembly=GRCh37.p13
>> dbSNP=144 ClinVar=201507
>> ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations
>> from Ensembl VEP. Format:
>> Consequence|Codons|Amino_acids|Gene|SYMBOL|Feature|EXON|PolyPhen|SIFT|Protein_position|BIOTYPE|CANONICAL|CCDS|HGVSc|HGVSp|PHENO”>
>> #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT
>> sample1A3       sample1A4       sample1A5       sample1A6       sample1A7
>>      sample1A8       sample1A3       sample1A4       sample1A5
>>  sample1A6       sample1A7       sample1A8
>> 7       117199644       rs199826652     ATCT    A       3996.41 PASS
>> AC=4;AF=0.023;AN=24;BaseQRankSum=-0.941;ClippingRankSum=-0.033;DB;DP=7203;FS=2.556;InbreedingCoeff=-0.0257;MLEAC=5;MLEAF=0.023;MQ0=0;MQ=60.36;MQRankSum=0.129;QD=21.37;ReadPosRankSum=0.673;SOR=0.921;VQSLOD=1.51;culprit=SOR;CSQ=downstream_gene_variant|||ENSG00000232661|AC000111.3|ENST00000441019|||||antisense|YES||||,inframe_deletion|aTCTtt/att|IF/I|ENSG00000001626|CFTR|ENST00000426809|10/26|||477-478/1438|protein_coding|||ENST00000426809.1:c.1431_1433delCTT|ENSP00000389119.1:p.Phe478del|,inframe_deletion|aTCTtt/att|IF/I|ENSG00000001626|CFTR|ENST00000454343|10/26|||446-447/1419|protein_coding|||ENST00000454343.1:c.1338_1340delCTT|ENSP00000403677.1:p.Phe447del|,inframe_deletion|aTCTtt/att|IF/I|ENSG00000001626|CFTR|ENST00000003084|11/27|||507-508/1480|protein_coding|YES|CCDS5773.1|ENST00000003084.6:c.1521_1523delCTT|ENSP00000003084.6:p.Phe508del|,upstream_gene_variant|||ENSG00000001626|CFTR|ENST00000472848|||||processed_transcript|||||
>>      GT:AD:DP:GQ:PL  0/0:4,0:4:9:0,9,135     0/0:10,0:10:24:0,24,360
>> 0/0:3,0:3:5:0,5,86      0/1:10,12:22:99:441,0,474
>>  0/1:9,18:27:99:729,0,424        0/0:23,0:23:63:0,63,945
>> 0/0:22,0:22:66:0,66,714 0/0:33,0:33:84:0,84,1096
>> 0/0:22,0:22:62:0,62,736 0/1:17,14:31:99:537,0,881
>>  0/1:13,16:29:99:620,0,632       0/0:24,0:24:60:0,60,791
>>
>>
>> Thanks!
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20151110/7f64fb96/attachment.html>


More information about the Dev mailing list