[ensembl-dev] VEP plugins for intergenic variants

Will McLaren wm2 at ebi.ac.uk
Thu May 8 16:12:51 BST 2014


Hello again,

I've fixed a bug that prevented UpDownDistance functioning correctly - it
hadn't been tested with larger distances such as you specified which broke
some assumptions in the core VEP code.
You will need to update your ensembl-variation module or re-run the VEP
INSTALL.pl script to pick up the new API code.

As far as the other plugins go, I think you are misunderstanding how some
of them work:

TSSDistance - this gives the distance between a variant and the annotated
transcript start site. If a variant is annotated as intergenic, there is no
transcript to give the distance to! Changing the code to force it to assess
intergenic variants will of course break here. Of course if you alter the
up/down-stream distance using UpDownStream such that this then finds a
transcript in range, the plugin will then work as expected without
modification. It seems to me that you are expecting that this plugin will
find the shortest distance to _any_ transcript start site, which is not the
intended purpose of the code.

Condel & dbNSFP - these two plugins work exclusively on missense AKA
non-synonymous SNVs (hence the NS in the name dbNSFP). While dbNSFP carries
scores for CADD, and CADD gives scores for any genomic position, the CADD
scores in dbNSFP are only for missense variants.

The feature_types() subroutine should be used when writing your own plugin
to determine which kind of variant/feature combinations are considered by
the plugin, since the run() sub is executed once for each variant/feature
overlap found by the core VEP code. Modifying existing plugins like this
should be done only if you are confident that the modification achieves
what you intend.

Hope that all helps

Will


On 7 May 2014 17:59, Genomeo Dev <genomeodev at gmail.com> wrote:

> Thanks Will.
>
> I am working with non-coding and intergenic variants and wanted to run VEP
> with the following plugins:
>
> --plugin UpDownDistance,100000 \
> --plugin TSSDistance \
> --plugin
> Condel,/media/sf_D_DRIVE/Projects/Databases/ensembl/Plugins/Condel/config,b
> \
> --plugin CADD,/media/sf_D_DRIVE/Projects/Databases/CADD/v1.0/1000G.tsv.gz \
> --plugin
> Gwava,tss,/media/sf_D_DRIVE/Projects/Databases/gwava/gwava_scores.bed.gz \
> --plugin Conservation,GERP_CONSERVATION_SCORE,mammals \
> --plugin
> dbNSFP,/media/sf/data/dbNSFP/dbNSFP2.4.gz,GERP++_NR,GERP++_RS,LRT_score,LRT_pred,MutationTaster_score,MutationTaster_pred,MutationAssessor_score,MutationAssessor_pred,FATHMM_score,FATHMM_pred,RadialSVM_score,RadialSVM_pred,LR_score,LR_pred,Reliability_index,SiPhy_29way_logOdds,Polyphen2_HVAR_score,Polyphen2_HVAR_pred,SIFT_score,SIFT_pred,CADD_raw,CADD_phred
>
>
> As shown in the output below, apart from CADD.pm and Gwava.pm, no scores
> are returned for the others. dbNSFP.pm should  get at least CADD scores
> because these exist. As recommended I tried using:
>
> sub feature_types {
>     return ['Feature', 'Intergenic'];
> }
>
> or
>
> sub feature_types {
>    return ['Transcript', 'Intergenic'];
> }
>
> in dbNFSP.pm but does not help. When I tried that in TSSDistance.pm I get
> this error:
>
> Plugin 'TSSDistance' went wrong: Can't locate object method "transcript"
> via package "Bio::EnsEMBL::Variation::IntergenicVariationAllele" at
> /media/sf_D_DRIVE/Projects/Databases/ensembl/Plugins//TSSDistance.pm line
> 56.
>
> For UpDownDistance.pm, it does not seem to work as for instance rs140931361
> is 58298 bp from ENSG00000198822 but this is gene is not returned.
>
>
> OUTPUT:
>
>   ## ENSEMBL VARIANT EFFECT PREDICTOR v75                               ##
> Output produced at 2014-05-07 17:28:44                               ##
> Connected to homo_sapiens_core_75_37 on ensembldb.ensembl.org                              ##
> Using cache in /media/sf_D_DRIVE/Projects/Databases/ensembl//homo_sapiens/75                             ##
> Using API version 75, DB version 75                               ## sift
> version sift5.0.2                                ## polyphen version 2.2.2                                ##
> Extra column keys:                                ## BIOTYPE : Biotype of
> transcript                               ## CANONICAL : Indicates if
> transcript is canonical for this gene                              ##
> CELL_TYPE : List of cell types and classifications for regulatory feature                              ##
> CLIN_SIG : Clinical significance of variant from dbSNP                              ##
> DISTANCE : Shortest distance from variant to transcript                              ##
> DOMAINS : The source and identifer of any overlapping protein domains                             ##
> ENSP : Ensembl protein identifer                               ## EXON :
> Exon number(s) / total                               ## HIGH_INF_POS : A
> flag indicating if the variant falls in a high information position of the
> TFBP                            ## INTRON : Intron number(s) / total                               ##
> MOTIF_NAME : The source and identifier of a transcription factor binding
> profile (TFBP) aligned at this position                            ##
> MOTIF_POS : The relative position of the variation in the aligned TFBP                              ##
> MOTIF_SCORE_CHANGE : The difference in motif score of the reference and
> variant sequences for the TFBP                            ## PUBMED :
> Pubmed ID(s) of publications that cite existing variant                              ##
> PolyPhen : PolyPhen prediction and/or score                               ##
> SIFT : SIFT prediction and/or score                               ##
> SYMBOL : Gene symbol (e.g. HGNC)                               ##
> SYMBOL_SOURCE : Source of gene symbol                               ##
> TSSDistance : Distance from the transcription start site                              ##
> Condel : Consensus deleteriousness score for an amino acid substitution
> based on SIFT and PolyPhen-2                           ## CADD_RAW : Raw
> CADD score                               ## CADD_PHRED : PHRED-like
> scaled CADD score                              ## GWAVA : Genome Wide
> Annotation of VAriants score (tss model)                             ##
> Conservation : The conservation score for this site
> (method_link_type="GERP_CONSERVATION_SCORE", species_set="mammals")                          ##
> MutationTaster_score : MutationTaster_score from dbNSFP file
> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
> Polyphen2_HVAR_score : Polyphen2_HVAR_score from dbNSFP file
> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
> LRT_pred : LRT_pred from dbNSFP file
> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
> MutationAssessor_score : MutationAssessor_score from dbNSFP file
> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
> FATHMM_pred : FATHMM_pred from dbNSFP file
> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
> LR_score : LR_score from dbNSFP file
> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
> MutationTaster_pred : MutationTaster_pred from dbNSFP file
> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
> SiPhy_29way_logOdds : SiPhy_29way_logOdds from dbNSFP file
> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
> CADD_phred : CADD_phred from dbNSFP file
> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
> Polyphen2_HVAR_pred : Polyphen2_HVAR_pred from dbNSFP file
> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
> RadialSVM_pred : RadialSVM_pred from dbNSFP file
> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
> Reliability_index : Reliability_index from dbNSFP file
> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
> GERP++_NR : GERP++_NR from dbNSFP file
> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
> MutationAssessor_pred : MutationAssessor_pred from dbNSFP file
> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
> LRT_score : LRT_score from dbNSFP file
> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
> CADD_raw : CADD_raw from dbNSFP file
> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
> LR_pred : LR_pred from dbNSFP file
> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                            ##
> FATHMM_score : FATHMM_score from dbNSFP file
> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
> SIFT_score : SIFT_score from dbNSFP file
> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
> GERP++_RS : GERP++_RS from dbNSFP file
> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
> SIFT_pred : SIFT_pred from dbNSFP file
> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
> RadialSVM_score : RadialSVM_score from dbNSFP file
> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz
> #Uploaded_variation Location Allele Existing_variation SYMBOL
> SYMBOL_SOURCE Gene ENSP Feature Feature_type BIOTYPE STRAND CANONICAL EXON
> INTRON DISTANCE TSSDistance Consequence cDNA_position CDS_position
> Protein_position Amino_acids Codons PolyPhen SIFT Condel CELL_TYPE SV
> PUBMED CLIN_SIG HIGH_INF_POS MOTIF_NAME MOTIF_POS MOTIF_SCORE_CHANGE
> TSSDistance CADD_RAW CADD_PHRED GWAVA Conservation GERP++_NR GERP++_RS
> LRT_score LRT_pred MutationTaster_score MutationTaster_pred
> MutationAssessor_score MutationAssessor_pred FATHMM_score FATHMM_pred
> RadialSVM_score RadialSVM_pred LR_score LR_pred Reliability_index
> SiPhy_29way_logOdds Polyphen2_HVAR_score Polyphen2_HVAR_pred SIFT_score
> SIFT_pred CADD_raw CADD_phred Extra  rs13247133 7:86199080 A rs13247133 -
> - - - - - - - - - - - - intergenic_variant - - - - - - - - - - - - - - - -
> - -0.25769 2.762 0.11 - - - - - - - - - - - - - - - - - - - - - - -
> CADD_RAW=-0.257691;CADD_PHRED=2.762;GWAVA=0.11  rs13244782 7:86202665 T
> rs13244782 - - - - - - - - - - - - - intergenic_variant - - - - - - - - -
> - - - - - - - - 1.957591 12.5 0.15 - - - - - - - - - - - - - - - - - - - -
> - - - CADD_RAW=1.957591;CADD_PHRED=12.50;GWAVA=0.15  rs12704267 7:86206830
> T rs12704267 - - - - - - - - - - - - - intergenic_variant - - - - - - - -
> - - - - - - - - - 0.111018 4.597 0.16 - - - - - - - - - - - - - - - - - -
> - - - - - CADD_RAW=0.111018;CADD_PHRED=4.597;GWAVA=0.16  rs140931361
> 7:86214933-86214937 - rs140931361 - - - - - - - - - - - - -
> intergenic_variant - - - - - - - - - - - - - - - - - -0.42024 2.04 - - - -
> - - - - - - - - - - - - - - - - - - - -
> CADD_RAW=-0.420243;CADD_PHRED=2.040  rs34536358 7:86222651 G rs34536358 -
> - - - - - - - - - - - - intergenic_variant - - - - - - - - - - - - - - - -
> - -0.31002 2.524 0.18 - - - - - - - - - - - - - - - - - - - - - - -
> CADD_RAW=-0.310016;CADD_PHRED=2.524;GWAVA=0.18  rs36006360 7:86224933 T
> rs36006360 - - - - - - - - - - - - - intergenic_variant - - - - - - - - -
> - - - - - - - - 2.513017 14.36 0.36 - - - - - - - - - - - - - - - - - - -
> - - - - CADD_RAW=2.513017;CADD_PHRED=14.36;GWAVA=0.36  rs13244678
> 7:86232583 T rs13244678 - - - - - - - - - - - - - intergenic_variant - - -
> - - - - - - - - - - - - - - -0.52024 1.626 0.05 - - - - - - - - - - - - -
> - - - - - - - - - - CADD_RAW=-0.520238;CADD_PHRED=1.626;GWAVA=0.05
> rs12704279 7:86238294 T rs12704279 - - - - - - - - - - - - -
> intergenic_variant - - - - - - - - - - - - - - - - - 0.454708 6.469 0.16 -
> - - - - - - - - - - - - - - - - - - - - - -
> CADD_RAW=0.454708;CADD_PHRED=6.469;GWAVA=0.16  rs13228078 7:86240691 C
> rs13228078 - - - - - - - - - - - - - intergenic_variant - - - - - - - - -
> - - - - - - - - 0.980262 9.002 0.1 - - - - - - - - - - - - - - - - - - - -
> - - - CADD_RAW=0.980262;CADD_PHRED=9.002;GWAVA=0.1  rs140931361
> 7:86214933-86214937 - rs140931361 - - - - - - - - - - - - -
> intergenic_variant - - - - - - - - - - - - - - - - - -0.42024 2.04 - - - -
> - - - - - - - - - - - - - - - - - - - -
> CADD_RAW=-0.420243;CADD_PHRED=2.040
>
> Thanks,
>
> G.
>
> On 7 May 2014 16:13, Will McLaren <wm2 at ebi.ac.uk> wrote:
>
>> Hello,
>>
>> Correct, the plugin was intended to work with the whole_genome_SNVs.tsv
>> file, which only contains data for SNVs.
>>
>> I've modified the plugin so that it should be able to cope with indel
>> data files such as you have; please do let me know if you have any problems
>> as I've only sparingly tested it on made-up data!
>>
>> Regards
>>
>> Will McLaren
>> Ensembl Variation
>>
>>
>> On 7 May 2014 15:37, Genomeo Dev <genomeodev at gmail.com> wrote:
>>
>>> Hi,
>>>
>>> There seem to be a discrepancy between the CADD score calculated using
>>> VEP with the CADD.pm plugin and the tabix direct output:
>>>
>>> For example using this 1000G variant:
>>>
>>> #CHROM POS ID REF ALT QUAL FILTER INFO
>>> 7 86214932 rs140931361 TTACTC T . PASS .
>>>
>>> variant_effect_predictor.pl -i input.txt --format vcf --plugin
>>> CADD,/media/sf_D_DRIVE/Projects/Databases/CADD/v1.0/1000G.tsv.gz
>>> does not return any CADD score
>>>
>>> whereas
>>> $ tabix -p vcf 1000G.tsv.gz 7:86214932-86214932
>>> 7 86214932 TTACTC T -0.420243 2.040
>>>
>>> This seems to affect indels and not SNVs. I could see in the plugin that
>>> there is a rule to ignore indels. Any suggestions please how to safely
>>> change that?
>>>
>>> Also, in the plugin, I assume there is a test to ensure the alleles are
>>> identical between the input file and the 1000G.tsv.gz file. Is this correct?
>>>
>>> Thanks.
>>>
>>> --
>>> G.
>>>
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info:
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>>
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>
>
> --
> G.
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20140508/15f2aab3/attachment.html>


More information about the Dev mailing list