[ensembl-dev] UpDownDistance using

Genomeo Dev genomeodev at gmail.com
Mon Jun 2 16:56:55 BST 2014


OK thanks will try that.

My remaining question from the above is whether the original attributes of
the input variant (including original coordinate) are passed on to the
plug-in?  In the run() subroutine, ($self, $tva) = @_ does not seem to
contain that.

Thanks,

G.







On 2 June 2014 14:18, Will McLaren <wm2 at ebi.ac.uk> wrote:

> I think if I were presented with this issue I would be going about it in a
> different way.
>
> If your goal is to find all genes within a set of genomic intervals (as
> defined by the variants and the required up/downstream distance), you'd be
> better off downloading a list of the genes and their coordinates and
> writing a short script or bend some piece of software to find the overlaps
> between your search areas and the genes.
>
> The quickest way I can think to do this off the top of my head is to use
> something like a tabix-indexed BED or GTF file of genes, then do a tabix
> lookup on this file for each of your intervals.
>
> Trying to bend the VEP to do this is in my opinion a bit overkill, as all
> the VEP will be calling for each of those genes is e.g.
> upstream_gene_variant.
>
> Will
>
>
> On 2 June 2014 13:44, Genomeo Dev <genomeodev at gmail.com> wrote:
>
>> Thanks Will. Yes that is what I want to achieve. Given so much variant
>> data is coming out of  GWAS studies, because of LD structure the approach
>> of considering specific loci for variant analysis is becoming increasingly
>> common.
>>
>> I have tried to write an individual input file for VEP for each variant
>> but since they are in the rank of 100K this is very slow to run. I don't
>> know how easy would be to adjust the VEP API to take in a fixed range of
>> coordinates as well as the relative upstream and downstream coordinates.
>> This way, I could set the fixed limits up in a new() method and VEP
>> would process all the variants at once.
>>
>>
>> Thanks,
>>
>> G.
>>
>>
>> On 2 June 2014 13:24, Will McLaren <wm2 at ebi.ac.uk> wrote:
>>
>>> Hello again,
>>>
>>> I think I'm getting a little lost here with exactly what you're trying
>>> to do with the plugin.
>>>
>>> If I understand correctly, you are trying to adjust the upstream and
>>> downstream search distance on a per-variant basis? e.g. use 5kb for
>>> variant1, 10kb for variant2?
>>>
>>> This is not possible using a plugin currently as the plugin code in the
>>> run() method is executed after the transcripts are found and the
>>> consequences are called. It is possible to set it globally, as in the
>>> UpDownDistance plugin, as this is done in the new() method which is
>>> executed once when the script starts up.
>>>
>>> I think the only way to achieve this would be to group your variants
>>> into different input files by the distance cutoff required, and run them as
>>> separate VEP commands with different distances passed to the UpDownDistance
>>> plugin.
>>>
>>> HTH
>>>
>>> Will
>>>
>>>
>>>
>>>
>>> On 2 June 2014 09:34, Genomeo Dev <genomeodev at gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> With regard to to the question with the aim to specify UPSTREAM_DISTANCE
>>>> and DOWNSTREAM_DISTANCE limits for each variant in an input list of
>>>> variants, if it is not possible to achieve that because of how VEP
>>>> interacts with plug-ins, would it be possible to introduce UPSTREAM_COORDINATE
>>>> and DOWNSTREAM_COORDINATE variables declarable within a plugin which
>>>> would then allow to restrict where VEP looks for consequences?
>>>>
>>>> Regards,
>>>>
>>>> G.
>>>>
>>>>
>>>>
>>>> On 30 May 2014 17:49, Genomeo Dev <genomeodev at gmail.com> wrote:
>>>>
>>>>> I did eventually figure out the answer to the first question: my
>>>>> ($self, $tva) = @_; $self->params()
>>>>>
>>>>> For my second question, more specifically, what I want to do is to be
>>>>> able to use the original input coordinate for each individual input variant
>>>>> to then specify the UPSTREAM_DISTANCE and DOWNSTREAM_DISTANCE limits
>>>>> per variant in UpDownDistance.pm. The reason for that is I have a
>>>>> large group of variants for which I want to consider consequences within
>>>>> the same physical range which I can already pass on the the plugin as
>>>>> arguments. Running VEP per variant is not efficient hence the question.
>>>>>
>>>>> Regards,
>>>>>
>>>>> G.
>>>>>
>>>>>
>>>>> On 30 May 2014 15:48, Genomeo Dev <genomeodev at gmail.com> wrote:
>>>>>
>>>>>> A related question is where how to get the inputed variant attributes
>>>>>> (e.g. position, reference ID) so to process that within the subroutine.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> G.
>>>>>>
>>>>>>
>>>>>> On 30 May 2014 13:01, Genomeo Dev <genomeodev at gmail.com> wrote:
>>>>>>
>>>>>>> Thanks Will. It is working fine now.
>>>>>>>
>>>>>>> I wanted to modify the UpDownDistance.pm to produce two separate
>>>>>>> columns in the VEP output showing the UPDIST_CUTOFF and UPDIST_CUTOFF
>>>>>>> parameters (See below). Please how do I fetch the plugin arguments into the
>>>>>>> run subroutine?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> G.
>>>>>>>
>>>>>>>
>>>>>>> use strict;
>>>>>>> use warnings;
>>>>>>> use base qw(Bio::EnsEMBL::Variation::Utils::BaseVepPlugin);
>>>>>>>
>>>>>>> sub feature_types {
>>>>>>>     return ['Feature', 'Intergenic'];
>>>>>>> }
>>>>>>>
>>>>>>> sub get_header_info {
>>>>>>>     return {
>>>>>>>         UPDIST_CUTOFF => "distance cutoff upstream variant where
>>>>>>> consequences are calculated",
>>>>>>>         DOWNIDST_CUTOFF => "distance cutoff downstream variant where
>>>>>>> consequences are calculated"
>>>>>>>     };
>>>>>>> }
>>>>>>>
>>>>>>> sub new {
>>>>>>>
>>>>>>>   my $class = shift;
>>>>>>>   my $self = $class->SUPER::new(@_);
>>>>>>>
>>>>>>>   # change up/down
>>>>>>>   my $up = $self->params->[0] || 5000;
>>>>>>>
>>>>>>>   my $down = $self->params->[1] || $up;
>>>>>>>
>>>>>>> $Bio::EnsEMBL::Variation::Utils::VariationEffect::UPSTREAM_DISTANCE = $up;
>>>>>>>
>>>>>>> $Bio::EnsEMBL::Variation::Utils::VariationEffect::DOWNSTREAM_DISTANCE =
>>>>>>> $down;
>>>>>>>
>>>>>>>   return $self;
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> sub run {
>>>>>>>         my $upstream_distance = ?
>>>>>>>         my $downstream_distance = ?
>>>>>>>  return {
>>>>>>> UPDIST_CUTOFF => $upstream_distance,
>>>>>>> DOWNDIST_CUTOFF => $downstream_distance
>>>>>>>  }
>>>>>>> };
>>>>>>>
>>>>>>> 1;
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 29 May 2014 09:57, Will McLaren <wm2 at ebi.ac.uk> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I've patched a fix in for the UpDownDistance issue, the fix is in
>>>>>>>> the main ensembl-variation API.
>>>>>>>>
>>>>>>>> Regarding the DISTANCE field, perhaps you could write a plugin that
>>>>>>>> does exactly what you want? Changing the behaviour of this field may not be
>>>>>>>> compatible with other people's pipelines, and the plugin system is the
>>>>>>>> perfect way for you to have annotations customised to your requirements.
>>>>>>>>
>>>>>>>> Regards
>>>>>>>>
>>>>>>>> Will
>>>>>>>>
>>>>>>>>
>>>>>>>> On 28 May 2014 18:58, Genomeo Dev <genomeodev at gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> When using different up and down arguments in UpDownDistance.pm,
>>>>>>>>> VEP returns genes outside the specified range as shown in the example below
>>>>>>>>> (MIR1302-4 is 94161 upstream of rs17808606 but is still reported
>>>>>>>>> using UpDownDistance,5000,100000). For the genes which are
>>>>>>>>> outside the range, the DISTANCE and Consequence columns are empty while for
>>>>>>>>> example TSSDistance is not empty which might indicate the up and down
>>>>>>>>> arguments may not be processed correctly.
>>>>>>>>>
>>>>>>>>> It would be helpful to only return genes whose coordinates satisfy
>>>>>>>>> the specified range. Also, it would immensely help as well if DISTANCE is
>>>>>>>>> set to 0 for variants falling within genes and is otherwise calculated even
>>>>>>>>> for non-transcript feature types.
>>>>>>>>>
>>>>>>>>> Note that I am using Ensembl 75 updated with the recently updated
>>>>>>>>> ensembl variantion module which allowed UpDownDistance.pm to work for
>>>>>>>>> distances beyond 5kb.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> G.
>>>>>>>>>
>>>>>>>>> ##UpDownDistance,5000,100000
>>>>>>>>> ##TSSDistance
>>>>>>>>>        #Uploaded_variation Location Allele Existing_variation
>>>>>>>>> SYMBOL SYMBOL_SOURCE Gene ENSP Feature Feature_type BIOTYPE STRAND
>>>>>>>>> CANONICAL EXON INTRON DISTANCE TSSDistance Consequence
>>>>>>>>> rs17808606  2:208228309 T rs17808606 AC007879.5
>>>>>>>>> Clone_based_vega_gene ENSG00000223725 - ENST00000412387 Transcript
>>>>>>>>> antisense -1 - - 3/4 - - intron_variant,nc_transcript_variant
>>>>>>>>> rs17808606 2:208228309 T rs17808606 MIR1302-4 HGNC ENSG00000221628
>>>>>>>>> - ENST00000408701 Transcript miRNA -1 YES - - - 94161  rs17808606
>>>>>>>>> 2:208228309 T rs17808606 AC007879.6 Clone_based_vega_gene
>>>>>>>>> ENSG00000225064 - ENST00000438824 Transcript lincRNA 1 YES - -
>>>>>>>>> 92895 - downstream_gene_variant  rs17808606  2:208228309 T
>>>>>>>>> rs17808606 AC007879.5 Clone_based_vega_gene ENSG00000223725 -
>>>>>>>>> ENST00000418850 Transcript antisense -1 YES - 4/5 - -
>>>>>>>>> intron_variant,nc_transcript_variant
>>>>>>>>> ##UpDownDistance,100000
>>>>>>>>> ##TSSDistance
>>>>>>>>>       #Uploaded_variation Location Allele Existing_variation
>>>>>>>>> SYMBOL SYMBOL_SOURCE Gene ENSP Feature Feature_type BIOTYPE STRAND
>>>>>>>>> CANONICAL EXON INTRON DISTANCE TSSDistance Consequence
>>>>>>>>> rs17808606  2:208228309 T rs17808606 AC007879.5
>>>>>>>>> Clone_based_vega_gene ENSG00000223725 - ENST00000412387 Transcript
>>>>>>>>> antisense -1 - - 3/4 - - intron_variant,nc_transcript_variant
>>>>>>>>> rs17808606  2:208228309 T rs17808606 MIR1302-4 HGNC
>>>>>>>>> ENSG00000221628 - ENST00000408701 Transcript miRNA -1 YES - -
>>>>>>>>> 94161 94161 upstream_gene_variant  rs17808606  2:208228309 T
>>>>>>>>> rs17808606 AC007879.6 Clone_based_vega_gene ENSG00000225064 -
>>>>>>>>> ENST00000438824 Transcript lincRNA 1 YES - - 92895 -
>>>>>>>>> downstream_gene_variant  rs17808606  2:208228309 T rs17808606
>>>>>>>>> AC007879.5 Clone_based_vega_gene ENSG00000223725 - ENST00000418850
>>>>>>>>> Transcript antisense -1 YES - 4/5 - -
>>>>>>>>> intron_variant,nc_transcript_variant
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 27 May 2014 11:03, Genomeo Dev <genomeodev at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Sorry seems the plug-in already does that thanks!
>>>>>>>>>>
>>>>>>>>>> G.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 23 May 2014 19:14, Genomeo Dev <genomeodev at gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Will,
>>>>>>>>>>>
>>>>>>>>>>> Thanks very much. That worked nicely.
>>>>>>>>>>>
>>>>>>>>>>> I am working with a set of variants within a locus where I know
>>>>>>>>>>> that they are LD-independent with other genes from outside this locus.
>>>>>>>>>>> Therefore, I want only to focus on genes inside this physically defined
>>>>>>>>>>> locus.
>>>>>>>>>>>
>>>>>>>>>>> Rarely do these variants fall exactly at the centre of the locus
>>>>>>>>>>> so distances to the right and left boundaries are not equal. Would it be
>>>>>>>>>>> possible to alter UpDownDistance.pm to be able to specify a
>>>>>>>>>>> start and end coordinate within which VEP should be constrained instead of
>>>>>>>>>>> the current distance cutoff?
>>>>>>>>>>>
>>>>>>>>>>> Many thanks,
>>>>>>>>>>>
>>>>>>>>>>> G.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 8 May 2014 16:12, Will McLaren <wm2 at ebi.ac.uk> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello again,
>>>>>>>>>>>>
>>>>>>>>>>>> I've fixed a bug that prevented UpDownDistance functioning
>>>>>>>>>>>> correctly - it hadn't been tested with larger distances such as you
>>>>>>>>>>>> specified which broke some assumptions in the core VEP code.
>>>>>>>>>>>> You will need to update your ensembl-variation module or re-run
>>>>>>>>>>>> the VEP INSTALL.pl script to pick up the new API code.
>>>>>>>>>>>>
>>>>>>>>>>>> As far as the other plugins go, I think you are
>>>>>>>>>>>> misunderstanding how some of them work:
>>>>>>>>>>>>
>>>>>>>>>>>> TSSDistance - this gives the distance between a variant and the
>>>>>>>>>>>> annotated transcript start site. If a variant is annotated as intergenic,
>>>>>>>>>>>> there is no transcript to give the distance to! Changing the code to force
>>>>>>>>>>>> it to assess intergenic variants will of course break here. Of course if
>>>>>>>>>>>> you alter the up/down-stream distance using UpDownStream such that this
>>>>>>>>>>>> then finds a transcript in range, the plugin will then work as expected
>>>>>>>>>>>> without modification. It seems to me that you are expecting that this
>>>>>>>>>>>> plugin will find the shortest distance to _any_ transcript start site,
>>>>>>>>>>>> which is not the intended purpose of the code.
>>>>>>>>>>>>
>>>>>>>>>>>> Condel & dbNSFP - these two plugins work exclusively on
>>>>>>>>>>>> missense AKA non-synonymous SNVs (hence the NS in the name dbNSFP). While
>>>>>>>>>>>> dbNSFP carries scores for CADD, and CADD gives scores for any genomic
>>>>>>>>>>>> position, the CADD scores in dbNSFP are only for missense variants.
>>>>>>>>>>>>
>>>>>>>>>>>> The feature_types() subroutine should be used when writing your
>>>>>>>>>>>> own plugin to determine which kind of variant/feature combinations are
>>>>>>>>>>>> considered by the plugin, since the run() sub is executed once for each
>>>>>>>>>>>> variant/feature overlap found by the core VEP code. Modifying existing
>>>>>>>>>>>> plugins like this should be done only if you are confident that the
>>>>>>>>>>>> modification achieves what you intend.
>>>>>>>>>>>>
>>>>>>>>>>>> Hope that all helps
>>>>>>>>>>>>
>>>>>>>>>>>> Will
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 7 May 2014 17:59, Genomeo Dev <genomeodev at gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks Will.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am working with non-coding and intergenic variants and
>>>>>>>>>>>>> wanted to run VEP with the following plugins:
>>>>>>>>>>>>>
>>>>>>>>>>>>> --plugin UpDownDistance,100000 \
>>>>>>>>>>>>> --plugin TSSDistance \
>>>>>>>>>>>>> --plugin
>>>>>>>>>>>>> Condel,/media/sf_D_DRIVE/Projects/Databases/ensembl/Plugins/Condel/config,b
>>>>>>>>>>>>> \
>>>>>>>>>>>>> --plugin
>>>>>>>>>>>>> CADD,/media/sf_D_DRIVE/Projects/Databases/CADD/v1.0/1000G.tsv.gz \
>>>>>>>>>>>>> --plugin
>>>>>>>>>>>>> Gwava,tss,/media/sf_D_DRIVE/Projects/Databases/gwava/gwava_scores.bed.gz \
>>>>>>>>>>>>> --plugin Conservation,GERP_CONSERVATION_SCORE,mammals \
>>>>>>>>>>>>> --plugin
>>>>>>>>>>>>> dbNSFP,/media/sf/data/dbNSFP/dbNSFP2.4.gz,GERP++_NR,GERP++_RS,LRT_score,LRT_pred,MutationTaster_score,MutationTaster_pred,MutationAssessor_score,MutationAssessor_pred,FATHMM_score,FATHMM_pred,RadialSVM_score,RadialSVM_pred,LR_score,LR_pred,Reliability_index,SiPhy_29way_logOdds,Polyphen2_HVAR_score,Polyphen2_HVAR_pred,SIFT_score,SIFT_pred,CADD_raw,CADD_phred
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> As shown in the output below, apart from CADD.pm and Gwava.pm,
>>>>>>>>>>>>> no scores are returned for the others. dbNSFP.pm should  get at least CADD
>>>>>>>>>>>>> scores because these exist. As recommended I tried using:
>>>>>>>>>>>>>
>>>>>>>>>>>>> sub feature_types {
>>>>>>>>>>>>>     return ['Feature', 'Intergenic'];
>>>>>>>>>>>>> }
>>>>>>>>>>>>>
>>>>>>>>>>>>> or
>>>>>>>>>>>>>
>>>>>>>>>>>>> sub feature_types {
>>>>>>>>>>>>>    return ['Transcript', 'Intergenic'];
>>>>>>>>>>>>> }
>>>>>>>>>>>>>
>>>>>>>>>>>>> in dbNFSP.pm but does not help. When I tried that in
>>>>>>>>>>>>> TSSDistance.pm I get this error:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Plugin 'TSSDistance' went wrong: Can't locate object method
>>>>>>>>>>>>> "transcript" via package
>>>>>>>>>>>>> "Bio::EnsEMBL::Variation::IntergenicVariationAllele" at
>>>>>>>>>>>>> /media/sf_D_DRIVE/Projects/Databases/ensembl/Plugins//TSSDistance.pm line
>>>>>>>>>>>>> 56.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For UpDownDistance.pm, it does not seem to work as for
>>>>>>>>>>>>> instance rs140931361 is 58298 bp from ENSG00000198822 but
>>>>>>>>>>>>> this is gene is not returned.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> OUTPUT:
>>>>>>>>>>>>>
>>>>>>>>>>>>>   ## ENSEMBL VARIANT EFFECT PREDICTOR v75                               ##
>>>>>>>>>>>>> Output produced at 2014-05-07 17:28:44                               ##
>>>>>>>>>>>>> Connected to homo_sapiens_core_75_37 on ensembldb.ensembl.org                              ##
>>>>>>>>>>>>> Using cache in /media/sf_D_DRIVE/Projects/Databases/ensembl//homo_sapiens/75                             ##
>>>>>>>>>>>>> Using API version 75, DB version 75                               ##
>>>>>>>>>>>>> sift version sift5.0.2                                ##
>>>>>>>>>>>>> polyphen version 2.2.2                                ##
>>>>>>>>>>>>> Extra column keys:                                ## BIOTYPE
>>>>>>>>>>>>> : Biotype of transcript                               ##
>>>>>>>>>>>>> CANONICAL : Indicates if transcript is canonical for this gene                              ##
>>>>>>>>>>>>> CELL_TYPE : List of cell types and classifications for regulatory feature                              ##
>>>>>>>>>>>>> CLIN_SIG : Clinical significance of variant from dbSNP                              ##
>>>>>>>>>>>>> DISTANCE : Shortest distance from variant to transcript                              ##
>>>>>>>>>>>>> DOMAINS : The source and identifer of any overlapping protein domains                             ##
>>>>>>>>>>>>> ENSP : Ensembl protein identifer                               ##
>>>>>>>>>>>>> EXON : Exon number(s) / total                               ##
>>>>>>>>>>>>> HIGH_INF_POS : A flag indicating if the variant falls in a high information
>>>>>>>>>>>>> position of the TFBP                            ## INTRON :
>>>>>>>>>>>>> Intron number(s) / total                               ##
>>>>>>>>>>>>> MOTIF_NAME : The source and identifier of a transcription factor binding
>>>>>>>>>>>>> profile (TFBP) aligned at this position                            ##
>>>>>>>>>>>>> MOTIF_POS : The relative position of the variation in the aligned TFBP                              ##
>>>>>>>>>>>>> MOTIF_SCORE_CHANGE : The difference in motif score of the reference and
>>>>>>>>>>>>> variant sequences for the TFBP                            ##
>>>>>>>>>>>>> PUBMED : Pubmed ID(s) of publications that cite existing variant                              ##
>>>>>>>>>>>>> PolyPhen : PolyPhen prediction and/or score                               ##
>>>>>>>>>>>>> SIFT : SIFT prediction and/or score                               ##
>>>>>>>>>>>>> SYMBOL : Gene symbol (e.g. HGNC)                               ##
>>>>>>>>>>>>> SYMBOL_SOURCE : Source of gene symbol                               ##
>>>>>>>>>>>>> TSSDistance : Distance from the transcription start site                              ##
>>>>>>>>>>>>> Condel : Consensus deleteriousness score for an amino acid
>>>>>>>>>>>>> substitution based on SIFT and PolyPhen-2                           ##
>>>>>>>>>>>>> CADD_RAW : Raw CADD score                               ##
>>>>>>>>>>>>> CADD_PHRED : PHRED-like scaled CADD score                              ##
>>>>>>>>>>>>> GWAVA : Genome Wide Annotation of VAriants score (tss model)                             ##
>>>>>>>>>>>>> Conservation : The conservation score for this site
>>>>>>>>>>>>> (method_link_type="GERP_CONSERVATION_SCORE", species_set="mammals")                          ##
>>>>>>>>>>>>> MutationTaster_score : MutationTaster_score from dbNSFP file
>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>> Polyphen2_HVAR_score : Polyphen2_HVAR_score from dbNSFP file
>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>> LRT_pred : LRT_pred from dbNSFP file
>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>> MutationAssessor_score : MutationAssessor_score from dbNSFP
>>>>>>>>>>>>> file /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>> FATHMM_pred : FATHMM_pred from dbNSFP file
>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>> LR_score : LR_score from dbNSFP file
>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>> MutationTaster_pred : MutationTaster_pred from dbNSFP file
>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>> SiPhy_29way_logOdds : SiPhy_29way_logOdds from dbNSFP file
>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>> CADD_phred : CADD_phred from dbNSFP file
>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>> Polyphen2_HVAR_pred : Polyphen2_HVAR_pred from dbNSFP file
>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>> RadialSVM_pred : RadialSVM_pred from dbNSFP file
>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>> Reliability_index : Reliability_index from dbNSFP file
>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>> GERP++_NR : GERP++_NR from dbNSFP file
>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>> MutationAssessor_pred : MutationAssessor_pred from dbNSFP
>>>>>>>>>>>>> file /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>> LRT_score : LRT_score from dbNSFP file
>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>> CADD_raw : CADD_raw from dbNSFP file
>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>> LR_pred : LR_pred from dbNSFP file
>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                            ##
>>>>>>>>>>>>> FATHMM_score : FATHMM_score from dbNSFP file
>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>> SIFT_score : SIFT_score from dbNSFP file
>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>> GERP++_RS : GERP++_RS from dbNSFP file
>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>> SIFT_pred : SIFT_pred from dbNSFP file
>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>> RadialSVM_score : RadialSVM_score from dbNSFP file
>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz
>>>>>>>>>>>>> #Uploaded_variation Location Allele Existing_variation SYMBOL
>>>>>>>>>>>>> SYMBOL_SOURCE Gene ENSP Feature Feature_type BIOTYPE STRAND
>>>>>>>>>>>>> CANONICAL EXON INTRON DISTANCE TSSDistance Consequence
>>>>>>>>>>>>> cDNA_position CDS_position Protein_position Amino_acids Codons
>>>>>>>>>>>>> PolyPhen SIFT Condel CELL_TYPE SV PUBMED CLIN_SIG HIGH_INF_POS
>>>>>>>>>>>>> MOTIF_NAME MOTIF_POS MOTIF_SCORE_CHANGE TSSDistance CADD_RAW
>>>>>>>>>>>>> CADD_PHRED GWAVA Conservation GERP++_NR GERP++_RS LRT_score
>>>>>>>>>>>>> LRT_pred MutationTaster_score MutationTaster_pred
>>>>>>>>>>>>> MutationAssessor_score MutationAssessor_pred FATHMM_score
>>>>>>>>>>>>> FATHMM_pred RadialSVM_score RadialSVM_pred LR_score LR_pred
>>>>>>>>>>>>> Reliability_index SiPhy_29way_logOdds Polyphen2_HVAR_score
>>>>>>>>>>>>> Polyphen2_HVAR_pred SIFT_score SIFT_pred CADD_raw CADD_phred
>>>>>>>>>>>>> Extra  rs13247133 7:86199080 A rs13247133 - - - - - - - - - -
>>>>>>>>>>>>> - - - intergenic_variant - - - - - - - - - - - - - - - - -
>>>>>>>>>>>>> -0.25769 2.762 0.11 - - - - - - - - - - - - - - - - - - - - -
>>>>>>>>>>>>> - - CADD_RAW=-0.257691;CADD_PHRED=2.762;GWAVA=0.11  rs13244782
>>>>>>>>>>>>> 7:86202665 T rs13244782 - - - - - - - - - - - - -
>>>>>>>>>>>>> intergenic_variant - - - - - - - - - - - - - - - - - 1.957591
>>>>>>>>>>>>> 12.5 0.15 - - - - - - - - - - - - - - - - - - - - - - -
>>>>>>>>>>>>> CADD_RAW=1.957591;CADD_PHRED=12.50;GWAVA=0.15  rs12704267
>>>>>>>>>>>>> 7:86206830 T rs12704267 - - - - - - - - - - - - -
>>>>>>>>>>>>> intergenic_variant - - - - - - - - - - - - - - - - - 0.111018
>>>>>>>>>>>>> 4.597 0.16 - - - - - - - - - - - - - - - - - - - - - - -
>>>>>>>>>>>>> CADD_RAW=0.111018;CADD_PHRED=4.597;GWAVA=0.16  rs140931361
>>>>>>>>>>>>> 7:86214933-86214937 - rs140931361 - - - - - - - - - - - - -
>>>>>>>>>>>>> intergenic_variant - - - - - - - - - - - - - - - - - -0.42024
>>>>>>>>>>>>> 2.04 - - - - - - - - - - - - - - - - - - - - - - - -
>>>>>>>>>>>>> CADD_RAW=-0.420243;CADD_PHRED=2.040  rs34536358 7:86222651 G
>>>>>>>>>>>>> rs34536358 - - - - - - - - - - - - - intergenic_variant - - -
>>>>>>>>>>>>> - - - - - - - - - - - - - - -0.31002 2.524 0.18 - - - - - - -
>>>>>>>>>>>>> - - - - - - - - - - - - - - - -
>>>>>>>>>>>>> CADD_RAW=-0.310016;CADD_PHRED=2.524;GWAVA=0.18  rs36006360
>>>>>>>>>>>>> 7:86224933 T rs36006360 - - - - - - - - - - - - -
>>>>>>>>>>>>> intergenic_variant - - - - - - - - - - - - - - - - - 2.513017
>>>>>>>>>>>>> 14.36 0.36 - - - - - - - - - - - - - - - - - - - - - - -
>>>>>>>>>>>>> CADD_RAW=2.513017;CADD_PHRED=14.36;GWAVA=0.36  rs13244678
>>>>>>>>>>>>> 7:86232583 T rs13244678 - - - - - - - - - - - - -
>>>>>>>>>>>>> intergenic_variant - - - - - - - - - - - - - - - - - -0.52024
>>>>>>>>>>>>> 1.626 0.05 - - - - - - - - - - - - - - - - - - - - - - -
>>>>>>>>>>>>> CADD_RAW=-0.520238;CADD_PHRED=1.626;GWAVA=0.05  rs12704279
>>>>>>>>>>>>> 7:86238294 T rs12704279 - - - - - - - - - - - - -
>>>>>>>>>>>>> intergenic_variant - - - - - - - - - - - - - - - - - 0.454708
>>>>>>>>>>>>> 6.469 0.16 - - - - - - - - - - - - - - - - - - - - - - -
>>>>>>>>>>>>> CADD_RAW=0.454708;CADD_PHRED=6.469;GWAVA=0.16  rs13228078
>>>>>>>>>>>>> 7:86240691 C rs13228078 - - - - - - - - - - - - -
>>>>>>>>>>>>> intergenic_variant - - - - - - - - - - - - - - - - - 0.980262
>>>>>>>>>>>>> 9.002 0.1 - - - - - - - - - - - - - - - - - - - - - - -
>>>>>>>>>>>>> CADD_RAW=0.980262;CADD_PHRED=9.002;GWAVA=0.1  rs140931361
>>>>>>>>>>>>> 7:86214933-86214937 - rs140931361 - - - - - - - - - - - - -
>>>>>>>>>>>>> intergenic_variant - - - - - - - - - - - - - - - - - -0.42024
>>>>>>>>>>>>> 2.04 - - - - - - - - - - - - - - - - - - - - - - - -
>>>>>>>>>>>>> CADD_RAW=-0.420243;CADD_PHRED=2.040
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> G.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 7 May 2014 16:13, Will McLaren <wm2 at ebi.ac.uk> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Correct, the plugin was intended to work with
>>>>>>>>>>>>>> the whole_genome_SNVs.tsv file, which only contains data for SNVs.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I've modified the plugin so that it should be able to cope
>>>>>>>>>>>>>> with indel data files such as you have; please do let me know if you have
>>>>>>>>>>>>>> any problems as I've only sparingly tested it on made-up data!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Will McLaren
>>>>>>>>>>>>>> Ensembl Variation
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 7 May 2014 15:37, Genomeo Dev <genomeodev at gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> There seem to be a discrepancy between the CADD score
>>>>>>>>>>>>>>> calculated using VEP with the CADD.pm plugin and the tabix direct output:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For example using this 1000G variant:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> #CHROM POS ID REF ALT QUAL FILTER INFO
>>>>>>>>>>>>>>> 7 86214932 rs140931361 TTACTC T . PASS .
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> variant_effect_predictor.pl -i input.txt --format vcf
>>>>>>>>>>>>>>> --plugin CADD,/media/sf_D_DRIVE/Projects/Databases/CADD/v1.0/1000G.tsv.gz
>>>>>>>>>>>>>>> does not return any CADD score
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> whereas
>>>>>>>>>>>>>>> $ tabix -p vcf 1000G.tsv.gz 7:86214932-86214932
>>>>>>>>>>>>>>> 7 86214932 TTACTC T -0.420243 2.040
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This seems to affect indels and not SNVs. I could see in the
>>>>>>>>>>>>>>> plugin that there is a rule to ignore indels. Any suggestions please how to
>>>>>>>>>>>>>>> safely change that?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Also, in the plugin, I assume there is a test to ensure the
>>>>>>>>>>>>>>> alleles are identical between the input file and the 1000G.tsv.gz file. Is
>>>>>>>>>>>>>>> this correct?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> G.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> Dev mailing list    Dev at ensembl.org
>>>>>>>>>>>>>>> Posting guidelines and subscribe/unsubscribe info:
>>>>>>>>>>>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>>>>>>>>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> Dev mailing list    Dev at ensembl.org
>>>>>>>>>>>>>> Posting guidelines and subscribe/unsubscribe info:
>>>>>>>>>>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>>>>>>>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> G.
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> Dev mailing list    Dev at ensembl.org
>>>>>>>>>>>>> Posting guidelines and subscribe/unsubscribe info:
>>>>>>>>>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>>>>>>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Dev mailing list    Dev at ensembl.org
>>>>>>>>>>>> Posting guidelines and subscribe/unsubscribe info:
>>>>>>>>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>>>>>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> G.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> G.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> G.
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Dev mailing list    Dev at ensembl.org
>>>>>>>>> Posting guidelines and subscribe/unsubscribe info:
>>>>>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Dev mailing list    Dev at ensembl.org
>>>>>>>> Posting guidelines and subscribe/unsubscribe info:
>>>>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> G.
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> G.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> G.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> G.
>>>>
>>>> _______________________________________________
>>>> Dev mailing list    Dev at ensembl.org
>>>> Posting guidelines and subscribe/unsubscribe info:
>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog: http://www.ensembl.info/
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info:
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>>
>>
>>
>> --
>> G.
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>


-- 
G.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20140602/aa596e24/attachment.html>


More information about the Dev mailing list