[ensembl-dev] UpDownDistance using

Will McLaren wm2 at ebi.ac.uk
Tue Jun 3 11:03:08 BST 2014


The $tva object gives you methods to access to, amongst many other things,
the VariationFeature object, which carries the input variant coordinate.

my $vf = $tva->variation_feature;

printf("Coordinates: %s:%i-%i\n", $vf->seq_region_name,
$vf->seq_region_start, $vf->seq_region_end);

Have a look at some existing plugins to see how these and other objects can
be used, e.g.

https://github.com/ensembl-variation/VEP_plugins/blob/master/dbNSFP.pm

The documentation for the object types is found here:

http://www.ensembl.org/info/docs/Doxygen/variation-api/

Will


On 2 June 2014 16:56, Genomeo Dev <genomeodev at gmail.com> wrote:

> OK thanks will try that.
>
> My remaining question from the above is whether the original attributes of
> the input variant (including original coordinate) are passed on to the
> plug-in?  In the run() subroutine, ($self, $tva) = @_ does not seem to
> contain that.
>
> Thanks,
>
> G.
>
>
>
>
>
>
>
> On 2 June 2014 14:18, Will McLaren <wm2 at ebi.ac.uk> wrote:
>
>> I think if I were presented with this issue I would be going about it in
>> a different way.
>>
>> If your goal is to find all genes within a set of genomic intervals (as
>> defined by the variants and the required up/downstream distance), you'd be
>> better off downloading a list of the genes and their coordinates and
>> writing a short script or bend some piece of software to find the overlaps
>> between your search areas and the genes.
>>
>> The quickest way I can think to do this off the top of my head is to use
>> something like a tabix-indexed BED or GTF file of genes, then do a tabix
>> lookup on this file for each of your intervals.
>>
>> Trying to bend the VEP to do this is in my opinion a bit overkill, as all
>> the VEP will be calling for each of those genes is e.g.
>> upstream_gene_variant.
>>
>> Will
>>
>>
>> On 2 June 2014 13:44, Genomeo Dev <genomeodev at gmail.com> wrote:
>>
>>> Thanks Will. Yes that is what I want to achieve. Given so much variant
>>> data is coming out of  GWAS studies, because of LD structure the approach
>>> of considering specific loci for variant analysis is becoming increasingly
>>> common.
>>>
>>> I have tried to write an individual input file for VEP for each variant
>>> but since they are in the rank of 100K this is very slow to run. I don't
>>> know how easy would be to adjust the VEP API to take in a fixed range of
>>> coordinates as well as the relative upstream and downstream coordinates.
>>> This way, I could set the fixed limits up in a new() method and VEP
>>> would process all the variants at once.
>>>
>>>
>>> Thanks,
>>>
>>> G.
>>>
>>>
>>> On 2 June 2014 13:24, Will McLaren <wm2 at ebi.ac.uk> wrote:
>>>
>>>> Hello again,
>>>>
>>>> I think I'm getting a little lost here with exactly what you're trying
>>>> to do with the plugin.
>>>>
>>>> If I understand correctly, you are trying to adjust the upstream and
>>>> downstream search distance on a per-variant basis? e.g. use 5kb for
>>>> variant1, 10kb for variant2?
>>>>
>>>> This is not possible using a plugin currently as the plugin code in the
>>>> run() method is executed after the transcripts are found and the
>>>> consequences are called. It is possible to set it globally, as in the
>>>> UpDownDistance plugin, as this is done in the new() method which is
>>>> executed once when the script starts up.
>>>>
>>>> I think the only way to achieve this would be to group your variants
>>>> into different input files by the distance cutoff required, and run them as
>>>> separate VEP commands with different distances passed to the UpDownDistance
>>>> plugin.
>>>>
>>>> HTH
>>>>
>>>> Will
>>>>
>>>>
>>>>
>>>>
>>>> On 2 June 2014 09:34, Genomeo Dev <genomeodev at gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> With regard to to the question with the aim to specify UPSTREAM_DISTANCE
>>>>> and DOWNSTREAM_DISTANCE limits for each variant in an input list of
>>>>> variants, if it is not possible to achieve that because of how VEP
>>>>> interacts with plug-ins, would it be possible to introduce UPSTREAM_COORDINATE
>>>>> and DOWNSTREAM_COORDINATE variables declarable within a plugin which
>>>>> would then allow to restrict where VEP looks for consequences?
>>>>>
>>>>> Regards,
>>>>>
>>>>> G.
>>>>>
>>>>>
>>>>>
>>>>> On 30 May 2014 17:49, Genomeo Dev <genomeodev at gmail.com> wrote:
>>>>>
>>>>>> I did eventually figure out the answer to the first question: my
>>>>>> ($self, $tva) = @_; $self->params()
>>>>>>
>>>>>> For my second question, more specifically, what I want to do is to be
>>>>>> able to use the original input coordinate for each individual input variant
>>>>>> to then specify the UPSTREAM_DISTANCE and DOWNSTREAM_DISTANCE limits
>>>>>> per variant in UpDownDistance.pm. The reason for that is I have a
>>>>>> large group of variants for which I want to consider consequences within
>>>>>> the same physical range which I can already pass on the the plugin as
>>>>>> arguments. Running VEP per variant is not efficient hence the question.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> G.
>>>>>>
>>>>>>
>>>>>> On 30 May 2014 15:48, Genomeo Dev <genomeodev at gmail.com> wrote:
>>>>>>
>>>>>>> A related question is where how to get the inputed variant
>>>>>>> attributes (e.g. position, reference ID) so to process that within the
>>>>>>> subroutine.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> G.
>>>>>>>
>>>>>>>
>>>>>>> On 30 May 2014 13:01, Genomeo Dev <genomeodev at gmail.com> wrote:
>>>>>>>
>>>>>>>> Thanks Will. It is working fine now.
>>>>>>>>
>>>>>>>> I wanted to modify the UpDownDistance.pm to produce two separate
>>>>>>>> columns in the VEP output showing the UPDIST_CUTOFF and UPDIST_CUTOFF
>>>>>>>> parameters (See below). Please how do I fetch the plugin arguments into the
>>>>>>>> run subroutine?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> G.
>>>>>>>>
>>>>>>>>
>>>>>>>> use strict;
>>>>>>>> use warnings;
>>>>>>>> use base qw(Bio::EnsEMBL::Variation::Utils::BaseVepPlugin);
>>>>>>>>
>>>>>>>> sub feature_types {
>>>>>>>>     return ['Feature', 'Intergenic'];
>>>>>>>> }
>>>>>>>>
>>>>>>>> sub get_header_info {
>>>>>>>>     return {
>>>>>>>>         UPDIST_CUTOFF => "distance cutoff upstream variant where
>>>>>>>> consequences are calculated",
>>>>>>>>         DOWNIDST_CUTOFF => "distance cutoff downstream variant
>>>>>>>> where consequences are calculated"
>>>>>>>>     };
>>>>>>>> }
>>>>>>>>
>>>>>>>> sub new {
>>>>>>>>
>>>>>>>>   my $class = shift;
>>>>>>>>   my $self = $class->SUPER::new(@_);
>>>>>>>>
>>>>>>>>   # change up/down
>>>>>>>>   my $up = $self->params->[0] || 5000;
>>>>>>>>
>>>>>>>>   my $down = $self->params->[1] || $up;
>>>>>>>>
>>>>>>>> $Bio::EnsEMBL::Variation::Utils::VariationEffect::UPSTREAM_DISTANCE = $up;
>>>>>>>>
>>>>>>>> $Bio::EnsEMBL::Variation::Utils::VariationEffect::DOWNSTREAM_DISTANCE =
>>>>>>>> $down;
>>>>>>>>
>>>>>>>>   return $self;
>>>>>>>>
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> sub run {
>>>>>>>>         my $upstream_distance = ?
>>>>>>>>         my $downstream_distance = ?
>>>>>>>>  return {
>>>>>>>> UPDIST_CUTOFF => $upstream_distance,
>>>>>>>> DOWNDIST_CUTOFF => $downstream_distance
>>>>>>>>  }
>>>>>>>> };
>>>>>>>>
>>>>>>>> 1;
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 29 May 2014 09:57, Will McLaren <wm2 at ebi.ac.uk> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I've patched a fix in for the UpDownDistance issue, the fix is in
>>>>>>>>> the main ensembl-variation API.
>>>>>>>>>
>>>>>>>>> Regarding the DISTANCE field, perhaps you could write a plugin
>>>>>>>>> that does exactly what you want? Changing the behaviour of this field may
>>>>>>>>> not be compatible with other people's pipelines, and the plugin system is
>>>>>>>>> the perfect way for you to have annotations customised to your requirements.
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>>
>>>>>>>>> Will
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 28 May 2014 18:58, Genomeo Dev <genomeodev at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> When using different up and down arguments in UpDownDistance.pm,
>>>>>>>>>> VEP returns genes outside the specified range as shown in the example below
>>>>>>>>>> (MIR1302-4 is 94161 upstream of rs17808606 but is still reported
>>>>>>>>>> using UpDownDistance,5000,100000). For the genes which are
>>>>>>>>>> outside the range, the DISTANCE and Consequence columns are empty while for
>>>>>>>>>> example TSSDistance is not empty which might indicate the up and down
>>>>>>>>>> arguments may not be processed correctly.
>>>>>>>>>>
>>>>>>>>>> It would be helpful to only return genes whose coordinates
>>>>>>>>>> satisfy the specified range. Also, it would immensely help as well if
>>>>>>>>>> DISTANCE is set to 0 for variants falling within genes and is otherwise
>>>>>>>>>> calculated even for non-transcript feature types.
>>>>>>>>>>
>>>>>>>>>> Note that I am using Ensembl 75 updated with the recently updated
>>>>>>>>>> ensembl variantion module which allowed UpDownDistance.pm to work for
>>>>>>>>>> distances beyond 5kb.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> G.
>>>>>>>>>>
>>>>>>>>>> ##UpDownDistance,5000,100000
>>>>>>>>>> ##TSSDistance
>>>>>>>>>>        #Uploaded_variation Location Allele Existing_variation
>>>>>>>>>> SYMBOL SYMBOL_SOURCE Gene ENSP Feature Feature_type BIOTYPE
>>>>>>>>>> STRAND CANONICAL EXON INTRON DISTANCE TSSDistance Consequence
>>>>>>>>>> rs17808606  2:208228309 T rs17808606 AC007879.5
>>>>>>>>>> Clone_based_vega_gene ENSG00000223725 - ENST00000412387
>>>>>>>>>> Transcript antisense -1 - - 3/4 - -
>>>>>>>>>> intron_variant,nc_transcript_variant  rs17808606 2:208228309 T
>>>>>>>>>> rs17808606 MIR1302-4 HGNC ENSG00000221628 - ENST00000408701
>>>>>>>>>> Transcript miRNA -1 YES - - - 94161  rs17808606  2:208228309 T
>>>>>>>>>> rs17808606 AC007879.6 Clone_based_vega_gene ENSG00000225064 -
>>>>>>>>>> ENST00000438824 Transcript lincRNA 1 YES - - 92895 -
>>>>>>>>>> downstream_gene_variant  rs17808606  2:208228309 T rs17808606
>>>>>>>>>> AC007879.5 Clone_based_vega_gene ENSG00000223725 -
>>>>>>>>>> ENST00000418850 Transcript antisense -1 YES - 4/5 - -
>>>>>>>>>> intron_variant,nc_transcript_variant
>>>>>>>>>> ##UpDownDistance,100000
>>>>>>>>>> ##TSSDistance
>>>>>>>>>>       #Uploaded_variation Location Allele Existing_variation
>>>>>>>>>> SYMBOL SYMBOL_SOURCE Gene ENSP Feature Feature_type BIOTYPE
>>>>>>>>>> STRAND CANONICAL EXON INTRON DISTANCE TSSDistance Consequence
>>>>>>>>>> rs17808606  2:208228309 T rs17808606 AC007879.5
>>>>>>>>>> Clone_based_vega_gene ENSG00000223725 - ENST00000412387
>>>>>>>>>> Transcript antisense -1 - - 3/4 - -
>>>>>>>>>> intron_variant,nc_transcript_variant  rs17808606  2:208228309 T
>>>>>>>>>> rs17808606 MIR1302-4 HGNC ENSG00000221628 - ENST00000408701
>>>>>>>>>> Transcript miRNA -1 YES - - 94161 94161 upstream_gene_variant
>>>>>>>>>> rs17808606  2:208228309 T rs17808606 AC007879.6
>>>>>>>>>> Clone_based_vega_gene ENSG00000225064 - ENST00000438824
>>>>>>>>>> Transcript lincRNA 1 YES - - 92895 - downstream_gene_variant
>>>>>>>>>> rs17808606  2:208228309 T rs17808606 AC007879.5
>>>>>>>>>> Clone_based_vega_gene ENSG00000223725 - ENST00000418850
>>>>>>>>>> Transcript antisense -1 YES - 4/5 - -
>>>>>>>>>> intron_variant,nc_transcript_variant
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 27 May 2014 11:03, Genomeo Dev <genomeodev at gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Sorry seems the plug-in already does that thanks!
>>>>>>>>>>>
>>>>>>>>>>> G.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 23 May 2014 19:14, Genomeo Dev <genomeodev at gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Will,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks very much. That worked nicely.
>>>>>>>>>>>>
>>>>>>>>>>>> I am working with a set of variants within a locus where I know
>>>>>>>>>>>> that they are LD-independent with other genes from outside this locus.
>>>>>>>>>>>> Therefore, I want only to focus on genes inside this physically defined
>>>>>>>>>>>> locus.
>>>>>>>>>>>>
>>>>>>>>>>>> Rarely do these variants fall exactly at the centre of the
>>>>>>>>>>>> locus so distances to the right and left boundaries are not equal. Would it
>>>>>>>>>>>> be possible to alter UpDownDistance.pm to be able to specify a
>>>>>>>>>>>> start and end coordinate within which VEP should be constrained instead of
>>>>>>>>>>>> the current distance cutoff?
>>>>>>>>>>>>
>>>>>>>>>>>> Many thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> G.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 8 May 2014 16:12, Will McLaren <wm2 at ebi.ac.uk> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hello again,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I've fixed a bug that prevented UpDownDistance functioning
>>>>>>>>>>>>> correctly - it hadn't been tested with larger distances such as you
>>>>>>>>>>>>> specified which broke some assumptions in the core VEP code.
>>>>>>>>>>>>> You will need to update your ensembl-variation module or
>>>>>>>>>>>>> re-run the VEP INSTALL.pl script to pick up the new API code.
>>>>>>>>>>>>>
>>>>>>>>>>>>> As far as the other plugins go, I think you are
>>>>>>>>>>>>> misunderstanding how some of them work:
>>>>>>>>>>>>>
>>>>>>>>>>>>> TSSDistance - this gives the distance between a variant and
>>>>>>>>>>>>> the annotated transcript start site. If a variant is annotated as
>>>>>>>>>>>>> intergenic, there is no transcript to give the distance to! Changing the
>>>>>>>>>>>>> code to force it to assess intergenic variants will of course break here.
>>>>>>>>>>>>> Of course if you alter the up/down-stream distance using UpDownStream such
>>>>>>>>>>>>> that this then finds a transcript in range, the plugin will then work as
>>>>>>>>>>>>> expected without modification. It seems to me that you are expecting that
>>>>>>>>>>>>> this plugin will find the shortest distance to _any_ transcript start site,
>>>>>>>>>>>>> which is not the intended purpose of the code.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Condel & dbNSFP - these two plugins work exclusively on
>>>>>>>>>>>>> missense AKA non-synonymous SNVs (hence the NS in the name dbNSFP). While
>>>>>>>>>>>>> dbNSFP carries scores for CADD, and CADD gives scores for any genomic
>>>>>>>>>>>>> position, the CADD scores in dbNSFP are only for missense variants.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The feature_types() subroutine should be used when writing
>>>>>>>>>>>>> your own plugin to determine which kind of variant/feature combinations are
>>>>>>>>>>>>> considered by the plugin, since the run() sub is executed once for each
>>>>>>>>>>>>> variant/feature overlap found by the core VEP code. Modifying existing
>>>>>>>>>>>>> plugins like this should be done only if you are confident that the
>>>>>>>>>>>>> modification achieves what you intend.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hope that all helps
>>>>>>>>>>>>>
>>>>>>>>>>>>> Will
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 7 May 2014 17:59, Genomeo Dev <genomeodev at gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks Will.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am working with non-coding and intergenic variants and
>>>>>>>>>>>>>> wanted to run VEP with the following plugins:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --plugin UpDownDistance,100000 \
>>>>>>>>>>>>>> --plugin TSSDistance \
>>>>>>>>>>>>>> --plugin
>>>>>>>>>>>>>> Condel,/media/sf_D_DRIVE/Projects/Databases/ensembl/Plugins/Condel/config,b
>>>>>>>>>>>>>> \
>>>>>>>>>>>>>> --plugin
>>>>>>>>>>>>>> CADD,/media/sf_D_DRIVE/Projects/Databases/CADD/v1.0/1000G.tsv.gz \
>>>>>>>>>>>>>> --plugin
>>>>>>>>>>>>>> Gwava,tss,/media/sf_D_DRIVE/Projects/Databases/gwava/gwava_scores.bed.gz \
>>>>>>>>>>>>>> --plugin Conservation,GERP_CONSERVATION_SCORE,mammals \
>>>>>>>>>>>>>> --plugin
>>>>>>>>>>>>>> dbNSFP,/media/sf/data/dbNSFP/dbNSFP2.4.gz,GERP++_NR,GERP++_RS,LRT_score,LRT_pred,MutationTaster_score,MutationTaster_pred,MutationAssessor_score,MutationAssessor_pred,FATHMM_score,FATHMM_pred,RadialSVM_score,RadialSVM_pred,LR_score,LR_pred,Reliability_index,SiPhy_29way_logOdds,Polyphen2_HVAR_score,Polyphen2_HVAR_pred,SIFT_score,SIFT_pred,CADD_raw,CADD_phred
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As shown in the output below, apart from CADD.pm and
>>>>>>>>>>>>>> Gwava.pm, no scores are returned for the others. dbNSFP.pm should  get at
>>>>>>>>>>>>>> least CADD scores because these exist. As recommended I tried using:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> sub feature_types {
>>>>>>>>>>>>>>     return ['Feature', 'Intergenic'];
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> sub feature_types {
>>>>>>>>>>>>>>    return ['Transcript', 'Intergenic'];
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> in dbNFSP.pm but does not help. When I tried that in
>>>>>>>>>>>>>> TSSDistance.pm I get this error:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Plugin 'TSSDistance' went wrong: Can't locate object method
>>>>>>>>>>>>>> "transcript" via package
>>>>>>>>>>>>>> "Bio::EnsEMBL::Variation::IntergenicVariationAllele" at
>>>>>>>>>>>>>> /media/sf_D_DRIVE/Projects/Databases/ensembl/Plugins//TSSDistance.pm line
>>>>>>>>>>>>>> 56.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For UpDownDistance.pm, it does not seem to work as for
>>>>>>>>>>>>>> instance rs140931361 is 58298 bp from ENSG00000198822 but
>>>>>>>>>>>>>> this is gene is not returned.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> OUTPUT:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   ## ENSEMBL VARIANT EFFECT PREDICTOR v75                               ##
>>>>>>>>>>>>>> Output produced at 2014-05-07 17:28:44                               ##
>>>>>>>>>>>>>> Connected to homo_sapiens_core_75_37 on ensembldb.ensembl.org                              ##
>>>>>>>>>>>>>> Using cache in /media/sf_D_DRIVE/Projects/Databases/ensembl//homo_sapiens/75                             ##
>>>>>>>>>>>>>> Using API version 75, DB version 75                               ##
>>>>>>>>>>>>>> sift version sift5.0.2                                ##
>>>>>>>>>>>>>> polyphen version 2.2.2                                ##
>>>>>>>>>>>>>> Extra column keys:                                ## BIOTYPE
>>>>>>>>>>>>>> : Biotype of transcript                               ##
>>>>>>>>>>>>>> CANONICAL : Indicates if transcript is canonical for this gene                              ##
>>>>>>>>>>>>>> CELL_TYPE : List of cell types and classifications for regulatory feature                              ##
>>>>>>>>>>>>>> CLIN_SIG : Clinical significance of variant from dbSNP                              ##
>>>>>>>>>>>>>> DISTANCE : Shortest distance from variant to transcript                              ##
>>>>>>>>>>>>>> DOMAINS : The source and identifer of any overlapping protein domains                             ##
>>>>>>>>>>>>>> ENSP : Ensembl protein identifer                               ##
>>>>>>>>>>>>>> EXON : Exon number(s) / total                               ##
>>>>>>>>>>>>>> HIGH_INF_POS : A flag indicating if the variant falls in a high information
>>>>>>>>>>>>>> position of the TFBP                            ## INTRON :
>>>>>>>>>>>>>> Intron number(s) / total                               ##
>>>>>>>>>>>>>> MOTIF_NAME : The source and identifier of a transcription factor binding
>>>>>>>>>>>>>> profile (TFBP) aligned at this position                            ##
>>>>>>>>>>>>>> MOTIF_POS : The relative position of the variation in the aligned TFBP                              ##
>>>>>>>>>>>>>> MOTIF_SCORE_CHANGE : The difference in motif score of the reference and
>>>>>>>>>>>>>> variant sequences for the TFBP                            ##
>>>>>>>>>>>>>> PUBMED : Pubmed ID(s) of publications that cite existing variant                              ##
>>>>>>>>>>>>>> PolyPhen : PolyPhen prediction and/or score                               ##
>>>>>>>>>>>>>> SIFT : SIFT prediction and/or score                               ##
>>>>>>>>>>>>>> SYMBOL : Gene symbol (e.g. HGNC)                               ##
>>>>>>>>>>>>>> SYMBOL_SOURCE : Source of gene symbol                               ##
>>>>>>>>>>>>>> TSSDistance : Distance from the transcription start site                              ##
>>>>>>>>>>>>>> Condel : Consensus deleteriousness score for an amino acid
>>>>>>>>>>>>>> substitution based on SIFT and PolyPhen-2                           ##
>>>>>>>>>>>>>> CADD_RAW : Raw CADD score                               ##
>>>>>>>>>>>>>> CADD_PHRED : PHRED-like scaled CADD score                              ##
>>>>>>>>>>>>>> GWAVA : Genome Wide Annotation of VAriants score (tss model)                             ##
>>>>>>>>>>>>>> Conservation : The conservation score for this site
>>>>>>>>>>>>>> (method_link_type="GERP_CONSERVATION_SCORE", species_set="mammals")                          ##
>>>>>>>>>>>>>> MutationTaster_score : MutationTaster_score from dbNSFP file
>>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>>> Polyphen2_HVAR_score : Polyphen2_HVAR_score from dbNSFP file
>>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>>> LRT_pred : LRT_pred from dbNSFP file
>>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>>> MutationAssessor_score : MutationAssessor_score from dbNSFP
>>>>>>>>>>>>>> file /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>>> FATHMM_pred : FATHMM_pred from dbNSFP file
>>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>>> LR_score : LR_score from dbNSFP file
>>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>>> MutationTaster_pred : MutationTaster_pred from dbNSFP file
>>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>>> SiPhy_29way_logOdds : SiPhy_29way_logOdds from dbNSFP file
>>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>>> CADD_phred : CADD_phred from dbNSFP file
>>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>>> Polyphen2_HVAR_pred : Polyphen2_HVAR_pred from dbNSFP file
>>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>>> RadialSVM_pred : RadialSVM_pred from dbNSFP file
>>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>>> Reliability_index : Reliability_index from dbNSFP file
>>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>>> GERP++_NR : GERP++_NR from dbNSFP file
>>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>>> MutationAssessor_pred : MutationAssessor_pred from dbNSFP
>>>>>>>>>>>>>> file /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>>> LRT_score : LRT_score from dbNSFP file
>>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>>> CADD_raw : CADD_raw from dbNSFP file
>>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>>> LR_pred : LR_pred from dbNSFP file
>>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                            ##
>>>>>>>>>>>>>> FATHMM_score : FATHMM_score from dbNSFP file
>>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>>> SIFT_score : SIFT_score from dbNSFP file
>>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>>> GERP++_RS : GERP++_RS from dbNSFP file
>>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>>> SIFT_pred : SIFT_pred from dbNSFP file
>>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz                           ##
>>>>>>>>>>>>>> RadialSVM_score : RadialSVM_score from dbNSFP file
>>>>>>>>>>>>>> /media/sf_Psychiatric_Genetics_yo2/data/dbNSFP/dbNSFP2.4.gz
>>>>>>>>>>>>>> #Uploaded_variation Location Allele Existing_variation SYMBOL
>>>>>>>>>>>>>> SYMBOL_SOURCE Gene ENSP Feature Feature_type BIOTYPE STRAND
>>>>>>>>>>>>>> CANONICAL EXON INTRON DISTANCE TSSDistance Consequence
>>>>>>>>>>>>>> cDNA_position CDS_position Protein_position Amino_acids
>>>>>>>>>>>>>> Codons PolyPhen SIFT Condel CELL_TYPE SV PUBMED CLIN_SIG
>>>>>>>>>>>>>> HIGH_INF_POS MOTIF_NAME MOTIF_POS MOTIF_SCORE_CHANGE
>>>>>>>>>>>>>> TSSDistance CADD_RAW CADD_PHRED GWAVA Conservation GERP++_NR
>>>>>>>>>>>>>> GERP++_RS LRT_score LRT_pred MutationTaster_score
>>>>>>>>>>>>>> MutationTaster_pred MutationAssessor_score
>>>>>>>>>>>>>> MutationAssessor_pred FATHMM_score FATHMM_pred
>>>>>>>>>>>>>> RadialSVM_score RadialSVM_pred LR_score LR_pred
>>>>>>>>>>>>>> Reliability_index SiPhy_29way_logOdds Polyphen2_HVAR_score
>>>>>>>>>>>>>> Polyphen2_HVAR_pred SIFT_score SIFT_pred CADD_raw CADD_phred
>>>>>>>>>>>>>> Extra  rs13247133 7:86199080 A rs13247133 - - - - - - - - - -
>>>>>>>>>>>>>> - - - intergenic_variant - - - - - - - - - - - - - - - - -
>>>>>>>>>>>>>> -0.25769 2.762 0.11 - - - - - - - - - - - - - - - - - - - - -
>>>>>>>>>>>>>> - - CADD_RAW=-0.257691;CADD_PHRED=2.762;GWAVA=0.11
>>>>>>>>>>>>>> rs13244782 7:86202665 T rs13244782 - - - - - - - - - - - - -
>>>>>>>>>>>>>> intergenic_variant - - - - - - - - - - - - - - - - - 1.957591
>>>>>>>>>>>>>> 12.5 0.15 - - - - - - - - - - - - - - - - - - - - - - -
>>>>>>>>>>>>>> CADD_RAW=1.957591;CADD_PHRED=12.50;GWAVA=0.15  rs12704267
>>>>>>>>>>>>>> 7:86206830 T rs12704267 - - - - - - - - - - - - -
>>>>>>>>>>>>>> intergenic_variant - - - - - - - - - - - - - - - - - 0.111018
>>>>>>>>>>>>>> 4.597 0.16 - - - - - - - - - - - - - - - - - - - - - - -
>>>>>>>>>>>>>> CADD_RAW=0.111018;CADD_PHRED=4.597;GWAVA=0.16  rs140931361
>>>>>>>>>>>>>> 7:86214933-86214937 - rs140931361 - - - - - - - - - - - - -
>>>>>>>>>>>>>> intergenic_variant - - - - - - - - - - - - - - - - - -0.42024
>>>>>>>>>>>>>> 2.04 - - - - - - - - - - - - - - - - - - - - - - - -
>>>>>>>>>>>>>> CADD_RAW=-0.420243;CADD_PHRED=2.040  rs34536358 7:86222651 G
>>>>>>>>>>>>>> rs34536358 - - - - - - - - - - - - - intergenic_variant - - -
>>>>>>>>>>>>>> - - - - - - - - - - - - - - -0.31002 2.524 0.18 - - - - - - -
>>>>>>>>>>>>>> - - - - - - - - - - - - - - - -
>>>>>>>>>>>>>> CADD_RAW=-0.310016;CADD_PHRED=2.524;GWAVA=0.18  rs36006360
>>>>>>>>>>>>>> 7:86224933 T rs36006360 - - - - - - - - - - - - -
>>>>>>>>>>>>>> intergenic_variant - - - - - - - - - - - - - - - - - 2.513017
>>>>>>>>>>>>>> 14.36 0.36 - - - - - - - - - - - - - - - - - - - - - - -
>>>>>>>>>>>>>> CADD_RAW=2.513017;CADD_PHRED=14.36;GWAVA=0.36  rs13244678
>>>>>>>>>>>>>> 7:86232583 T rs13244678 - - - - - - - - - - - - -
>>>>>>>>>>>>>> intergenic_variant - - - - - - - - - - - - - - - - - -0.52024
>>>>>>>>>>>>>> 1.626 0.05 - - - - - - - - - - - - - - - - - - - - - - -
>>>>>>>>>>>>>> CADD_RAW=-0.520238;CADD_PHRED=1.626;GWAVA=0.05  rs12704279
>>>>>>>>>>>>>> 7:86238294 T rs12704279 - - - - - - - - - - - - -
>>>>>>>>>>>>>> intergenic_variant - - - - - - - - - - - - - - - - - 0.454708
>>>>>>>>>>>>>> 6.469 0.16 - - - - - - - - - - - - - - - - - - - - - - -
>>>>>>>>>>>>>> CADD_RAW=0.454708;CADD_PHRED=6.469;GWAVA=0.16  rs13228078
>>>>>>>>>>>>>> 7:86240691 C rs13228078 - - - - - - - - - - - - -
>>>>>>>>>>>>>> intergenic_variant - - - - - - - - - - - - - - - - - 0.980262
>>>>>>>>>>>>>> 9.002 0.1 - - - - - - - - - - - - - - - - - - - - - - -
>>>>>>>>>>>>>> CADD_RAW=0.980262;CADD_PHRED=9.002;GWAVA=0.1  rs140931361
>>>>>>>>>>>>>> 7:86214933-86214937 - rs140931361 - - - - - - - - - - - - -
>>>>>>>>>>>>>> intergenic_variant - - - - - - - - - - - - - - - - - -0.42024
>>>>>>>>>>>>>> 2.04 - - - - - - - - - - - - - - - - - - - - - - - -
>>>>>>>>>>>>>> CADD_RAW=-0.420243;CADD_PHRED=2.040
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> G.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 7 May 2014 16:13, Will McLaren <wm2 at ebi.ac.uk> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Correct, the plugin was intended to work with
>>>>>>>>>>>>>>> the whole_genome_SNVs.tsv file, which only contains data for SNVs.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I've modified the plugin so that it should be able to cope
>>>>>>>>>>>>>>> with indel data files such as you have; please do let me know if you have
>>>>>>>>>>>>>>> any problems as I've only sparingly tested it on made-up data!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Will McLaren
>>>>>>>>>>>>>>> Ensembl Variation
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 7 May 2014 15:37, Genomeo Dev <genomeodev at gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> There seem to be a discrepancy between the CADD score
>>>>>>>>>>>>>>>> calculated using VEP with the CADD.pm plugin and the tabix direct output:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For example using this 1000G variant:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> #CHROM POS ID REF ALT QUAL FILTER INFO
>>>>>>>>>>>>>>>> 7 86214932 rs140931361 TTACTC T . PASS .
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> variant_effect_predictor.pl -i input.txt --format vcf
>>>>>>>>>>>>>>>> --plugin CADD,/media/sf_D_DRIVE/Projects/Databases/CADD/v1.0/1000G.tsv.gz
>>>>>>>>>>>>>>>> does not return any CADD score
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> whereas
>>>>>>>>>>>>>>>> $ tabix -p vcf 1000G.tsv.gz 7:86214932-86214932
>>>>>>>>>>>>>>>> 7 86214932 TTACTC T -0.420243 2.040
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This seems to affect indels and not SNVs. I could see in
>>>>>>>>>>>>>>>> the plugin that there is a rule to ignore indels. Any suggestions please
>>>>>>>>>>>>>>>> how to safely change that?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Also, in the plugin, I assume there is a test to ensure the
>>>>>>>>>>>>>>>> alleles are identical between the input file and the 1000G.tsv.gz file. Is
>>>>>>>>>>>>>>>> this correct?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> G.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> Dev mailing list    Dev at ensembl.org
>>>>>>>>>>>>>>>> Posting guidelines and subscribe/unsubscribe info:
>>>>>>>>>>>>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>>>>>>>>>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> Dev mailing list    Dev at ensembl.org
>>>>>>>>>>>>>>> Posting guidelines and subscribe/unsubscribe info:
>>>>>>>>>>>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>>>>>>>>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> G.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> Dev mailing list    Dev at ensembl.org
>>>>>>>>>>>>>> Posting guidelines and subscribe/unsubscribe info:
>>>>>>>>>>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>>>>>>>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> Dev mailing list    Dev at ensembl.org
>>>>>>>>>>>>> Posting guidelines and subscribe/unsubscribe info:
>>>>>>>>>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>>>>>>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> G.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> G.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> G.
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Dev mailing list    Dev at ensembl.org
>>>>>>>>>> Posting guidelines and subscribe/unsubscribe info:
>>>>>>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>>>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Dev mailing list    Dev at ensembl.org
>>>>>>>>> Posting guidelines and subscribe/unsubscribe info:
>>>>>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> G.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> G.
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> G.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> G.
>>>>>
>>>>> _______________________________________________
>>>>> Dev mailing list    Dev at ensembl.org
>>>>> Posting guidelines and subscribe/unsubscribe info:
>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Dev mailing list    Dev at ensembl.org
>>>> Posting guidelines and subscribe/unsubscribe info:
>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog: http://www.ensembl.info/
>>>>
>>>>
>>>
>>>
>>> --
>>> G.
>>>
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info:
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>>
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>
>
> --
> G.
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20140603/2c2e2fc3/attachment.html>


More information about the Dev mailing list