[ensembl-dev] VEP Extra output information
Guillermo Marco Puche
guillermo.marco at sistemasgenomicos.com
Wed Apr 24 17:01:50 BST 2013
Forgot to mention that we believe that output format will be
1_35246849_-/C for that variant it's ok.
But $pos_string should have the original 35246848 position to access the
indexed file.
Best regards,
Guillermo.
On 04/24/13 17:54, Guillermo Marco Puche wrote:
> Hello Will,
>
> We've been testing around that VCF line that it's causing error:
>
> We believe that the insertion format is correct (we already read the
> format page on VEP ensembl before) for the following line:
>
> *chr1 35246848 . G GC 1000 . data2
>
> This is a "C" insertion in pos **35246848.
> *For this line in VCF the VEP script is getting this values for
> $pos_string:
> chr1:35246849-35246848
>
> Since this index is afterwards not recognized by tabix access to the
> indexed file won't happen.
> We may be wrong and don't understand the explanation on the VCF
> formatting section you mentioned before.
>
> On 04/24/13 13:51, Will McLaren wrote:
>> See:
>>
>> http://www.ensembl.org/info/docs/variation/vep/vep_formats.html#vcf
>>
>> Will
>>
>>
>> On 24 April 2013 12:40, Guillermo Marco Puche
>> <guillermo.marco at sistemasgenomicos.com
>> <mailto:guillermo.marco at sistemasgenomicos.com>> wrote:
>>
>> Hello,
>>
>> It seems this has something to annotation format:
>>
>> I've modified the line to look like this:
>>
>> chr1 35246849 . - C 1000 . data2
>>
>> I removed *.gz and index files.
>>
>> Re-indexed with the following commands:
>>
>> bgzip -c tabix_test.vcf > tabix_test.vcf.gz
>>
>> tabix -p vcf tabix_test.vcf.gz
>>
>> And when running VEP I get this error:
>>
>> WARNING: Alleles look like an insertion (-/C) but coordinates are
>> not start = end + 1 (START=35246849, END=35246849) on line 2
>>
>>
>> On 04/24/13 13:24, Guillermo Marco Puche wrote:
>>> Hello,
>>>
>>> I'm confused. I've found a bug into my plugin.
>>>
>>> This is my input vcf file sample:
>>>
>>> #CHROM POS ID REF ALT QUAL FILTER INFO
>>> FORMAT DATA
>>> chr1 6520668 . A C 1000 . data1
>>> *chr1 35246848 . G GC 1000 . data2*
>>> chr1 35247292 . A G 1000 . data1
>>>
>>>
>>> This line "*chr1 35246848 . G GC 1000 .
>>> data2"* is making my plugin fail.
>>> I've noticed that printing
>>> my $pos_string = sprintf("chr%s:%i-%i", $vf->{chr}, $vf->{start}, $vf->{end});
>>> Results in this for that line: chr1:35246849-35246848
>>>
>>> How is this possible? That start position is one position ahead
>>> of end?
>>>
>>> Best regards,
>>> Guillermo.
>>>
>>> -------- Original Message --------
>>> Subject: Re: [ensembl-dev] VEP Extra output information
>>> Date: Tue, 23 Apr 2013 09:48:00 +0200
>>> From: Guillermo Marco Puche
>>> <guillermo.marco at sistemasgenomicos.com>
>>> <mailto:guillermo.marco at sistemasgenomicos.com>
>>> Organization: Sistemas Genómicos
>>> To: dev at ensembl.org <mailto:dev at ensembl.org>
>>>
>>>
>>>
>>> Hello,
>>>
>>> Ok finally fixed it !
>>>
>>> The problem was that my input VCF file had first column in this
>>> format: chr2 and not just the number.
>>> Fixed it changing the following line on the script:
>>> my $pos_string = sprintf("chr%s:%i-%i", $vf->{chr}, $vf->{start}, $vf->{end});
>>>
>>> Thank you so much Will.
>>> Now I'm pretty sure that my next plugins will be easier to develop !
>>>
>>> I still think that plugin documentation should be extended on
>>> Ensembl website, it's not that easy to get started ;)
>>>
>>> Once again thank you very much.
>>>
>>> Best regards,
>>> Guillermo.
>>>
>>> On 04/23/13 09:27, Guillermo Marco Puche wrote:
>>>> Hello,
>>>>
>>>> I edited the code so it match exactly the way you handled input
>>>> file and tabix in dbNSFP plugin.
>>>> You can check code on GIT repo:
>>>> https://github.com/guillermomarco/vcf_input/blob/master/vcf_input.pm
>>>>
>>>> It still won't iterate over TABIX file. Both variables
>>>> $pos_string (2:26739423-26739423) and $file(myfile.vcf.gz) have
>>>> correct values during execution. The only thing that differs
>>>> from your script is that I'm not extracting HEAD for Tabix
>>>> since I don't need it.
>>>>
>>>> Maybe something is wrong with my input file since plugin is
>>>> being executed but output is empty because it doesn't iterate
>>>> file over TABIX file handler.
>>>>
>>>> I've compressed and indexed it with the following commands
>>>> previously:
>>>> bgzip -c test.vcf > test.vcf.gz
>>>> tabix -p vcf test.vcf.gz
>>>>
>>>> Thank you.
>>>>
>>>> Best regards,
>>>> Guillermo.
>>>>
>>>> On 04/22/13 18:09, Guillermo Marco Puche wrote:
>>>>> Hello Will,
>>>>>
>>>>> After applying your fixes there's still no output.
>>>>>
>>>>> Something must be wrong with TABIX file handler. $pos_string
>>>>> seems correct, i've been printing on screen the values and
>>>>> they seem correct :)
>>>>>
>>>>> On the other hand "while TABIX" loop is not being executed. It
>>>>> never iterates through this loop. But there's also no error
>>>>> with file handler. So I don't know what is wrong.
>>>>>
>>>>> Guillermo.
>>>>>
>>>>> On 04/22/13 17:56, Will McLaren wrote:
>>>>>> A couple of other bugs I've spotted - you've got some variables being
>>>>>> declared in the wrong scope which means you won't get anything in your
>>>>>> results. Your run method should look like:
>>>>>>
>>>>>> sub run {
>>>>>> my ($self, $tva) = @_;
>>>>>> my $vf = $tva->variation_feature;
>>>>>> my $pos_string = sprintf("%s:%i-%i", $vf->{chr}, $vf->{start}, $vf->{end});
>>>>>>
>>>>>> #my $fichero = "test.vcf";
>>>>>>
>>>>>> my $info;
>>>>>>
>>>>>> open TABIX, sprintf("tabix %s %s |", $self->{file}, $pos_string);
>>>>>> #open TABIX file handler for current position
>>>>>> while(<TABIX>){
>>>>>> chomp;
>>>>>> if (my $line =~ /^#/){ next; } #skip header line
>>>>>> my @split = split /\t/;
>>>>>> $info = $split[7]; #store 8th column (INFO)
>>>>>> }
>>>>>> close TABIX; #close TABIX file handler
>>>>>>
>>>>>> return {
>>>>>> "SAMPLES" => $info,
>>>>>> };
>>>>>> }
>>>>>>
>>>>>> On 22 April 2013 16:38, Will McLaren<wm2 at ebi.ac.uk> <mailto:wm2 at ebi.ac.uk> wrote:
>>>>>>> You need to store the file name on the object itself. In the new method, replace
>>>>>>>
>>>>>>> my $file = $self->params->[0];
>>>>>>>
>>>>>>> with
>>>>>>>
>>>>>>> $self->{file} = $self->params->[0];
>>>>>>>
>>>>>>> Will
>>>>>>>
>>>>>>> On 22 April 2013 16:33, Guillermo Marco Puche
>>>>>>> <guillermo.marco at sistemasgenomicos.com> <mailto:guillermo.marco at sistemasgenomicos.com> wrote:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> Thank you for that information Will. It's so useful.
>>>>>>>>
>>>>>>>> Ok I've managed to almost finish the plugin:
>>>>>>>> https://github.com/guillermomarco/vcf_input/blob/master/vcf_input.pm
>>>>>>>>
>>>>>>>> I'm still getting an error with sprintf when trying to run it with VEP.
>>>>>>>>
>>>>>>>> 2013-04-22 17:18:07 - Warning: plugin
>>>>>>>> vcf_input,/home/likewise-open/SGNET/gmarco/scripts/genomics/global/annotation/ensembl71/VCF_Input/test.vcf.gz
>>>>>>>> version (0.1) does not match the current VEP version (71)
>>>>>>>> 2013-04-22 17:18:07 - You may experience unexpected behaviour with this
>>>>>>>> plugin
>>>>>>>> 2013-04-22 17:18:07 - Loaded plugin: vcf_input
>>>>>>>> 2013-04-22 17:18:07 - Output fields redefined (26 defined)
>>>>>>>> 2013-04-22 17:18:09 - INFO: Database will be accessed when using --hgvs
>>>>>>>> 2013-04-22 17:18:09 - Starting...
>>>>>>>> 2013-04-22 17:18:09 - Read 3 variants into buffer
>>>>>>>> 2013-04-22 17:18:09 - Reading transcript data from cache and/or database
>>>>>>>> [===============================================] [ 100% ]
>>>>>>>> 2013-04-22 17:18:09 - Retrieved 74 transcripts (0 mem, 74 cached, 0 DB, 0
>>>>>>>> duplicates)
>>>>>>>> 2013-04-22 17:18:09 - Reading regulatory data from cache and/or database
>>>>>>>> [===============================================] [ 100% ]
>>>>>>>> 2013-04-22 17:18:09 - Retrieved 539 regulatory features (0 mem, 539 cached,
>>>>>>>> 0 DB, 0 duplicates)
>>>>>>>> 2013-04-22 17:18:09 - Calculating consequences
>>>>>>>> [> ] [ 5% ]
>>>>>>>> ERROR: Forked process failed
>>>>>>>> Use of uninitialized value in sprintf at
>>>>>>>> /home/likewise-open/SGNET/gmarco/.vep/Plugins/vcf_input.pm <http://vcf_input.pm> line 69.
>>>>>>>>
>>>>>>>> Thank you.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Guillermo.
>>>>>>>>
>>>>>>>> On 04/22/13 14:34, Will McLaren wrote:
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> The plugin is run once for each combination of variant and overlapped
>>>>>>>> feature.
>>>>>>>>
>>>>>>>> Let's say your variant overlaps 4 transcripts (they may be splice
>>>>>>>> variants of the same gene, or 4 different genes, the principal is the
>>>>>>>> same). In this case, the plugin's "run" method will be executed 4
>>>>>>>> times for that variant, once for each transcript. The $tva
>>>>>>>> TranscriptVariationAllele object in each run will have a different
>>>>>>>> transcript "attached" to it. I'd recommend stepping through how a
>>>>>>>> plugin works using Perl's debugger - simply add the line:
>>>>>>>>
>>>>>>>> $DB::single = 1;
>>>>>>>>
>>>>>>>> somewhere at the start of the run method, then run the vep with perl's -d
>>>>>>>> flag:
>>>>>>>>
>>>>>>>> perl -dvariant_effect_predictor.pl <http://variant_effect_predictor.pl> [etc]
>>>>>>>>
>>>>>>>> I'm not sure, but it seems like you're preloading all of the VCF in
>>>>>>>> your plugin at the beginning, and then trying to add the info one at a
>>>>>>>> time. This is not an ideal way to do it, as for large VCF files you
>>>>>>>> may run in to memory usage issues, and especially using an array you
>>>>>>>> may not be able to reliably link the lines from your VCF to the lines
>>>>>>>> of input going into the VEP (a hash keyed on position would be much
>>>>>>>> better).
>>>>>>>>
>>>>>>>> A much better way to do it would be to prepare the VCF file with
>>>>>>>> tabix. The tabix utility can then retrieve just the relevant line from
>>>>>>>> the VCF on demand (this is very quick, and you can cache data on the
>>>>>>>> plugin's hash structure between separate executions of the "run"
>>>>>>>> method).
>>>>>>>>
>>>>>>>> In the dbNSFP.pm plugin on GitHub, I do something very very similar to
>>>>>>>> this. First, I get the variation feature object being passed to the
>>>>>>>> "run" method - this contains the genomic coordinates of the current
>>>>>>>> variant:
>>>>>>>>
>>>>>>>> my $vf = $tva->variation_feature;
>>>>>>>>
>>>>>>>> I then create a string to pass to tabix, which is chr:start-end; this
>>>>>>>> means tabix will retrieve the lines from your VCF in that range:
>>>>>>>>
>>>>>>>> my $pos_string = sprintf("%s:%i-%i", $vf->{chr}, $vf->{start}, $vf->{end});
>>>>>>>>
>>>>>>>> I then run tabix and open the output as a pipe:
>>>>>>>>
>>>>>>>> open TABIX, sprintf("tabix %s %s |", $self->{file}, $pos_string);
>>>>>>>>
>>>>>>>> I can then read lines of VCF from the <TABIX> filehandle, parse them,
>>>>>>>> and finally add the data to the %return hash that gets sent back at
>>>>>>>> the end of the plugin.
>>>>>>>>
>>>>>>>> This return hash must contain key/value pairs that will be printed in
>>>>>>>> the output. For example, lets say I want to add the variant name from
>>>>>>>> the VCF file I've just parsed:
>>>>>>>>
>>>>>>>> return {
>>>>>>>> "VAR_NAME" => $var_name,
>>>>>>>> }
>>>>>>>>
>>>>>>>> where $var_name = 'rs123'. Then in the output you would see (normal,
>>>>>>>> tab-delimited):
>>>>>>>>
>>>>>>>> VAR_NAME=rs123
>>>>>>>>
>>>>>>>> appear in the Extra column of your output file. You could add multiple
>>>>>>>> values for VAR_NAME, but you'd have to write that as a string, for
>>>>>>>> example:
>>>>>>>>
>>>>>>>> return {
>>>>>>>> "VAR_NAME" => join(",", ($var_name1, $var_name2)),
>>>>>>>> }
>>>>>>>>
>>>>>>>> which would give you e.g.
>>>>>>>>
>>>>>>>> VAR_NAME=rs123,rs456
>>>>>>>>
>>>>>>>> I'm afraid also at this juncture I have to point out that I'm nearing
>>>>>>>> the limit of support I'm meant to be giving out to one individual.
>>>>>>>> While we are here to help and will answer any reasonable questions you
>>>>>>>> have, we have to stop short of doing people's jobs for them! Anything
>>>>>>>> more than a base level of help might have to be considered as a
>>>>>>>> collaboration, and this would require communication between our
>>>>>>>> respective supervisors.
>>>>>>>>
>>>>>>>> I hope that the documentation on the website and the example code
>>>>>>>> (which we try to comment as thoroughly as we can) should be enough to
>>>>>>>> keep you going, and of course I don't want to put you off using and
>>>>>>>> helping us improve the VEP.
>>>>>>>>
>>>>>>>> Regards
>>>>>>>>
>>>>>>>> Will
>>>>>>>>
>>>>>>>> On 22 April 2013 12:59, Guillermo Marco Puche
>>>>>>>> <guillermo.marco at sistemasgenomicos.com> <mailto:guillermo.marco at sistemasgenomicos.com> wrote:
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I'm starting to develop a simple plugin to write the INFO column from VCF
>>>>>>>> input into VEP output.
>>>>>>>> As far as I've been seen in Git VEP Plugin repo, VEP script will right the
>>>>>>>> value returned by the plugin in run function.
>>>>>>>>
>>>>>>>> Suppose that I've an array of values with the INFO column. Something like
>>>>>>>> this:
>>>>>>>> my @info_column = ("info_row1","info_row2","info_row3")
>>>>>>>>
>>>>>>>> An array containing the content of INFO column for each line of VCF input.
>>>>>>>> How do I associate each value to the corresponding VEP line output?
>>>>>>>> I guess I cannot simply return the array as the result of my plugin.
>>>>>>>>
>>>>>>>> Is plugin executed for every line by VEP script?
>>>>>>>>
>>>>>>>>
>>>>>>>> Thank you.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Guillermo.
>>>>>>>>
>>>>>>>> On 04/22/13 12:07, Guillermo Marco Puche wrote:
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> Me neither. So I've no clue. I hope someone else can help me.
>>>>>>>>
>>>>>>>> I've also been looking the plugin code you mentioned.
>>>>>>>> I don't really see how to extract the columns from input VCF and intersect
>>>>>>>> them with VEP output.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Guillermo.
>>>>>>>>
>>>>>>>> On 04/22/13 12:01, Will McLaren wrote:
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I haven't used VCFannotate myself, perhaps I was wrong!
>>>>>>>>
>>>>>>>> I know of other VEP users who have used it though, maybe someone on
>>>>>>>> the list will read this email and can give you some help.
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>> Will
>>>>>>>>
>>>>>>>> On 22 April 2013 10:58, Guillermo Marco Puche
>>>>>>>> <guillermo.marco at sistemasgenomicos.com> <mailto:guillermo.marco at sistemasgenomicos.com> wrote:
>>>>>>>>
>>>>>>>> Hello Will,
>>>>>>>>
>>>>>>>> It seems VCFannotate is made for "Intersect the records in the VCF file with
>>>>>>>> targets provided in a BED file.".
>>>>>>>> How I'm supposed to intersect the output from vep script (VCF or VEP file)
>>>>>>>> with my input file VCF?
>>>>>>>>
>>>>>>>>
>>>>>>>> Thank you.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Guillermo.
>>>>>>>>
>>>>>>>> On 04/19/13 11:31, Will McLaren wrote:
>>>>>>>>
>>>>>>>> Hi Guillermo,
>>>>>>>>
>>>>>>>> The --custom system doesn't quite work like that. Currently it is set
>>>>>>>> up to either provide only the ID or the coordinates of any features it
>>>>>>>> finds overlapping your variants in the custom file. It can't pull
>>>>>>>> particular fields from a VCF in the way you describe here.
>>>>>>>>
>>>>>>>> To do so, you'd either have to write a plugin to do this (see the
>>>>>>>> dbNSFP.pm plugin for an example of doing similar), or use VCFannotate,
>>>>>>>> which I believe can do this sort of thing out of the box.
>>>>>>>>
>>>>>>>> Regards
>>>>>>>>
>>>>>>>> Will
>>>>>>>>
>>>>>>>> On 19 April 2013 07:42, Guillermo Marco Puche
>>>>>>>> <guillermo.marco at sistemasgenomicos.com> <mailto:guillermo.marco at sistemasgenomicos.com> wrote:
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I'm trying to get the following fields from the VCF input with the --custom
>>>>>>>> flag.
>>>>>>>> I want to add the following columns to the VEP output file:
>>>>>>>>
>>>>>>>> #CHROM POS ID REF ALT QUA
>>>>>>>>
>>>>>>>> From what I've been reading this is possible to achieve using custom flag
>>>>>>>> and VCF input, since third column is used as identifier (ID, ie: rs6054257)
>>>>>>>>
>>>>>>>>
>>>>>>>> I've been trying with the following command:
>>>>>>>>
>>>>>>>> ./variant_effect_predictor.pl <http://variant_effect_predictor.pl> -i myinput.vcf.gz -format vcf -o myoutput.vep
>>>>>>>> --cache --everything --maf_1kg --force_overwrite --plugin
>>>>>>>> Condel,/home/likewise-open/SGNET/gmarco/.vep/Plugins/config/Condel/config,b
>>>>>>>> --custom myinput.vcf.gz,CHROM,vcf,exact,0 --fields
>>>>>>>> CHROM,Existing_variation,AFR_MAF,AMR_MAF,ASN_MAF,EUR_MAF,GMAF,Feature,Feature_type,HGVSc,HGVSp,Consequence,Domains,MOTIF_NAME,MOTIF_POS,HIGH_INF_POS,Condel,SIFT,Polyphen,Cell_Type,Canonical,CCDS,Intron,Exon
>>>>>>>>
>>>>>>>> I got an output like this:
>>>>>>>>
>>>>>>>> #CHROM Existing_variation AFR_MAF AMR_MAF ASN_MAF EUR_MAF
>>>>>>>> GMAF Feature Feature_type HGVSc HGVSp Consequence Domains
>>>>>>>> MOTIF_NAME MOTIF_POS HIGH_INF_POS Condel SIFT Polyphen
>>>>>>>> Cell_Type Canonical CCDS Intron Exon
>>>>>>>>
>>>>>>>> 1:6500735-6500735 - - - - - - NM_031475.2 Transcript
>>>>>>>> NM_031475.2:c.725C>T NP_113663.2:p.Thr242Ile missense_variant -
>>>>>>>> - - - deleterious(0.765) deleterious(0.03) - - - -
>>>>>>>> - -
>>>>>>>> 1:6501044-6501044 rs2311045 0.28 0.12 0.21 0.13 G:0.1822
>>>>>>>> ENSR00000074413 RegulatoryFeature - - regulatory_region_variant
>>>>>>>> - - - - - - - - - - - -
>>>>>>>> 1:6501044-6501044 rs2311045 0.28 0.12 0.21 0.13 G:0.1822
>>>>>>>> CCDS70.1 Transcript CCDS70.1:c.909C>G CCDS70.1:c.909C>G(p.=)
>>>>>>>> synonymous_variant - - - - - - - - - CCDS70.1
>>>>>>>> - -
>>>>>>>>
>>>>>>>> Position being show in CHROM column makes no sense to me if it's the key
>>>>>>>> identifier. If you're using the "exact" configuration in custom flag with no
>>>>>>>> overlapping why it's an interval shown?
>>>>>>>>
>>>>>>>> I would like that POS being shown in a second column called POS like in
>>>>>>>> original VCF and so on with the rest of custom missing fields. Output format
>>>>>>>> would be:
>>>>>>>>
>>>>>>>> #CHROM POS ID REF ALT QUA Existing_variation AFR_MAF AMR_MAF
>>>>>>>> ASN_MAF EUR_MAF GMAF Feature Feature_type HGVSc HGVSp
>>>>>>>> Consequence Domains MOTIF_NAME MOTIF_POS HIGH_INF_POS Condel
>>>>>>>> SIFT Polyphen Cell_Type Canonical &nbs
>>>>>>>> p;
>>>>>>>> CCDS Intron Exon
>>>>>>>> chr1 6501044 rs2311045 0.28 0.12 0.21 0.13 G:0.1822
>>>>>>>> ENSR00000074413 RegulatoryFeature - - regulatory_region_variant
>>>>>>>> - - - - - - - - - - - -
>>>>>>>>
>>>>>>>> I've been experiencing errors if I try with the following custom flag:
>>>>>>>> --custom myinput.vcf.gz,CHROM,POS,ID,REF,ALT,QUA,vcf,exact,0
>>>>>>>> I've no idea how to are more than one custom flag at a time, or not even if
>>>>>>>> this is possible. What would be the correct way to do this?
>>>>>>>>
>>>>>>>>
>>>>>>>> Thank you.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Guillermo.
>>>>>>>>
>>>>>>>> On 04/18/13 13:55, Guillermo Marco Puche wrote:
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> --fields command is working flawlessly ! I love it. It has saved me so much
>>>>>>>> work.
>>>>>>>>
>>>>>>>> ./variant_effect_predictor.pl <http://variant_effect_predictor.pl> -i
>>>>>>>> /home/likewise-open/SGNET/gmarco/VEP_71/in/Oto2_collect_not_annotated.vcf -o
>>>>>>>> /home/likewise-open/SGNET/gmarco/VEP_71/out/output.fields -format vcf
>>>>>>>> --cache --everything --maf_1kg --force_overwrite --fork 2 --plugin
>>>>>>>> Condel,/home/likewise-open/SGNET/gmarco/.vep/Plugins/config/Condel/config,b
>>>>>>>> --fields
>>>>>>>> Existing_variation,AFR_MAF,AMR_MAF,ASN_MAF,EUR_MAF,GMAF,Feature,Feature_type,HGVSc,HGVSp,Consequence,Domains,MOTIF_NAME,MOTIF_POS,HIGH_INF_POS,Condel,SIFT,Polyphen,Cell_Type,Canonical,CCDS,Intron,Exon
>>>>>>>>
>>>>>>>>
>>>>>>>> Now I need to figure out how to create a final output file which is the
>>>>>>>> relation of VCF input (Chromosome, Position, Ref_Allele, Var_Allele) with
>>>>>>>> the VEP output. To display all variants info for each chromosome.
>>>>>>>>
>>>>>>>> Guillermo.
>>>>>>>>
>>>>>>>> On 04/18/13 10:40, Will McLaren wrote:
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> The only way to do this would be to specify each Extra column as a
>>>>>>>> separate column using --fields.
>>>>>>>>
>>>>>>>> Will
>>>>>>>>
>>>>>>>> On 18 April 2013 08:29, Guillermo Marco Puche
>>>>>>>> <guillermo.marco at sistemasgenomicos.com> <mailto:guillermo.marco at sistemasgenomicos.com> wrote:
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> Finally I'm not going to use VCF format as output.
>>>>>>>> From original input VFC I need to print into my output Chromosome, Position,
>>>>>>>> Ref_Allele and Var_Allele columns.
>>>>>>>>
>>>>>>>> I prefer standard VEP column tabbed file for output, since it's much easier
>>>>>>>> to parse "Extra" column because all extra parameters are delimited by ";".
>>>>>>>> Is there any way to force VEP to print empty extra parameters?
>>>>>>>>
>>>>>>>> ie:
>>>>>>>>
>>>>>>>> 1_6508122_G/C 1:6508122 C ENSESTG00000022320 ENSESTT00000056337
>>>>>>>> Transcript downstream_gene_variant - - - - - rs11808508
>>>>>>>> AFR_MAF=;DISTANCE=2305;GMAF=;ASN_MAF=;EUR_MAF=;ENSP=ENSESTP00000056337;CANONICAL=YES;AMR_MAF=
>>>>>>>>
>>>>>>>> Or simply fill print empty extra empty fields with =EMPTY.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thank you.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Guillermo.
>>>>>>>>
>>>>>>>> On 04/17/13 16:53, Guillermo Marco Puche wrote:
>>>>>>>>
>>>>>>>> Again, thank you so much !
>>>>>>>>
>>>>>>>> I'm looking further VCFTools, maybe it should be the easiest and standard
>>>>>>>> way to parse VCF output from VEP.
>>>>>>>>
>>>>>>>> Thank you.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Guillermo.
>>>>>>>>
>>>>>>>> On 04/17/13 16:50, Will McLaren wrote:
>>>>>>>>
>>>>>>>> Yes, you can customise the fields used and the order they appear in
>>>>>>>> with --fields; this applies to both VCF and the normal tab-delimited
>>>>>>>> output.
>>>>>>>>
>>>>>>>> The delimiter is hardcoded I'm afraid, but I'm not sure what you'd
>>>>>>>> pick if you did decide to change it. ";" and "," are already used by
>>>>>>>> the VCF spec, and ":" appears in HGVS notations and other fields.
>>>>>>>>
>>>>>>>> If you did want to change it, you'd just need to edit lines 1272 and
>>>>>>>> 1275 of ensembl-variation/modules/Bio/EnsEMBL/Variation/Utils/VEP.pm.
>>>>>>>>
>>>>>>>> Will
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 17 April 2013 15:32, Guillermo Marco Puche
>>>>>>>> <guillermo.marco at sistemasgenomicos.com> <mailto:guillermo.marco at sistemasgenomicos.com> wrote:
>>>>>>>>
>>>>>>>> Hello Will,
>>>>>>>>
>>>>>>>>
>>>>>>>> On 04/17/13 14:46, Will McLaren wrote:
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> It's difficult (well, in fact impossible) to provide an example where
>>>>>>>> every field is populated, since some field types are mutually
>>>>>>>> exclusive dependent on the feature type overlapped (for example, you
>>>>>>>> will never see the CELL_TYPE field populated for a variant/transcript
>>>>>>>> combination).
>>>>>>>>
>>>>>>>> If you are interested in this for the purposes of how it looks for a
>>>>>>>> parser, you really want to be looking at the header line added to the
>>>>>>>> VCF by the VEP:
>>>>>>>>
>>>>>>>> ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as
>>>>>>>> predicted by VEP. Format:
>>>>>>>> Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|EXON|INTRON|HGNC|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|DISTANCE|CLIN_SIG|CANONICAL|SIFT|PolyPhen|GMAF|ENSP|DOMAINS|CCDS|HGVSc|HGVSp|CELL_TYPE|BLOSUM62|CAROL|Conservation|LinkedVariants|INTERPRO|TSSDistance">
>>>>>>>>
>>>>>>>> This lists the fields that are added in order. Using this you should
>>>>>>>> be able to parse what appears in the body of the file.
>>>>>>>>
>>>>>>>> Here's an example using a bunch of plugins and with the "--everything"
>>>>>>>> flag switched on:
>>>>>>>>
>>>>>>>> ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as
>>>>>>>> predicted by VEP. Format:
>>>>>>>> Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|EXON|INTRON|HGNC|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|DISTANCE|CLIN_SIG|CANONICAL|SIFT|PolyPhen|GMAF|ENSP|DOMAINS|CCDS|HGVSc|HGVSp|CELL_TYPE|BLOSUM62|CAROL|Conservation|LinkedVariants|INTERPRO|TSSDistance">
>>>>>>>> #CHROM POS ID REF ALT QUAL FILTER INFO
>>>>>>>> 21 26960070 rs116645811 G A . .
>>>>>>>>
>>>>>>>> CSQ=|||||||||||||||||||||||||||||||||||,A|ENSG00000154719|ENST00000352957|Transcript|intron_variant||||||rs116645811||9/9|MRPL39||||||||||A:0.0005|ENSP00000284967||CCDS13573.1|ENST00000352957.4:c.969+1077C>T|||||0.840||ENSP00000284967|,A|ENSG00000154719|ENST00000307301|Transcript|missense_variant|1043|1001|334|T/M|aCg/aTg|rs116645811|10/11||MRPL39|||||||YES|tolerated(0.06)|benign(0.001)|A:0.0005|ENSP00000305682|Low_complexity_(Seg):Seg|CCDS33522.1|ENST00000307301.7:c.1001C>T|ENSP00000305682.7:p.Thr334Met||-1|Neutral(0.940)|0.840||ENSP00000305682|
>>>>>>>>
>>>>>>>> I like this. It won't be so hard to parse it.
>>>>>>>>
>>>>>>>> I've I'm not wrong I can even choose the field order with "--fields" flag.
>>>>>>>> Is this only working for regular VEP column tabbed output file? Does it work
>>>>>>>> with VCF output also?
>>>>>>>>
>>>>>>>> The only thing I don't like is that delimiter being "|" character is also
>>>>>>>> used to fill empty fields. It would be great to change delimiter to another
>>>>>>>> special character so parsing is much easier.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thank you.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Guillermo.
>>>>>>>>
>>>>>>>> This is from input:
>>>>>>>>
>>>>>>>> #CHROM POS ID REF ALT QUAL FILTER INFO
>>>>>>>> 21 26960070 rs116645811 G A . . .
>>>>>>>>
>>>>>>>> using the command line:
>>>>>>>>
>>>>>>>> perlvariant_effect_predictor.pl <http://variant_effect_predictor.pl> -i test.txt -force -database
>>>>>>>> -everything -vcf -plugin Blosum62 -plugin Carol -plugin Conservation
>>>>>>>> -plugin LD -plugin ProteinDomains -plugin TSSDistance
>>>>>>>>
>>>>>>>> Hope this is a bit clearer!
>>>>>>>>
>>>>>>>> Will
>>>>>>>>
>>>>>>>> On 17 April 2013 11:25, Guillermo Marco Puche
>>>>>>>> <guillermo.marco at sistemasgenomicos.com> <mailto:guillermo.marco at sistemasgenomicos.com> wrote:
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I'm looking for an example *.vcf output with ALL the "Extra" parameters.
>>>>>>>> I've generated some with VEP script but i'm missing some extras never being
>>>>>>>> generated like HGNC.
>>>>>>>>
>>>>>>>> A few lines VCF with all values would be enough, since i'm planning to parse
>>>>>>>> "Extra" column.
>>>>>>>>
>>>>>>>> It also would be great if it includes most of the plugins outputs also :)
>>>>>>>>
>>>>>>>> Thank you :)
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Guillermo.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 04/16/13 18:00, Guillermo Marco Puche wrote:
>>>>>>>>
>>>>>>>> On 04/16/13 14:49, Will McLaren wrote:
>>>>>>>>
>>>>>>>> Hi Guillermo,
>>>>>>>>
>>>>>>>> There's two distinct ways you can add additional data to the output
>>>>>>>> from the VEP.
>>>>>>>>
>>>>>>>> 1) Custom annotations - here you simply provide the VEP with a
>>>>>>>> tabix-indexed position-based data file, and the VEP does the work of
>>>>>>>> finding overlaps with your variant input and the data from the file.
>>>>>>>>
>>>>>>>> 2) Plugins - you write the code to add to or manipulate the internal
>>>>>>>> data structures used by the VEP. In its simplest form, a plugin can be
>>>>>>>> simply looking up an attribute of some object and adding it to the
>>>>>>>> output.
>>>>>>>>
>>>>>>>> Writing a plugin requires a basic understanding of the Ensembl API,
>>>>>>>> but getting a basic plugin working requires only a very small amount
>>>>>>>> of code.
>>>>>>>>
>>>>>>>> Since additional data is being obtained from multiple sources, APIs, files,
>>>>>>>> etc.. I guess plugins are the only way to go for me.
>>>>>>>>
>>>>>>>> The documentation
>>>>>>>> (http://www.ensembl.org/info/docs/variation/vep/vep_script.html#plugins)
>>>>>>>> explains all of this, but the best way to see how plugins work is to
>>>>>>>> look at the existing plugins at
>>>>>>>> https://github.com/ensembl-variation/VEP_plugins. I'd suggest looking
>>>>>>>> at Conservation.pm and ProteinSeqs.pm as some relatively simple
>>>>>>>> examples of retrieving additional data from the API.
>>>>>>>>
>>>>>>>> Where are packages like package Conservation; comming from?
>>>>>>>>
>>>>>>>> You should note that using VCF output you will see repeated elements
>>>>>>>> in the INFO field added, since the plugin gets run once for every
>>>>>>>> variant/transcript overlap; all data appear under the CSQ field in the
>>>>>>>> INFO column. Currently there is no way for the VEP via plugins to add
>>>>>>>> separate INFO fields, however this is something we are looking into,
>>>>>>>> and in fact would be relatively easy to "hack" in for someone
>>>>>>>> determined enough (see subroutine vf_list_to_cons in
>>>>>>>> Bio::EnsEMBL::Variation::Utils::VEP).
>>>>>>>>
>>>>>>>> I'll look further into this tomorrow since I've to go now.
>>>>>>>>
>>>>>>>> A workaround could be simply generating a temp file with extra columns and
>>>>>>>> in the end merge original VCF from VEP script with the output from plugins
>>>>>>>> for additional columns.
>>>>>>>>
>>>>>>>> Maybe I missunderstood you. Correct me if i'm wrong please.
>>>>>>>>
>>>>>>>> Hope this helps, and feel free to ask further questions!
>>>>>>>>
>>>>>>>> Will McLaren
>>>>>>>> Ensembl Variation
>>>>>>>>
>>>>>>>> Thank you so much.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Guillermo.
>>>>>>>>
>>>>>>>> On 16 April 2013 12:58, Guillermo Marco Puche
>>>>>>>> <guillermo.marco at sistemasgenomicos.com> <mailto:guillermo.marco at sistemasgenomicos.com> wrote:
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I'm in need to develop some extra features for VEP.
>>>>>>>>
>>>>>>>> My input files are in VCF format and also my output.
>>>>>>>>
>>>>>>>> But I want to add several additional columns for extra data at the VCF out.
>>>>>>>>
>>>>>>>> For example,AA conservation score, Biobase description, Biobase link, MAF
>>>>>>>> populations, Flanking sequence, Gene description, InterPro_ID and more..
>>>>>>>>
>>>>>>>> I've been reading the documents and I'm a bit confused about "Custom
>>>>>>>> annotations".
>>>>>>>> I think since the data I want is extra on the output and not in the input,
>>>>>>>> what I should do is develop several Plugins to obtain all the values I need.
>>>>>>>>
>>>>>>>> I think most of them can be obtained through the Ensembl API even if I'm new
>>>>>>>> to this. Other will require more hard coding.
>>>>>>>>
>>>>>>>> I hope someone can clarify me a bit on this matter.
>>>>>>>>
>>>>>>>> Thank you.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Guillermo.
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Dev mailing listDev at ensembl.org <mailto:Dev at ensembl.org>
>>>>>>>> Posting guidelines and subscribe/unsubscribe info:
>>>>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>>>>> Ensembl Blog:http://www.ensembl.info/
>>
>
>
> _______________________________________________
> Dev mailing list Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20130424/b1ac5d50/attachment.html>
More information about the Dev
mailing list