[ensembl-dev] VEP Extra output information

Mon Apr 22 16:33:38 BST 2013

Hello,

Thank you for that information Will. It's so useful.

Ok I've managed to almost finish the plugin:
https://github.com/guillermomarco/vcf_input/blob/master/vcf_input.pm

I'm still getting an error with sprintf when trying to run it with VEP.

2013-04-22 17:18:07 - Warning: plugin vcf_input,/home/likewise-open/SGNET/gmarco/scripts/genomics/global/annotation/ensembl71/VCF_Input/test.vcf.gz version (0.1) does not match the current VEP version (71)
2013-04-22 17:18:07 - You may experience unexpected behaviour with this plugin
2013-04-22 17:18:07 - Loaded plugin: vcf_input
2013-04-22 17:18:07 - Output fields redefined (26 defined)
2013-04-22 17:18:09 - INFO: Database will be accessed when using --hgvs
2013-04-22 17:18:09 - Starting...
2013-04-22 17:18:09 - Read 3 variants into buffer
2013-04-22 17:18:09 - Reading transcript data from cache and/or database
[===============================================]  [ 100% ]
2013-04-22 17:18:09 - Retrieved 74 transcripts (0 mem, 74 cached, 0 DB, 0 duplicates)
2013-04-22 17:18:09 - Reading regulatory data from cache and/or database
[===============================================]  [ 100% ]
2013-04-22 17:18:09 - Retrieved 539 regulatory features (0 mem, 539 cached, 0 DB, 0 duplicates)
2013-04-22 17:18:09 - Calculating consequences
[>                                              ]    [ 5% ]
ERROR: Forked process failed
Use of uninitialized value in sprintf at /home/likewise-open/SGNET/gmarco/.vep/Plugins/vcf_input.pm line 69.

Thank you.

Best regards,
Guillermo.

On 04/22/13 14:34, Will McLaren wrote:

> Hello,
>
> The plugin is run once for each combination of variant and overlapped feature.
>
> Let's say your variant overlaps 4 transcripts (they may be splice
> variants of the same gene, or 4 different genes, the principal is the
> same). In this case, the plugin's "run" method will be executed 4
> times for that variant, once for each transcript. The $tva
> TranscriptVariationAllele object in each run will have a different
> transcript "attached" to it. I'd recommend stepping through how a
> plugin works using Perl's debugger - simply add the line:
>
> $DB::single = 1;
>
> somewhere at the start of the run method, then run the vep with perl's -d flag:
>
> perl -d variant_effect_predictor.pl [etc]
>
> I'm not sure, but it seems like you're preloading all of the VCF in
> your plugin at the beginning, and then trying to add the info one at a
> time. This is not an ideal way to do it, as for large VCF files you
> may run in to memory usage issues, and especially using an array you
> may not be able to reliably link the lines from your VCF to the lines
> of input going into the VEP (a hash keyed on position would be much
> better).
>
> A much better way to do it would be to prepare the VCF file with
> tabix. The tabix utility can then retrieve just the relevant line from
> the VCF on demand (this is very quick, and you can cache data on the
> plugin's hash structure between separate executions of the "run"
> method).
>
> In the dbNSFP.pm plugin on GitHub, I do something very very similar to
> this. First, I get the variation feature object being passed to the
> "run" method - this contains the genomic coordinates of the current
> variant:
>
> my $vf = $tva->variation_feature;
>
> I then create a string to pass to tabix, which is chr:start-end; this
> means tabix will retrieve the lines from your VCF in that range:
>
> my $pos_string = sprintf("%s:%i-%i", $vf->{chr}, $vf->{start}, $vf->{end});
>
> I then run tabix and open the output as a pipe:
>
> open TABIX, sprintf("tabix %s %s |", $self->{file}, $pos_string);
>
> I can then read lines of VCF from the <TABIX> filehandle, parse them,
> and finally add the data to the %return hash that gets sent back at
> the end of the plugin.
>
> This return hash must contain key/value pairs that will be printed in
> the output. For example, lets say I want to add the variant name from
> the VCF file I've just parsed:
>
> return {
>    "VAR_NAME" => $var_name,
> }
>
> where $var_name = 'rs123'. Then in the output you would see (normal,
> tab-delimited):
>
> VAR_NAME=rs123
>
> appear in the Extra column of your output file. You could add multiple
> values for VAR_NAME, but you'd have to write that as a string, for
> example:
>
> return {
>    "VAR_NAME" => join(",", ($var_name1, $var_name2)),
> }
>
> which would give you e.g.
>
> VAR_NAME=rs123,rs456
>
> I'm afraid also at this juncture I have to point out that I'm nearing
> the limit of support I'm meant to be giving out to one individual.
> While we are here to help and will answer any reasonable questions you
> have, we have to stop short of doing people's jobs for them! Anything
> more than a base level of help might have to be considered as a
> collaboration, and this would require communication between our
> respective supervisors.
>
> I hope that the documentation on the website and the example code
> (which we try to comment as thoroughly as we can) should be enough to
> keep you going, and of course I don't want to put you off using and
> helping us improve the VEP.
>
> Regards
>
> Will
>
> On 22 April 2013 12:59, Guillermo Marco Puche
> <guillermo.marco at sistemasgenomicos.com> wrote:
>> Hello,
>>
>> I'm starting to develop a simple plugin to write the INFO column from VCF
>> input into VEP output.
>> As far as I've been seen in Git VEP Plugin repo, VEP script will right the
>> value returned by the plugin in run function.
>>
>> Suppose that I've an array of values with the INFO column. Something like
>> this:
>> my @info_column = ("info_row1","info_row2","info_row3")
>>
>> An array containing the content of INFO column for each line of VCF input.
>> How do I associate each value to the corresponding VEP line output?
>> I guess I cannot simply return the array as the result of my plugin.
>>
>> Is plugin executed for every line by VEP script?
>>
>>
>> Thank you.
>>
>> Best regards,
>> Guillermo.
>>
>> On 04/22/13 12:07, Guillermo Marco Puche wrote:
>>
>> Hello,
>>
>> Me neither. So I've no clue. I hope someone else can help me.
>>
>> I've also been looking the plugin code you mentioned.
>> I don't really see how to extract the columns from input VCF and intersect
>> them with VEP output.
>>
>> Regards,
>> Guillermo.
>>
>> On 04/22/13 12:01, Will McLaren wrote:
>>
>> Hello,
>>
>> I haven't used VCFannotate myself, perhaps I was wrong!
>>
>> I know of other VEP users who have used it though, maybe someone on
>> the list will read this email and can give you some help.
>>
>> Cheers
>>
>> Will
>>
>> On 22 April 2013 10:58, Guillermo Marco Puche
>> <guillermo.marco at sistemasgenomicos.com> wrote:
>>
>> Hello Will,
>>
>> It seems VCFannotate is made for "Intersect the records in the VCF file with
>> targets provided in a BED file.".
>> How I'm supposed to intersect the output from vep script (VCF or VEP file)
>> with my input file VCF?
>>
>>
>> Thank you.
>>
>> Best regards,
>> Guillermo.
>>
>> On 04/19/13 11:31, Will McLaren wrote:
>>
>> Hi Guillermo,
>>
>> The --custom system doesn't quite work like that. Currently it is set
>> up to either provide only the ID or the coordinates of any features it
>> finds overlapping your variants in the custom file. It can't pull
>> particular fields from a VCF in the way you describe here.
>>
>> To do so, you'd either have to write a plugin to do this (see the
>> dbNSFP.pm plugin for an example of doing similar), or use VCFannotate,
>> which I believe can do this sort of thing out of the box.
>>
>> Regards
>>
>> Will
>>
>> On 19 April 2013 07:42, Guillermo Marco Puche
>> <guillermo.marco at sistemasgenomicos.com> wrote:
>>
>> Hello,
>>
>> I'm trying to get the following fields from the VCF input with the --custom
>> flag.
>> I want to add the following columns to the VEP output file:
>>
>> #CHROM	POS	ID	REF	ALT	QUA
>>
>>  From what I've been reading this is possible to achieve using custom flag
>> and VCF input, since third column is used as identifier (ID, ie: rs6054257)
>>
>>
>> I've been trying with the following command:
>>
>> ./variant_effect_predictor.pl -i myinput.vcf.gz -format vcf -o myoutput.vep
>> --cache --everything --maf_1kg --force_overwrite --plugin
>> Condel,/home/likewise-open/SGNET/gmarco/.vep/Plugins/config/Condel/config,b
>> --custom myinput.vcf.gz,CHROM,vcf,exact,0 --fields
>> CHROM,Existing_variation,AFR_MAF,AMR_MAF,ASN_MAF,EUR_MAF,GMAF,Feature,Feature_type,HGVSc,HGVSp,Consequence,Domains,MOTIF_NAME,MOTIF_POS,HIGH_INF_POS,Condel,SIFT,Polyphen,Cell_Type,Canonical,CCDS,Intron,Exon
>>
>> I got an output like this:
>>
>> #CHROM    Existing_variation    AFR_MAF    AMR_MAF    ASN_MAF    EUR_MAF
>> GMAF    Feature    Feature_type    HGVSc    HGVSp    Consequence    Domains
>> MOTIF_NAME    MOTIF_POS    HIGH_INF_POS    Condel    SIFT    Polyphen
>> Cell_Type    Canonical    CCDS    Intron    Exon
>>
>> 1:6500735-6500735    -    -    -    -    -    -    NM_031475.2    Transcript
>> NM_031475.2:c.725C>T    NP_113663.2:p.Thr242Ile    missense_variant    -
>> -    -    -    deleterious(0.765)    deleterious(0.03)    -    -    -    -
>> -    -
>> 1:6501044-6501044    rs2311045    0.28    0.12    0.21    0.13    G:0.1822
>> ENSR00000074413    RegulatoryFeature    -    -    regulatory_region_variant
>> -    -    -    -    -    -    -    -    -    -    -    -
>> 1:6501044-6501044    rs2311045    0.28    0.12    0.21    0.13    G:0.1822
>> CCDS70.1    Transcript    CCDS70.1:c.909C>G    CCDS70.1:c.909C>G(p.=)
>> synonymous_variant    -    -    -    -    -    -    -    -    -    CCDS70.1
>> -    -
>>
>> Position being show in CHROM column makes no sense to me if it's the key
>> identifier. If you're using the "exact" configuration in custom flag with no
>> overlapping why it's an interval shown?
>>
>> I would like that POS being shown in a second column called POS like in
>> original VCF and so on with the rest of custom missing fields. Output format
>> would be:
>>
>> #CHROM	POS	ID	REF	ALT	QUA	Existing_variation    AFR_MAF    AMR_MAF
>> ASN_MAF    EUR_MAF    GMAF    Feature    Feature_type    HGVSc    HGVSp
>> Consequence    Domains    MOTIF_NAME    MOTIF_POS    HIGH_INF_POS    Condel
>> SIFT    Polyphen    Cell_Type    Canonical &nbs
>>   p;
>> CCDS    Intron    Exon
>> chr1	6501044	rs2311045 0.28    0.12    0.21    0.13    G:0.1822
>> ENSR00000074413    RegulatoryFeature    -    -    regulatory_region_variant
>> -    -    -    -    -    -    -    -    -    -    -    -
>>
>> I've been experiencing errors if I try with the following custom flag:
>> --custom myinput.vcf.gz,CHROM,POS,ID,REF,ALT,QUA,vcf,exact,0
>> I've no idea how to are more than one custom flag at a time, or not even if
>> this is possible. What would be the correct way to do this?
>>
>>
>> Thank you.
>>
>> Best regards,
>> Guillermo.
>>
>> On 04/18/13 13:55, Guillermo Marco Puche wrote:
>>
>> Hello,
>>
>> --fields command is working flawlessly ! I love it. It has saved me so much
>> work.
>>
>> ./variant_effect_predictor.pl -i
>> /home/likewise-open/SGNET/gmarco/VEP_71/in/Oto2_collect_not_annotated.vcf -o
>> /home/likewise-open/SGNET/gmarco/VEP_71/out/output.fields -format vcf
>> --cache --everything --maf_1kg --force_overwrite --fork 2 --plugin
>> Condel,/home/likewise-open/SGNET/gmarco/.vep/Plugins/config/Condel/config,b
>> --fields
>> Existing_variation,AFR_MAF,AMR_MAF,ASN_MAF,EUR_MAF,GMAF,Feature,Feature_type,HGVSc,HGVSp,Consequence,Domains,MOTIF_NAME,MOTIF_POS,HIGH_INF_POS,Condel,SIFT,Polyphen,Cell_Type,Canonical,CCDS,Intron,Exon
>>
>>
>> Now I need to figure out how to create a final output file which is the
>> relation of VCF input (Chromosome, Position, Ref_Allele, Var_Allele) with
>> the VEP output. To display all variants info for each chromosome.
>>
>> Guillermo.
>>
>> On 04/18/13 10:40, Will McLaren wrote:
>>
>> Hello,
>>
>> The only way to do this would be to specify each Extra column as a
>> separate column using --fields.
>>
>> Will
>>
>> On 18 April 2013 08:29, Guillermo Marco Puche
>> <guillermo.marco at sistemasgenomicos.com> wrote:
>>
>> Hello,
>>
>> Finally I'm not going to use VCF format as output.
>>  From original input VFC I need to print into my output Chromosome, Position,
>> Ref_Allele and Var_Allele columns.
>>
>> I prefer standard VEP column tabbed file for output, since it's much easier
>> to parse "Extra" column because all extra parameters are delimited by ";".
>> Is there any way to force VEP to print empty extra parameters?
>>
>> ie:
>>
>> 1_6508122_G/C    1:6508122    C    ENSESTG00000022320    ENSESTT00000056337
>> Transcript    downstream_gene_variant    -    -    -    -    -    rs11808508
>> AFR_MAF=;DISTANCE=2305;GMAF=;ASN_MAF=;EUR_MAF=;ENSP=ENSESTP00000056337;CANONICAL=YES;AMR_MAF=
>>
>> Or simply fill print empty extra empty fields with =EMPTY.
>>
>>
>> Thank you.
>>
>> Best regards,
>> Guillermo.
>>
>> On 04/17/13 16:53, Guillermo Marco Puche wrote:
>>
>> Again, thank you so much !
>>
>> I'm looking further VCFTools, maybe it should be the easiest and standard
>> way to parse VCF output from VEP.
>>
>> Thank you.
>>
>> Best regards,
>> Guillermo.
>>
>> On 04/17/13 16:50, Will McLaren wrote:
>>
>> Yes, you can customise the fields used and the order they appear in
>> with --fields; this applies to both VCF and the normal tab-delimited
>> output.
>>
>> The delimiter is hardcoded I'm afraid, but I'm not sure what you'd
>> pick if you did decide to change it. ";" and "," are already used by
>> the VCF spec, and ":" appears in HGVS notations and other fields.
>>
>> If you did want to change it, you'd just need to edit lines 1272 and
>> 1275 of ensembl-variation/modules/Bio/EnsEMBL/Variation/Utils/VEP.pm.
>>
>> Will
>>
>>
>>
>> On 17 April 2013 15:32, Guillermo Marco Puche
>> <guillermo.marco at sistemasgenomicos.com> wrote:
>>
>> Hello Will,
>>
>>
>> On 04/17/13 14:46, Will McLaren wrote:
>>
>> Hello,
>>
>> It's difficult (well, in fact impossible) to provide an example where
>> every field is populated, since some field types are mutually
>> exclusive dependent on the feature type overlapped (for example, you
>> will never see the CELL_TYPE field populated for a variant/transcript
>> combination).
>>
>> If you are interested in this for the purposes of how it looks for a
>> parser, you really want to be looking at the header line added to the
>> VCF by the VEP:
>>
>> ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as
>> predicted by VEP. Format:
>> Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|EXON|INTRON|HGNC|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|DISTANCE|CLIN_SIG|CANONICAL|SIFT|PolyPhen|GMAF|ENSP|DOMAINS|CCDS|HGVSc|HGVSp|CELL_TYPE|BLOSUM62|CAROL|Conservation|LinkedVariants|INTERPRO|TSSDistance">
>>
>> This lists the fields that are added in order. Using this you should
>> be able to parse what appears in the body of the file.
>>
>> Here's an example using a bunch of plugins and with the "--everything"
>> flag switched on:
>>
>> ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as
>> predicted by VEP. Format:
>> Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|EXON|INTRON|HGNC|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|DISTANCE|CLIN_SIG|CANONICAL|SIFT|PolyPhen|GMAF|ENSP|DOMAINS|CCDS|HGVSc|HGVSp|CELL_TYPE|BLOSUM62|CAROL|Conservation|LinkedVariants|INTERPRO|TSSDistance">
>> #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
>> 21      26960070        rs116645811     G       A       .       .
>>
>> CSQ=|||||||||||||||||||||||||||||||||||,A|ENSG00000154719|ENST00000352957|Transcript|intron_variant||||||rs116645811||9/9|MRPL39||||||||||A:0.0005|ENSP00000284967||CCDS13573.1|ENST00000352957.4:c.969+1077C>T|||||0.840||ENSP00000284967|,A|ENSG00000154719|ENST00000307301|Transcript|missense_variant|1043|1001|334|T/M|aCg/aTg|rs116645811|10/11||MRPL39|||||||YES|tolerated(0.06)|benign(0.001)|A:0.0005|ENSP00000305682|Low_complexity_(Seg):Seg|CCDS33522.1|ENST00000307301.7:c.1001C>T|ENSP00000305682.7:p.Thr334Met||-1|Neutral(0.940)|0.840||ENSP00000305682|
>>
>> I like this. It won't be so hard to parse it.
>>
>> I've I'm not wrong I can even choose the field order with "--fields" flag.
>> Is this only working for regular VEP column tabbed output file? Does it work
>> with VCF output also?
>>
>> The only thing I don't like is that delimiter being "|" character is also
>> used to fill empty fields. It would be great to change delimiter to another
>> special character so parsing is much easier.
>>
>>
>> Thank you.
>>
>> Best regards,
>> Guillermo.
>>
>> This is from input:
>>
>> #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
>> 21      26960070        rs116645811     G       A       .       .       .
>>
>> using the command line:
>>
>> perl variant_effect_predictor.pl -i test.txt -force -database
>> -everything -vcf -plugin Blosum62 -plugin Carol -plugin Conservation
>> -plugin LD -plugin ProteinDomains -plugin TSSDistance
>>
>> Hope this is a bit clearer!
>>
>> Will
>>
>> On 17 April 2013 11:25, Guillermo Marco Puche
>> <guillermo.marco at sistemasgenomicos.com> wrote:
>>
>> Hello,
>>
>> I'm looking for an example *.vcf output with ALL the "Extra" parameters.
>> I've generated some with VEP script but i'm missing some extras never being
>> generated like HGNC.
>>
>> A few lines VCF with all values would be enough, since i'm planning to parse
>> "Extra" column.
>>
>> It also would be great if it includes most of the plugins outputs also :)
>>
>> Thank you :)
>>
>> Best regards,
>> Guillermo.
>>
>>
>> On 04/16/13 18:00, Guillermo Marco Puche wrote:
>>
>> On 04/16/13 14:49, Will McLaren wrote:
>>
>> Hi Guillermo,
>>
>> There's two distinct ways you can add additional data to the output
>> from the VEP.
>>
>> 1) Custom annotations - here you simply provide the VEP with a
>> tabix-indexed position-based data file, and the VEP does the work of
>> finding overlaps with your variant input and the data from the file.
>>
>> 2) Plugins - you write the code to add to or manipulate the internal
>> data structures used by the VEP. In its simplest form, a plugin can be
>> simply looking up an attribute of some object and adding it to the
>> output.
>>
>> Writing a plugin requires a basic understanding of the Ensembl API,
>> but getting a basic plugin working requires only a very small amount
>> of code.
>>
>> Since additional data is being obtained from multiple sources, APIs, files,
>> etc.. I guess plugins are the only way to go for me.
>>
>> The documentation
>> (http://www.ensembl.org/info/docs/variation/vep/vep_script.html#plugins)
>> explains all of this, but the best way to see how plugins work is to
>> look at the existing plugins at
>> https://github.com/ensembl-variation/VEP_plugins. I'd suggest looking
>> at Conservation.pm and ProteinSeqs.pm as some relatively simple
>> examples of retrieving additional data from the API.
>>
>> Where are packages like package Conservation; comming from?
>>
>> You should note that using VCF output you will see repeated elements
>> in the INFO field added, since the plugin gets run once for every
>> variant/transcript overlap; all data appear under the CSQ field in the
>> INFO column. Currently there is no way for the VEP via plugins to add
>> separate INFO fields, however this is something we are looking into,
>> and in fact would be relatively easy to "hack" in for someone
>> determined enough (see subroutine vf_list_to_cons in
>> Bio::EnsEMBL::Variation::Utils::VEP).
>>
>> I'll look further into this tomorrow since I've to go now.
>>
>> A workaround could be simply generating a temp file with extra columns and
>> in the end merge original VCF from VEP script with the output from plugins
>> for additional columns.
>>
>> Maybe I missunderstood you. Correct me if i'm wrong please.
>>
>> Hope this helps, and feel free to ask further questions!
>>
>> Will McLaren
>> Ensembl Variation
>>
>> Thank you so much.
>>
>> Best regards,
>> Guillermo.
>>
>> On 16 April 2013 12:58, Guillermo Marco Puche
>> <guillermo.marco at sistemasgenomicos.com> wrote:
>>
>> Hello,
>>
>> I'm in need to develop some extra features for VEP.
>>
>> My input files are in VCF format and also my output.
>>
>> But I want to add several additional columns for extra data at the VCF out.
>>
>> For example,AA conservation score, Biobase description, Biobase link, MAF
>> populations, Flanking sequence, Gene description, InterPro_ID and more..
>>
>> I've been reading the documents and I'm a bit confused about "Custom
>> annotations".
>> I think since the data I want is extra on the output and not in the input,
>> what I should do is develop several Plugins to obtain all the values I need.
>>
>> I think most of them can be obtained through the Ensembl API even if I'm new
>> to this. Other will require more hard coding.
>>
>> I hope someone can clarify me a bit on this matter.
>>
>> Thank you.
>>
>> Best regards,
>> Guillermo.
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20130422/5ffb24b5/attachment.html>