[ensembl-dev] VEP Extra output information

Guillermo Marco Puche guillermo.marco at sistemasgenomicos.com
Mon Apr 22 10:58:51 BST 2013


Hello Will,

It seems VCFannotate is made for "|Intersect the records in the VCF file 
with targets provided in a BED file.|".
How I'm supposed to intersect the output from vep script (VCF or VEP 
file) with my input file VCF?

Thank you.

Best regards,
Guillermo.

On 04/19/13 11:31, Will McLaren wrote:
> Hi Guillermo,
>
> The --custom system doesn't quite work like that. Currently it is set
> up to either provide only the ID or the coordinates of any features it
> finds overlapping your variants in the custom file. It can't pull
> particular fields from a VCF in the way you describe here.
>
> To do so, you'd either have to write a plugin to do this (see the
> dbNSFP.pm plugin for an example of doing similar), or use VCFannotate,
> which I believe can do this sort of thing out of the box.
>
> Regards
>
> Will
>
> On 19 April 2013 07:42, Guillermo Marco Puche
> <guillermo.marco at sistemasgenomicos.com> wrote:
>> Hello,
>>
>> I'm trying to get the following fields from the VCF input with the --custom
>> flag.
>> I want to add the following columns to the VEP output file:
>>
>> #CHROM	POS	ID	REF	ALT	QUA
>>
>>  From what I've been reading this is possible to achieve using custom flag
>> and VCF input, since third column is used as identifier (ID, ie: rs6054257)
>>
>>
>> I've been trying with the following command:
>>
>> ./variant_effect_predictor.pl -i myinput.vcf.gz -format vcf -o myoutput.vep
>> --cache --everything --maf_1kg --force_overwrite --plugin
>> Condel,/home/likewise-open/SGNET/gmarco/.vep/Plugins/config/Condel/config,b
>> --custom myinput.vcf.gz,CHROM,vcf,exact,0 --fields
>> CHROM,Existing_variation,AFR_MAF,AMR_MAF,ASN_MAF,EUR_MAF,GMAF,Feature,Feature_type,HGVSc,HGVSp,Consequence,Domains,MOTIF_NAME,MOTIF_POS,HIGH_INF_POS,Condel,SIFT,Polyphen,Cell_Type,Canonical,CCDS,Intron,Exon
>>
>> I got an output like this:
>>
>> #CHROM    Existing_variation    AFR_MAF    AMR_MAF    ASN_MAF    EUR_MAF
>> GMAF    Feature    Feature_type    HGVSc    HGVSp    Consequence    Domains
>> MOTIF_NAME    MOTIF_POS    HIGH_INF_POS    Condel    SIFT    Polyphen
>> Cell_Type    Canonical    CCDS    Intron    Exon
>>
>> 1:6500735-6500735    -    -    -    -    -    -    NM_031475.2    Transcript
>> NM_031475.2:c.725C>T    NP_113663.2:p.Thr242Ile    missense_variant    -
>> -    -    -    deleterious(0.765)    deleterious(0.03)    -    -    -    -
>> -    -
>> 1:6501044-6501044    rs2311045    0.28    0.12    0.21    0.13    G:0.1822
>> ENSR00000074413    RegulatoryFeature    -    -    regulatory_region_variant
>> -    -    -    -    -    -    -    -    -    -    -    -
>> 1:6501044-6501044    rs2311045    0.28    0.12    0.21    0.13    G:0.1822
>> CCDS70.1    Transcript    CCDS70.1:c.909C>G    CCDS70.1:c.909C>G(p.=)
>> synonymous_variant    -    -    -    -    -    -    -    -    -    CCDS70.1
>> -    -
>>
>> Position being show in CHROM column makes no sense to me if it's the key
>> identifier. If you're using the "exact" configuration in custom flag with no
>> overlapping why it's an interval shown?
>>
>> I would like that POS being shown in a second column called POS like in
>> original VCF and so on with the rest of custom missing fields. Output format
>> would be:
>>
>> #CHROM	POS	ID	REF	ALT	QUA	Existing_variation    AFR_MAF    AMR_MAF
>> ASN_MAF    EUR_MAF    GMAF    Feature    Feature_type    HGVSc    HGVSp
>> Consequence    Domains    MOTIF_NAME    MOTIF_POS    HIGH_INF_POS    Condel
>> SIFT    Polyphen    Cell_Type    Canonical &nbs
>>   p;
>> CCDS    Intron    Exon
>> chr1	6501044	rs2311045 0.28    0.12    0.21    0.13    G:0.1822
>> ENSR00000074413    RegulatoryFeature    -    -    regulatory_region_variant
>> -    -    -    -    -    -    -    -    -    -    -    -
>>
>> I've been experiencing errors if I try with the following custom flag:
>> --custom myinput.vcf.gz,CHROM,POS,ID,REF,ALT,QUA,vcf,exact,0
>> I've no idea how to are more than one custom flag at a time, or not even if
>> this is possible. What would be the correct way to do this?
>>
>>
>> Thank you.
>>
>> Best regards,
>> Guillermo.
>>
>> On 04/18/13 13:55, Guillermo Marco Puche wrote:
>>
>> Hello,
>>
>> --fields command is working flawlessly ! I love it. It has saved me so much
>> work.
>>
>> ./variant_effect_predictor.pl -i
>> /home/likewise-open/SGNET/gmarco/VEP_71/in/Oto2_collect_not_annotated.vcf -o
>> /home/likewise-open/SGNET/gmarco/VEP_71/out/output.fields -format vcf
>> --cache --everything --maf_1kg --force_overwrite --fork 2 --plugin
>> Condel,/home/likewise-open/SGNET/gmarco/.vep/Plugins/config/Condel/config,b
>> --fields
>> Existing_variation,AFR_MAF,AMR_MAF,ASN_MAF,EUR_MAF,GMAF,Feature,Feature_type,HGVSc,HGVSp,Consequence,Domains,MOTIF_NAME,MOTIF_POS,HIGH_INF_POS,Condel,SIFT,Polyphen,Cell_Type,Canonical,CCDS,Intron,Exon
>>
>>
>> Now I need to figure out how to create a final output file which is the
>> relation of VCF input (Chromosome, Position, Ref_Allele, Var_Allele) with
>> the VEP output. To display all variants info for each chromosome.
>>
>> Guillermo.
>>
>> On 04/18/13 10:40, Will McLaren wrote:
>>
>> Hello,
>>
>> The only way to do this would be to specify each Extra column as a
>> separate column using --fields.
>>
>> Will
>>
>> On 18 April 2013 08:29, Guillermo Marco Puche
>> <guillermo.marco at sistemasgenomicos.com> wrote:
>>
>> Hello,
>>
>> Finally I'm not going to use VCF format as output.
>>  From original input VFC I need to print into my output Chromosome, Position,
>> Ref_Allele and Var_Allele columns.
>>
>> I prefer standard VEP column tabbed file for output, since it's much easier
>> to parse "Extra" column because all extra parameters are delimited by ";".
>> Is there any way to force VEP to print empty extra parameters?
>>
>> ie:
>>
>> 1_6508122_G/C    1:6508122    C    ENSESTG00000022320    ENSESTT00000056337
>> Transcript    downstream_gene_variant    -    -    -    -    -    rs11808508
>> AFR_MAF=;DISTANCE=2305;GMAF=;ASN_MAF=;EUR_MAF=;ENSP=ENSESTP00000056337;CANONICAL=YES;AMR_MAF=
>>
>> Or simply fill print empty extra empty fields with =EMPTY.
>>
>>
>> Thank you.
>>
>> Best regards,
>> Guillermo.
>>
>> On 04/17/13 16:53, Guillermo Marco Puche wrote:
>>
>> Again, thank you so much !
>>
>> I'm looking further VCFTools, maybe it should be the easiest and standard
>> way to parse VCF output from VEP.
>>
>> Thank you.
>>
>> Best regards,
>> Guillermo.
>>
>> On 04/17/13 16:50, Will McLaren wrote:
>>
>> Yes, you can customise the fields used and the order they appear in
>> with --fields; this applies to both VCF and the normal tab-delimited
>> output.
>>
>> The delimiter is hardcoded I'm afraid, but I'm not sure what you'd
>> pick if you did decide to change it. ";" and "," are already used by
>> the VCF spec, and ":" appears in HGVS notations and other fields.
>>
>> If you did want to change it, you'd just need to edit lines 1272 and
>> 1275 of ensembl-variation/modules/Bio/EnsEMBL/Variation/Utils/VEP.pm.
>>
>> Will
>>
>>
>>
>> On 17 April 2013 15:32, Guillermo Marco Puche
>> <guillermo.marco at sistemasgenomicos.com> wrote:
>>
>> Hello Will,
>>
>>
>> On 04/17/13 14:46, Will McLaren wrote:
>>
>> Hello,
>>
>> It's difficult (well, in fact impossible) to provide an example where
>> every field is populated, since some field types are mutually
>> exclusive dependent on the feature type overlapped (for example, you
>> will never see the CELL_TYPE field populated for a variant/transcript
>> combination).
>>
>> If you are interested in this for the purposes of how it looks for a
>> parser, you really want to be looking at the header line added to the
>> VCF by the VEP:
>>
>> ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as
>> predicted by VEP. Format:
>> Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|EXON|INTRON|HGNC|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|DISTANCE|CLIN_SIG|CANONICAL|SIFT|PolyPhen|GMAF|ENSP|DOMAINS|CCDS|HGVSc|HGVSp|CELL_TYPE|BLOSUM62|CAROL|Conservation|LinkedVariants|INTERPRO|TSSDistance">
>>
>> This lists the fields that are added in order. Using this you should
>> be able to parse what appears in the body of the file.
>>
>> Here's an example using a bunch of plugins and with the "--everything"
>> flag switched on:
>>
>> ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as
>> predicted by VEP. Format:
>> Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|EXON|INTRON|HGNC|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|DISTANCE|CLIN_SIG|CANONICAL|SIFT|PolyPhen|GMAF|ENSP|DOMAINS|CCDS|HGVSc|HGVSp|CELL_TYPE|BLOSUM62|CAROL|Conservation|LinkedVariants|INTERPRO|TSSDistance">
>> #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
>> 21      26960070        rs116645811     G       A       .       .
>>
>> CSQ=|||||||||||||||||||||||||||||||||||,A|ENSG00000154719|ENST00000352957|Transcript|intron_variant||||||rs116645811||9/9|MRPL39||||||||||A:0.0005|ENSP00000284967||CCDS13573.1|ENST00000352957.4:c.969+1077C>T|||||0.840||ENSP00000284967|,A|ENSG00000154719|ENST00000307301|Transcript|missense_variant|1043|1001|334|T/M|aCg/aTg|rs116645811|10/11||MRPL39|||||||YES|tolerated(0.06)|benign(0.001)|A:0.0005|ENSP00000305682|Low_complexity_(Seg):Seg|CCDS33522.1|ENST00000307301.7:c.1001C>T|ENSP00000305682.7:p.Thr334Met||-1|Neutral(0.940)|0.840||ENSP00000305682|
>>
>> I like this. It won't be so hard to parse it.
>>
>> I've I'm not wrong I can even choose the field order with "--fields" flag.
>> Is this only working for regular VEP column tabbed output file? Does it work
>> with VCF output also?
>>
>> The only thing I don't like is that delimiter being "|" character is also
>> used to fill empty fields. It would be great to change delimiter to another
>> special character so parsing is much easier.
>>
>>
>> Thank you.
>>
>> Best regards,
>> Guillermo.
>>
>> This is from input:
>>
>> #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
>> 21      26960070        rs116645811     G       A       .       .       .
>>
>> using the command line:
>>
>> perl variant_effect_predictor.pl -i test.txt -force -database
>> -everything -vcf -plugin Blosum62 -plugin Carol -plugin Conservation
>> -plugin LD -plugin ProteinDomains -plugin TSSDistance
>>
>> Hope this is a bit clearer!
>>
>> Will
>>
>> On 17 April 2013 11:25, Guillermo Marco Puche
>> <guillermo.marco at sistemasgenomicos.com> wrote:
>>
>> Hello,
>>
>> I'm looking for an example *.vcf output with ALL the "Extra" parameters.
>> I've generated some with VEP script but i'm missing some extras never being
>> generated like HGNC.
>>
>> A few lines VCF with all values would be enough, since i'm planning to parse
>> "Extra" column.
>>
>> It also would be great if it includes most of the plugins outputs also :)
>>
>> Thank you :)
>>
>> Best regards,
>> Guillermo.
>>
>>
>> On 04/16/13 18:00, Guillermo Marco Puche wrote:
>>
>> On 04/16/13 14:49, Will McLaren wrote:
>>
>> Hi Guillermo,
>>
>> There's two distinct ways you can add additional data to the output
>> from the VEP.
>>
>> 1) Custom annotations - here you simply provide the VEP with a
>> tabix-indexed position-based data file, and the VEP does the work of
>> finding overlaps with your variant input and the data from the file.
>>
>> 2) Plugins - you write the code to add to or manipulate the internal
>> data structures used by the VEP. In its simplest form, a plugin can be
>> simply looking up an attribute of some object and adding it to the
>> output.
>>
>> Writing a plugin requires a basic understanding of the Ensembl API,
>> but getting a basic plugin working requires only a very small amount
>> of code.
>>
>> Since additional data is being obtained from multiple sources, APIs, files,
>> etc.. I guess plugins are the only way to go for me.
>>
>> The documentation
>> (http://www.ensembl.org/info/docs/variation/vep/vep_script.html#plugins)
>> explains all of this, but the best way to see how plugins work is to
>> look at the existing plugins at
>> https://github.com/ensembl-variation/VEP_plugins. I'd suggest looking
>> at Conservation.pm and ProteinSeqs.pm as some relatively simple
>> examples of retrieving additional data from the API.
>>
>> Where are packages like package Conservation; comming from?
>>
>> You should note that using VCF output you will see repeated elements
>> in the INFO field added, since the plugin gets run once for every
>> variant/transcript overlap; all data appear under the CSQ field in the
>> INFO column. Currently there is no way for the VEP via plugins to add
>> separate INFO fields, however this is something we are looking into,
>> and in fact would be relatively easy to "hack" in for someone
>> determined enough (see subroutine vf_list_to_cons in
>> Bio::EnsEMBL::Variation::Utils::VEP).
>>
>> I'll look further into this tomorrow since I've to go now.
>>
>> A workaround could be simply generating a temp file with extra columns and
>> in the end merge original VCF from VEP script with the output from plugins
>> for additional columns.
>>
>> Maybe I missunderstood you. Correct me if i'm wrong please.
>>
>> Hope this helps, and feel free to ask further questions!
>>
>> Will McLaren
>> Ensembl Variation
>>
>> Thank you so much.
>>
>> Best regards,
>> Guillermo.
>>
>> On 16 April 2013 12:58, Guillermo Marco Puche
>> <guillermo.marco at sistemasgenomicos.com> wrote:
>>
>> Hello,
>>
>> I'm in need to develop some extra features for VEP.
>>
>> My input files are in VCF format and also my output.
>>
>> But I want to add several additional columns for extra data at the VCF out.
>>
>> For example,AA conservation score, Biobase description, Biobase link, MAF
>> populations, Flanking sequence, Gene description, InterPro_ID and more..
>>
>> I've been reading the documents and I'm a bit confused about "Custom
>> annotations".
>> I think since the data I want is extra on the output and not in the input,
>> what I should do is develop several Plugins to obtain all the values I need.
>>
>> I think most of them can be obtained through the Ensembl API even if I'm new
>> to this. Other will require more hard coding.
>>
>> I hope someone can clarify me a bit on this matter.
>>
>> Thank you.
>>
>> Best regards,
>> Guillermo.
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20130422/20eaa06d/attachment.html>


More information about the Dev mailing list