[ensembl-dev] VEP Extra output information

Guillermo Marco Puche guillermo.marco at sistemasgenomicos.com
Thu Apr 18 08:29:33 BST 2013


Hello,

Finally I'm not going to use VCF format as output.
 From original input VFC I need to print into my output Chromosome, 
Position, Ref_Allele and Var_Allele columns.

I prefer standard VEP column tabbed file for output, since it's much 
easier to parse "Extra" column because all extra parameters are 
delimited by ";".
Is there any way to force VEP to print empty extra parameters?

ie:

1_6508122_G/C    1:6508122    C    ENSESTG00000022320 
ENSESTT00000056337    Transcript    downstream_gene_variant -    -    
-    -    -    rs11808508 
AFR_MAF=;DISTANCE=2305;GMAF=;ASN_MAF=;EUR_MAF=;ENSP=ENSESTP00000056337;CANONICAL=YES;AMR_MAF=

Or simply fill print empty extra empty fields with =EMPTY.

Thank you.

Best regards,
Guillermo.

On 04/17/13 16:53, Guillermo Marco Puche wrote:
> Again, thank you so much !
>
> I'm looking further VCFTools, maybe it should be the easiest and 
> standard way to parse VCF output from VEP.
>
> Thank you.
>
> Best regards,
> Guillermo.
>
> On 04/17/13 16:50, Will McLaren wrote:
>> Yes, you can customise the fields used and the order they appear in
>> with --fields; this applies to both VCF and the normal tab-delimited
>> output.
>>
>> The delimiter is hardcoded I'm afraid, but I'm not sure what you'd
>> pick if you did decide to change it. ";" and "," are already used by
>> the VCF spec, and ":" appears in HGVS notations and other fields.
>>
>> If you did want to change it, you'd just need to edit lines 1272 and
>> 1275 of ensembl-variation/modules/Bio/EnsEMBL/Variation/Utils/VEP.pm.
>>
>> Will
>>
>>
>>
>> On 17 April 2013 15:32, Guillermo Marco Puche
>> <guillermo.marco at sistemasgenomicos.com>  wrote:
>>> Hello Will,
>>>
>>>
>>> On 04/17/13 14:46, Will McLaren wrote:
>>>
>>> Hello,
>>>
>>> It's difficult (well, in fact impossible) to provide an example where
>>> every field is populated, since some field types are mutually
>>> exclusive dependent on the feature type overlapped (for example, you
>>> will never see the CELL_TYPE field populated for a variant/transcript
>>> combination).
>>>
>>> If you are interested in this for the purposes of how it looks for a
>>> parser, you really want to be looking at the header line added to the
>>> VCF by the VEP:
>>>
>>> ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as
>>> predicted by VEP. Format:
>>> Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|EXON|INTRON|HGNC|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|DISTANCE|CLIN_SIG|CANONICAL|SIFT|PolyPhen|GMAF|ENSP|DOMAINS|CCDS|HGVSc|HGVSp|CELL_TYPE|BLOSUM62|CAROL|Conservation|LinkedVariants|INTERPRO|TSSDistance">
>>>
>>> This lists the fields that are added in order. Using this you should
>>> be able to parse what appears in the body of the file.
>>>
>>> Here's an example using a bunch of plugins and with the "--everything"
>>> flag switched on:
>>>
>>> ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as
>>> predicted by VEP. Format:
>>> Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|EXON|INTRON|HGNC|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|DISTANCE|CLIN_SIG|CANONICAL|SIFT|PolyPhen|GMAF|ENSP|DOMAINS|CCDS|HGVSc|HGVSp|CELL_TYPE|BLOSUM62|CAROL|Conservation|LinkedVariants|INTERPRO|TSSDistance">
>>> #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
>>> 21      26960070        rs116645811     G       A       .       .
>>>
>>> CSQ=|||||||||||||||||||||||||||||||||||,A|ENSG00000154719|ENST00000352957|Transcript|intron_variant||||||rs116645811||9/9|MRPL39||||||||||A:0.0005|ENSP00000284967||CCDS13573.1|ENST00000352957.4:c.969+1077C>T|||||0.840||ENSP00000284967|,A|ENSG00000154719|ENST00000307301|Transcript|missense_variant|1043|1001|334|T/M|aCg/aTg|rs116645811|10/11||MRPL39|||||||YES|tolerated(0.06)|benign(0.001)|A:0.0005|ENSP00000305682|Low_complexity_(Seg):Seg|CCDS33522.1|ENST00000307301.7:c.1001C>T|ENSP00000305682.7:p.Thr334Met||-1|Neutral(0.940)|0.840||ENSP00000305682|
>>>
>>> I like this. It won't be so hard to parse it.
>>>
>>> I've I'm not wrong I can even choose the field order with "--fields" flag.
>>> Is this only working for regular VEP column tabbed output file? Does it work
>>> with VCF output also?
>>>
>>> The only thing I don't like is that delimiter being "|" character is also
>>> used to fill empty fields. It would be great to change delimiter to another
>>> special character so parsing is much easier.
>>>
>>>
>>> Thank you.
>>>
>>> Best regards,
>>> Guillermo.
>>>
>>> This is from input:
>>>
>>> #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
>>> 21      26960070        rs116645811     G       A       .       .       .
>>>
>>> using the command line:
>>>
>>> perl variant_effect_predictor.pl -i test.txt -force -database
>>> -everything -vcf -plugin Blosum62 -plugin Carol -plugin Conservation
>>> -plugin LD -plugin ProteinDomains -plugin TSSDistance
>>>
>>> Hope this is a bit clearer!
>>>
>>> Will
>>>
>>> On 17 April 2013 11:25, Guillermo Marco Puche
>>> <guillermo.marco at sistemasgenomicos.com>  wrote:
>>>
>>> Hello,
>>>
>>> I'm looking for an example *.vcf output with ALL the "Extra" parameters.
>>> I've generated some with VEP script but i'm missing some extras never being
>>> generated like HGNC.
>>>
>>> A few lines VCF with all values would be enough, since i'm planning to parse
>>> "Extra" column.
>>>
>>> It also would be great if it includes most of the plugins outputs also :)
>>>
>>> Thank you :)
>>>
>>> Best regards,
>>> Guillermo.
>>>
>>>
>>> On 04/16/13 18:00, Guillermo Marco Puche wrote:
>>>
>>> On 04/16/13 14:49, Will McLaren wrote:
>>>
>>> Hi Guillermo,
>>>
>>> There's two distinct ways you can add additional data to the output
>>> from the VEP.
>>>
>>> 1) Custom annotations - here you simply provide the VEP with a
>>> tabix-indexed position-based data file, and the VEP does the work of
>>> finding overlaps with your variant input and the data from the file.
>>>
>>> 2) Plugins - you write the code to add to or manipulate the internal
>>> data structures used by the VEP. In its simplest form, a plugin can be
>>> simply looking up an attribute of some object and adding it to the
>>> output.
>>>
>>> Writing a plugin requires a basic understanding of the Ensembl API,
>>> but getting a basic plugin working requires only a very small amount
>>> of code.
>>>
>>> Since additional data is being obtained from multiple sources, APIs, files,
>>> etc.. I guess plugins are the only way to go for me.
>>>
>>> The documentation
>>> (http://www.ensembl.org/info/docs/variation/vep/vep_script.html#plugins)
>>> explains all of this, but the best way to see how plugins work is to
>>> look at the existing plugins at
>>> https://github.com/ensembl-variation/VEP_plugins. I'd suggest looking
>>> at Conservation.pm and ProteinSeqs.pm as some relatively simple
>>> examples of retrieving additional data from the API.
>>>
>>> Where are packages like package Conservation; comming from?
>>>
>>> You should note that using VCF output you will see repeated elements
>>> in the INFO field added, since the plugin gets run once for every
>>> variant/transcript overlap; all data appear under the CSQ field in the
>>> INFO column. Currently there is no way for the VEP via plugins to add
>>> separate INFO fields, however this is something we are looking into,
>>> and in fact would be relatively easy to "hack" in for someone
>>> determined enough (see subroutine vf_list_to_cons in
>>> Bio::EnsEMBL::Variation::Utils::VEP).
>>>
>>> I'll look further into this tomorrow since I've to go now.
>>>
>>> A workaround could be simply generating a temp file with extra columns and
>>> in the end merge original VCF from VEP script with the output from plugins
>>> for additional columns.
>>>
>>> Maybe I missunderstood you. Correct me if i'm wrong please.
>>>
>>> Hope this helps, and feel free to ask further questions!
>>>
>>> Will McLaren
>>> Ensembl Variation
>>>
>>> Thank you so much.
>>>
>>> Best regards,
>>> Guillermo.
>>>
>>> On 16 April 2013 12:58, Guillermo Marco Puche
>>> <guillermo.marco at sistemasgenomicos.com>  wrote:
>>>
>>> Hello,
>>>
>>> I'm in need to develop some extra features for VEP.
>>>
>>> My input files are in VCF format and also my output.
>>>
>>> But I want to add several additional columns for extra data at the VCF out.
>>>
>>> For example,AA conservation score, Biobase description, Biobase link, MAF
>>> populations, Flanking sequence, Gene description, InterPro_ID and more..
>>>
>>> I've been reading the documents and I'm a bit confused about "Custom
>>> annotations".
>>> I think since the data I want is extra on the output and not in the input,
>>> what I should do is develop several Plugins to obtain all the values I need.
>>>
>>> I think most of them can be obtained through the Ensembl API even if I'm new
>>> to this. Other will require more hard coding.
>>>
>>> I hope someone can clarify me a bit on this matter.
>>>
>>> Thank you.
>>>
>>> Best regards,
>>> Guillermo.
>>>
>>> _______________________________________________
>>> Dev mailing listDev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info:
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog:http://www.ensembl.info/
>>>
>>> _______________________________________________
>>>
>>>
>>> _______________________________________________
>>> Dev mailing listDev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info:
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog:http://www.ensembl.info/
>>>
>> _______________________________________________
>> Dev mailing listDev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog:http://www.ensembl.info/
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20130418/ec796f37/attachment.html>


More information about the Dev mailing list