[ensembl-dev] VCFtools parsing error in Ensembl Homo sapiens VCF files.
Tjaart de Beer
tjaart at ebi.ac.uk
Thu Oct 10 14:36:49 BST 2013
Hi Anja,
Thanks for the quick response. The thing I need is the mapping between the
rsid and the transcript as well as the consequence. I've tried using VEP
but because of the large number of rsids (550k), an alternative using the
consequence vcf files was suggested (see previous mail to mailing list).
I've also tried using the Homo_sapiens.vcf you mentioned but I get the
following error:
vcftools --vcf Homo_sapiens.vcf --snps test.dat
VCFtools - v0.1.11
(C) Adam Auton 2009
Parameters as interpreted:
--vcf Homo_sapiens.vcf
--snps test.dat
Reading Index file.
Building new index file.
Error:No header or meta information. Invalid file: Homo_sapiens.vcf
My header of Homo_sapiens.vcf looks like this:
##fileformat=VCFv4.1
##fileDate=20130830
##source=ensembl;version=73;url=http://e73.ensembl.org/homo_sapiens
##reference=ftp://ftp.ensembl.org/pub/release-73/fasta/homo_sapiens/dna/
##INFO=<ID=TSA,Number=0,Type=String,Description="Type of sequence
alteration. Child of term sequence_alteration as defined by the sequence
ontology project.">
##INFO=<ID=E_MO,Number=0,Type=Flag,Description="Multiple_observations.http://www.ensembl.org/info/docs/variation/data_description.html#evidence_status">
##INFO=<ID=E_ESP,Number=0,Type=Flag,Description="Exome_Sequencing_Project.http://www.ensembl.org/info/docs/variation/data_description.html#evidence_status">
##INFO=<ID=E_1000G,Number=0,Type=Flag,Description="1000Genomes.http://www.ensembl.org/info/docs/variation/data_description.html#evidence_status">
##INFO=<ID=E_HM,Number=0,Type=Flag,Description="HapMap.http://www.ensembl.org/info/docs/variation/data_description.html#evidence_status">
##INFO=<ID=E_Freq,Number=0,Type=Flag,Description="Frequency.http://www.ensembl.org/info/docs/variation/data_description.html#evidence_status">
##INFO=<ID=E_C,Number=0,Type=Flag,Description="Cited.http://www.ensembl.org/info/docs/variation/data_description.html#evidence_status">
##INFO=<ID=CS,Number=0,Type=Flag,Description="The clinical significance of
a variant as reported by dbSNP.">
##INFO=<ID=CS_DR,Number=0,Type=Flag,Description="The clinical significance
of a variant as reported by dbSNP. drug-response.">
##INFO=<ID=CS_PP,Number=0,Type=Flag,Description="The clinical significance
of a variant as reported by dbSNP. probable-pathogenic.">
##INFO=<ID=CS_P,Number=0,Type=Flag,Description="The clinical significance
of a variant as reported by dbSNP. pathogenic.">
##INFO=<ID=CS_NP,Number=0,Type=Flag,Description="The clinical significance
of a variant as reported by dbSNP. non-pathogenic.">
##INFO=<ID=CS_O,Number=0,Type=Flag,Description="The clinical significance
of a variant as reported by dbSNP. other.">
##INFO=<ID=CS_U,Number=0,Type=Flag,Description="The clinical significance
of a variant as reported by dbSNP. untested.">
##INFO=<ID=CS_H,Number=0,Type=Flag,Description="The clinical significance
of a variant as reported by dbSNP. histocompatibility.">
##INFO=<ID=CS_PNP,Number=0,Type=Flag,Description="The clinical
significance of a variant as reported by dbSNP. probable-non-pathogenic.">
##INFO=<ID=MA,Number=1,Type=String,Description="Minor Alelele">
##INFO=<ID=MAF,Number=1,Type=Float,Description="Minor Alelele Frequency">
##INFO=<ID=MAC,Number=1,Type=Integer,Description="Minor Alelele Count">
##INFO=<ID=COSMIC_65,Number=0,Type=Flag,Description="Somatic mutations
found in human cancers from the COSMIC project">
##INFO=<ID=dbSNP_137,Number=0,Type=Flag,Description="Variants (including
SNPs and indels) imported from dbSNP">
##INFO=<ID=HGMD-PUBLIC_20132,Number=0,Type=Flag,Description="Variants from
HGMD-PUBLIC dataset June 2013">
##INFO=<ID=PhenCode_20121114,Number=0,Type=Flag,Description="PhenCode is a
collaborative project to better understand the relationship between
genotype and phenotype in humans">
##INFO=<ID=ESP_6500,Number=0,Type=Flag,Description="NHLBI Exome Sequencing
Project">
##INFO=<ID=dbSNP_ClinVar,Number=0,Type=Flag,Description="Variants of
clinical significance imported from dbSNP/ClinVar">
1 10144 rs144773400 TA T . .
dbSNP_137;TSA=deletion
So there seems to be something wrong there as well. I compared the header
info/format between Homo_sapiens.vcf and
Homo_sapiens_incl_consequences.vcf and it appears that all the needed info
is there yet vcftools still thinks it is an invalid file.
When I run vcf-sort on the file and use the newly generated file, I still
get the same error. Any idea what is going on?
Thanks!
Tjaart
> Hi Tjaart,
> The VCF specification doesn't provide a way for representing variant
consequences in a VCF file.
> Until the specification contains a way of storing consequence
information
> we decided to store the
> data as a list of strings. We store our consequence data in our VCF
files
> similar to how
> we store consequence data in our GVF files. There is a specification for
how to store
> consequence data in GVF format
> (http://www.sequenceontology.org/resources/gvf.html).
> We are aware that this can cause problems with VCF parsers and we could
include changes
> for next releases by storing maybe only the most severe consequence for
a
> variant which should be easier
> to model with the current VCF specification.
> In the meantime you could use the file Homo_sapiens.vcf instead which
inlcudes the same data as
> the file Homo_sapiens_incl_consequences.vcf except for the consequence
information.
> Best regards,
> Anja
> On 10 Oct 2013, at 13:20, Tjaart de Beer wrote:
>> Hi,
>> I am trying to look for specific rsids in the latest release of human vcf
>> files from
>> ftp://ftp.ensembl.org/pub/release-73/variation/vcf/homo_sapiens/ I am
using this file
>> Homo_sapiens_incl_consequences.vcf.gz
>> I installed the latest vcftools (0.1.11) and when I run the following
command
>> vcftools --vcf Homo_sapiens_incl_consequences.vcf --snps test.dat I get
this error:
>> VCFtools - v0.1.11
>> (C) Adam Auton 2009
>> Parameters as interpreted:
>> --vcf Homo_sapiens_incl_consequences.vcf
>> --snps test.dat
>> Reading Index file.
>> Building new index file.
>> Error:Unknown Type in INFO meta-information:
>> ##INFO=<ID=VE,Number=.,Type=ListOfString,Description="Effect that a
sequence alteration has on a sequence feature that overlaps
>> it.Format=SV|IDX|FT|FID">
>> According to the vcftools page, the only valid options for Type is
Integer, Float, Flag, Character, and String and not ListOfString This
thread from the vcftools mailing seems to support this that the
ListOfstring is an invalid option.
>> http://sourceforge.net/mailarchive/message.php?msg_id=31150267
>> Could this perhaps be a bug in the way the Ensembl vcf files are
generated? Or am I missing something?
>> --
>> Dr. Tjaart de Beer
>> Thornton group
>> European Bioinformatics Institute (EMBL-EBI)
>> European Molecular Biology Laboratory
>> Wellcome Trust Genome Campus
>> Hinxton
>> Cambridge CB10 1SD
>> United Kingdom
>> _______________________________________________
>> Dev mailing list Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
--
Dr. Tjaart de Beer
Thornton group
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom
More information about the Dev
mailing list