[ensembl-dev] VCFtools parsing error in Ensembl Homo sapiens VCF files.

Anja Thormann anja at ebi.ac.uk
Fri Oct 11 11:03:11 BST 2013


Hi Tjaart,

Thank you for reporting this. We have updated our VCF files on our FTP site ftp://ftp.ensembl.org/pub/release-73/variation/vcf/.
I changed the type ListOfString to String and added the missing header line.
Vcftools is now able to parse Homo_sapiens.vcf.gz and Homo_sapiens_incl_consequences.vcf.gz.

Best regards,
Anja

On 10 Oct 2013, at 14:36, Tjaart de Beer wrote:

> Hi Anja,
> 
> Thanks for the quick response. The thing I need is the mapping between the
> rsid and the transcript as well as the consequence. I've tried using VEP
> but because of the large number of rsids (550k), an alternative using the
> consequence vcf files was suggested (see previous mail to mailing list).
> 
> I've also tried using the Homo_sapiens.vcf you mentioned but I get the
> following error:
> 
> vcftools --vcf Homo_sapiens.vcf --snps test.dat
> 
> VCFtools - v0.1.11
> (C) Adam Auton 2009
> 
> Parameters as interpreted:
>        --vcf Homo_sapiens.vcf
>        --snps test.dat
> 
> Reading Index file.
> Building new index file.
> Error:No header or meta information. Invalid file: Homo_sapiens.vcf
> 
> My header of Homo_sapiens.vcf looks like this:
> 
> ##fileformat=VCFv4.1
> ##fileDate=20130830
> ##source=ensembl;version=73;url=http://e73.ensembl.org/homo_sapiens
> ##reference=ftp://ftp.ensembl.org/pub/release-73/fasta/homo_sapiens/dna/
> ##INFO=<ID=TSA,Number=0,Type=String,Description="Type of sequence
> alteration. Child of term sequence_alteration as defined by the sequence
> ontology project.">
> ##INFO=<ID=E_MO,Number=0,Type=Flag,Description="Multiple_observations.http://www.ensembl.org/info/docs/variation/data_description.html#evidence_status">
> ##INFO=<ID=E_ESP,Number=0,Type=Flag,Description="Exome_Sequencing_Project.http://www.ensembl.org/info/docs/variation/data_description.html#evidence_status">
> ##INFO=<ID=E_1000G,Number=0,Type=Flag,Description="1000Genomes.http://www.ensembl.org/info/docs/variation/data_description.html#evidence_status">
> ##INFO=<ID=E_HM,Number=0,Type=Flag,Description="HapMap.http://www.ensembl.org/info/docs/variation/data_description.html#evidence_status">
> ##INFO=<ID=E_Freq,Number=0,Type=Flag,Description="Frequency.http://www.ensembl.org/info/docs/variation/data_description.html#evidence_status">
> ##INFO=<ID=E_C,Number=0,Type=Flag,Description="Cited.http://www.ensembl.org/info/docs/variation/data_description.html#evidence_status">
> ##INFO=<ID=CS,Number=0,Type=Flag,Description="The clinical significance of
> a variant as reported by dbSNP.">
> ##INFO=<ID=CS_DR,Number=0,Type=Flag,Description="The clinical significance
> of a variant as reported by dbSNP. drug-response.">
> ##INFO=<ID=CS_PP,Number=0,Type=Flag,Description="The clinical significance
> of a variant as reported by dbSNP. probable-pathogenic.">
> ##INFO=<ID=CS_P,Number=0,Type=Flag,Description="The clinical significance
> of a variant as reported by dbSNP. pathogenic.">
> ##INFO=<ID=CS_NP,Number=0,Type=Flag,Description="The clinical significance
> of a variant as reported by dbSNP. non-pathogenic.">
> ##INFO=<ID=CS_O,Number=0,Type=Flag,Description="The clinical significance
> of a variant as reported by dbSNP. other.">
> ##INFO=<ID=CS_U,Number=0,Type=Flag,Description="The clinical significance
> of a variant as reported by dbSNP. untested.">
> ##INFO=<ID=CS_H,Number=0,Type=Flag,Description="The clinical significance
> of a variant as reported by dbSNP. histocompatibility.">
> ##INFO=<ID=CS_PNP,Number=0,Type=Flag,Description="The clinical
> significance of a variant as reported by dbSNP. probable-non-pathogenic.">
> ##INFO=<ID=MA,Number=1,Type=String,Description="Minor Alelele">
> ##INFO=<ID=MAF,Number=1,Type=Float,Description="Minor Alelele Frequency">
> ##INFO=<ID=MAC,Number=1,Type=Integer,Description="Minor Alelele Count">
> ##INFO=<ID=COSMIC_65,Number=0,Type=Flag,Description="Somatic mutations
> found in human cancers from the COSMIC project">
> ##INFO=<ID=dbSNP_137,Number=0,Type=Flag,Description="Variants (including
> SNPs and indels) imported from dbSNP">
> ##INFO=<ID=HGMD-PUBLIC_20132,Number=0,Type=Flag,Description="Variants from
> HGMD-PUBLIC dataset June 2013">
> ##INFO=<ID=PhenCode_20121114,Number=0,Type=Flag,Description="PhenCode is a
> collaborative project to better understand the relationship between
> genotype and phenotype in humans">
> ##INFO=<ID=ESP_6500,Number=0,Type=Flag,Description="NHLBI Exome Sequencing
> Project">
> ##INFO=<ID=dbSNP_ClinVar,Number=0,Type=Flag,Description="Variants of
> clinical significance imported from dbSNP/ClinVar">
> 1       10144   rs144773400     TA      T       .       .
> dbSNP_137;TSA=deletion
> 
> 
> So there seems to be something wrong there as well. I compared the header
> info/format between Homo_sapiens.vcf and
> Homo_sapiens_incl_consequences.vcf and it appears that all the needed info
> is there yet vcftools still thinks it is an invalid file.
> 
> When I run vcf-sort on the file and use the newly generated file, I still
> get the same error. Any idea what is going on?
> 
> Thanks!
> Tjaart
> 
>> Hi Tjaart,
>> The VCF specification doesn't provide a way for representing variant
> consequences in a VCF file.
>> Until the specification contains a way of storing consequence
> information
>> we decided to store the
>> data as a list of strings. We store our consequence data in our VCF
> files
>> similar to how
>> we store consequence data in our GVF files. There is a specification for
> how to store
>> consequence data in GVF format
>> (http://www.sequenceontology.org/resources/gvf.html).
>> We are aware that this can cause problems with VCF parsers and we could
> include changes
>> for next releases by storing maybe only the most severe consequence for
> a
>> variant which should be easier
>> to model with the current VCF specification.
>> In the meantime you could use the file Homo_sapiens.vcf instead which
> inlcudes the same data as
>> the file Homo_sapiens_incl_consequences.vcf except for the consequence
> information.
>> Best regards,
>> Anja
>> On 10 Oct 2013, at 13:20, Tjaart de Beer wrote:
>>> Hi,
>>> I am trying to look for specific rsids in the latest release of human vcf
>>> files from
>>> ftp://ftp.ensembl.org/pub/release-73/variation/vcf/homo_sapiens/ I am
> using this file
>>> Homo_sapiens_incl_consequences.vcf.gz
>>> I installed the latest vcftools (0.1.11) and when I run the following
> command
>>> vcftools --vcf Homo_sapiens_incl_consequences.vcf --snps test.dat I get
> this error:
>>> VCFtools - v0.1.11
>>> (C) Adam Auton 2009
>>> Parameters as interpreted:
>>>       --vcf Homo_sapiens_incl_consequences.vcf
>>>       --snps test.dat
>>> Reading Index file.
>>> Building new index file.
>>> Error:Unknown Type in INFO meta-information:
>>> ##INFO=<ID=VE,Number=.,Type=ListOfString,Description="Effect that a
> sequence alteration has on a sequence feature that overlaps
>>> it.Format=SV|IDX|FT|FID">
>>> According to the vcftools page, the only valid options for Type is
> Integer, Float, Flag, Character, and String and not ListOfString This
> thread from the vcftools mailing seems to support this that the
> ListOfstring is an invalid option.
>>> http://sourceforge.net/mailarchive/message.php?msg_id=31150267
>>> Could this perhaps be a bug in the way the Ensembl vcf files are
> generated? Or am I missing something?
>>> --
>>> Dr. Tjaart de Beer
>>> Thornton group
>>> European Bioinformatics Institute (EMBL-EBI)
>>> European Molecular Biology Laboratory
>>> Wellcome Trust Genome Campus
>>> Hinxton
>>> Cambridge CB10 1SD
>>> United Kingdom
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info:
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
> 
> 
> --
> Dr. Tjaart de Beer
> Thornton group
> European Bioinformatics Institute (EMBL-EBI)
> European Molecular Biology Laboratory
> Wellcome Trust Genome Campus
> Hinxton
> Cambridge CB10 1SD
> United Kingdom
> 
> 
> 
> 





More information about the Dev mailing list