[ensembl-dev] dump of attrib_type table in Ensembl 75 breaks standard parsing tools
Hervé Pagès
hpages at fhcrc.org
Thu Mar 13 17:33:35 GMT 2014
Hi,
I'd like to report that the dump of the attrib_type table in Ensembl 75
contains some lines that seem to contain embedded EOL characters. These
embedded EOL characters tend to break standard parsing tools like the
cut Unix command or the read.table() function in R.
Trying to use cut:
hpages:~$ wget
ftp://ftp.ensembl.org/pub/release-74/mysql/homo_sapiens_core_74_37/attrib_type.txt.gz
hpages:~$ gunzip attrib_type.txt.gz
hpages:~$ tail attrib_type.txt | cut -f 1,2
413 lnoncoding_rcnt
414 pseudogene_rcnt
415 pseudogene_racnt
416 gencode_level
\evel 2 (manually annotated loci),
\evel 3 (automatically annotated loci)
417 gencode_basic
418 struct_var
Trying to use read.table() in R:
> df <- read.table("attrib_type.txt", sep="\t", quote="",
comment.char="")
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
na.strings, :
line 278 did not have 4 elements
Note that those lines are at the bottom of the attrib_type.txt.gz
file and are new in Ensembl 75. The same file in Ensembl 74 and
earlier versions didn't have this problem and could easily be
parsed with standard tools.
I have a workaround for this but I was hoping that maybe something
could be done on your side to fix the file.
Thanks in advance,
H.
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Dev
mailing list