[ensembl-dev] dump of attrib_type table in Ensembl 75 breaks standard parsing tools

Hervé Pagès hpages at fhcrc.org
Thu Mar 13 17:33:35 GMT 2014


Hi,

I'd like to report that the dump of the attrib_type table in Ensembl 75
contains some lines that seem to contain embedded EOL characters. These
embedded EOL characters tend to break standard parsing tools like the
cut Unix command or the read.table() function in R.

Trying to use cut:

   hpages:~$ wget 
ftp://ftp.ensembl.org/pub/release-74/mysql/homo_sapiens_core_74_37/attrib_type.txt.gz
   hpages:~$ gunzip attrib_type.txt.gz
   hpages:~$ tail attrib_type.txt | cut -f 1,2
   413	lnoncoding_rcnt
   414	pseudogene_rcnt
   415	pseudogene_racnt
   416	gencode_level
   \evel 2 (manually annotated loci),
   \evel 3 (automatically annotated loci)

   417	gencode_basic

   418	struct_var

Trying to use read.table() in R:

   > df <- read.table("attrib_type.txt", sep="\t", quote="", 
comment.char="")
   Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, 
na.strings,  :
     line 278 did not have 4 elements

Note that those lines are at the bottom of the attrib_type.txt.gz
file and are new in Ensembl 75. The same file in Ensembl 74 and
earlier versions didn't have this problem and could easily be
parsed with standard tools.

I have a workaround for this but I was hoping that maybe something
could be done on your side to fix the file.

Thanks in advance,
H.

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319




More information about the Dev mailing list