[ensembl-dev] dump of attrib_type table in Ensembl 75 breaks standard parsing tools
Hervé Pagès
hpages at fhcrc.org
Thu Mar 13 19:58:57 GMT 2014
On 03/13/2014 10:33 AM, Hervé Pagès wrote:
> Hi,
>
> I'd like to report that the dump of the attrib_type table in Ensembl 75
> contains some lines that seem to contain embedded EOL characters. These
> embedded EOL characters tend to break standard parsing tools like the
> cut Unix command or the read.table() function in R.
>
> Trying to use cut:
>
> hpages:~$ wget
> ftp://ftp.ensembl.org/pub/release-74/mysql/homo_sapiens_core_74_37/attrib_type.txt.gz
Ooops, looks like I gave the wrong URL, sorry. The above URL is for
Ensembl 74 and there is no parsing issue with this file. The parsing
issue is I'm reporting below is with:
ftp://ftp.ensembl.org/pub/release-75/mysql/homo_sapiens_core_75_37/attrib_type.txt.gz
H.
>
> hpages:~$ gunzip attrib_type.txt.gz
> hpages:~$ tail attrib_type.txt | cut -f 1,2
> 413 lnoncoding_rcnt
> 414 pseudogene_rcnt
> 415 pseudogene_racnt
> 416 gencode_level
> \evel 2 (manually annotated loci),
> \evel 3 (automatically annotated loci)
>
> 417 gencode_basic
>
> 418 struct_var
>
> Trying to use read.table() in R:
>
> > df <- read.table("attrib_type.txt", sep="\t", quote="",
> comment.char="")
> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
> na.strings, :
> line 278 did not have 4 elements
>
> Note that those lines are at the bottom of the attrib_type.txt.gz
> file and are new in Ensembl 75. The same file in Ensembl 74 and
> earlier versions didn't have this problem and could easily be
> parsed with standard tools.
>
> I have a workaround for this but I was hoping that maybe something
> could be done on your side to fix the file.
>
> Thanks in advance,
> H.
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Dev
mailing list