[ensembl-dev] dump of attrib_type table in Ensembl 75 breaks standard parsing tools

Hervé Pagès hpages at fhcrc.org
Thu Mar 13 19:58:57 GMT 2014


On 03/13/2014 10:33 AM, Hervé Pagès wrote:
> Hi,
>
> I'd like to report that the dump of the attrib_type table in Ensembl 75
> contains some lines that seem to contain embedded EOL characters. These
> embedded EOL characters tend to break standard parsing tools like the
> cut Unix command or the read.table() function in R.
>
> Trying to use cut:
>
>    hpages:~$ wget
> ftp://ftp.ensembl.org/pub/release-74/mysql/homo_sapiens_core_74_37/attrib_type.txt.gz

Ooops, looks like I gave the wrong URL, sorry. The above URL is for
Ensembl 74 and there is no parsing issue with this file. The parsing
issue is I'm reporting below is with:

 
ftp://ftp.ensembl.org/pub/release-75/mysql/homo_sapiens_core_75_37/attrib_type.txt.gz 


H.

>
>    hpages:~$ gunzip attrib_type.txt.gz
>    hpages:~$ tail attrib_type.txt | cut -f 1,2
>    413    lnoncoding_rcnt
>    414    pseudogene_rcnt
>    415    pseudogene_racnt
>    416    gencode_level
>    \evel 2 (manually annotated loci),
>    \evel 3 (automatically annotated loci)
>
>    417    gencode_basic
>
>    418    struct_var
>
> Trying to use read.table() in R:
>
>    > df <- read.table("attrib_type.txt", sep="\t", quote="",
> comment.char="")
>    Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
> na.strings,  :
>      line 278 did not have 4 elements
>
> Note that those lines are at the bottom of the attrib_type.txt.gz
> file and are new in Ensembl 75. The same file in Ensembl 74 and
> earlier versions didn't have this problem and could easily be
> parsed with standard tools.
>
> I have a workaround for this but I was hoping that maybe something
> could be done on your side to fix the file.
>
> Thanks in advance,
> H.
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319




More information about the Dev mailing list