[ensembl-dev] dump of attrib_type table in Ensembl 75 breaks standard parsing tools

Hervé Pagès hpages at fhcrc.org
Fri Mar 14 21:25:45 GMT 2014


Hi Steve,

On 03/14/2014 12:19 PM, Steve Trevanion wrote:
> Hi Herve,
>
> Sorry for the delay in replying to you. This is not a problem we have
> encountered before (with these being MySQL files, control characters can
> appear in them quite legitimately) so after looking into how best to
> deal with it, we've decided to regenerate the affected files and upload
> them to the FTP site. We will let you know when they're in place, but we
> won't now be able to do this until next week. Apologies in advance if it
> causes you any more issues.

No worries. I'm glad that you were able to regenerate these files.
Thanks a lot.

H.

>
> Regards,
>
> Steve
>
> On 13/03/14 19:58, Hervé Pagès wrote:
>> On 03/13/2014 10:33 AM, Hervé Pagès wrote:
>>> Hi,
>>>
>>> I'd like to report that the dump of the attrib_type table in Ensembl 75
>>> contains some lines that seem to contain embedded EOL characters. These
>>> embedded EOL characters tend to break standard parsing tools like the
>>> cut Unix command or the read.table() function in R.
>>>
>>> Trying to use cut:
>>>
>>>    hpages:~$ wget
>>> ftp://ftp.ensembl.org/pub/release-74/mysql/homo_sapiens_core_74_37/attrib_type.txt.gz
>>>
>>
>> Ooops, looks like I gave the wrong URL, sorry. The above URL is for
>> Ensembl 74 and there is no parsing issue with this file. The parsing
>> issue is I'm reporting below is with:
>>
>>
>> ftp://ftp.ensembl.org/pub/release-75/mysql/homo_sapiens_core_75_37/attrib_type.txt.gz
>>
>>
>> H.
>>
>>>
>>>    hpages:~$ gunzip attrib_type.txt.gz
>>>    hpages:~$ tail attrib_type.txt | cut -f 1,2
>>>    413    lnoncoding_rcnt
>>>    414    pseudogene_rcnt
>>>    415    pseudogene_racnt
>>>    416    gencode_level
>>>    \evel 2 (manually annotated loci),
>>>    \evel 3 (automatically annotated loci)
>>>
>>>    417    gencode_basic
>>>
>>>    418    struct_var
>>>
>>> Trying to use read.table() in R:
>>>
>>> > df <- read.table("attrib_type.txt", sep="\t", quote="",
>>> comment.char="")
>>>    Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
>>> na.strings,  :
>>>      line 278 did not have 4 elements
>>>
>>> Note that those lines are at the bottom of the attrib_type.txt.gz
>>> file and are new in Ensembl 75. The same file in Ensembl 74 and
>>> earlier versions didn't have this problem and could easily be
>>> parsed with standard tools.
>>>
>>> I have a workaround for this but I was hoping that maybe something
>>> could be done on your side to fix the file.
>>>
>>> Thanks in advance,
>>> H.
>>>
>>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319




More information about the Dev mailing list