[ensembl-dev] dump of attrib_type table in Ensembl 75 breaks standard parsing tools
Hervé Pagès
hpages at fhcrc.org
Fri Mar 14 21:25:45 GMT 2014
Hi Steve,
On 03/14/2014 12:19 PM, Steve Trevanion wrote:
> Hi Herve,
>
> Sorry for the delay in replying to you. This is not a problem we have
> encountered before (with these being MySQL files, control characters can
> appear in them quite legitimately) so after looking into how best to
> deal with it, we've decided to regenerate the affected files and upload
> them to the FTP site. We will let you know when they're in place, but we
> won't now be able to do this until next week. Apologies in advance if it
> causes you any more issues.
No worries. I'm glad that you were able to regenerate these files.
Thanks a lot.
H.
>
> Regards,
>
> Steve
>
> On 13/03/14 19:58, Hervé Pagès wrote:
>> On 03/13/2014 10:33 AM, Hervé Pagès wrote:
>>> Hi,
>>>
>>> I'd like to report that the dump of the attrib_type table in Ensembl 75
>>> contains some lines that seem to contain embedded EOL characters. These
>>> embedded EOL characters tend to break standard parsing tools like the
>>> cut Unix command or the read.table() function in R.
>>>
>>> Trying to use cut:
>>>
>>> hpages:~$ wget
>>> ftp://ftp.ensembl.org/pub/release-74/mysql/homo_sapiens_core_74_37/attrib_type.txt.gz
>>>
>>
>> Ooops, looks like I gave the wrong URL, sorry. The above URL is for
>> Ensembl 74 and there is no parsing issue with this file. The parsing
>> issue is I'm reporting below is with:
>>
>>
>> ftp://ftp.ensembl.org/pub/release-75/mysql/homo_sapiens_core_75_37/attrib_type.txt.gz
>>
>>
>> H.
>>
>>>
>>> hpages:~$ gunzip attrib_type.txt.gz
>>> hpages:~$ tail attrib_type.txt | cut -f 1,2
>>> 413 lnoncoding_rcnt
>>> 414 pseudogene_rcnt
>>> 415 pseudogene_racnt
>>> 416 gencode_level
>>> \evel 2 (manually annotated loci),
>>> \evel 3 (automatically annotated loci)
>>>
>>> 417 gencode_basic
>>>
>>> 418 struct_var
>>>
>>> Trying to use read.table() in R:
>>>
>>> > df <- read.table("attrib_type.txt", sep="\t", quote="",
>>> comment.char="")
>>> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
>>> na.strings, :
>>> line 278 did not have 4 elements
>>>
>>> Note that those lines are at the bottom of the attrib_type.txt.gz
>>> file and are new in Ensembl 75. The same file in Ensembl 74 and
>>> earlier versions didn't have this problem and could easily be
>>> parsed with standard tools.
>>>
>>> I have a workaround for this but I was hoping that maybe something
>>> could be done on your side to fix the file.
>>>
>>> Thanks in advance,
>>> H.
>>>
>>
>
> _______________________________________________
> Dev mailing list Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Dev
mailing list