[ensembl-dev] dump of attrib_type table in Ensembl 75 breaks standard parsing tools

Hervé Pagès hpages at fhcrc.org
Wed Mar 19 18:58:46 GMT 2014


Hi Steve,

On 03/19/2014 05:43 AM, Steve Trevanion wrote:
> Hi Herve,
>
> The files have been regenerated and uploaded to the FTP site. We've also
> put internal checks in place so that it shouldn't slip through again.

Great. Thanks a lot!

Cheers,
H.

>
> Regards,
>
> Steve
>
> On 14/03/14 21:25, Hervé Pagès wrote:
>> Hi Steve,
>>
>> On 03/14/2014 12:19 PM, Steve Trevanion wrote:
>>> Hi Herve,
>>>
>>> Sorry for the delay in replying to you. This is not a problem we have
>>> encountered before (with these being MySQL files, control characters can
>>> appear in them quite legitimately) so after looking into how best to
>>> deal with it, we've decided to regenerate the affected files and upload
>>> them to the FTP site. We will let you know when they're in place, but we
>>> won't now be able to do this until next week. Apologies in advance if it
>>> causes you any more issues.
>>
>> No worries. I'm glad that you were able to regenerate these files.
>> Thanks a lot.
>>
>> H.
>>
>>>
>>> Regards,
>>>
>>> Steve
>>>
>>> On 13/03/14 19:58, Hervé Pagès wrote:
>>>> On 03/13/2014 10:33 AM, Hervé Pagès wrote:
>>>>> Hi,
>>>>>
>>>>> I'd like to report that the dump of the attrib_type table in
>>>>> Ensembl 75
>>>>> contains some lines that seem to contain embedded EOL characters.
>>>>> These
>>>>> embedded EOL characters tend to break standard parsing tools like the
>>>>> cut Unix command or the read.table() function in R.
>>>>>
>>>>> Trying to use cut:
>>>>>
>>>>>    hpages:~$ wget
>>>>> ftp://ftp.ensembl.org/pub/release-74/mysql/homo_sapiens_core_74_37/attrib_type.txt.gz
>>>>>
>>>>>
>>>>
>>>> Ooops, looks like I gave the wrong URL, sorry. The above URL is for
>>>> Ensembl 74 and there is no parsing issue with this file. The parsing
>>>> issue is I'm reporting below is with:
>>>>
>>>>
>>>> ftp://ftp.ensembl.org/pub/release-75/mysql/homo_sapiens_core_75_37/attrib_type.txt.gz
>>>>
>>>>
>>>>
>>>> H.
>>>>
>>>>>
>>>>>    hpages:~$ gunzip attrib_type.txt.gz
>>>>>    hpages:~$ tail attrib_type.txt | cut -f 1,2
>>>>>    413    lnoncoding_rcnt
>>>>>    414    pseudogene_rcnt
>>>>>    415    pseudogene_racnt
>>>>>    416    gencode_level
>>>>>    \evel 2 (manually annotated loci),
>>>>>    \evel 3 (automatically annotated loci)
>>>>>
>>>>>    417    gencode_basic
>>>>>
>>>>>    418    struct_var
>>>>>
>>>>> Trying to use read.table() in R:
>>>>>
>>>>> > df <- read.table("attrib_type.txt", sep="\t", quote="",
>>>>> comment.char="")
>>>>>    Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
>>>>> na.strings,  :
>>>>>      line 278 did not have 4 elements
>>>>>
>>>>> Note that those lines are at the bottom of the attrib_type.txt.gz
>>>>> file and are new in Ensembl 75. The same file in Ensembl 74 and
>>>>> earlier versions didn't have this problem and could easily be
>>>>> parsed with standard tools.
>>>>>
>>>>> I have a workaround for this but I was hoping that maybe something
>>>>> could be done on your side to fix the file.
>>>>>
>>>>> Thanks in advance,
>>>>> H.
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info:
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319




More information about the Dev mailing list