[ensembl-dev] dump of attrib_type table in Ensembl 75 breaks standard parsing tools

Steve Trevanion steve at ebi.ac.uk
Wed Mar 19 12:43:25 GMT 2014


Hi Herve,

The files have been regenerated and uploaded to the FTP site. We've also 
put internal checks in place so that it shouldn't slip through again.

Regards,

Steve

On 14/03/14 21:25, Hervé Pagès wrote:
> Hi Steve,
>
> On 03/14/2014 12:19 PM, Steve Trevanion wrote:
>> Hi Herve,
>>
>> Sorry for the delay in replying to you. This is not a problem we have
>> encountered before (with these being MySQL files, control characters can
>> appear in them quite legitimately) so after looking into how best to
>> deal with it, we've decided to regenerate the affected files and upload
>> them to the FTP site. We will let you know when they're in place, but we
>> won't now be able to do this until next week. Apologies in advance if it
>> causes you any more issues.
>
> No worries. I'm glad that you were able to regenerate these files.
> Thanks a lot.
>
> H.
>
>>
>> Regards,
>>
>> Steve
>>
>> On 13/03/14 19:58, Hervé Pagès wrote:
>>> On 03/13/2014 10:33 AM, Hervé Pagès wrote:
>>>> Hi,
>>>>
>>>> I'd like to report that the dump of the attrib_type table in 
>>>> Ensembl 75
>>>> contains some lines that seem to contain embedded EOL characters. 
>>>> These
>>>> embedded EOL characters tend to break standard parsing tools like the
>>>> cut Unix command or the read.table() function in R.
>>>>
>>>> Trying to use cut:
>>>>
>>>>    hpages:~$ wget
>>>> ftp://ftp.ensembl.org/pub/release-74/mysql/homo_sapiens_core_74_37/attrib_type.txt.gz 
>>>>
>>>>
>>>
>>> Ooops, looks like I gave the wrong URL, sorry. The above URL is for
>>> Ensembl 74 and there is no parsing issue with this file. The parsing
>>> issue is I'm reporting below is with:
>>>
>>>
>>> ftp://ftp.ensembl.org/pub/release-75/mysql/homo_sapiens_core_75_37/attrib_type.txt.gz 
>>>
>>>
>>>
>>> H.
>>>
>>>>
>>>>    hpages:~$ gunzip attrib_type.txt.gz
>>>>    hpages:~$ tail attrib_type.txt | cut -f 1,2
>>>>    413    lnoncoding_rcnt
>>>>    414    pseudogene_rcnt
>>>>    415    pseudogene_racnt
>>>>    416    gencode_level
>>>>    \evel 2 (manually annotated loci),
>>>>    \evel 3 (automatically annotated loci)
>>>>
>>>>    417    gencode_basic
>>>>
>>>>    418    struct_var
>>>>
>>>> Trying to use read.table() in R:
>>>>
>>>> > df <- read.table("attrib_type.txt", sep="\t", quote="",
>>>> comment.char="")
>>>>    Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
>>>> na.strings,  :
>>>>      line 278 did not have 4 elements
>>>>
>>>> Note that those lines are at the bottom of the attrib_type.txt.gz
>>>> file and are new in Ensembl 75. The same file in Ensembl 74 and
>>>> earlier versions didn't have this problem and could easily be
>>>> parsed with standard tools.
>>>>
>>>> I have a workaround for this but I was hoping that maybe something
>>>> could be done on your side to fix the file.
>>>>
>>>> Thanks in advance,
>>>> H.
>>>>
>>>
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>





More information about the Dev mailing list