[ensembl-dev] dump of attrib_type table in Ensembl 75 breaks standard parsing tools

Steve Trevanion steve at ebi.ac.uk
Fri Mar 14 19:19:46 GMT 2014


Hi Herve,

Sorry for the delay in replying to you. This is not a problem we have 
encountered before (with these being MySQL files, control characters can 
appear in them quite legitimately) so after looking into how best to 
deal with it, we've decided to regenerate the affected files and upload 
them to the FTP site. We will let you know when they're in place, but we 
won't now be able to do this until next week. Apologies in advance if it 
causes you any more issues.

Regards,

Steve

On 13/03/14 19:58, Hervé Pagès wrote:
> On 03/13/2014 10:33 AM, Hervé Pagès wrote:
>> Hi,
>>
>> I'd like to report that the dump of the attrib_type table in Ensembl 75
>> contains some lines that seem to contain embedded EOL characters. These
>> embedded EOL characters tend to break standard parsing tools like the
>> cut Unix command or the read.table() function in R.
>>
>> Trying to use cut:
>>
>>    hpages:~$ wget
>> ftp://ftp.ensembl.org/pub/release-74/mysql/homo_sapiens_core_74_37/attrib_type.txt.gz 
>>
>
> Ooops, looks like I gave the wrong URL, sorry. The above URL is for
> Ensembl 74 and there is no parsing issue with this file. The parsing
> issue is I'm reporting below is with:
>
>
> ftp://ftp.ensembl.org/pub/release-75/mysql/homo_sapiens_core_75_37/attrib_type.txt.gz 
>
>
> H.
>
>>
>>    hpages:~$ gunzip attrib_type.txt.gz
>>    hpages:~$ tail attrib_type.txt | cut -f 1,2
>>    413    lnoncoding_rcnt
>>    414    pseudogene_rcnt
>>    415    pseudogene_racnt
>>    416    gencode_level
>>    \evel 2 (manually annotated loci),
>>    \evel 3 (automatically annotated loci)
>>
>>    417    gencode_basic
>>
>>    418    struct_var
>>
>> Trying to use read.table() in R:
>>
>> > df <- read.table("attrib_type.txt", sep="\t", quote="",
>> comment.char="")
>>    Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
>> na.strings,  :
>>      line 278 did not have 4 elements
>>
>> Note that those lines are at the bottom of the attrib_type.txt.gz
>> file and are new in Ensembl 75. The same file in Ensembl 74 and
>> earlier versions didn't have this problem and could easily be
>> parsed with standard tools.
>>
>> I have a workaround for this but I was hoping that maybe something
>> could be done on your side to fix the file.
>>
>> Thanks in advance,
>> H.
>>
>




More information about the Dev mailing list