[ensembl-dev] database .txt files have strange newline characters (help)

Andy Yates ayates at ebi.ac.uk
Fri Oct 28 16:35:46 BST 2011


Hi Mar,

These are tables which contain blob fields & their inclusion of control characters is something which happens by chance. When loading any Ensembl table you should use the command:

LOAD DATA LOCAL INFILE 'compressed_genotype_single_bp.txt' INTO TABLE `compressed_genotype_single_bp` FIELDS ESCAPED BY '\\';

The \\ is interpreted by MySQL to use \ as the escaping character when a field contains a "bad character" like \n. mysqlimport should have a similar command line argument.

Hope this helps,

Andy

On 28 Oct 2011, at 16:18, Mark Aquino wrote:

> Hi all,
> 
> So I'm trying to create the latest version of the Ensembl DB (v64) locally and noticed that there are ^H newline characters in the .txt files from the FTP site:
> e.g.
> Ensembl variation schema:
> 
> Table: compressed_genotype_single_bp.txt  
> 
> 41719   27511   1115503 1120431 1       TT\0,GG\
> ¦CC^H5TT\05GG
> [every line contains a new line character there thus the txt file has a wc of 150M rows while the DB count(*) has 75M rows.]
> On import this creates the corresponding Entry:
> 4179 2511 1115503 1120431 1 TT, GG
> 
> The following TT,GG are ommitted from the entry as they are not on the same line.
> 
> 2 Questions:  Is this even a problem? And if it is (which I'm guessing it is otherwise why would the extra info be  in the file to begin with) how can I fix it?
> 
> Best,
> Mark Aquino
> 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

---
Andrew Yates                   Ensembl Core Software Project Leader
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensembl.org/





More information about the Dev mailing list