[ensembl-dev] database .txt files have strange newline characters (help)
Andy Yates
ayates at ebi.ac.uk
Fri Oct 28 16:35:46 BST 2011
Hi Mar,
These are tables which contain blob fields & their inclusion of control characters is something which happens by chance. When loading any Ensembl table you should use the command:
LOAD DATA LOCAL INFILE 'compressed_genotype_single_bp.txt' INTO TABLE `compressed_genotype_single_bp` FIELDS ESCAPED BY '\\';
The \\ is interpreted by MySQL to use \ as the escaping character when a field contains a "bad character" like \n. mysqlimport should have a similar command line argument.
Hope this helps,
Andy
On 28 Oct 2011, at 16:18, Mark Aquino wrote:
> Hi all,
>
> So I'm trying to create the latest version of the Ensembl DB (v64) locally and noticed that there are ^H newline characters in the .txt files from the FTP site:
> e.g.
> Ensembl variation schema:
>
> Table: compressed_genotype_single_bp.txt
>
> 41719 27511 1115503 1120431 1 TT\0,GG\
> ¦CC^H5TT\05GG
> [every line contains a new line character there thus the txt file has a wc of 150M rows while the DB count(*) has 75M rows.]
> On import this creates the corresponding Entry:
> 4179 2511 1115503 1120431 1 TT, GG
>
> The following TT,GG are ommitted from the entry as they are not on the same line.
>
> 2 Questions: Is this even a problem? And if it is (which I'm guessing it is otherwise why would the extra info be in the file to begin with) how can I fix it?
>
> Best,
> Mark Aquino
>
>
> _______________________________________________
> Dev mailing list Dev at ensembl.org
> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
---
Andrew Yates Ensembl Core Software Project Leader
EMBL-EBI Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK http://www.ensembl.org/
More information about the Dev
mailing list