[ensembl-dev] database .txt files have strange newline characters (help)

Mark Aquino aquinom85 at me.com
Fri Oct 28 16:18:24 BST 2011


Hi all,

So I'm trying to create the latest version of the Ensembl DB (v64) locally and noticed that there are ^H newline characters in the .txt files from the FTP site:
e.g.
Ensembl variation schema:

Table: compressed_genotype_single_bp.txt  

41719   27511   1115503 1120431 1       TT\0,GG\
¦CC^H5TT\05GG
[every line contains a new line character there thus the txt file has a wc of 150M rows while the DB count(*) has 75M rows.]
On import this creates the corresponding Entry:
 4179 2511 1115503 1120431 1 TT, GG

The following TT,GG are ommitted from the entry as they are not on the same line.

2 Questions:  Is this even a problem? And if it is (which I'm guessing it is otherwise why would the extra info be  in the file to begin with) how can I fix it?

Best,
Mark Aquino





More information about the Dev mailing list