[ensembl-dev] database .txt files have strange newline characters (help)
aquinom85 at me.com
Fri Oct 28 16:18:24 BST 2011
So I'm trying to create the latest version of the Ensembl DB (v64) locally and noticed that there are ^H newline characters in the .txt files from the FTP site:
Ensembl variation schema:
41719 27511 1115503 1120431 1 TT\0,GG\
[every line contains a new line character there thus the txt file has a wc of 150M rows while the DB count(*) has 75M rows.]
On import this creates the corresponding Entry:
4179 2511 1115503 1120431 1 TT, GG
The following TT,GG are ommitted from the entry as they are not on the same line.
2 Questions: Is this even a problem? And if it is (which I'm guessing it is otherwise why would the extra info be in the file to begin with) how can I fix it?
More information about the Dev