[ensembl-dev] Release-97 FASTA header

Black, Andrew N andrew at cgrb.oregonstate.edu
Thu Aug 8 23:46:53 BST 2019


Looking at the following file:

ftp.ensembl.org/pub/release-97/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.X.fa.gz
Lists the number of nucleotides as:


>X dna:chromosome chromosome:GRCh38:X:1:156040895:1 REF
Stating that there are 156,040,895 nucleotides in this sequence.

However, the number of nucleotides doesn’t match the number of characters:


grep -v “>” GRCh38X.fa | grep [A,T,C,G,N] | wc -c

158,641,577
Stating that there are 158,641,577 nucleotides in this sequence

It appears that the headers might be recycled from previous releases?

If I am correct in my conclusion, I just wanted to make sure that people at Ensembl were aware of this for future / past releases…

Andrew
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20190808/38a574b6/attachment.html>


More information about the Dev mailing list