[ensembl-dev] Release-97 FASTA header

Thomas Danhorn danhornt at njhealth.org
Fri Aug 9 01:51:45 BST 2019


Hi Andrew,

Your "wc -c" command is counting newline characters, so it will not give 
the correct number of nucleotides if there are line breaks (and in FASTA 
files with millions of nucleotides I would expect quite a few of those).

Try this:

awk 'BEGIN{sum=0} $0 !~ /^>/ {sum += length($0);} END{print sum}' GRCh38X.fa

or this (if you prefer grep):

grep -v '^>' GRCh38X.fa | grep -o [ACGTN] | wc -l

(I don't have the GRCh38X.fa file, so I can't tell you if the header 
matches what you will get.)

Hope this helps,

Thomas


On Thu, 8 Aug 2019, Black, Andrew N wrote:

> Looking at the following file:
>
> ftp.ensembl.org/pub/release-97/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.X.fa.gz
> Lists the number of nucleotides as:
>
>
>> X dna:chromosome chromosome:GRCh38:X:1:156040895:1 REF
> Stating that there are 156,040,895 nucleotides in this sequence.
>
> However, the number of nucleotides doesn’t match the number of characters:
>
>
> grep -v “>” GRCh38X.fa | grep [A,T,C,G,N] | wc -c
>
> 158,641,577
> Stating that there are 158,641,577 nucleotides in this sequence
>
> It appears that the headers might be recycled from previous releases?
>
> If I am correct in my conclusion, I just wanted to make sure that people at Ensembl were aware of this for future / past releases…
>
> Andrew
>


More information about the Dev mailing list