[ensembl-dev] consistency in the fasta title line

A. J. anemone.c at gmail.com
Mon Jan 18 17:20:31 GMT 2021

Hello dev team,

I'd like to make a suggestion regarding the title line of the Y
chromosome's sequence .fa file. In recent releases (GRCh38.p13 releases
100, 101 and 102) or likely most releases if not all, the line provides the
information of the starting and ending position of the Y chromosome's
second unique sequences:
>Y dna:chromosome chromosome:GRCh38:Y:2781480:56887902:1 REF

The two pseudoautosomal regions (PAR), which are masked with 'N', along
with the other two unique sequences are not presented in the title. While
all other chromosomes have the title showing the starting and ending
positions, this becomes an exception and is inconsistent among others,
especially if we take the title line to get an idea of the overall length
of the sequence.

This is probably a minor issue as most programs would simply ignore the
title line when accessing the sequence and we can always manually intervene
for just a single exception. However, it'd be helpful if the fasta file
itself contains information that can be used for subsequent applications
rather than just a common record of its origin. The Y chromosome's title
line is probably a result from the API when assembling chromosome Y from
the stored sequence; yet, it becomes an inconsistency within the fasta file
itself. It'd thus be nice if you can consider to settle a consistent and
informative format for the titles of the fasta. In my humble opinion, the
simplest solution is to have the direct information of the sequence that
follows it:
>Y dna:chromosome chromosome:GRCh38:Y:1:57227415:1 REF

Thank you.

Best wishes,

A. J.
