[ensembl-dev] discrepancies in number of alignmentblocks between website and files, 25 eutherian mammals EPO
Matthieu Muffato
muffato at ebi.ac.uk
Tue Feb 13 15:25:45 GMT 2018
Hi Christian
Our EPO alignment pipeline also reconstruct ancestral sequences. At the
database level, those are stored in separate alignment blocks. Out of
the 329,294, half are made of ancestral sequences _only_ and the other
half are made of extant sequences _only_
When using the API / preparing the alignment file, both are
automatically combined so that users should not bother with that
technicality. In fact, the number of blocks advertised on the stats page
should be half as well, and we'll fix that for the next release, due in
April.
However, there is still a slight difference in the alignment files. From
the database I would expect 164,646 blocks, you count 165,214 blocks in
the files, I count 165,287 in them [2].
Sorry for this half-satisfying answer, but I need to have a deeper look
at the files :)
Regards,
Matthieu
[1] The numbers don't add up because, strangely, there is 1 block with
both ancestral and extant sequences. I also think that there is 1
spurious ancestral block, a leftover of the production run, that should
be removed.
[2] zgrep "^a" *.maf.gz | wc -l
On 12/02/18 14:55, Christian Groß - EWI wrote:
> Dear Dev Team,
>
> I am writing you because of a discrepancy between the number of
> alignment blocks stated on your website and the number of alignment
> blocks which I can find in the associated files.
>
> On your website you list the details for the 25 eutherian mammals EPO
> alignment
> (https://www.ensembl.org/info/genome/compara/mlss.html?mlss=1102) and
> state that there is a total of 329,294 blocks.
>
> After downloading the entire alignment (838 files) in .maf format from
> (ftp://ftp.ensembl.org/pub/release-91/maf/ensembl-compara/multiple_alignments/epo_25_eutherian/)
> I unzipped them and utilized awk to count the number of alignment blocks
> by counting the number of rows starting with an “a”
>
> for file in 25_eutherian_mammals_EPO.* ; do awk
> '$1=="a"{count++}END{print count}' $file >> alignment_block_counts.txt ;
> done ;
>
> awk '{sum+=$1}END{print sum}' alignment_block_counts.txt ;
>
> The total number of alignment blocks sums up to 165,214 which is around
> half of what is mentioned on the website, therefore I am a bit confused
> what this number consists of?
>
> Best regards,
>
> Christian Gross
>
>
>
> _______________________________________________
> Dev mailing list Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
--
Matthieu Muffato, Ph.D.
Ensembl Compara and TreeFam Project Leader
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Wellcome Trust Genome Campus, Hinxton
Cambridge, CB10 1SD, United Kingdom
Room A3-145
Phone + 44 (0) 1223 49 4631
Fax + 44 (0) 1223 49 4468
More information about the Dev
mailing list