[ensembl-dev] discrepancies in number of alignmentblocks between website and files, 25 eutherian mammals EPO

Matthieu Muffato muffato at ebi.ac.uk
Tue Feb 13 15:25:45 GMT 2018


Hi Christian

Our EPO alignment pipeline also reconstruct ancestral sequences. At the 
database level, those are stored in separate alignment blocks. Out of 
the 329,294, half are made of ancestral sequences _only_ and the other 
half are made of extant sequences _only_

When using the API / preparing the alignment file, both are 
automatically combined so that users should not bother with that 
technicality. In fact, the number of blocks advertised on the stats page 
should be half as well, and we'll fix that for the next release, due in 
April.

However, there is still a slight difference in the alignment files. From 
the database I would expect 164,646 blocks, you count 165,214 blocks in 
the files, I count 165,287 in them [2].
Sorry for this half-satisfying answer, but I need to have a deeper look 
at the files :)

Regards,
Matthieu


[1] The numbers don't add up because, strangely, there is 1 block with 
both ancestral and extant sequences. I also think that there is 1 
spurious ancestral block, a leftover of the production run, that should 
be removed.
[2] zgrep "^a" *.maf.gz | wc -l

On 12/02/18 14:55, Christian Groß - EWI wrote:
> Dear Dev Team,
> 
> I am writing you because of a discrepancy between the number of 
> alignment blocks stated on your website and the number of alignment 
> blocks which I can find in the associated files.
> 
> On your website you list the details for the 25 eutherian mammals EPO 
> alignment 
> (https://www.ensembl.org/info/genome/compara/mlss.html?mlss=1102) and 
> state that there is a total of 329,294 blocks.
> 
> After downloading the entire alignment (838 files) in .maf format from 
> (ftp://ftp.ensembl.org/pub/release-91/maf/ensembl-compara/multiple_alignments/epo_25_eutherian/) 
> I unzipped them and utilized awk to count the number of alignment blocks 
> by counting the number of rows starting with an “a”
> 
> for file in 25_eutherian_mammals_EPO.* ; do awk 
> '$1=="a"{count++}END{print count}' $file >> alignment_block_counts.txt ; 
> done ;
> 
> awk '{sum+=$1}END{print sum}' alignment_block_counts.txt ;
> 
> The total number of alignment blocks sums up to 165,214 which is around 
> half of what is mentioned on the website, therefore I am a bit confused 
> what this number consists of?
> 
> Best regards,
> 
> Christian Gross
> 
> 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
> 

-- 
Matthieu Muffato, Ph.D.
Ensembl Compara and TreeFam Project Leader
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Wellcome Trust Genome Campus, Hinxton
Cambridge, CB10 1SD, United Kingdom
Room  A3-145
Phone + 44 (0) 1223 49 4631
Fax   + 44 (0) 1223 49 4468



More information about the Dev mailing list