[ensembl-dev] discrepancies in number of alignmentblocks between website and files, 25 eutherian mammals EPO

Mon Mar 12 12:14:42 GMT 2018

Hi Christian,

While reviewing the code that generates the EMF/MAF files, I've found a 
very simple reason why the number of blocks is different (apart from the 
x2), and it is explained in the README 
ftp://ftp.ensembl.org/pub/release-91/emf/ensembl-compara/multiple_alignments/epo_25_eutherian/README.25_eutherian_mammals_EPO

-----------
Alignments are grouped by human chromosome, and then by coordinate 
system. Alignments containing duplications in human are dumped once per 
duplicated segment.
-----------

I've still found that some alignments have an extra, incomplete, block 
(only the ancestral sequences but not the extant sequences) but I think 
those are spurious rows left from the production process that should 
have been deleted. However, the API should discard them because they're 
incomplete, so nothing to worry.

Matthieu

On 13/02/18 15:25, Matthieu Muffato wrote:
> Hi Christian
> 
> Our EPO alignment pipeline also reconstruct ancestral sequences. At the 
> database level, those are stored in separate alignment blocks. Out of 
> the 329,294, half are made of ancestral sequences _only_ and the other 
> half are made of extant sequences _only_
> 
> When using the API / preparing the alignment file, both are 
> automatically combined so that users should not bother with that 
> technicality. In fact, the number of blocks advertised on the stats page 
> should be half as well, and we'll fix that for the next release, due in 
> April.
> 
> However, there is still a slight difference in the alignment files. From 
> the database I would expect 164,646 blocks, you count 165,214 blocks in 
> the files, I count 165,287 in them [2].
> Sorry for this half-satisfying answer, but I need to have a deeper look 
> at the files :)
> 
> Regards,
> Matthieu
> 
> 
> [1] The numbers don't add up because, strangely, there is 1 block with 
> both ancestral and extant sequences. I also think that there is 1 
> spurious ancestral block, a leftover of the production run, that should 
> be removed.
> [2] zgrep "^a" *.maf.gz | wc -l
> 
> On 12/02/18 14:55, Christian Groß - EWI wrote:
>> Dear Dev Team,
>>
>> I am writing you because of a discrepancy between the number of 
>> alignment blocks stated on your website and the number of alignment 
>> blocks which I can find in the associated files.
>>
>> On your website you list the details for the 25 eutherian mammals EPO 
>> alignment 
>> (https://www.ensembl.org/info/genome/compara/mlss.html?mlss=1102) and 
>> state that there is a total of 329,294 blocks.
>>
>> After downloading the entire alignment (838 files) in .maf format from 
>> (ftp://ftp.ensembl.org/pub/release-91/maf/ensembl-compara/multiple_alignments/epo_25_eutherian/) 
>> I unzipped them and utilized awk to count the number of alignment 
>> blocks by counting the number of rows starting with an “a”
>>
>> for file in 25_eutherian_mammals_EPO.* ; do awk 
>> '$1=="a"{count++}END{print count}' $file >> alignment_block_counts.txt 
>> ; done ;
>>
>> awk '{sum+=$1}END{print sum}' alignment_block_counts.txt ;
>>
>> The total number of alignment blocks sums up to 165,214 which is 
>> around half of what is mentioned on the website, therefore I am a bit 
>> confused what this number consists of?
>>
>> Best regards,
>>
>> Christian Gross
>>
>>
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: 
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
> 

-- 
Matthieu Muffato, Ph.D.
Ensembl Compara and TreeFam Project Leader
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Wellcome Trust Genome Campus, Hinxton
Cambridge, CB10 1SD, United Kingdom
Room  A3-145
Phone + 44 (0) 1223 49 4631
Fax   + 44 (0) 1223 49 4468