[ensembl-dev] [ Compara ] proposal: prioritise hcoffee_himem over hcoffee and hcoffee_short

Thu Feb 15 11:30:03 GMT 2018

Hi Wojciech,

Yes I am aware that the scheduling of the mcoffee / mafft jobs is not 
great at the moment.

Two steps before mcoffee / mafft there is "cluster_factory" that creates 
jobs by decreasing size of their family. The intent was to get the 
biggest families started first (MySQL and eHive run jobs by increasing 
job_id). However, that property was partially lost as we introduced the 
"alignment_entry_point" which chooses the best method to align the 
cluster with. All the flavours of mcoffee and mafft end up being treated 
equally. Within each analysis the biggest families will still start 
first, but because they are all coupled via the hive_capacity, only a 
fraction of the compute power is effectively dedicated to mcoffee_himem.

eHive has a solution for that: analysis priority. By default the value 
is 0, but it can be either decreased or increased. eHive will try to 
fully allocate the analyses with the highest priority first, so a 
solution would be to set a higher priority on the mcoffee_himem analysis 
so that the jobs are scheduled first.

I would suggest to set the highest priority to mcoffee_himem because the 
alignments take a lot of time to generate, and then set lower and lower 
priorities to mafft_himem, mafft, mcoffee and mcoffee_short, so that 
overall the biggest families can go through first.
But what is important to change at the same time is the hive_capacity. 
Currently all aligners have the same capacity but in reality they each 
take a different amount of resources on the server. mcoffee_himem spends 
a lot of time computing the alignment offline so we could run a lot more 
of these than mcoffee_short, which only deals with small families and 
very frequently comes to the database.

At some point in the past we only set priorities without changing the 
capacity, resulting in the pipeline being "stuck" only running long 
mcoffee alignments whilst the short families were still waiting and the 
database server was not being visibly used

Matthieu

On 15/02/18 10:58, Wojciech Bazant wrote:
> 
> 
> Hi,
> 
> I am currently running Compara for a release of Wormbase ParaSite, 
> corresponding to Ensembl 89.
> 
> The pipeline is currently finishing to run mcoffee_himem, and some of 
> the runs take very long - my pipeline currently stuck on finishing two 
> remaining jobs, and otherwise it isn't doing very much - everything it 
> could have done is done.
> 
> I am thinking if the jobs that sometimes take very long (here,  
> mcoffee_himem) were given capacity first, the situation I am in now 
> wouldn't have happened: the remaining long taking jobs would be taking 
> their sweet time but meanwhile the capacity would go to shorter jobs.
> 
> I originally thought something's wrong with the long running 
> mcoffee_himem jobs but I think it's just the nature of the problem 
> they're given - they're keeping CPU 100% busy as they should. Also I'm 
> not sure how to express in Hive the concept of "give resources to this 
> job first if possible but let the other job run in parallel" but I think 
> it's just a matter of reordering the jobs within a fan or something.
> 
> Do you think it's a good idea? Do you think it's a worthwhile enough 
> idea for me to mess around with job orderings in future runs of WBPS 
> Compara :) ? Would you want it as a default setting for Compara 92?
> 
> This is what the jobs usually take - rounded to nearest power of ten:
> select count(*), pow(10, round(log(10, cpu_sec),0)) as bucket  from 
> worker inner join worker_resource_usage using (worker_id) where 
> (worker.resource_class_id=34) group by bucket order by bucket;
> +----------+--------+
> | count(*) | bucket |
> +----------+--------+
> |        3   |        0.1 |
> |      644  |         1  |
> |      148  |       10  |
> |       17   |      100 |
> |     255   |    1000 |
> |    2274  |  10000  |
> |       11  | 100000  |
> +----------+--------+
> 
> These are the longest runs:
> select cpu_sec from worker inner join worker_resource_usage using 
> (worker_id) where (worker.resource_class_id=34) order by -cpu_sec limit 15;
> +---------+
> | cpu_sec |
> +---------+
> | 46893.5 |
> | 45129.3 |
> |   43254 |
> | 41565.2 |
> | 41472.6 |
> | 38532.7 |
> | 37886.8 |
> | 37643.3 |
> | 36617.7 |
> | 36336.8 |
> | 32764.8 |
> |   31295 |
> |   28305 |
> | 28287.9 |
> | 27565.6 |
> +---------+
> 
> Wojtek
> 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
> 

-- 
Matthieu Muffato, Ph.D.
Ensembl Compara and TreeFam Project Leader
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Wellcome Trust Genome Campus, Hinxton
Cambridge, CB10 1SD, United Kingdom
Room  A3-145
Phone + 44 (0) 1223 49 4631
Fax   + 44 (0) 1223 49 4468