Friday, May 8, 2009

Build Farm Part 2: The core mission

The core mission of the Build Farm is:
  • Accurately assign the right work to the right machines,
  • While providing access to historical results for analysis.
You can make a perfectly reasonable argument that both these sub-missions are equally important, but it's the first part of the mission that I want to talk about here: as jobs queue up on the Build Farm, and as machines become available to do work, which job should be assigned to which machine?

Efficiently executing work is an often-discussed topic; you can find it covered until topics such as "Workload Management", "Resource Management", etc. It tends to be the province of software such as DBMSs, Application Servers, and Operating Systems. That is, it's a really hard problem to do this well.

We did not want to write an operating system.

So, we made several simplifying assumptions:
  • We can afford to over-provision our Build Farm. Machine resources are cheap enough nowadays, and our overall problem is small enough, that we can "throw hardware at the problem".
  • We don't have to be perfect. We have to do a decent job of scheduling work, but we can afford to make the occasional sub-par decision, so long as in general all the necessary work is getting done in a timely fashion.
We decided upon a fairly simple design, based on two basic concepts: Capabilities, and Priorities.

Priorities are just what you think they are: each job in the Build Farm is assigned a priority, which is just an integer, and jobs with a higher priority are scheduled in preference to jobs with a lower priority.

Capabilities are an abstract concept which enable the Build Farm to perform match-making between waiting jobs, and available machines. Jobs require certain capabilities, and machines provide certain capabilities, and for a job to be scheduled on a machine, the job's capabilities must be a strict subset of the machine's capabilities.

So, for example, a particular job might specify that it requires the capabilities:
  • Windows
  • JDK 1.5
  • JBoss 4.3
  • DB2 9.5
Trying to run this job on a Linux machine, or on a machine which only has Oracle installed, would be pointless. So each machine specifies the software which it has available as a similar set of capabilities, and the Build Farm can use this information to decide if this machine is capable of handling this job.

The bottom line is that we re-defined the Build Farm's core job-scheduling requirement to be:
  • When a machine is available to do work, ask it to do the highest-priority job in the queue which this machine is capable of executing.
When a programming problem is too hard, re-define it so that you have a problem you can actually solve.

No comments:

Post a Comment