UP | HOME

Hubble: Evaluating Scheduling Strategies for the Hydra/Nix Build Tools

Warning: This document is work in progress.

Nix is a package build tool that is typically used to build large sets of packages, such as the constituents of the NixOS GNU/Linux distribution. The input to Nix is a large directed acyclic graph (DAG) of package build tasks (currently ~3,000 nodes and ~18,000 edges), which Nix can choose to distribute over several machines in a cluster. Hydra is a continuous integration system, which uses Nix to compile sets of packages directly from their version control system.

Efficiently using computing and networking resources can help reduce build times for a DAG of tasks. Hubble is a simulator aiming to evaluate several on-line scheduling algorithms for Nix build tasks, on several platforms. Its input is a DAG of build tasks and the description of a grid or cluster to realize the build. Hubble simulates the build tasks and associated data transfers on SimGrid.

Table of Contents

1 Introduction

The Nix build tool reads a high-level description of the DAG of build tasks, written in the Nix language. Each task has an associated build script, which assumes the availability of all the declared dependencies (or "build inputs") of the task.

Build results are stored in the "Nix store", a special directory in the file system. There is a direct mapping between a build task (builder script and set of build inputs) and the name of its result in the Nix store. Thus, the Nix store functions as a build cache: build results can be reused whenever they are needed, without having to perform the build again (this is a form of memoization).

Nix and Hydra currently use naïve greedy scheduling strategies. Within Nix, libstore topologically sorts the build DAG and then executes them in that order. When a build task is ready, it is submitted to the "build hook", which, by default, submits it to one of the participating machines or declines it. In the latter case, the task is scheduled locally. In the former case, the build hook copies all the necessary build inputs to the target machine, so that the build can actually take place.

Nix and Hydra are currently deployed on a small, homogeneous cluster at the Technical University of Delft, The Netherlands, consisting of 3 optocore x86_64 machines connected via a GiB Ethernet LAN.

2 Implementation

This section describes the implementation of the simulation of build tasks as well as the implementation of task scheduling in Hubble.

2.1 Overview

Hubble is implemented in Scheme, using GNU Guile version 2. Details of the simulation, such as keeping track of processor occupation and network usage, are taken care of by SimGrid, a toolkit for the simulation of distributed applications in heterogeneous distributed environments.

The input to Hubble is an XML description of the DAG of build tasks. For each task, a build duration and the size in bytes of the build output are specified. For our evaluation purposes, we collected this data on a production system, the hydra.nixos.org build farm hosted at the Technical University of Delft. The DAG itself is the snapshot of the Nix Package Collection (Nixpkgs) corresponding to this data. Hubble has its own in-memory representation of the DAG in the form of a purely functional data structure.

The Nixpkgs DAG contains fixed-output nodes, i.e., nodes whose output is known in advance and does not require any computation. These nodes are typically downloads of source code from external web sites. The raw data collected on hydra.nixos.org specifies a non-zero duration for these nodes, which represents the time it took to perform the download. This duration info is irrelevant in our context, since they don't require any computation, and Hubble views these nodes as instantaneous.

Hubble contains several stand-alone tools with a command-line interface:

  • hubble simulates the execution of the given DAG on the given platform, using the specified scheduling algorithms. It produces a file in the Pajé format, which contains a record of all the scheduling events. This file can then be opened with ViTE, which displays a Gantt diagram of the simulation (see below for screenshots.)
  • plot simulates the execution of the given DAG on a series of generated platforms (clusters) and with different scheduling algorithms. The output can be fed to Gnuplot or GNU Plotutils, or saved in a "raw" format for further processing.
  • critical-path displays the critical path of the given DAG.
  • make-platform generates an XML platform description file for the given parameters.

In addition, the following programs can be used to collect and interpret actual build times:

  • nix-build-and-log is a wrapper around the nix-store command, which collects build events (beginning & end of build with timestamp, etc.) and stores them.
  • build-events-to-speedup analyzes series of build events collected by nix-build-and-log for sequential and parallel builds and outputs a plot that synthesizes speedup of individual package builds on multi-core machines (see "Parallel Tasks" below).

Each of these programs understands the --help command-line option, which provides additional details.

The Scheme API of Hubble consists of several Guile modules, each providing fine-grain access to all these simulation tools. In fact, it can be used interactively as a "shell"1:

$ guile -L /path/to/hubble/modules
GNU Guile 1.9.13.89-1c20c-dirty
Copyright (C) 1995-2010 Free Software Foundation, Inc.

Guile comes with ABSOLUTELY NO WARRANTY; for details type `,show w'.
This program is free software, and you are welcome to redistribute it
under certain conditions; type `,show c' for details.

Enter `,help' for help.
scheme@(guile-user)> (use-modules (hubble dag) (srfi srfi-1))
scheme@(guile-user)> (define my-dag (load-dag "data/dags/nixpkgs/r24680/nixpkgs.dag"))
[  17] reading `data/dags/nixpkgs/r24680/nixpkgs.dag'...  done
[  23] building DAG...  done
[  32] zeroing fixed-output derivations...  done
scheme@(guile-user)> (display-dag-stats my-dag (current-output-port))
3141 nodes, 2922 [93.0%] with size info, 2696 [85.8%] with duration info
1310 nodes [41.7%] with run-time dependency information
2563 nodes [81.6%] with size > 8 KiB
1402 nodes [44.6%] have fixed output (e.g., external downloads)
783 nodes [24.9%] have speedup info
346 sources [11.0%], 515 sinks [16.4%]
$1 = #t
scheme@(guile-user)> (define coreutils
                       (find (lambda (n)
                               (string-suffix? "coreutils-8.5" (node-output-path n)))
                             (dag-nodes my-dag)))
scheme@(guile-user)> coreutils
$2 = #<node 23f41e0 /nix/store/2iy8g13imlizav7zksd62l31nsidblwn-coreutils-8.5 /nix/store/79alf7r9cgglyw8isp1f09n6iwd8cg7a-coreutils-8.5.drv>
scheme@(guile-user)> (node-size coreutils)
$3 = 14815232
scheme@(guile-user)> (length (node-prerequisites coreutils))
$4 = 24

This allows for rapid, incremental experimentation.

2.2 Dynamic Scheduling Engine

Hubble intends to evaluate dynamic scheduling strategies, whereby scheduling decisions are taken on-line, as computing resources become available. We chose to focus on on-line scheduling algorithms because they are easier to implement and retrofit in an existing tool like Nix and Hydra, and because such algorithms can gracefully handle the failure of a computational node.

On-line scheduling algorithms have drawbacks compared to their static counterpart due to the fact that they make decisions "locally" in time. For instance, an on-line schedule may assign a long-running task to a very slow node, whereas it would have been able to assign it to a much faster node seconds later. Similarly, on a very slow network, distributing work may actually be slower than doing it sequentially, which a dynamic scheduler could overlook. However, our measurements suggest that on-line scheduling algorithms can achieve results close to the optimal (see "Results"), which makes static scheduling algorithms unattractive.

Hubble uses the SimDAG API of SimGrid, which is specifically designed to make it easy to simulate the execution of a DAG of tasks. However, the tight control Hubble requires on scheduling, as illustrated below, led us to use very few of SimDAG's facilities. For instance, Hubble doesn't use SimDAG's ability to describe inter-task dependencies.

Hubble's dynamic scheduling engine maintains a list of tasks to execute, a list of ready tasks (the subset of the list of tasks to execute such that each of a task's dependencies are done), and a list of idle workstations. Whenever a computational node (the core of a CPU) is available, it assigns it a ready task; the ready task depends on communication tasks, one for each dependency whose result needs to be transferred to the computational node. See "Data Transfers" below for details.

The scheduling engine is parametrized by two functions: one that selects the next task to execute among the set of ready tasks, and one that selects the computational node to run it as well as the source for each data transfer it entails. This has allowed us to implement several scheduling strategies separately from the scheduling engine, and to identify the impact of the task selection and node selection strategies separately.

2.3 Data Transfers

Tasks whose result needs to be transferred are those that (1) were executed on a different machine, and (2) whose result wasn't previously downloaded. This way, Hubble implements caching semantics equivalent to those of the Nix store. Note that cores and CPUs within a single machine are clearly identified as such; thus, all cores and CPUs of a machine share the same cache.

In a first iteration of Hubble, only direct compile-time dependencies of a task were considered to be transferred. This was an underestimate of the actual amount of data to be transferred, since in reality each build output (known as an output path in Nix parlance) has an associated set of run-time dependencies. Run-time dependencies are build outputs required to actually use a given build output; they are a subset of the compile-time dependencies.

For instance, every compiled C program or library has the system C library as one of its run-time dependencies, because it is linked against it. Conversely, a C program may depend on Bison at compile-time, while not depending on it at run-time.

Hubble now integrates run-time dependency information as part of build DAGs. When a build task is scheduled, not only its compile-time direct dependencies are transferred, but also their recursive run-time dependencies. The impact on the amount of data transferred can be estimated by computing the ratio of the size of direct compile-time dependencies, to the size of direct compile-time dependencies plus their recursive run-time dependencies. Here is the value of this ratio for a few packages2:

Packagec-t. depsc-t. deps + r.r-t. depsratio
GCC 4.5.12402959363279790081.3648962
OpenOffice.org 3.2.183793100811939553281.4248850
GNU Coreutils 8.51563115523066798081.9619779
GIMP 2.6.112254929926001090562.6613202
GNU Emacs 23.21107476485552414725.0135735
GNU Scientific Lib. 1.14719257635263283249.027335
FFTW 3.2.1406323235300966486.879032

As can be seen, there are vastly different ratios for different packages. In the case of FFTW and GSL, the reason is that its direct compile-time dependencies are just the standard C compilation environment (GCC, GNU Make, etc.) and the source tarball. However, these direct dependencies have in turn many of run-time dependencies: GCC, for instance, depends directly on 5 libraries at run-time, which significantly increases the amount of data needed to build FFTW/GSL.

The average ratio for all nodes of the Nixpkgs DAG is 39.9, with a standard deviation of 30.8. However, the increase in data transfers to be largely mitigated by the cache effect of the Nix store. For instance, the standard C compilation environment is likely to be present very early in each machine's cache.

Simulation on a cluster of six dual-core machines shows that, for the same scheduling policy and DAG, the amount of data transferred raises from 33,232 MiB to 44,878 MiB (i.e., a 35% increase) while the time spent transferring data goes from 1,921 to 3,203 time units (i.e., a 67% increase), showing an increase in network contention.

2.4 Parallel Tasks

Most package builds can be parallelized—e.g., using GNU Make's -j flags. To study scheduling algorithms that exploit parallelism at this level, we collected sequential and parallel build times for a subset of Nixpkgs. This allows us to know how builds actually scale when using -j. This speedup information used as an input to Hubble when simulating parallel tasks, making the simulation realistic.

2.5 Scheduling Algorithms

Three pairs of task-selection and workstation-selection algorithms are implemented, which we refer to as simple, random, and heft.

The simple algorithm is, as the name suggests, the easiest one to implement. It chooses the first task among a set of ready tasks, the first node among the set of idle computational nodes, and the first workstation storing a given result as the download source of a dependency.

The random algorithms chooses the task to execute, the computational node to run it, and the sources of data transfers at random.

Finally, the heft algorithm implements an on-line variant of the heterogeneous earliest finish time (HEFT) algorithm. It chooses the ready task with the highest upward rank as the next task to execute. The difference compared to random is immediately visible on a Gantt diagram:

Gantt diagram for `random'

Gantt diagram for the random task-selection and workstation-selection algorithms on a cluster of 3 dual-core machines.

Gantt diagram for `heft'

Gantt diagram for the heft task-selection and workstation-selection algorithms on a cluster of 3 dual-core machines.

The heft algorithm chooses the computational node with the earliest finish time (EFT) to execute it. The finish time is computed as the time needed to transfer all the dependencies plus the computation time.

Minimizing the EFT amounts to minimizing computation time plus data transfer time. Data transfer time is a function of several things:

  • the dependencies already in cache on the computational node;
  • the bandwidth and latency of the network links used to transfer the missing dependencies;
  • contention that may occur on the network links, e.g., because source nodes already have outgoing transfers, or because the destination node will carry out several simultaneous incoming transfers.

Thus, Hubble's implementation of HEFT estimates transfer time this way:

FIXME: do it

[Often, computational nodes have similar computational power, especially with current CPU technology where the power of individual CPU cores is roughly withing the range of 1 to 2. In such a situation, minimizing the EFT amounts to minimizing data transfer time.]

Finally, the many-cores computational node selection strategy will always assign tasks to as many idle cores of the same machine are available, assuming the task has internal parallelism. This is true of most build tasks: those using makefiles, for instance, can be performed either sequentially or in parallel using GNU Make's -j flag, which leverages parallelism available withing the DAG of makefile targets.

Likewise, the m-heft algorithm (for "mixed-parallel HEFT") is a variant of HEFT that uses parallel tasks when several cores of the same machine are available and when doing so reduces the EFT.

See "Parallel Tasks" below for details.

3 Results

3.1 DAG Characteristics

The Nixpkgs DAG (revision 24680) of build tasks has the following characteristics:

3141 nodes, 2922 [93.0%] with size info, 2696 [85.8%] with duration info
1310 nodes [41.7%] with run-time dependency information
2563 nodes [81.6%] with size > 8 KiB
1402 nodes [44.6%] have fixed output (e.g., external downloads)
783 nodes [24.9%] have speedup info
346 sources [11.0%], 515 sinks [16.4%]

The length of its critical path on the hydra.nixos.org cluster, when all the tasks are sequential, is 30758 seconds, i.e., approximately 8 hours and 33 minutes. It means that it is impossible to execute the DAG in less than 8h33m on this cluster when using sequential tasks.

The ratio of the sequential execution time to the length of the critical path indicates and upper bound on the speedup of around 10.5. In other words, this DAG is "tall and skinny": its critical path is very long, and execution time cannot be reduced by adding CPU cores beyond 11. One possibility to overcome this limitation is to use parallelism within tasks, which is explored later (see "Parallel Tasks").

Another important piece of information is the "shape" of the DAG. The Nixpkgs DAG starts with a mostly-sequential sequence of build tasks, which corresponds to the tasks that contribute to boostrapping the standard build environment (known as stdenv), which contains the C library and compiler, along with build tools such as GNU Make and utilities like GNU Coreutils and Perl. The stdenv build is followed by a fork, which corresponds to the fact that stdenv is a prerequisite for all the remaining build tasks. The stdenv build is obviously on the critical path.

A similar fork/join pattern occurs close at the other end of the DAG, with the Qt library and KDE support libraries on which many applications depend.

3.2 The hydra.nixos.org Cluster

The array below shows the simulation results obtained for several combinations of the task selection and workstation selection algorithms for the hydra.nixos.org platform, a cluster of 3 8-core x86_64 machines, with 2.66 GHz Intel Xeon E5430 CPUs.

The following correlated metrics are provided:

  • The speedup column shows the ratio of the sequential execution time to the measured execution time (higher is better).
  • The SLR column shows the schedule length ratio, i.e., the measured execution time divided by the length of the critical path (lower is better).
  • The makespan column shows the total execution time of the DAG in seconds (lower is better).
  • The improvement shows the execution time gain (percentage) compared to the naïve implementation, modeled here by the first-task and first-workstation combination.

The first three rows rely on sequential build tasks, while the next three rows use parallel build tasks (i.e., make -j).

task selectionworkstation selectionspeedupSLRmakespanimprovement (%)
`first-task'`first-workstation'6.771.557479020
`heft'`first-workstation'7.331.440442757.5717089
`heft'`heft'8.991.1743610924.619014
`first-task'`many-cores'6.601.60049144-2.5927936
`heft'`many-cores'4.762.21468102-42.169429
`heft'`m-heft'11.590.9102798641.576552

Looking at the first three rows shows that the dynamic HEFT implementation provides an appreciable performance improvement. Choosing tasks according to their HEFT upward ranks yields 7.5% of improvement, while choosing the destination core for a task according to its estimated earliest finish time (EFT) yields an additional 17%.

Since all CPU cores have the same power, the only factor that influences the EFT is the time taken to transfer all the prerequisites of each build task. Thus, the good results of the heft CPU core selection algorithms can be explained by an improvement in data locality. Comparing the Gantt diagrams for heft + first-workstation vs. heft + heft confirms this intuition.

Gantt diagram for `heft' + `first-workstation'

Gantt diagram for the heft + first-workstation combination.

Gantt diagram for `heft' + `heft'

Gantt diagram for the heft + heft combination.

The Gantt diagrams above show the DAG execution between 0 and around 15,000 seconds, which corresponds to the stdenv build. In the first-workstation case, building the compiler of stdenv completes at 11,922 seconds, whereas in the heft case it completes after only 8,665 seconds. The Gantt diagram clearly illustrate improved data locality in the latter case, where almost all of stdenv is built on a single machine, kenny, whereas first-workstation distributes it on all 3 machines, thereby adding data transfers on the critical path.

The results with the many-cores policy suggest that systematically allocating all cores to each task is counter-productive on this platform. Conversely, m-heft achieves a 41% improvement by allocating several cores only when several cores are available and when it would reduce a task's EFT.

The table below gives networking statistics for the same simulations:

  • The data transferred column shows the total amount of data transferred among the 3 machines of the cluster, in MiB.
  • The transfer time column shows the accumulated transfer time, i.e., the total amount of time spent transferring data during the simulation.
  • The transfer over computation column shows the ratio of the accumulated transfer time to the accumulated computation time. Note that transfers occur in parallel with computations. Nevertheless, this figure gives an idea of the relative importance of data transfers.
  • The stretch column is a measure of network congestion, equal to the measured transfer time over the time it would have taken to transfer this amount of data on this network without any congestion, averaged on all the data transfers; the higher, the more congested the network is.

Again the first three rows correspond to sequential build tasks:

task selectionworkstation selectiondata transferredtransfer timetransfer/computationstretch
`first-task'`first-workstation'27150.482829390.878177.83
`heft'`first-workstation'23575.312474370.763171.44
`heft'`heft'23616.392478720.764165.72
`first-task'`many-cores'14775.031551290.304158.08
`heft'`many-cores'14508.051522770.208158.69
`heft'`m-heft'20048.272104180.593170.50

On this platform, all the sequential variants achieve similar networking performance, with only a slight advantage for heft. It confirms that the performance gains of the heft workstation selection algorithm are mainly due to improved data locality for tasks on the critical path, as discussed above, and not to reduced network traffic as one might think.

The many-cores policy leads to reduced network traffic, which is expected given that it reduces fragmentation: tasks always get to use all the cores of a CPU. However, as seen before, the reduced traffic does not balance the computational inefficiency of this policy.

3.3 Homogeneous Platforms

3.3.1 Scalability

  • number of cores vs. speedup
    • bandwidth = 1e6 (cluster)
    • bandwidth = 1e4

3.3.2 Impact of the Bandwidth/CPU Power Ratio

3.3.3 Data Transfers

  • amount of data transferred
  • time spent transferring data
  • impact of `select-workstation/heft' vs. `select-random-workstation'

3.4 Heterogeneous Platforms

3.4.1 Heterogeneous Network Bandwidth

  • speedup vs. stddev(bandwidth) -> stable

3.5 Parallel Tasks

4 Summary & Conclusion

  • parallel tasks

5 References

Footnotes:

1 This is known as the read-eval-print loop or "REPL" in the Scheme/Lisp world.

2 The run-time/compile-time-dependency-size-ratio of the (hubble dag) module can be used to compute this ratio.

Author: Ludovic Courtès

Date: 2011-03-02 10:55:56 CET