This wiki page describes a proposed build system for the Elftoolchain project.
This build system is intended to be "P2P Transparent" (the precise definition of this term is given further down).
The major goals for the system are:
Local Repeatability
We consider a build to be locally repeatable when two attempts at building the same source tree on a given machine result in the 'same' effective bits being generated.
The word 'same' is used here to mean "logically indistinguishable when used by a subsequent build step".
Many compilation processes do not guarantee local repeatability out of the box. Some causes of local non-repeatibility include:
__DATE__
and __TIME__
,and the like.
Prior work in making builds locally repeatable includes:
Some Elftoolchain tools produce locally repeatable output (this mode is usually specified using the -D
option).
A locally repeatable build permits the build process to be treated as a pure function mapping a set of sources to a set of built artifacts.
Local repeatability is the first step towards sharing build cycles across multiple developers.
Cross-Machine Repeatability a.k.a. Global Reproducibility
In a globally reproducible build, builds on different machines generate the 'same' (i.e., logically indistinguishable) output. Two different machines are permitted to have different build execution graphs, as long as the final outputs are the 'same'. For example, the build on one machine could have additional instrumentation interleaved between its build steps, or could use tools that differ from the other, provided the end result is the 'same'.
Some sources of "non-local" non-determinism include:
and so on.
Global reproducibility is needed before we can share build cycles across machines.
In a hermetic build all inputs to the build process is fully specified. These include:
A hermetic build process guarantees that the build does not depend upon inputs that are "external" in some fashion (and are thus uncontrolled).
If the source code to the compilation tools is part of the source tree being built (e.g. as is true for NetBSD) then the build can be made hermetic through time. With a hermetic-through-time build we would be able to roll back a source tree to a prior revision and be able to reconstruct binaries using the exact toolchain that had been originally used at that revision.
Peer To Peer Builds
Once builds are hermetic and globally reproducible, we would be able to share builds across build peers.
Sharing builds in this fashion would help speed up builds of large source trees (e.g., the equivalent of a make world
or a run of build.sh
). Our build utility could choose between either performing a local fully deterministic build from sources, or retrieving the already built artifact for that build step from its peers (or occasionally doing both, for verification).
The peer-to-peer build system could be made independent of the build front end, possibly implementing an existing API such as the Bazel remote execution API. This would both ease development, and would permit the peer-to-peer build network to be potentially used by multiple build front ends (e.g., Bazel or Buck, in addition to proposed build tool):
Related work: Bazel supports caching of build results using a (centralized) remote cache.
Herd Immunity and Decentralized Verification
Peer-to-peer builds could also be used to implement a form of herd immunity. Once we have multiple builders building the same source tree and sharing metadata about their builds, it would become possible to detect output artifacts that had been tampered with.
This tamper detection protocol could be used to verify, in a decentralized fashion, that the final artifact generated by a build (say the ISO image containing an OS installer) had not been tampered with locally.
Auditable Builds
This kind of build generates a (cryptographically signed) trail of the inputs, outputs and tools used for each build step.
This trail could be used to verify the build 'end to end'.
P2P Auditable Builds
In this document this term means an extension of auditable builds that share outputs and build trails in a P2P fashion.
Vetted Source
Source code that is linked to a 'is-reviewed' bit issued by a human.
Vetted Builds
These are builds whose source code inputs can be proven to have been vetted by one or more humans.
Transparent Builds
This is a build that is both auditable and vetted.
P2P Transparent Build
This is a transparent build which uses a P2P build-sharing backend.
Cross-Organisational Verifiability
Sometimes a software build would take for its input one or more binary blobs built by external organisations, for which source code is not available.
Ordinarily the presence of such binary blobs would preclude the build process from offering a meaningful guarantee of the integrity of the final product. Nevertheless, if the binary blob can be published along with a certificate that confirms that the blob had been verifiably built (or peer-verifiably built) from its source code, then that certificate could possibly be made part of the audit trail for the current build, improving the guarantee offered for its final output.
For the purposes of this document, a build process which has the ability to incorporate such an external certificate into its audit trail is said to offer cross-organisational verifiability.
TODO(jkoshy): Look for existing standards and protocols implementing cross-organisational verifiability, if these exist.
BSD operating systems are built from source using BSD make (e.g., the FreeBSD makeworld build process, or by using an additional tool that drives make, such as in NetBSD's build.sh).
Unfortunately, make's specification language is not expressive enough to implement a fully-deterministic build system of the type desired.
This section lists some of the traits desired in a build specification language.
At this point of time I am inclined to use a notation that is close to S-expressions, for the reasons outlined in some of the sections below.
The build configuration language should be tooling-friendly. Large projects almost always need such tooling support. This means that the build configuration language should be easy to parse and modify programmatically.
Due to historical reasons Bazel and Buck use a syntax reminiscent of Python for their build rules. Other aspects, such as the naming of build dependencies, use a different (embedded) syntax. Extensions to the build system's rules are written using a Python-like extension environment; please see Starlark for Bazel, and Buck Macros for Buck.
# A Bazel/Buck example. # # The following rule invocation specifies the build process for library "libfoo". cc_library( name = "foo", srcs = ["foo.c", "bar.c"], hdrs = ["foo.h"], deps = ["//some/package"], # dependencies use a different sub-language )
Although we could adopt this syntax (for continuity), I have a slight preference for a more uniform notation, say one that is based on S-expressions.
For example, using an SRFI-88/89-esque syntax:
; This stanza might define the build process for library "libfoo". ; ; This is a proposal only - this may not be the final syntax. (c_library name: "foo" srcs: ("foo.c" "bar.c") hdrs: ("foo.h") deps: (...))
The build system should support the structuring of project code into logical modules.
Build configuration should also be modular: users should be able to write their build files without need to worry about conflicts between the symbols used in their configuration files and any symbols defined elsewhere.
Make, in contrast uses a single global namespace, with its attendant lack of modularity. Large projects then end up using recursive invocations of make
, which all the drawbacks documented in Recursive make Considered Harmful (Peter Miller, 2008).
We would like the configuration used to specify build & test scenarios (i.e., different compilation modes: -c optimize
vs -c debug
or running tests under valgrind
vs running them normally) to be separate from the configuration that specifies the dependencies between input and output artifacts. These two aspects of build configuration are intertwined in make.
We want users to be able to easily define their own scenarios for building and testing code.
The BSD /usr/share/mk/*.mk
framework currently works best when a single sub-directory in a source tree builds one program or one library. This restriction can be fairly limiting.
We would like our build system to allow test suites to be co-located with code that they test. The build system should allow multiple logically related libraries and binaries to be built in a single source directory.
Builds for large projects would need a language for writing "build macros" and the like.
make's language is quite limited. BSD make allows certain rules to be designated as "macros", by using the .USE
pseudo target as a dependency. BSD make also offers iteration using .for
/.endfor
and conditional execution using .if
/.endif
. However these and other extensions to the base "make" syntax do not fit together well, which means that specifying large systems with make remains awkward.
Bazel and Buck have their own "macro" definition languages; these are distinct from the build configuration language itself.
If we go the S-expression route, then a declarative subset of RnRS Scheme may be a natural choice for the extension language.
The build system should be able to work in 'offline' mode, without needing external network access to function (although builds would be entirely local in this case).
This feature would be useful for developers in areas of poor network connectivity.
The build system should be able to work effectively in a resource constrained host environment (such as a laptop computer).
Bazel in particular seems to need a lot of memory when building projects.
The build system should support building code on a variety of (non-native) machine architectures and operating systems.
Cross building an operating system often involves compiling the cross compiler itself. The build tool should be able to handle these kinds of scenarios efficiently.
The build system should be able to work efficiently on a source tree with a small number of local changes, sharing the build of the unmodified parts of the source tree with its peers.
New target architectures should be easy to support, either through cross compilation, or by invoking the necessary build steps "remotely" using a small, portable helper that runs on the target hardware.
The build tool should be usable when bootstrapping an operating system from source.
This means that a substantial part of the build system needs to be written in portable C, so that it can be built and can run in a cross-hosted OS bootstrap environment.
The following general purpose build systems support deterministic builds:
Bazel and Buck implement Applicative builds.
Looking over some of the steps involved in building BSD operating systems from source, it appears that we would need the full generality of Monadic builds; monadic build systems allow the structure of the build graph to be determined by the output of a previous build step. Please see "Applicative vs Monadic build systems" by Neil Mitchell for a discussion of the differences between the two kinds of build systems.
Open-source build systems that are monadic in nature include:
Bazel and Buck use a Python-esque syntax for specifying build targets and their dependencies. Bazel and Buck also offer their own mini-languages for writing extensions: Starlark for Bazel, and Skylark for Buck.
Android's build system Soong uses a configuration language that is closely related to that used by Bazel.
Shake's build configuration language is Haskell, in its full generality. This makes it hard to write tooling to modify Shake
configuration files.
Dune, O'Caml's build system, uses an S-expression based configuration language.
Bitten, Buildbot and Hudson are popular tools used for continuous integration.
Our project has constraints that come in the way of using these tools to automate our builds:
crosstool-NG is a utility for building toolchains, offering a user-friendly way to configure and build a GNU cross toolchain, using a menuconfig
-style configuration interface.
However, it does not address the issue that we face: that of invoking and managing builds on virtual machines running non-native operating systems.