Cocojunk

🚀 Dive deep with CocoJunk – your destination for detailed, well-researched articles across science, technology, culture, and more. Explore knowledge that matters, explained in plain English.

Navigation: Home

Bootstrapping (compilers)

Published: Sat May 03 2025 19:14:06 GMT+0000 (Coordinated Universal Time) Last Updated: 5/3/2025, 7:14:06 PM

Read the original article here.

Compiler Bootstrapping: Building the Tools to Build Your Tools

In the grand project of building a computer system from the ground up, you eventually reach a point where you need to move beyond writing raw machine code or simple assembly. To create complex software, you need a programming language – and to use a programming language, you need a compiler (or an assembler) to translate your human-readable code into instructions the machine understands.

But here lies a fundamental challenge, often called the "chicken-or-egg" problem in computer science: If you want to write a compiler for a new programming language (let's call it Language X), and you want to write that compiler in Language X itself (which offers many advantages, as we'll see), how do you compile the first version of the compiler when you don't have a working compiler for Language X yet?

The answer to this dilemma is bootstrapping.

What is Bootstrapping?

Bootstrapping (Compilers): A fundamental technique in computer science for creating a self-compiling compiler. This is a compiler (or assembler) written in the very programming language that it is designed to compile. The process starts with a minimal, initial version of the compiler (the bootstrap compiler), often created using existing tools or simpler methods. This minimal compiler is then used to compile subsequent, more complete versions of itself, allowing the compiler's development to proceed using the target language itself.

Essentially, bootstrapping is the process of lifting yourself up by your own bootstraps – starting with limited means and using them to build the tools that allow you to build more sophisticated versions of those same tools.

This technique is not merely theoretical; it's a standard practice in the development of most modern programming languages and their compilers. Many widely used languages, including C, Python, Java, Rust, Go, Pascal, and countless others, have compilers that were developed using bootstrapping methods.

The Bootstrapping Process: A Multi-Stage Journey

Bootstrapping a compiler typically involves a series of stages, evolving from a rudimentary starting point to a fully functional, self-contained development environment.

Stage 0: Establishing the Foundation

This initial stage is about preparing the ground and getting anything that can translate some code into machine instructions on your target system.

The Starting Point: This stage heavily depends on your context.
- Bare Machine: If you are truly starting "from scratch" on a new or custom hardware platform with no existing software, your only option might be writing code directly in binary machine code or a very simple assembly language. You might need to use basic hardware tools or manual entry to get the first bytes into memory.
- Existing Environment: More commonly, you might have some existing tools available on the target machine or another machine. This could be a simple assembler, a basic compiler for a different language (like C or Fortran), or even just a hexadecimal editor and a loader.
Choosing the Bootstrap Compiler's Language: You select the language in which the very first compiler will be written. This is often assembly language, a simple subset of your target language (Language X), or another language (Language Y) for which a compiler or interpreter already exists on the target or a connected machine.
Defining Input and Output: You decide what the bootstrap compiler will accept (its source language – often a very simple version of Language X, or Language Y, or assembly) and what it will produce (its output language – this could be assembly, object files, or direct machine code for your target architecture).
Cross-Compilation: If you are developing for a new or embedded system, Stage 0 often involves cross-compilation. You write the bootstrap compiler (often in C or C++) on a powerful host machine (like your desktop PC) and use a cross-compiler on the host to generate executable code for the target machine architecture.

Cross-Compilation: The process of compiling a program on one computer architecture (the host) so that the resulting executable code can run on a different computer architecture (the target). This is essential when the target machine is resource-constrained, doesn't have a suitable compiler yet, or when developing for embedded systems.

Stage 1: Building the Bootstrap Compiler

Using the environment established in Stage 0, you create the first, minimal version of your compiler (the bootstrap compiler).

Minimal Functionality: This compiler isn't full-featured. It only needs to understand enough of its input language (whether that's assembly, a subset of X, or Language Y) to be able to compile the source code for the next stage of the compiler.
Target Output: This compiler produces executable code (assembly, object files, or machine code) for your target architecture.
The First Big Step: Once you successfully compile and test this Stage 1 bootstrap compiler using your Stage 0 methods (assembly, existing Language Y compiler, cross-compiler, or even hand-compilation), you have a working executable that can translate some code. This is your key tool for the next stage.

Stage 2: Compiling the First Self-Hosted Compiler

Now, the magic starts. You use the output of Stage 1 (the bootstrap compiler executable) to compile a more complete version of your compiler.

Compiler Source Code: The source code for this version of the compiler is typically written in the target language itself (Language X), perhaps utilizing a broader set of features than the bootstrap compiler itself was capable of understanding.
Using the Bootstrap Compiler: You feed the source code of this more complete compiler (written in Language X) into the executable produced in Stage 1 (the bootstrap compiler).
Result: The output is an executable program that is a more powerful compiler for Language X. This compiler is now capable of compiling a wider range of Language X features, potentially including all the features needed to compile its own source code and the source code of subsequent compiler versions. This is often referred to as the first self-hosted version.

Stage 3: Verification and Refinement (Compiling Twice)

This stage is crucial for ensuring the correctness and reliability of the compiler you've built.

The Process: You take the source code of the Stage 2 compiler and compile it again, but this time using the executable produced by Stage 2 (the self-hosted compiler).
Verification: You compare the output of this second compilation (using the Stage 2 executable) with the output from the first compilation (using the Stage 1 bootstrap compiler).
Why Build Twice? If the two executable outputs are identical, it provides strong evidence that the Stage 2 compiler is correctly translating its own source code. If they differ, it indicates a bug exists in either the Stage 1 bootstrap compiler or the Stage 2 source code (or the Stage 2 compiler's ability to compile itself). This verification step catches errors that might only appear when a compiler tries to compile more complex code, including its own source.
Moving Forward: The verified executable from Stage 3 becomes the new trusted compiler. If you want to add more features to Language X or the compiler itself, you write the next version of the compiler source code (in Language X), using the newly available features, and return to the Stage 2 process, using the Stage 3 compiler as the tool.

This iterative process allows the compiler and the language features it supports to evolve together, with each new version being compiled by the previous, proven version.

Advantages of Bootstrapping

Beyond solving the fundamental "chicken-or-egg" problem, bootstrapping provides several significant benefits:

Dogfooding: The compiler developers are forced to use the language they are building extensively. This process, known as "eating your own dog food," provides invaluable feedback on the language design, ease of use, performance, and identifies bugs rapidly.
Developer Familiarity: Once the system is bootstrapped, compiler developers only need to be proficient in the language being compiled (Language X). They don't need deep expertise in the lower-level languages or the initial Language Y used in Stage 0/1. This simplifies development and expands the pool of potential contributors.
Development Efficiency: Writing complex software like a compiler is significantly easier, faster, and less error-prone in a higher-level language (Language X) compared to assembly or machine code.
Improved Backend Benefits the Compiler: Enhancements made to the compiler's code generation, optimization passes, or runtime support directly improve the performance and efficiency of the compiler itself, as it is also a program written in Language X.
Comprehensive Consistency Check: The Stage 3 process of compiling the compiler with itself is a powerful test. If the output is identical to the previous build, it demonstrates the compiler's internal consistency and its ability to correctly process its own source code.

Getting Started: Practical Methods for the First Step (Stage 0/1)

The biggest hurdle in bootstrapping is creating that very first working compiler (Stage 1) from the initial foundation (Stage 0). Here are common methods used in practice:

Implementing an Interpreter or Compiler in Another Language (Language Y): This is a frequent starting point if you have a working language and tools (like C, Java, Fortran, etc.) already available on your target or host system. You write a basic interpreter or a simple compiler for a subset of your new language (Language X) using Language Y. This initial tool then allows you to run or compile the source code for your Stage 1 compiler, which is written in Language X.
- Example: The first Pascal compiler was reportedly written in Fortran. Early LISP compilers were built using existing LISP interpreters.
Using an Existing Compiler for X (on a Different Platform or in a Variant): Sometimes, a compiler for your target language (Language X) or a closely related dialect already exists, but perhaps on a different architecture or with a different set of features.
- Cross-Compilation: As mentioned earlier, you use an existing compiler on a host machine to produce code for your target machine. This is very common for porting languages like C to new hardware platforms.
- Subset/Superset Compilers: A compiler might exist that handles a base version of Language X. You write your initial compiler in this base subset, compile it with the existing tool, and then use the result to compile the full language features. Compilers for supersets like C++ often rely on compiling a core C++ subset first using an existing C++ compiler (like Clang being able to compile a subset of its code using g++ or MSVC).
Compiling a Subset of X: Design a simple subset of your target language (Language X) that is simple enough that you can write a compiler for it using very basic tools (e.g., assembly, or a simple Language Y interpreter/compiler). You write the first version of your full compiler's source code using only the features from this subset, and compile it using your initial Stage 0/1 toolchain. This results in a compiler that understands more of Language X, which you can then use to compile a more complete version written with richer Language X features.
Hand-Compilation (Manual Bootstrapping): For extreme "from scratch" scenarios, or when no other tools are readily available, it is theoretically possible to manually translate the source code of a very simple compiler (written in its own language) into machine code. This is incredibly tedious and error-prone, typically only feasible for extremely small, core components. However, it highlights the absolute minimum required.
- Example: Donald Knuth reportedly hand-compiled a small part of his WEB system's PASCAL code to get the initial compiler running.
Portable Bytecode: The compiler's source code can be distributed along with a version compiled into a simple, portable bytecode format. A minimal interpreter for this bytecode is easier to write or port to a new platform than a full compiler. Once the bytecode interpreter is running, it can execute the compiler's bytecode, which in turn compiles the compiler's source code into native machine code.

Visualizing these complex multi-stage processes can be challenging. Tools like the T-diagram are often used in compiler theory to represent the languages involved (source, target, and implementation) and how translators (compilers, interpreters, assemblers) are used to achieve the desired transformations.

T-Diagram: A formal notation used to illustrate the process of translating programs between different programming languages and architectures. A T-diagram visually represents a translator tool by showing its source language (top left), target language (top right), and the language it is implemented in (bottom). Combining T-diagrams shows how multiple translation steps or different translators are used together, particularly in complex bootstrapping scenarios.

Historical Context

The concept of bootstrapping tools is as old as computing itself.

Assemblers First: The earliest language tools to bootstrap themselves were assemblers. An initial, very simple assembler might be hand-coded in binary or a very primitive notation. This simple assembler could then be used to assemble the source code of a slightly more sophisticated assembler, and so on.
High-Level Languages: The first high-level language compiler to successfully bootstrap itself was for NELIAC in 1958.
Early Widely Used Examples: This was followed by Burroughs B5000 Algol in 1961 and LISP in 1962.
The LISP Example (Hart & Levin, 1962): A famous early example involved writing a LISP compiler in LISP and testing it within an existing LISP interpreter. Once the compiler was good enough to compile its own source code, they used the interpreter-executed compiler to produce a compiled version of the compiler. This compiled version was much faster and became the standard. This technique directly leverages the existence of an interpreter for the target language, using it as the initial "Stage 0/1" tool.

The idea of a program acting on itself (the compiler compiling itself) also has parallels in theoretical computer science proofs, such as those related to the undecidability of the halting problem.

Modern Challenges: Trust and Reproducibility

While bootstrapping allows us to build complex systems from simpler parts, it introduces a potential vulnerability highlighted by the Trusting Trust Attack (also known as a Thompson attack).

Trusting Trust Attack: A sophisticated security vulnerability where a compiler executable is maliciously modified to subtly introduce backdoors into any software it compiles. The attack is self-replicating: the malicious compiler is also modified so that when it compiles its own source code (or the source code of a different compiler), it inserts the same malicious code into the new compiler executable, perpetuating the attack chain even if the original source code is clean. This makes it very difficult to detect simply by reviewing source code.

Because we rely on a chain of trust (Compiler A compiled Compiler B, which compiled Compiler C...), a compromise at any stage, particularly in a widely distributed compiler executable, can be propagated uncontrollably.

This concern has spurred modern efforts like the Bootstrappable Builds project and the Reproducible Builds project. These initiatives aim to:

Reduce the number of binary steps needed to bootstrap a compiler, ideally starting from very simple, verifiable components.
Ensure that compiling the exact same source code twice, potentially on different machines or at different times, produces bit-identical output executables. This allows users to verify that a distributed binary was genuinely built from the claimed source code using a trusted, verifiable process.

These efforts highlight that bootstrapping isn't just a historical technique for getting started; it's a crucial part of the software supply chain with ongoing implications for security and trust.

Related Concepts

Understanding compiler bootstrapping is also helped by looking at related ideas:

Self-Hosting: A system (like an operating system or a compiler) that is developed and runs on itself. A bootstrapped compiler is typically self-hosted once Stage 2/3 is achieved.
Self-Interpreter: An interpreter written in the language that it interprets. A self-interpreter for Language X could potentially be used as a Stage 0/1 tool to begin the bootstrapping process for a Language X compiler.
Indirect Self-Modification: While bootstrapping isn't direct self-modifying code, the compiler processing its own source and producing a new version of itself involves a form of indirect self-modification of the system's tools.
Tombstone Diagram: Another name for a T-diagram, used to visually represent compiler translations and bootstrapping.
Metacompiler: A compiler that compiles compilers, or tools that help in compiler construction. Bootstrapping involves using one compiler instance to build a new compiler instance.
Falsework: A temporary structure used in construction (like scaffolding or forms for concrete) that is removed once the permanent structure can support itself. The initial bootstrap compiler or tools can be seen as "falsework" removed once the self-hosted compiler is functional.

In summary, bootstrapping is an ingenious and essential technique that solves the recursive problem of creating the tools needed to build the tools themselves. It's a foundational concept for anyone interested in how programming languages and their compilers are brought into existence, particularly when starting from the ground up.