Menu

Design

John Källén

Decompiler design

The Decompiler consists of a central .NET assembly Decompiler.dll which contains the central core logic. Leaving aside the user interface for a moment, the Decompiler can at a glance be considered a pipeline. The first stage of the pipeline loads the executable we wish to decompile. Later stages perform different kinds of analyses, extracting information from the machine language where they can and aggregating it into structured information (such as [Procedure]s and data types). The final stage is the output stage, where source code is emitted into files.

A central tenet is that the Decompiler is extensible: whereever possible, we strive to avoid hard-coding knowledge about specific platforms, processors, or file formats in the core decompiler. Instead, such special knowledge is farmed out in separate assemblies. Examples:
- Decompiler.Arch.X86.dll - provides support for disassembing Intel X86 binaries.
- Decompiler.ImageLoaders.MzExe.dll - understands how to load MS-DOS executable files, and all related formats
- Decompiler.ImageLoaders.Elf.dll - understands the ELF executable file format.

Overview of operation

When loading an executable image for decompilation, the [Loader] front end is invoked. The Loader looks for clues in the executable file that indicate what kind of executable format the file has. The Loader has a table of 'magic numbers' and [ImageLoader]s in the Decompiler.config file. It peeks inside the executable file to locate magic numbers, which determine what kind of ImageLoader is capable of loading the executable image.

Once an ImageLoader is selected, it proceeds to read the executable image. The ImageLoader decides what [Processor Architecture] the executable is expecting, performs image relocation if necessary, detects any external dependencies the executable might have, and returns its findings in a [Program] object. This central data structure maintains all global data about a decompiled executable file.

The Program has a reference to a [Processor Architecture]. All processor-specific knowledge, such as how to create a disassembler, or whether to read words in little- or big-endian fashion, is abstracted by the Processor Architecture.

The program has a reference to a [Platform], which is the operating environment or operating system the program expects to execute in. The Decompiler can ask the Program's Platform what character encoding is used to encode text, for instance.

The Program has a reference to a [LoadedImage], which is the byte representation of the program as it is loaded into memory for execution. An [ImageMap] is used to subdivide the LoadedImage into segments, which may vary depending on the executable file image type and the Platform in question.

When the ImageLoader has finished loading the executable file, it passes the resulting Program to the [Scanner]. It in turn uses the program entry point(s), which should have been determined by the ImageLoader, as starting points for traversing the LoadedImage. The Scanner uses the Program's ProcessorArchitecture to create a [Rewriter]. The Rewriter visit successive LoadedImage locations and decomposes the machine code instructions it encounters to [Register Transfer Language] (RTL) instructions, which model the sometimes complex machine instructions with simple side-effect free operations.


Related

Wiki: Home
Wiki: How To