-
Notifications
You must be signed in to change notification settings - Fork 257
Design
Reko consists of a central .NET assembly Reko.Decompiler.dll which contains the central core logic. Leaving aside the user interface for a moment, the Reko can at a glance be considered a pipeline. The first stage of the pipeline loads the executable we wish to decompile. Later stages perform different kinds of analyses, extracting information from the machine language where they can and aggregating it into structured information (such as Procedures and data types). The final stage is the output stage, where source code is emitted into files.
A central tenet is that Reko is extensible: whereever possible, we strive to avoid hard-coding knowledge about specific platforms, processors, or file formats in the core decompiler. Instead, such special knowledge is farmed out in separate assemblies. Examples:
- Reko.Arch.X86.dll - provides support for disassembing Intel X86 binaries.
- Reko.ImageLoaders.MzExe.dll - understands how to load MS-DOS executable files, and all related formats
- Reko.ImageLoaders.Elf.dll - understands the ELF executable file format.
When loading an executable image for decompilation, the Loader front end is invoked. The Loader looks for clues in the executable file that indicate what kind of executable format the file has. The Loader has a table of 'magic numbers' and ImageLoaders in the reko.config
file. It peeks inside the executable file to locate magic numbers, which determine what kind of ImageLoader is capable of loading the executable image.
Once an ImageLoader is selected, it proceeds to read the executable image. The ImageLoader decides what Processor Architecture the executable is expecting, performs image relocation if necessary, detects any external dependencies the executable might have, and returns its findings in a Program object. This central data structure maintains all global data about a decompiled executable file.
The Program has a reference to a Processor Architecture. All processor-specific knowledge, such as how to create a disassembler, or whether to read words in little- or big-endian fashion, is abstracted by the processor architecture.
The program has a reference to a Platform, which is the operating environment or operating system the program expects to execute in. Reko can ask the Program's Platform what character encoding is used to encode text, for instance.
The Program has a reference to a LoadedImage, which is the byte representation of the program as it is loaded into memory for execution. An ImageMap is used to subdivide the LoadedImage into segments, which may vary depending on the executable file image type and the Platform in question.
When the ImageLoader has finished loading the executable file, it passes the resulting Program to the Scanner. It in turn uses the program entry point(s), which should have been determined by the ImageLoader, as starting points for traversing the LoadedImage. The Scanner uses the Program's ProcessorArchitecture to create a Rewriter. The Rewriter visits successive LoadedImage locations and decomposes the machine code instructions it encounters to Register Transfer Language instructions, which model the sometimes complex machine instructions with simple side-effect free operations.
Once reko has located as much executable code as it can, it starts the data flow analysis phase. This consists of determining what regions of code can be grouped as procedures; whether or not the procedures appear to return; what registers the procedures destroy when called; and what registers or memory are being used to pass data in and out of the procedures. With that information, the procedure's signatures can be deduced and the procedures can then be analyzed separately. Expressions are simplified and coalesced, induction variables are detected, and local variables are identified and given names.
Reko's [Type analysis] phase tries to recover type information by looking at how the expressions are used and deduce the type based on the operations involved. Once the data types of all expressions are discovered, they are used to rewrite the RTL into a slightly higher-level representation, with raw memory accesses replaced by field accesses or array accesses as appropriate.
The final phase attempts to locate as many high-level structures as possible in the code. The control flow graph is converted to if-then-elses, while-loops, and switches wherever possible.