-
Notifications
You must be signed in to change notification settings - Fork 257
HowTo
So you're interested in contributing, but are unsure on how to get started? Here are few primers on how to extend Reko's functionality. First, though, you should make yourself familiar with the Design of the solution.
To implement a processor architecture, be it a real physical processor or a virtual machine, you need to implement the IProcessorArchitecture
and related interfaces.
Naturally, a prerequisite is that you are familiar with the processor, have access to manuals from the manufacturer that describe its architecture, its instruction repertoire, and how its machine code is encoded. In the following discussion, we assume you are implementing support for the (fictitious) processor MicroFoo. As you follow these instructions, it will be helpful to consult the souce code of some of the other implemented processor architectures as a guideline.
The first thing to do is to create a new project in the Reko solution, under the ~/src/Arch
directory. Your source code should use a namespace like Reko.Arch.MicroFoo
, and your assembly should be named Reko.Arch.MicroFoo
.
The first class in your project is MicroFooArchitecture
, which must implement the IProcessorArchitecture
interface. Your implementation must describe how many registers the processor has, how large pointers are, what endianness is used to interpret words stored in memory, and other such machine specific details. Pay special attention to the GetRegister()
methods, which return a RegisterStorage
instance representing a specific processor register.
Consult the processor manual and obtain a list of all the opcodes. Create an enum called Opcode
and put the opcodes inside. It is a good idea to make the first Opcode be invalid
:
enum Opcode {
invalid = 0, // return this when disassembling bytes that aren't valid machine code.
load,
store,
addi, // etc.
}
Most processor instructions will have one or more operands. The Reko.Core.Machine
namespace defines the abstract class MachineOperand
and two concrete subclasses, which implement the common cases of a RegisterOperand
and a ImmediateOperand
, containing a RegisterStorage
and a Constant
, respectively. MicroFoo may have strange memory addressing modes (see the M68k processor architecture for an particularly complex set of addressing modes), so you will likely need to derive your own subclasses from MachineOperand
to support each of MicroFoo's addressing modes.
To represent disassembled machine instructions you need to create a class MicroFooInstruction
, deriving from Reko.Core.Machine.MachineInstruction
, that models the instructions of the MicroFoo processor. You will want to keep track of an opcode and allocate enough member variables to store all possible operands for any machine instruction the processor can execute. You should be familiar enough with your architecture to know the maximum number of operands can have, and allocate space accordingly. Suppose MicroFoo instructions can have at most 3 operands. You have the choice of either implementing this as three discrete fields or properties:
class MicroFooInstruction : MachineInstruction {
public Opcode opcode;
public MachineOperand op1, op2, op3;
}
or as an array, which may vary in length depending on the instruction:
class MicroFooInstruction : MachineInstruction {
public Opcode opcode;
public MachineOperand [] ops;
}
Importantly, every implementor of a MachineInstruction
should override the Render()
method. This method is used by the user interface to render machine instructions into -- possibly colorized -- text.
You can use the different methods of the supplied MachineInstructionWriter
reference to write opcodes and addresses. The user interface components will then render the opcodes and addresses in a different color, and make addresses hyperlinks to their destinations.
Once you've implemented the representation of MicroFoo's instructions, you are ready to create the MicroFooDisassembler
. Disassemblers can be viewed as filters that take an input stream of bytes provided by an ImageReader
and return a stream of disassembled machine instructions. To do this, MicroFooDisassembler
implements the IEnumerable<MicroFooInstruction>
interface. The constructor of the disassembler will need to accept at least one operand, the ImageReader
. The implementation of IEnumerable<MicroFooInstruction.GetEnumerator() returns an enumerator whose MoveNext method is responsible for reading one or more bytes using the ImageReader
, interpreting the machine code represented by those bytes, and returning a MicroFooInstruction
.
Implementing a disassembler is a large task, especially for processors with large numbers of instructions, addressing modes, or both. The work can be made easier by exploiting regularities in the machine code encodings. We strongly recommend implementing the various instructions by creating a [unit test] for each instruction like this:
[Test]
public void MicroFoo_dasm_movi()
{
AssertCode("movi\tr1,0x42", 0x12, 0x00, 0x42);
}
The test should be read as "when the byte sequence 0x12 0x00 0x42 is encountered, the disassembler should return the machine instruction movi r1,0x42
". Consult the source code for PowerPCDisassembler
for a good example of how to implement this.
Finally, you will need to implement an instruction rewriter MicroFooRewriter
. Rewriters can be viewed as filters that take an input stream of machine instructions and return a stream of low-level register transfer level instructions (RTL) that model possibly very complex machine code with very simple operations. Rewriting a typical CISC instruction can result in many RTL instructions; for instance rewriting the M68k instruction:
add.l -(a3),d0
results in the following RTL instructions:
a3 = a3 - 4
tmp1 = Mem[a3:word32]
d0 = d0 + tmp1
CVZNX = cond(d0)
which model the predecrement operator and the setting of condition codes. Later passes of the decompiler will strive to reduce this to more compact high-level language representation.
When you've created a disassembler and a rewriter, you can implement the CreateDisassembler()
and CreateRewriter()
methods on MicroFooArchitecture
. Now you're ready to make the Decompiler aware of your processor architecture by adding a references to it in the configuration file for the Decompiler. In the app.config
file, look for the <Architectures>
section, and add the following element:
<Architecture
Name="uFoo"
Description="MicroFoo Architecture"
Type="Decompiler.Arch.MicroFoo.MicroFooArchitecture,Decompiler.Arch.MicroFoo" />
To test that your new architecture is working, you need to make the build process copy your architecture into the directory where Decompiler is built. You'll need to modify the WindowsDecompiler.csproj
file manually and add an entry in the <Architectures>
element.
Make sure you have access to a specification of the image file format available. In the following discussion, we assume you are implementing support for the (fictitious) image file format FooExe.
First, create a new project in the Reko solution, under the ~/src/ImageLoaders
directory. Your code should use a namespace like Reko.ImageLoaders.FooExe
, and your assembly should be named Reko.ImageLoaders.FooExe
The central class of your project will be responsible for loading an image from an array of bytes that the Reko framework will have read from a file. In our example, it would be named FooExeImageLoader
and it must have Reko.Core.ImageLoader
as its base class.
The constructor of the image loader must take exactly three parameters:
- a
IServiceProvider
reference, which you can use to access services provided by the core decompiler. For instance, if your image loader needs to show a user interface, like a dialog box, while loading, you can use theIServiceProvider
reference to access theIDecompilerUIService
service. - the name of the file that contained the image. Note that you don't need to open this file; the file name is provided in the case that the image loader loads differently depending on, say, the file extension.
- an array of raw bytes loaded from the file.
Your image loader needs to implement the abstract Load
and Relocate
methods. The Load
method is responsible for ensuring the image is a valid one, and creating a LoadedImage
which is what the executable looks like once it has been loaded into memory; in general, the byte array in the LoadedImage
will be different from the raw bytes in the binary file. Your Load
method must also determine what processor architecture and what operating environment the program was intended for. Finally, if the image format supports segments, you must add them to an ImageMap
object.
The Load
method finishes by returning an instance of the Program
class, which will contain
- a processor architecture for the image
- a LoadedImage containing the post-load layout of the program
- an ImageMap that describes any segments described in the image format.
- an instance of a class derived from
Platform
that the operating environment the program was written for.
The Relocate()
method applies any relocations that may be necessary. Some executables don't have any relocations, while others do. Consult the specification and model your implementation after the ones for, say PE executables.
Once the loader is completed, you need to make Reko aware of it, and give rules for deciding whether a given file is in fact a FooExe file. Very often, an executable file can be identified by the presence of a magic number
, often (but not always!) located at the start of the file. For instance, a large class of Microsoft executable files start with the bytes 4D 5A
, (interpreted as ASCII, these are the initials of Mark Zbikowski, who designed the image file format for MS-DOS 2.0), while ELF images will start with the bytes 7F 45 4C 46
The appropriate section in the configuration file is called <Loaders>
. For our sample loader, we would add the following sub-element:
<Loader
MagicNumber="464F4F0A"
Offset="0"
Type="Reko.ImageLoaders.FooExe.FooExeLoader,Decompiler.ImageLoaders.FooExe" />
which specifies that if the four bytes at offset 0 of the image file match the magic number specified, use a Reko.ImageLoaders.FooExe.FooExeLoader
to load the image.
Assume that you have a OllyScript file called FooUnpacker.osc and you wish to have it be executed automatically when an image file, packed by the (fictitious) FooPacker version 1.0, is loaded. You must first identify a signature, that is a sequence of byte values that uniquely identify the packer in question. Given a signature, you need to add an XML element in the file ~/src/Decompiler/Loading/Signatures/IMAGE_FILE_MACHINE_I386.xml
like this:
<ENTRY>
<NAME>FooPacker v1.0</NAME>
<COMMENTS />
<ENTRYPOINT>45A4EA??????D3</ENTRYPOINT>
<ENTIREPE />
</ENTRY>
The <ENTRYPOINT>
sub-element specifies that the given pattern of bytes must be present at the entry point of the program for this to be considered a match for an image packed by "FooPacker v1.0".
After specifying the signature in the signature file, you need to tell Reko what script to use to unpack it. This is done by adding an element like the following to the <Loaders>
section of Reko configuration file:
<Loader
Label="FooPacker v1.0"
Argument="FooUnpacker.osc"
Type="Decompiler.ImageLoaders.OdbgScript.OdbgScriptLoader,Decompiler.ImageLoaders.OdbgScript" />
Here we're stating that if we have detected a "FooPacker v1.0" signature, then we will use the OdbgScriptLoader to load the unpacker file FooUnpacker.osc
.
The development team makes an effort to provide disassembler and rewriter support for all instructions handled by each processor, but resource constraints sometimes cause us to fall short of this goal. If you have discovered that a particular processor's disassembler is not able to disassemble what you know is a valid sequence of machine code bytes, you can add support to this yourself.
Start by following the Test Driven Development methodology and creating a unit test for the missing instruction. Locate the disassembler unit tests for the processor architecture in question. They typically look like this:
[Test]
public void X86_xor()
{
AssertCode("xor\teax,eax"", 0x33, 0xC0);
}
Here the test is asserting that when the disassembler encounters the bytes 33 C0
, the disassembler should emit an instruction which when converted to a string, reads xor eax,eax
.
Run the unit tests. If the byte sequence you provided is not yet supported, the unit test will fail. Now you need to implement the disassembly of the instruction. Use the other, already implemented instructions as a guideline. Disassemblers vary widely in their implementation, but the majority perform lookups in arrays and/or dictionary to perform the mapping from byte value to disassembled instruction.
Once the disassembler unit test is passing, it's time to change the corresponding RTL rewriter. Locate the rewriter unit tests; they will likely look something like this:
[Test]
public void X86_Rewrite_Xor()
{
AssertCode(0x33, 0xC0,
"0|00100000(2): 3 instructions",
"1|L--|eax = eax ^ eax"
"2|L--|SZ = cond(eax)"
"3|L--|C = false");
}
The first line states that an instruction starting at address 00100000
and being 2 bytes long was rewritten into 3 RTL instructions. The remaining lines are those RTL instructions. The field after the line number states that this instruction is classified L (for 'linear'). A jump or call statement might have been classified as T (for 'transfer').
Note how the RTL rewriter must be careful to model the effects of the machine code exactly. Many x86 programs depend on the carry flag being clear after certain logic operations. The translation
[Test]
public void X86_Rewrite_Xor()
{
AssertCode(0x33, 0xC0,
"0|00100000(2): 2 instructions",
"1|L--|eax = eax ^ eax"
"2|L--|SZC = cond(eax)");
}
while in a strict sense is as correct as the previous translation, is not as good since we lose the opportunity of leveraging the fact the C (carry) flag is clear in later stages of the decompiler.