Apparently the quickest way to make software moar faster is to turn it into a tiny compiler and lots of things are doing it. As I get time between
iterations of TenFourFox and smoke-testing Firefox builds on
ppc64le, slow work on the Firefox JIT continues, but that doesn't mean we can't be JITting all the other things in the meantime.
One of my favourite games is LucasArts' Dark Forces, an FPS set in the Star Wars universe (but now apparently non-canon after the events of Rogue One). Although projects like XLEngine can run it, that, too, requires a code generator to be written (because of AngelScript). I decided that if I had to write a backend after all, a better approach would be to add a backend to DOSBox, the famous DOS emulator. That would happily run the copy of PC Dark Forces I already have and any of my other old DOS games like Extreme Pinball and Pinball Illusions and Death Rally and all those other great titles in my office besides. (The classic Mac version of Dark Forces is better, by the way, not least of which because of its beautiful high-resolution graphics that are double those of the PC release.)
Fortunately, a 32-bit big-endian PowerPC version of DOSBox already existed as unofficial patches (play it on your old Power Mac), which took only a few days for me to convert to 64-bit little-endian. While DOSBox in strictly interpreted mode on the Talos II is no slouch, this JIT, which is for DOSBox's dynamic recompiling "dynrec" core, increases performance roughly by a factor of four. This makes even the most demanding games playable and makes most other games run like butter (in fact, it's so fast it even destabilizes some timing loops, like the credits scroller in Descent). If I could shoot and take screenshots at the same time, you'd see me do better at blowing away Imperial officers, too.
You can build this yourself. Download the patch and backend, which we will call ppc64le_dosbox.diff and risc_ppc64le.h. This patch is intended to apply against SVN trunk.
svn checkout svn://svn.code.sf.net/p/dosbox/code-0/dosbox/trunk dosbox-code-0
cd dosbox-code-0
patch -p0 < ../ppc64le_dosbox.diff
cp ../risc_ppc64le.h src/cpu/core_dynrec
./autogen.sh
./configure CFLAGS="-O3 -mcpu=power9" CXXFLAGS="-O3 -mcpu=power9"
make -j24 (or as you like)
Copy over your favourite games or install media, and src/dosbox to start your fast-powered DOS machine. Or try this benchmark pack and see for yourself. (I like the PC Player one.) See the DOSBox documentation and DOSBox wiki for more.
But let's say you'd like to work on a JIT backend of your very own for some other ppc64le port. This is hardly the place for a tutorial on writing ppc64 assembly language -- you can read IBM's -- but I will talk about how to get generated code into memory and how to execute it, and how this might differ from x86_64 or ARM. The following examples should run on any 64-bit Power CPU from the 970/G5 to the POWER9 regardless of pagesize or endianness under recent Linux kernel versions, but they're tested on my ppc64le Talos II in Fedora 31, of course.
Being able to emit generated code to memory and run it is actually a rather big security hole unless it is done carefully and correctly. Indeed, SELinux will halt you right in your tracks (as it should) unless you do the correct mating dance. The basic steps are:
- Allocate writeable memory.
- Emit machine code to that memory.
- Flush the caches.
- Make that tract of memory executable. (This is the dangerous bit. We'll talk about how to mitigate it.)
- Run the now executable code.
- Profit!
For steps 1 (and, indirectly, step 4), you need to know what the OS believes the memory page size is. (On Fedora 31, this Talos II has a pagesize of 64K. Some Linuces like Adelie use 4K pages.) For step 3, you need to know how large a cache line on your CPU is (for all current 64-bit Power ISA processors, this number is 128 bytes). We will handwave away these a bit here by keeping the example code at or less than a cache line's length, and reading the OS's page size ourselves.
Let's look at the first example. You'll notice a couple blocks of code commented out. If you are not using SELinux for whatever reason, you may be able to get away with posix_memalign() and mprotect() to allocate your memory. However, if you use this on an SELinux system (Red Hat, recent Debians, etc.), you will have to modify your policy or temporarily disable some protection features to run that code.
The better way is to create the tract of memory as an in-RAM file and use mmap() to make it writeable. This only works in integral page numbers, hence the need to know your page size (we ask the kernel via syscall). You may be able to use the second commented block to call memfd_create() directly, but the syscall approach I've used here doesn't require setting _GNU_SOURCE. Once we have mmap()ed the memory, we can write to it. We do so as simply an array of 32-bit integers (even 64-bit Power still has 32-bit opcodes).
I've ripped off the assembler macros from DOSBox's backend because they match up nicely with the numbers in the Power ISA book and pretty much any Power assembly language reference. Our first example runs a very simple program:
li 3,2020
blr
This code loads the integer 2020 into r3, the first argument register and the standard return register in the ELFv2 ABI (and indeed for any PowerPC using either PowerOpen ABI like AIX and Mac OS X, or SysV ABIs like Linux or the BSDs). It then branches back to the return address in the link register, terminating the program. We emit this code to the mmap()ed in-memory file.
The Linux kernel flushes the data and instruction caches of the processor separately, so we do the same. For every cache line that needs invalidation, we use the dcbst instruction to invalidate the data cache line in question, and once they are all flushed, we use the sync instruction so that each CPU's view of memory is consistent. Then we flush the instruction cache in the same way, using the icbi instruction and finally the isync instruction to ensure consistency of the I-cache this time. Because this all fits into a single cache line we just do each instruction once.
As our last step, we do another mmap() to make the tract of memory executable. Since we are using a named file to store our code rather than just memory we managed to grab, SELinux does not block it the same way it would ordinarily block an anonymous allocation. The second mmap() adds further security, because we are not making the memory executable until the program is fully assembled in RAM, and while the two mmap()s are linked and reference the same memory area as far as our program is concerned, assuming randomization is in effect an attacker would now have to derive two completely random memory addresses to do any funny business. We then execute the code as if were a C function (more on this in a moment).
Run the code like this:
% gcc -o jitlab1 jitlab1.c
% ./jitlab1
jitcode at 0x7fffac650000
result = 2020
Excellent! Note that the actual address of the jitcode may vary from run to run.
However, an example this trivial can only be this trivial because it didn't need to pull and manage a stack frame or maintain all but the most token adherence to the C ABI. We need a stack frame if we modify, or might modify, any non-volatile register or the link register (i.e., make any calls to other subroutines). Frankly, a JIT isn't much good if it can't call into its host somehow, either to run higher level operations or exchange data, so for all practical purposes you'll probably want to pull a stack frame in your code and be ABI compliant.
With that in mind, let's turn to our second contrived example. This slightly more complex demonstration will receive an integer value which it will perform basic math on (add 2020 to it), then call a function to display that result, double the result, and return that. As we are doing very little actual work in this JIT code (essentially messing with r3 as both the in and out argument, which is volatile, so it need not be saved in the stack frame we created), that makes control flow simpler. The function prologue, which is broken up a bit for performance reasons, can be as simple as stdu 1,-size(1):mflr 0:std 0,size+16(1), where size is the desired size of the stack frame (we use 256 here just for laziness).
Having pulled the frame and computed the first value, we now want to call the display routine to show our work. A minor complication is that there is no PowerPC/Power ISA instruction to directly branch to an address in a regular general purpose register. Instead, the ISA only supports indirect branching to either the address in the link register "LR" (blr) or the counter register "CTR" (bctr), both of which are special-purpose registers. As a practical measure, for branching other than returns from a routine, we prefer the counter register which does not need to be saved.
A more significant complication is setting up the actual function call itself. Without getting too deep into the weeds, 64-bit ELFv2 compliant functions can have two entry points, called the "global entry point" and the "local entry point." The global entry point is called when r2, the register customarily used for global symbol access through the Table of Contents, must be computed. The TOC, a holdover from PowerOpen systems, contains a conventional ELF global offset table "GOT" and optionally a small data section. The global entry point invariably consists of two instructions that compute r2 from the address of the routine itself, which is conventionally maintained in r12, and the local entry point follows at eight bytes after the global entry point. This scheme facilitates position-independent code generation as the compiler can emit code referencing globals as relative indexes on the TOC base stored in r2.
At link time the linker looks at branches and determines whether the caller function and callee function will share the same TOC. If they don't, then the linker points the branch at the global entry point either directly or through a procedure linkage table "PLT" stub. If they do, however, then this call is considered "local" (from the perspective of the global context), and the linker calls the local entry point instead. However, if it turns out the callee actually does no global access, the compiler generates only a single entry point because there is no need to compute r2, and the linker calls that.
The linker can juggle this because it has time to burn and the breadth of the codebase to scan. However, in our example here, we are the linker, and we have much less visibility into the codebase we're trying to call. Think of it as working in the basement with a flashlight trying not to walk into the walls and only a limited time to finish our job.
A hard branch (using b or bl) will always be faster, even if imperceptibly, and is less susceptible to Spectre-style attacks. However, if we actually turn out to be calling a function's global entry point and r12 is not set correctly, then r2 will also not be set correctly, and global access will either fault or just be plain wrong. If we end up doing all this computation, we might as well just branch to r12 via CTR, which is slower by a minor degree but will always work. Plus, if we're able, whatever speed hit is incurred can be mitigated by hoisting the mtctr up a few instructions ahead of the bctr or bctrl that uses it. Either way, whether we hit the global entry, the local entry or a non-global function, everything will be ABI compliant.
That does not mean you can never branch directly. Directly branching to a function with b or bl will work if you are calling a pure local function that accesses no globals, or you are calling the local entry point and nothing has modified r2 (though this is an awfully big gamble in complex codebases), or it's a function you generated (i.e., another JIT function) that you can guarantee to be purely local because it never touches r2. Or you can just set r12 and directly branch, I suppose.
The last consideration you need to keep in mind with JIT function calls is whether they need to be patchable. If they need to be patched or redirected as code addresses change, then you need to have a fixed-size branch stanza that can be changed on the fly. The minimum size of a branch stanza in 64-bit Power ISA is seven instructions (28 bytes) because we may call a 64-bit address and we need four immediate loads to compose it plus a rotation step. For example, to call a routine at 0x1234567876543210, the branch stanza looks like this:
lis 12,0x1234 (of course actually addis 12,0,0x1234)
ori 12,12,0x5678
rldicr 12,12,32,31
oris 12,12,0x7654
ori 12,12,0x3210
mtctr 12
bctrl
This yields a full 64-bit quantity, 0x1234567876543210. We can then mtctr 12 and bctrl to call the routine and come back to the generated code. Repatching this stanza is "merely" a matter of changing the bottom 16 bits of the four immediate loads and flushing the caches. You can have a direct branch in a stanza if the location of the routine is within the available displacement for those instructions, but then everything else should be nops, e.g.,
bl 0x1234567876543210
nop (ori 0,0,0)
nop
nop
nop
nop
nop
so that if the new address no longer fits into the branch instruction's displacement, you can rewrite it as a full stanza. If you choose to set r12 and end in a hard-coded branch at the same time, remember that you'll need to repatch both things if the address changes, which might make your code generator a bit more hairy. I consider the aggressive promotion of branch stanzas to hard branches to be a form of premature optimization and you shouldn't be doing this until you are sure the rest of your code generator is working correctly.
This gets even more complex, by the way, if your JIT code must itself access globals; in that case you may need to save r2 yourself and/or do additional linkage work. I should note that in the JITs I've personally written this was never necessary. Let the C host code handle it and debugging your generated code will then be much less of a hassle.
Returning to our example, when the function call returns to the JIT code it finishes its "work," dismantles the stack frame, recovers the previous link register value and branches back to it to exit. The result remains in r3. We use decimal 1111 as our passed value in this example, so the resulting values should be 3131 (1111+2020) and 6262 (3131+3131). Since our intermediate print function helpfully returns the value it was called with, in our example here we don't need to worry about stashing it anywhere. You may not be so lucky calling someone else's work.
% gcc -o jitlab2 jitlab2.c
% ./jitlab2
jitcode at 0x7fff99dc0000
called with 3131
result = 6262
These were obviously toy examples generating small blebs of code. For code blocks greater than one cache line in size, which you will almost certainly generate, you need to dcbst (or similar) and icbi each cache line in whatever setup function gets called to flush the cache. The Z constraint we use here can make this very easy by having gcc do the work of setting up the register for each address for you. See the cache_block_closing() function in the DOSBox backend for a real-world example. In a like fashion, if your code spans multiple memory pages, with the method we have used here you will need to make each memory page writeable, and then ultimately executable, in turn.
You should also get used to tossing trap (tw 31,0,0) instructions in code blocks you're not sure about. This lets you use the debugger to verify that you actually assembled the opcodes in memory correctly and allows you to trap at the point you believe might be problematic and single-step from there. Oddly, in Fedora 31, when a trap instruction is hit gdb does pause but doesn't register a SIGTRAP (whereas lldb does seem to trap correctly, but I'm much more used to gdb personally). In this case, when the trap is sprung in gdb, you would need to press CTRL-C and induce an interrupt when the trap is hit to actually get into the debugger. Debugging JITs can be a real nightmare especially when the bug is very subtle, so only write the minimum you need to get the JIT off the ground, use traps liberally during building and never optimize prematurely. In particular ensure that your actual assembler steps that write instructions to memory are doing so accurately, as this can yield some rather humiliating issues later if you hit edge cases.
Let's make all the ppc64le things faster!