Juicing QEMU for fun, ??? and profit!


The number of packages and applications natively available for OpenPOWER continue to grow in just about every distro's package manager, and even if a prebuilt package doesn't exist even more will build from source. But emulation is still going to be a fact of life for Windows-only/x86/x86_64-only (maybe even aarch64-only) binaries we can't rebuild, and KVM only helps us with other Power ISA systems (in fact, it looks like KVM-PR broke and can't boot Mac OS X again, so I guess I'll be diving back into the source), so we need to wring as much speed out of QEMU's emulation engine as possible.

We are fortunate with QEMU in that there is ppc64le support in TCG, the Tiny Code Generator which implements a basic JIT, and the Power ISA TCG backend even emits those tasty newer POWER9 instructions to take better advantage of the processor. Without TCG, QEMU would be dreadfully slow when emulating a foreign architecture. However, unless IBM or some other OpenPOWER hardware developer implements instructions (a la Apple M1) in a future chip that specifically improve emulation of other CPUs (like, I dunno, x86_64), there's very little that can be done to improve the code the Power TCG backend generates and CPU emulation spends most of its time in TCG-generated code.

However, the software MMU that QEMU's CPU emulation uses has pre-compiled portions, and all the devices and components QEMU emulates (like the system bus, video, mass storage, USB, etc.) are also pre-compiled. This gives us an opportunity: with a little extra elbow grease, you can make a link-time-optimized and profile-guided-optimized (LTO-PGO) build of QEMU specific to the particular workload which can run the CPU anywhere from 3-8% faster and video and other devices up to 15% faster depending on the set of devices. While number crunching isn't substantially faster, and the modest CPU improvements don't improve user-mode emulation a great deal, full system emulation's general responsiveness improves and makes using more applications more feasible.

This process is not automated. For Firefox, we make LTO-PGO builds using the internal machinery and our patches for gcc compatibility, which is currently our preferred compiler on OpenPOWER systems. The Firefox build system generates a profiling build first, then automatically collects profiling data with it off a model workload and builds the optimized browser from that profile. QEMU doesn't have that infrastructure right now, but you can do it manually: you configure and compile a profiling build, run your workload with it to create a profile, and then configure and compile an optimized build with the profile thus generated.

I'll give instructions here for both QEMU 5.0 and 5.2, since 5.0 seems to be a bit more performant than 5.2 and has fewer build prerequisites, but 5.2 is more straightforward and we'll do it first. In these examples, I'm optimizing ppc-softmmu so that I can run Mac OS 9, which has never worked properly with KVM-PR; substitute with your desired target, such as x86_64-softmmu. Only do one target at a time, and you will want to do individual builds for each system image — even if you normally use the same executable binary for multiple OSes — because different code paths may be exercised with different workloads and/or configurations.

Let's start with making a profiling build. To do this, we'll add -fprofile-generate to the compiler flags (as well as -flto for LTO). For consistency we'll pass the same set of options to the C compiler, the C++ compiler and the linker (each will ignore options they don't need). In the QEMU source tree,

  • mkdir build
  • cd build
  • ../configure --extra-cflags="-O3 -mcpu=power9 -flto -fprofile-generate" \
    --extra-cxxflags="-O3 -mcpu=power9 -flto -fprofile-generate" \
    --extra-ldflags="-flto -fprofile-generate" --target-list=ppc-softmmu
  • make -j24 (or as appropriate: this is a dual-8 Talos II)

Wait for QEMU to build. When it finishes, back up your drive image because you may not be able to shut it down normally and it would suck to damage it inadvertently. With a backup copy saved, run the new QEMU as you ordinarily would on your target workload. For example, my classic script is (assuming you're still in the build directory)

./qemu-system-ppc -M mac99,accel=tcg,via=pmu -m 1536 -boot c \
-drive id=root,file=classic.img,format=qcow2,l2-cache-size=4M \
-usb -netdev tap,id=mynet0,ifname=tap0,script=no,downscript=no \
-device rtl8139,netdev=mynet0 -rtc base=localtime

You should use as close to your normal configuration as possible so that the device drivers you run are factored into the profile.

The first thing you'll notice is that QEMU is now really, really, really slow. Crust-of-the-earth-cooling slow. This is because it's storing all that profile data every time any block of compiled code is executed. As a result you will probably not be able to type or interact with the guest in any meaningful fashion, so let the system boot, grab a cup of a fortifying beverage and and wait for it to get as far as it can. For Mac OS 9, it took several minutes to get to the desktop; for OS X 10.4, it took about a quarter of an hour (with a lot of timeouts in a verbose boot) to even start the login window. At some point you will not be able to usefully proceed any further with the guest, but fortunately you backed up your drive image already, so you can simply close the window.

Go back to the build directory. This time we will tell gcc to build with the generated profile (-fprofile-use), though we will allow it to account for certain changes (-fprofile-correction) and allow compilation to occur even if a profile doesn't exist for a particular target (-Wno-missing-profile) so that it can get through configure cleanly:

  • make clean (this doesn't remove the profile .gcda files)
  • ../configure \ --extra-cflags="-O3 -mcpu=power9 -flto -fprofile-correction -fprofile-use -Wno-missing-profile" \
    --extra-cxxflags="-O3 -mcpu=power9 -flto -fprofile-use -fprofile-correction -Wno-missing-profile" \
    --extra-ldflags="-flto -fprofile-use -fprofile-correction -Wno-missing-profile" \
    --target-list=ppc-softmmu
  • make -j24

Enjoy the new hotness. You should be able to see measurable improvements in the CPU emulation, but more importantly, boot times and responsiveness of the full system emulation should also be improved.

For 5.0.0, the process is a bit more complicated, but it's a bit quicker, so I found it worth it (and it's what I currently use for Mac OS 9). In the QEMU source tree, configure the build:

  • ./configure --extra-cflags="-O3 -mcpu=power9 -flto -fprofile-generate" \
    --extra-cxxflags="-O3 -mcpu=power9 -flto -fprofile-generate" \
    --extra-ldflags="-flto -fprofile-generate" --target-list=ppc-softmmu
  • make -j24

Run your profile as before. However, you need to preserve the profile before the rebuild because make clean will clobber it.

  • tar cvf instrumented.tar `find . -name '*.gcda' -print`
  • make clean
  • tar xf instrumented.tar
  • ../configure \ --extra-cflags="-O3 -mcpu=power9 -flto -fprofile-correction -fprofile-use -Wno-missing-profile" \
    --extra-cxxflags="-O3 -mcpu=power9 -flto -fprofile-use -fprofile-correction -Wno-missing-profile" \
    --extra-ldflags="-flto -fprofile-use -fprofile-correction -Wno-missing-profile" \
    --target-list=ppc-softmmu
  • make -j24

Life's golden, and just a little bit zippier. It's not always possible to PGO all the things, but here's one where it makes a noticeable difference.

Comments

  1. I gave up on QEMU entirely by now, it is simply underperformant, and not just on processor speed, but also input lag, screen delays and other issues that plague any virtualized system as opposed to a natively-booted one.

    So instead of working on QEMU, I put a 7448 @2.0+GHz on my MDDs to boot Mac OS 9 with. Those PowerPC processors are still manufactured by NXP even today. Lots of supply available (if you know where to look). MDDs also rain on eBay, and are extremely maintainable, with various parts replaceable, repairable and upgradeable.

    The Talos II family is still interesting in terms of libre hardware and libre software, though, and true end-user ownership. Just not as a "what if Apple didn't switch" successor to PowerMacs -- it isn't a successor until booting is native. Virtualization is inherently unable to deliver. So MDDs for OS 9, G5s for OS X, and Talos for a future where ownership and freedom still exist. QEMU for me only serves to boot systems I am not passionate about to care enough for proper performance (which is not equal to "speed").

    ReplyDelete
    Replies
    1. I've got a Sonnet Encore/MDX in my G4 MDD, so I understand the sentiment, but I get enough performance out of QEMU and with file sharing enough convenience that I still use it a fair bit for OS 9.

      Delete

Post a Comment

Comments are subject to moderation. Be nice.