It's Talos all the way down


Still can't bear the sticker shock of your very own Talos II, or even a itty bitty Blackbird? Why not do what we all do for the machines we can't own and emulate one instead? (And then decide you like it a lot, and save your pennies?)

QEMU 5.0.0 offers a machine model for the bare-metal PowerNV profile, to which the Raptor systems and other OpenPOWER POWER8 and POWER9 designs intended for Linux (i.e., not PowerVM machines) belong. Using the Talos II firmware image (mostly: one snag to be mentioned), you can boot the machine in QEMU and from there bring up an operating system in emulation. In this article we'll prove it works by bringing up Void Linux for Power (hi, Daniel!) in a variety of configurations. A set-up like this might be enough to test that your software or open-source package builds and runs on OpenPOWER, even if you don't own one yet. In a future article we'll talk about how you can boot your own code on the metal so you can port your favourite OS or build a unikernel.

(For the purposes of this article I'll assume an audience that isn't as familiar with OpenPOWER terminology as our usual readership. Kindly humour me.)

The emulation is imperfect, both if you're emulating it on a real Raptor family system or on an icky PC. While QEMU can emulate an AST2500 (i.e., the ARM-based Baseboard Management Controller, which acts as the service processor and provides the video framebuffer), and QEMU can also emulate a PowerNV system, it doesn't do both at the same time. That means the very lowest levels are actually being simulated here -- you can't watch Raptor's pretty Hostboot display, for example, and only the barest functions of the BMC are simulated enough to allow bring-up, not including the framebuffer. In fact, the hardware profiles we will use here do not in general match a real Raptor system either: we're just virtually plugging in PCI devices that give us necessary functionality, though of course none of the peripheral devices in a Raptor system is Raptor-proprietary. Finally, even though I have tagged this entry with KVM, KVM currently doesn't work right with the QEMU PowerNV machine model even though I'm pretty sure it should be technically possible. Sadly, I tried in vain to do so, could never get KVM-HV to be happy, and ended up kernel panicking the machine with KVM-PR. See if you can triumph where I have failed. In the meantime, naturally you can do everything here on a T2 or Blackbird as well because that's how I did it writing this article, but there is no special acceleration for those systems right now.

The first order of business is the first order of business with any emulator: get the ROMs. Fortunately, no one is going to bust you for pirating a set of these because we're an open platform, remember?

The two pieces required are Skiboot and Petitboot, both of which live in the system's PNOR flash. Skiboot contains OPAL, the OpenPOWER Abstraction Layer. It comes in after the BMC has turned on main power and started the Power CPUs' self-boot engines, which then IPL ("initial program load") Hostboot for the second-stage power-on sequence. When Hostboot completes, it chains into Skiboot, which initializes the PCIe host bus controllers (PHBs) and provides all the basic hardware calls needed by a guest kernel to support the platform. You can think of it as something like an overgrown BIOS. This is the lowest firmware level of an OpenPOWER system that QEMU currently supports emulating.

Skiboot lives only to service a kernel, so it immediately starts one. This initial payload is the bootloader for Petitboot, which is also stored in firmware. Petitboot has a small Linux root (Skiroot) and acts as a boot menu, finding bootable volumes on attached devices or over the network. Having found one (or you select one), it chains into it to start the main OS, and from then on Skiboot will provide platform services via OPAL for this final guest until the system is shut down or restarted. Because it's in firmware, Petitboot is always available, which can come in really handy when you're trying to do system recovery.

The first, best and most dedicated way is to build Skiboot and Petitboot yourself. They are open-source and the process is relatively well documented and automated, and you should know how to do this if you own an OpenPOWER machine anyhow. If you aren't doing this on a real OpenPOWER machine you'll need a cross-compiler, but most Linux distros offer such a package nowadays. Do keep in mind that if it looks like you're building a tiny Linux distro, well, that's because that's exactly what you're doing. The advantage here is you can fool around with the firmware at your leisure, but it requires a bit of an investment in disk space and time.

The second way assumes you have a more casual interest and would prefer to go with something prefab. It's possible if you (or, you know, your "friend") has a Raptor-family system to extract the necessary components right from the BMC prompt. Log into the BMC over SSH (or via direct serial connection) and type pflash -i. You'll see a list of all the partitions stored in the PNOR flash. The ones we want are PAYLOAD (which contains Skiboot) and BOOTKERNEL (which contains Skiroot and Petitboot). The exact addresses may vary from system to system and firmware to firmware.

root@bmc:~# pflash -P PAYLOAD -r /tmp/pnor.PAYLOAD --skip=4096
Reading to "/tmp/pnor.PAYLOAD" from 0x021a1000..0x022a1000 !
[==================================================] 100%
root@bmc:~# pflash -P BOOTKERNEL -r /tmp/pnor.BOOTKERNEL --skip=4096
Reading to "/tmp/pnor.BOOTKERNEL" from 0x022a1000..0x03821000 !
[==================================================] 100%

We skip the first 4K page to avoid the wrapping around each partition. pnor.PAYLOAD is actually compressed and needs to be uncompressed prior to use, so:

root@bmc:~# cd /tmp
root@bmc:/tmp# xz -d < pnor.PAYLOAD > skiboot.lid

Finally, scp both skiboot.lid and pnor.BOOTKERNEL to your desired system from the BMC.

Admittedly we just talked at length about the two ways most of you won't get the firmware, so let's talk about the third method and the way most of you will, i.e., you'll just download it. Currently there is an irregularity about Raptor's present Skiboot build for this purpose: it only boots if you are emulating a single POWER8. That's not a typo. If you use it to boot an emulated POWER9, the guest will simply panic, and the guest will go into a bootloop if you are emulating multiple POWER8 CPUs (necessary if you need a larger number of PCIe devices). This is undoubtedly a QEMU deficiency which will be corrected in future releases. In the meantime, if you just care about playing around using a single POWER8 on a terminal, then Raptor's builds (either from BMC flash or downloaded) will suffice. However, if you intend to emulate a POWER9 or SMP POWER8 system, download QEMU's own pre-built skiboot.lid and use that instead.

For Petitboot, we will extract that directly from Raptor's PNOR images. Assuming you didn't get it using the process above, download the current Talos II PNOR image and decompress it. In the shell_upgrade directory you will see the bzip2-compressed PNOR image. Uncompress that, leaving you with a filename like talos-ii-v2.00.pnor. Download my pnorex extractor tool (it's in Perl, because I'm one of those people) and run it on the PNOR image:

% pnorex talos-ii-v2.00.pnor
Version 1 PNOR archive with 33 entries.
Extracting PAYLOAD at offset 8601.
This is a xz format image.
Wrote 1020K successfully.
Extracting BOOTKERNEL at offset 8857.
This is an ELF executable image.
Wrote 22012K successfully.
Extracted 2 partitions successfully.

If you will be using Raptor's Skiroot, then uncompress pnor.PAYLOAD to skiroot.lid as above: xz -d < pnor.PAYLOAD > skiboot.lid

Now, with skiroot.lid (for this first example, either Raptor's or QEMU's) and pnor.BOOTKERNEL in the same folder, grab an ISO you want to boot. I used the prefab one Daniel offers on the Void Linux for Power site since I know it boots fine on OpenPOWER hardware. For our first example let's do a simple example of booting Void from a CD image on a POWER8 using the serial port. Our QEMU command line:

qemu-system-ppc64 -M powernv8 -m 4G -cpu power8 \
-nographic \
-bios ./skiboot.lid \
-kernel ./pnor.BOOTKERNEL \
-device ich9-ahci,id=ahci0 \
-drive id=cd0,media=cdrom,file=void-live-ppc64le-musl-20200411.iso,if=none \
-device ide-cd,bus=ahci0.0,drive=cd0

This configures a single-processor POWER8 system with 4GB of RAM, no graphics, and an Intel AHCI host controller with a single CD-ROM drive attached. The serial output should go to your terminal. It goes a little like this:

Here we are with Skiboot chaining into Petitboot. You can ignore the errors; there will be a lot of them since the platform is still incomplete. It will take a little bit of time to decompress the kernel (much slower than it would be on a regular system). You will notice a single device attached to the three available PCIe host bridges on the single POWER8 CPU, i.e., the host controller itself. Don't you just love that the vendor code for Intel is 8086?

This is Petitboot. When the bootable choices appear, cursor up to the starred option and press E before it autoboots, because we need to tell Void its console is the on-board serial port (otherwise it uses a VGA console: not sure whose bug that is).

Add console=hvc0 at the end, cursor down to OK and hit RETURN/ENTER a couple times to boot.

A successful login on your emulated baby POWER8. Ta-daa! To rudely pull the plug on the QEMU session, press Ctrl-A, and then X (QEMU: Terminated).

Let's now load out the POWER8. We would like to add a video card, an Ethernet card and a USB controller to our existing system, but POWER8 Turismo chips only offer enough PHBs for three PCI endpoints. How do we solve this problem? Easy: we'll add another processor!

At this point you will require the QEMU Skiboot and should use that where skiboot.lid appears in the remainder of this article. I use tun/tap networking in this example, which assumes you already have tap0 configured and up; change the -netdev setting if you want to use a different means of bridging the NIC. This example keeps the AHCI host controller and still displays debug output on the terminal, but uses the QEMU emulated VGA as a console instead and adds a good old Realtek 8139 NIC with a USB mouse and keyboard attached to a QEMU XHCI USB 3.0 controller.

qemu-system-ppc64 -M powernv8 -cpu power8 -m 4G -smp 2 \
-serial mon:stdio \
-device VGA \
-device ich9-ahci,id=ahci0,bus=pcie.0 \
-netdev tap,id=nic0,ifname=tap0,script=no,downscript=no \
-device rtl8139,netdev=nic0,bus=pcie.1 \
-device qemu-xhci,id=usb0,bus=pcie.2 \
-device usb-mouse \
-device usb-kbd \
-bios ./skiboot.lid \
-kernel ./pnor.BOOTKERNEL \
-drive id=cd0,media=cdrom,file=void-live-ppc64le-musl-20200411.iso,if=none \
-device ide-cd,bus=ahci0.0,drive=cd0

Let's spin this sucker like Superman's cape in a dryer:

The reason I keep the serial output is because the extra CPU adds around an extra minute on this T2 to get to Petitboot. Here, you will notice we now have six PHBs available, three per CPU, so now we have enough virtual PCI slots for the peripherals we require.

Petitboot shows up on both the 2D framebuffer and the serial terminal, and both work. You'll also see it probing the bridged Ethernet tap to see if it can boot that way, proving our Ethernet device is up and working. Whichever you use is where boot messages will go, so we'll use the framebuffer as console and start Void by cursoring up and selecting the starred option (thus also proving our USB devices work too).

Having booted Void, we can now demonstrate the PCI cards in the system, the attached peripherals and the number of CPUs. For the record, the DD2.3 POWER9 I'm typing this on shows its Spectre v2 status as "mitigated" with hardware acceleration.

Starting the Installer, which won't install anything because we haven't configured any storage to install to in our QEMU options. I'll leave that as an exercise to the reader.

If we switch to an emulated POWER9 system, Sforza CPUs support six PCI endpoints, so we get six PHBs. This means a single CPU is more than enough for our basic configuration without adding additional startup time. The QEMU command line to do so merely returns to single processor and changes the machine to powernv9 and the CPU to power9, i.e.,

qemu-system-ppc64 -M powernv9 -cpu power9 -m 4G \
-serial mon:stdio \
-device VGA \
-device ich9-ahci,id=ahci0,bus=pcie.0 \
-netdev tap,id=nic0,ifname=tap0,script=no,downscript=no \
-device rtl8139,netdev=nic0,bus=pcie.1 \
-device qemu-xhci,id=usb0,bus=pcie.2 \
-device usb-mouse \
-device usb-kbd \
-bios ./skiboot.lid \
-kernel ./pnor.BOOTKERNEL \
-drive id=cd0,media=cdrom,file=void-live-ppc64le-musl-20200411.iso,if=none \
-device ide-cd,bus=ahci0.0,drive=cd0

and it runs in the same way, but faster, because the emulation overhead is less. So let's totally do something stupid as our last parlour trick and run a POWER9 configuration with as many sockets as QEMU will let us hold (which right now is four). Note that these are all single-threaded cores, so this is still much less powerful than even a 4-core basic Blackbird.

./qemu-system-ppc64 -M powernv9 -cpu power9 -m 4G -smp 4 \
-serial mon:stdio \
-device VGA \
-device ich9-ahci,id=ahci0,bus=pcie.0 \
-netdev tap,id=nic0,ifname=tap0,script=no,downscript=no \
-device rtl8139,netdev=nic0,bus=pcie.1 \
-device qemu-xhci,id=usb0,bus=pcie.2 \
-device usb-mouse \
-device usb-kbd \
-bios ./skiboot.lid \
-kernel ./pnor.BOOTKERNEL \
-drive id=cd0,media=cdrom,file=void-live-ppc64le-musl-20200411.iso,if=none \
-device ide-cd,bus=ahci0.0,drive=cd0

With four emulated CPUs startup took over seven minutes from start to Petitboot on this dual-8 Talos II, so have patience if you're on a lesser workstation, but it does work:

You can see the watchdog complaining about the length of time OPAL calls are taking now (call 128 resets the XIVE VM interrupt controller on POWER9 chips). But we do have our four cores, and it's not impossibly slow on a beefy enough system (like another POWER9).

Incidentally, while the Power ISA emulation in QEMU allows SMT, it's very basic and not enough to get through the boot-up sequence, or at least not before the heat death of the universe. If you like listening to your cooling fans, see what happens when you try to emulate the biggest baddest dual-22 Talos II by adding -accel tcg,thread=multi -smp 176,threads=4,cores=22,sockets=2 to your QEMU command line. It's not pretty. That's why you should buy an OpenPOWER machine of your own instead of emulating one.

Comments

  1. The vital info in sixth paragraph starting with "The two pieces required are" can only be learned via connecting dots between 6-10 other documents incl. IBM's, Raptor's and Barreleye's. Thank you very much for providing it in a nutshell!

    ReplyDelete

Post a Comment

Comments are subject to moderation. Be nice.