Where did the 64K page size come from?
And there's lots of things that glitch and have glitched when userspace makes assumptions about this. Besides Hangover, Firefox used to barf on 64K pages (on aarch64 too), and had an issue where binaries built on one page size wouldn't work on systems with a different one. (This also bit numpy.) Golang and the runtime used to throw fatal errors. The famous nouveau driver for Nvidia GPUs assumes a 4K page size, and the compute-only binary driver that does exist (at least for POWER8) cheats by making 64K pages out of 16 copies of the "actual" 4K page. btrfs uses a filesystem page size that mirrors that of the host's page size on which it was created. That means if you make a btrfs filesystem on a 4K page size system, it won't be readable on a 64K page system and vice versa (this is being fixed, but hasn't been yet).
With all these problems, why have a 64K page size at all, let alone default to it? There must be some reason to use it because ppc64(le) isn't even unique in this regard; many of those bugs related to aarch64 which also has a 64K page option. As you might guess, it's all about performance. When a virtual memory page has to be attached to a process or mapped into its addressing space, a page fault is triggered and has to be handled by the operating system. Sometimes this is minor (it's already in memory and just has to be added to the process), sometimes this is major (the page is on disk, or swapped out), but either way a page fault has a cost. 64-bit systems naturally came about because of the need for larger memory addressing spaces, which benefits big applications like databases and high-performance computing generally, and these were the tasks that early 64-bit systems were largely used for. As memory increases, subdividing it into proportionally larger pieces thus becomes more performance-efficient: when the application faults less, the application spends more time in its own code and less in the operating system's.
A second performance improvement afforded by larger pages is higher efficiency from the translation lookaside buffer, or TLB. The TLB is essentially a mapping cache that allows a CPU to quickly get the physical memory page for a given virtual memory address. When the virtual memory address cannot be found in the TLB, then the processor has to go through the entire page table and find the address (and filling it in the TLB for later), assuming it exists. This can be a relatively expensive process if there are many entries to go through, and even worse if the page tables are nested in a virtualized setup. A larger page size not only allows more memory to be handled with a smaller page table, making table walks quicker, but also yields more hits for a TLB of the same size. It is fair to point out there are arguments over MMU performance between processor architectures which would magnify the need for this: performance, after all, was the reason why POWER9 moved to a radix-based MMU instead of the less-cache-friendly hashed page table scheme of earlier Power generations, and x86_64 has a radix tree per process while Power ISA's page table is global. (As an aside, some systems optionally or even exclusively have software-managed TLBs where the operating system manages the TLB for the CPU and walks the page tables itself. Power ISA isn't one of them, but these architectures in particular would obviously benefit from a smaller page table.)
64K page sizes, compatibility issues notwithstanding, naturally have a downside. The most important objection relates to memory fragmentation: many memory allocators have page alignment constraints for convenience, which could waste up to the remaining 60K if the memory actually in use fits entirely within a 4K page instead. On bigger systems with large amounts of memory running tasks that allocate large memory blocks, this excess might be relatively low, but they could add up on a workstation-class system with smaller RAM running a mix of client applications making smaller allocations. In a somewhat infamous rebuttal, Linus Torvalds commented, "These absolute -idiots- talk about how they win 5% on some (important, for them) benchmark by doing large pages, but then ignore the fact that on other real-world loads they lose by sevaral HUNDRED percent because of the memory fragmentation costs [sic]." Putting Linus' opinion into more anodyne terms, if the architecture bears a relatively modest page fault penalty, then the performance improvements of a larger page size may not be worth the memory it can waste. This is probably why AIX, presently specific to ppc64, offers both 4K and 64K pages (and even larger 16MB and 16GB pages) and determines what to offer to a process.
The 4K vs. 64K gulf is not unlike the endian debate. I like big endian and I cannot lie, but going little endian gave Power ISA a larger working software library by aligning with what those packages already assumed; going 4K is a similar situation. But while the performance difference between endiannesses has arguably never been significant, there really are performance reasons for a 64K page size and those reasons get more important as RAM and application size both increase. On my 16GB 4-core Blackbird, the same memory size as my 2005 Power Mac Quad G5, a 4K page size makes a lot more sense than a 64K one because I'm not running anything massive. In that sense the only reason I'm still running Fedora on it is to serve as an early warning indicator. But on my 64GB dual-8 Talos II, where I do run larger applications, build kernels and Firefoxen and run VMs, the performance implications of the larger page size under Fedora may well become relevant for those workloads.
For servers and HPCers big pages can have big benefits, but for those of us using these machines as workstations I think we need to consider whether the performance improvement outweighs the inconvenience. And while Fedora has generally served me well, lacking a 4K page option on ppc64le certainly hurts the value proposition for Fedora Workstation on OpenPOWER since there are likely to be other useful applications that make these assumptions. More to the point, I don't see Red Hat-IBM doubling their maintenance burden to issue a 4K page version and maintaining a downstream distro is typically an incredibly thankless task. While I've picked on Fedora a bit here, you can throw Debian and others into that mix as well for some of the same reasons. Until other operating systems adopt a hybrid approach like AIX's, the quibble over page size is probably the next major schism we'll have to deal with because in my humble opinion OpenPOWER should not be limited to the server room where big pages are king.
For what it's worth, maintaining your own kernel on Fedora isn't super hard. I don't run Fedora on POWER, but I still use it on an x86 Dell laptop; I build my own kernel there because when I got the laptop a couple years ago, I went on a big adventure patching out the disablement of hibernation (an important feature for laptops) from the "lockdown mode" that's triggered when Secure Boot is on, and then had to figure out how to generate keys to sign my own builds with. Along the way I've thrown in a number of other changes, but those are all much more complicated than what's needed here.
ReplyDeleteIn short:
- Use `fedpkg` to clone Fedora's RPM source repo for the kernel
- Switch to the branch for your Fedora release
- Add the config changes to the "kernel-local" file
- Run `fedpkg local` to build the kernel
- `dnf install` the resulting packages (or ideally use `createrepo` and add a local repo)
You'll want to either rename the package in the spec or prevent it from being clobbered by the upstream version (which is a bit fiddly in dnf unfortunately), but that's the gist of it. I can write up a more detailed procedure if you or others are interested.
It's not so much that it's hard as that it's inconvenient, and as you mention, dnf isn't too friendly about stopping it from getting clobbered. But I'm sure people will appreciate if you want to put up a step-by-step of the entire process.
DeleteRealistically it would be nice if this was a command line option rather than a compile time one.
Yeah for sure, it's a pretty awkward situation to have to recompile for a page size change. Legacy decisions catch up with us...
DeleteI can try doing a writeup in a little while, and spin up a Fedora VM to double-check my work.
Copr can be used to host other kernel variants for Fedora. I have used it for building Talos/ppc64le kernels with additional patches in the early days.
DeleteVoidLinux uses 4k, IIRC.
ReplyDeleteYup! https://docs.voidlinux-ppc.org/faq/index.html#why-use-4-kib-page-kernels-instead-of-64-kib-like-other-distros
DeletePowerPC ISA uses also software table walks. The Motorola/Freescale/NXP e200 cores do so for example.
ReplyDeleteAarch64 allows to combine small pages into larger ones for TLB optimized usage.
There's always an exception. However, I don't consider the Book E chips particularly representative of what we would be doing.
DeleteWow, thank you for this detailed explanation, it perfectly answers my question about page sizes in the 'Hangover' blog post!
ReplyDeleteBased on the information provided in this blog post, I decided to switch to a 4K page size kernel on my Debian Buster system (T2).
Recompiling the kernel on Debian was pretty easy, I used this guide: https://wiki.debian.org/BuildADebianKernelPackage
Even though the guide itself says that the information is outdated, it still works fine! Make sure that you empty the CONFIG_SYSTEM_TRUSTED_KEYS string, otherwise the build will fail (where the error will be cluttered with output from other threads if you compile like me with 24 threads). And of course don't forget to uncheck CONFIG_PPC_64K_PAGES and enable CONFIG_PPC_4K_PAGES.
Very informative blog post, thank you. I've been working with Gentoo on a 2006 Power Mac G5 Quad, and it's interesting to see what ppc64 advice does and doesn't apply.
ReplyDeleteI thought I might require 64K pages to run KVM-PS guests with 64K pages, which doesn't seem to be true for a PPC970 guest, and I can't emulate a newer POWER CPU than what I have, which only has 4K native pages. I haven't seen any compatibility problems with 64K pages so far, at least with the Radeon driver.
What I have seen is a slight speed improvement. System CPU was reduced from 8% to 5% or less for my compile jobs. I blame the overhead of managing updating those hashed page tables. I think there's some efficiency by paging in/out 64K at a time vs. 4K, even though the native page tables are still 4K.
As a side benefit, 64K pages use 210 MiB less RAM for memmap pages on a 16GB system, if I did the math right from the dmesg output. So there are hidden benefits to 64K pages even on such an old PowerPC that has to emulate them.
Nearly two years later, is this still an issue?
ReplyDelete