It took a little while but I'm now typing on my second Raptor Talos II workstation, effectively upgrading two years in from a 32GB RAM dual quad-core POWER9 DD2.2 to a 64GB RAM dual octo-core DD2.3. It's rather notable to think about
how far we've come with the platform. A number of you have asked about how this changed things in practise, so let's do yet another semi-review.
Again, I say "semi-review" because if I were going to do this right, I'd have set up both the dual-4 and the dual-8 identically, had them do the same tasks and gone back if the results were weird. However, when you're buying a $7000+ workstation you economize where you can, which means I didn't buy any new NVMe cards, bought additional rather than spare RAM, and didn't buy another GPU; the plan was always to consolidate those into the new machine and keep the old chassis, board and CPUs/HSFs as spares. Plus, I moved over the case stickers and those totally change the entire performance characteristics of the system, you dig? We'll let the Phoronix guy(s) do that kind of exacting head-to-head because I pay for this out of pocket and we've all gotta tighten our belts in these days of plague. Here's the new beast sitting beside me under my work area:
(By the way, I'm still taking candidates for #ShowUsYourTalos. If you have pictures uploaded somewhere, I'll rebroadcast them here with your permission and your system's specs. Blackbirds, POWER8s and of course any other OpenPOWER systems welcome. Post in the comments.)
This new "consolidated" system has 64GB of RAM, Raptor's BTO option Radeon WX7100 workstation GPU and two NVMe main drives on the current Talos II 1.01 board, running Fedora 31 as before. In normal usage the dual-8 runs a bit hotter than the dual-4, but this is absolutely par for the course when you've just doubled the number of processors onboard. In my low-Earth-orbit Southern California office the dual-4's fastest fan rarely got above 2100rpm while the dual-8 occasionally spins up to 2300 or 2600rpm. Similarly, give or take system load, the infrared thermometer pegged the dual-4's "exhaust" at around 95 degrees Fahrenheit; the dual-8 puts out about 110 F. However, idle power usage is only about 20W more when sitting around in Firefox and gnome-terminal (130W vs 110W), and the idle fan speeds are about the same such that overall the dual-8 isn't appreciably louder than the very quiet dual-4 was with the most current firmware (with the standard Supermicro fan assemblies, though I replaced the dual-4's PSUs with "super-quiets" a while back and those are in the dual-8 now too).
Naturally the CPUs are the most notable change. Recall that the Sforza "scale out" POWER9 CPUs in Raptor family workstations are SMT-4, i.e., each core offers four hardware threads, which appear as discrete CPUs to the operating system. My dual-4 appeared to be a 32 CPU system to Fedora; this dual-8 appears to have 64. These threads come from "slices," and SMT-4 cores have four which are paired into two "super-slices." They look like this:
Each slice has a vector-scalar unit and an address generator feeding a load-store unit. The VSU has 64-bit integer, floating point and vector ALU components; two slices are needed to get the full 128-bit width of a VMX vector, hence their pairing as super-slices. The super-slices do not have L1 cache of their own, nor do they handle branch instructions or branch prediction; all of that is per-core, which also does instruction fetch and dispatch to the slices. (This has some similar strengths and pitfalls to AMD Ryzen, for example, which also has a single branch unit and caches per core, but the Ryzen execution units are not organized in the same fashion.) The upshot of all this is that certain parallel jobs, especially those that may be competing for scarcer per-core resources like L1 cache or the branch unit, may benefit more from full cores than from threads and this is true of pretty much any SMT implementation. In a like fashion, since each POWER9 slice is not a full vector unit (only the super-slices are), heavy use of VMX would soak up to twice the execution resources though amortized over the greater efficiency vector code would offer over scalar.
The biggest task I routinely do on my T2 are frequent smoke-test builds of Firefox to make sure OpenPOWER-specific bugs are found before they get to release. This was, in fact, where I hoped I would see the most improvement. Indeed, a fair bit of it can be run parallel, so if any of my typical workloads would show benefit, I felt it would likely be this one. Before I tore down the dual-4 I put all 64GB of RAM in it for a final time run to eliminate memory pressure as a variable (and the same sticks are in the dual-8, so it's exactly the same RAM in exactly the same slot layout). These snapshots were done building the Firefox 75 source code from mozilla-release (current as of this writing) with my standard optimized .mozconfig, varying only in the number of jobs specified. I'm only reporting wall time here because frankly that's the only thing I personally cared about. All build runs were done at the text console before booting X and GNOME to further eliminate variability, and I did three runs of each configuration back to back (./mach clobber && ./mach build) to account for any caching that might have occurred. Power was measured at the UPS. Default Spectre and Meltdown mitigations were in effect.
Dual-4 (-j24)
32:22.65
31:19.17
30:49.66
average draw 170W
Dual-8 (-j48)
19:16.28
19:09.18
19:08.32
average draw 230W
Dual-8 (-j24)
19:16.46
19:13.78
19:10.14
average draw 230W
The dual-8 is approximately 40% faster than the dual-4 on this task (or, said another way, the dual-4 was about 1.6x slower), but doubling the number of make processes from my prior configuration didn't seem to yield any improvement despite being well within the 64 threads available. This surprised me, so given that the dual-8 has 16 cores, I tried 16 processes directly:
Dual-8 (-j16)
21:49.72
21:40.18
21:41.33
average draw 215W
This proves, at least for this workload, that SMT does make some difference, just not as much as I would have thought. It also argues that the sweet spot for the dual-4 might have been around -j12, but I'm not willing to tear this box back down to try it. Still, cutting down my build times by over 10 minutes is nothing to sneeze at.
For other kinds of uses, though, I didn't see a lot different in terms of performance between DD2.2 and DD2.3 and to be honest you wouldn't expect to. DD2.3 does have improved Spectre mitigations and this would help the kind of branch-heavy code that would benefit least from additional slices, but the change is relatively minor and the difference in practise indeed seemed to be minimal. On my JIT-accelerated DOSBox build the benchmarks came in nearly exactly the same, as did QEMU running Mac OS 9. Booted into GNOME as I am right now, the extra CPU resources certainly do smooth out doing more things at once, but again, that's of course more a factor of the number of cores and slices than the processor stepping.
Overall I'm pretty pleased with the upgrade, and it's a nice, welcome boost that improves my efficiency further. Here are my present observations if you're thinking about a CPU upgrade too (or are a first time buyer considering how much you should get):
- Upgrading is pretty easy with these machines: if you bought too little today, you can always drop in a beefier CPU or two tomorrow (assuming you have the dough and the sockets), and the Self-Boot Engine code is generic such that any Sforza POWER9 chip will work on any Raptor system that can accommodate it. I have been repeatedly assured no FPGA update is needed to use DD2.3 chips, even for "O.G." T2 systems. However, if you're running a Blackbird, you should think about case and cooling as well because the 8-core will run noticeably hotter than a 4-core. A lot of the more attractive slimline mATX cases are very thermally constrained, and the 8-core CPU really should be paired with the (3U) high speed fan assembly than the (2U) passive heatsink. This is a big reason why my personal HTPC Blackbird is still a little 4-core system.
- The 4 and 8-core chips are familiar to most OpenPOWER denizens but the 18 and 22-core monsters have a complicated value proposition. People have certainly run them in Blackbirds; my personal thought is this is courting an early system demise, though I am impressed by how heroic some of those efforts are. I wouldn't myself, however: between the thermal and power demands you're gonna need a bigger boat for these sharks.
The T2 is designed for them and will work fine, but after my experience here one wonders how loud and hot they would get in warmer environments. Plus, you really need to fill both of those sockets or you'll lose three slots (those serviced by the second CPU's PCIe lanes), which would make them even louder and hotter. The dual-8 setup here gets you 16 cores and all of the slots turned on, so I think it's the better workstation configuration even though it costs a little more than a single-18 and isn't nearly as performant. The dual-18 and dual-22 configurations are really meant for big servers and crazy people.
With the T2 Lite, though, these CPUs make absolute sense and it would be almost a waste to run one with anything less. The T2 Lite is just a cut-down single-socket T2 board in the same form factor, so it will also easily accommodate any CPU but more cheaply. If you need the massive thread resources of a single-18 (72 thread) or single-22 (88 thread) workstation, and you can make do with an x16 and an x8 slot, it's really the overall best option for those configurations and it's not that much more than a Blackbird board. Plus, being a single CPU configuration it's probably a lot more liveable under one's desk.
- Simply buying a DD2.3 processor to replace a DD2.2 processor of the same core count probably doesn't pay off for most typical tasks. Unless you need the features (and there are some: besides the Spectre mitigations, it also has Ultravisor support and proper hardware watchpoints), you'll just end up spending additional money for little or no observable benefit. However, if you're going to buy more cores at the same time, then you might as well get a DD2.3 chip and have those extra features just in case you'll need them later. The price difference is almost certainly worth a little futureproofing.