The first production RISC-V workstation?
Precious little details are available, such as loadout, options, availability and most of all cost, but when has that stopped us from idly speculating before, eh? It is virtually certain the machine will be composed largely of off-the-shelf components other than the CPU, which is the real mystery of interest. The FU740 appears to be an evolution of the FU540, which is a 64-bit 1.5GHz+ part with four U54 "little" cores combined with one S1-series "big" core and 2MB of L2 cache on a 28nm process. Plainly, neither of these cores are even remotely in the ballpark with OpenPOWER: SiFive quotes CoreMark/MHz scores of 3.01 for both the U54 and S54, whereas the POWER9 easily achieves over 160. While the FU740 will almost certainly be faster due to its probable basis on the U74, it is difficult to imagine that the performance gulf will be narrowed significantly (the U74 edges up to around 5). You should not buy one and expect it to compare favourably with x86 or a Raptor system.
On the other hand, there's a good chance this will be another truly open system based on the fact that the Freedom E300 and U500 series are open source under the Apache license. While some parts of SiFive are proprietary, this line is not, and we presume that the U700 series will be likewise. RISC-V still lacks firm specs for vector and bit manipulation instructions, and this certainly hurts them for desktop and mobile applications, but this is a known deficiency and is being worked on. Assuming no shenanigans with the firmware, there's encouraging potential even in this early form.
I'm unambiguously on Team Power because of my long history with the architecture, but this blog is certainly interested in all kinds of free vendor-unencumbered computing, and this machine may well represent another such system. And it's newsworthy as the first RISC-V system that's at least workstation form factor even if its likely performance doesn't currently make it a credible daily driver. But maybe that's not the point: the point is to get developers on the architecture in a way that's bigger than an evaluation board (cf. Linus Torvalds and ARM), meaning it doesn't have to be their only daily driver; it just has to "be there" so people think about it. More on cost and specs and "how open is it" when we actually see it in October.
SiFive's CoreMark/MHz is probably reporting per-core performance. The linked POWER9 result is from a 128 thread parallel execution. I suspect that converting to per-core results is not as simple as 166.47/128. Additionally CoreMark isn't a good evaluation of "uncore" and memory system performance.
ReplyDeleteI wasn't able to find a single threaded result so I ran coremark (https://github.com/eembc/coremark @ 41537ea30b0104438b4ff993e7d349af26900acf) on my Talos II with SMT disabled and got: 14279.0 / 3800
= 3.76 CoreMark/MHz (this assumes turbo remained active for the duration of the run)
Here's the full output:
2K performance run parameters for coremark.
CoreMark Size : 666
Total ticks : 14006
Total time (secs): 14.006000
Iterations/Sec : 14279.594460
Iterations : 200000
Compiler version : GCC9.2.0
Compiler flags : -O2 -DPERFORMANCE_RUN=1 -lrt
Memory location : Please put data memory location here
(e.g. code in flash, data on heap etc)
seedcrc : 0xe9f5
[0]crclist : 0xe714
[0]crcmatrix : 0x1fd7
[0]crcstate : 0x8e3a
[0]crcfinal : 0x4983
Correct operation validated. See README.md for run and reporting rules.
CoreMark 1.0 : 14279.594460 / GCC9.2.0 -O2 -DPERFORMANCE_RUN=1 -lrt / Heap
For comparison I repeated the same test on my G5 (2.3GHz 970MP) and P5020DS (2.0GHz E5500). Both systems are in 64-bit mode. Note that the P5020DS is basically identical to the Amiga X5000.
Delete970MP @ 2.3GHz (GCC4.1.2):
4033.342296 / 2300 = 1.75 CoreMark/MHz
E5500 @ 2.0GHz (GCC4.9.2):
5370.569280 / 2000 = 2.68 CoreMark/MHz
Thanks for running those on your system as comparisons. The only reason I'm using CoreMark here is because that's the only benchmark SiFive seems to be consistently reporting.
DeleteCoreMark is a load that while being more complex than Dhrystone is not more complex-enough by that much, it is still mostly a test for MCUs and simple cores.
DeleteComparing it with an adult high-performance core like Power9, out of order cortexes or modern x86 probably makes almost no sense.
Yo will get some "duh it's faster" result but the ratios will be wildly inaccurate.
And yes, some interpret "they give just CoreMark" as a sign that RISC-V vendors are not ready to compete in standard desktop/notebook/android platform code. Perhaps when they get SIMD they will start giving better benches? Or will it take more upgrading of the uarchs for that? No idea.
Joel S. thanks a lot for E5500 and 970MP comparison. Nowhere seen E5500 numbers and was curious how the chip performs. Not bad at all IMHO! I'm adding here my P8 @ 3.8GHz: 12478 and W-2265@>4.5-4.6GHz: 32865 for your information too.
ReplyDeleteE6500 @ 1.8GHz (GCC4.9.2):
Delete4931.777084 / 1800 = 2.74 CoreMark/MHz
Since you were interested in E5500 results, I got out my T4240 system which has 12 1.8GHz E6500 cores which will be used in the upcoming powerpc notebook (https://www.powerpc-notebook.org/en/)
E6500 @ 1.8GHz (GCC4.9.2):
4931.777084 / 1800 = 2.74 CoreMark/MHz
I find the E6500 quite fascinating since it is the last powerpc core Freescale (now NXP) will ever release. The E6500 is almost fully compatible with POWER8. Both the E5500 and E6500 can operate in true little endian mode with the limitation that E6500 Altivec is always in big endian mode.
The lineage of the E6500 is roughly as follows:
MPC601 -> MPC603 -> MPC750 (G3) -> MPC74xx (G4) -> E500 -> E500MC -> E5500 -> E6500.
In contrast to the G5, the E5500 and E6500 have massively improved memory systems (1,2 or 3 channels of DDR3 with on-die memory controllers). The G5 had terrible memory latency compared to its contemporaries which drags the fancy ALUs down.
Wow! Even e6500! Man, you did my day here. Indeed, e6500 is kind of a swan's song of PowerPC @NXP. What a pity, we can just hope that they will recover from ARM's dreams and really jump into RISC-V. Imagine e6500 capable chip with RISC-V front-end. :-)
DeleteAnyway, thanks a lot for the numbers, it looks like NXP really just worked on putting more cores/threads throughput into e6500 in comparison with e5500. Your comparison with 970 is also very interesting. At its time it was very capable chip. Indeed, if it is hurt by memory latency it's another matter... E5500 looks very good, still playing with the idea of buying one...
An oldie but goodie here - Peg II G4 (Debian 7)
ReplyDeleteroot@pegasos2:/tmp/coremark# lscpu
Architecture: ppc
Byte Order: Big Endian
CPU(s): 1
On-line CPU(s) list: 0
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 1
Model: Pegasos2
BogoMIPS: 66.66
L1d cache: 32K
L1i cache: 32K
root@pegasos2:/tmp/coremark# cat run1.log
2K performance run parameters for coremark.
CoreMark Size : 666
Total ticks : 12870
Total time (secs): 12.870000
Iterations/Sec : 3108.003108
Iterations : 40000
Compiler version : GCC4.6.3
Compiler flags : -O2 -DPERFORMANCE_RUN=1 -lrt
Memory location : Please put data memory location here
(e.g. code in flash, data on heap etc)
seedcrc : 0xe9f5
[0]crclist : 0xe714
[0]crcmatrix : 0x1fd7
[0]crcstate : 0x8e3a
[0]crcfinal : 0x25b5
Correct operation validated. See README.md for run and reporting rules.
CoreMark 1.0 : 3108.003108 / GCC4.6.3 -O2 -DPERFORMANCE_RUN=1 -lrt / Heap
Has anyone tried nbench on POWER9 or e6500? I'd be curious to see those results as there is a long history on various platforms for comparison:
ReplyDeleteen.wikipedia.org/wiki/NBench
www.math.utah.edu/~mayer/linux/bmark.html
#POWER9@3800 - Talos II - DDR4@2666 - SMT=off
DeleteCFLAGS="-s -static -Wall -O3 -mcpu=power9 -mtune=power9 -mabi=altivec -maltivec -mvsx"
BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)
TEST : Iterations/sec. : Old Index : New Index
: : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT : 1440.3 : 36.94 : 12.13
STRING SORT : 527.85 : 235.86 : 36.51
BITFIELD : 2.6417e+08 : 45.31 : 9.47
FP EMULATION : 416.42 : 199.82 : 46.11
FOURIER : 1.0232e+05 : 116.36 : 65.36
ASSIGNMENT : 52.115 : 198.31 : 51.44
IDEA : 8705 : 133.14 : 39.53
HUFFMAN : 3567.4 : 98.92 : 31.59
NEURAL NET : 106.14 : 170.51 : 71.72
LU DECOMPOSITION : 2178.9 : 112.88 : 81.51
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX : 110.878
FLOATING-POINT INDEX: 130.830
Baseline (MSDOS*) : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU : 36 CPU
L2 Cache :
OS : Linux 5.4.25-gentoo-ppc64
C compiler : gcc version 9.2.0 (Gentoo 9.2.0-r2 p3)
libc : libc-2.29.so
MEMORY INDEX : 26.097
INTEGER INDEX : 28.909
FLOATING-POINT INDEX: 72.564
Baseline (LINUX) : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
#T4240 - E6500@1800MHz - DDR3@1866
DeleteCFLAGS="-s -Wall -O3 -mcpu=e6500 -mtune=e6500 -mabi=altivec -maltivec"
BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)
TEST : Iterations/sec. : Old Index : New Index
: : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT : 820.27 : 21.04 : 6.91
STRING SORT : 161.92 : 72.35 : 11.20
BITFIELD : 1.5984e+08 : 27.42 : 5.73
FP EMULATION : 252 : 120.92 : 27.90
FOURIER : 8531.2 : 9.70 : 5.45
ASSIGNMENT : 15.363 : 58.46 : 15.16
IDEA : 3497.3 : 53.49 : 15.88
HUFFMAN : 1674.2 : 46.43 : 14.82
NEURAL NET : 16.104 : 25.87 : 10.88
LU DECOMPOSITION : 453.38 : 23.49 : 16.96
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX : 49.542
FLOATING-POINT INDEX: 18.065
Baseline (MSDOS*) : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU : 24 CPU
L2 Cache :
OS : Linux 4.1.8-rt8+gbd51baf
C compiler : gcc version 4.9.2 (GCC)
libc :
MEMORY INDEX : 9.907
INTEGER INDEX : 14.596
FLOATING-POINT INDEX: 10.019
Baseline (LINUX) : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
#P5020 - E5500@2000MHz - DDR3@1333 - defaults
DeleteCFLAGS="-s -Wall -O3 -mcpu=e5500 -mtune=e5500"
BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)
TEST : Iterations/sec. : Old Index : New Index
: : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT : 929.76 : 23.84 : 7.83
STRING SORT : 137.98 : 61.65 : 9.54
BITFIELD : 2.2342e+08 : 38.32 : 8.01
FP EMULATION : 269.07 : 129.11 : 29.79
FOURIER : 9595.8 : 10.91 : 6.13
ASSIGNMENT : 18.374 : 69.92 : 18.13
IDEA : 3336.6 : 51.03 : 15.15
HUFFMAN : 1748.4 : 48.48 : 15.48
NEURAL NET : 15.756 : 25.31 : 10.65
LU DECOMPOSITION : 526.25 : 27.26 : 19.69
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX : 53.523
FLOATING-POINT INDEX: 19.600
Baseline (MSDOS*) : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU : Dual
L2 Cache :
OS : Linux 4.1.8-rt8+gbd51baf
C compiler : gcc version 4.9.2 (GCC)
libc :
MEMORY INDEX : 11.148
INTEGER INDEX : 15.295
FLOATING-POINT INDEX: 10.871
Baseline (LINUX) : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38