Announcement

**David (PassMark)** · Dec-20-2017, 04:43 AM

And just for comparison here is a graph from an Intel i7-4770 with 16GB of GSkill DDR3 - F3-17000CL11. 4 x 4GB RAM.

Latency:
Linear: 5.4ns
Random: 78ns
Random range: 26ms

**VIDEOSTAR** · Jan-04-2018, 07:05 AM

RE NUMA slowdown of CPU performance

Thanks for the update on this! I am seeing similar impact of NUMA while benchmarking

VIDEOSTAR workstation- CPU: Threadripper 1950x GPU: GTX 1080Ti, MBd: ASRock Tai Chi
RAM: 64GB RAM@3200 SSD NVMe Samsung 960 Pro 1TB, PSU: DarkQuiet Pro 1200W
Case: Fractal XL R2 Cooler: Arctic 360(6-fan) + 4x 140mm fans DISP: BENQ PV3200PT OS: Win10Pro,

If I understand correctly for my Tai Chi board- NUMA is 'on' when memory interleaf is set to 'channel'.
and 'off' when is set to 'auto'. With that definition here are scores for exact same configuration
with NUMA on or off, CPU @4.125 GHz:

Passmark 9 NUMA and HPET OFF NUMA and HPET ON

System 7004 7216
CPU 28004 25125
2D 994 981
3D 15982 15853
Mem 2238 2622
Disk 20349 19938

So NUMA increases system score by about 3%, but reduces CPU score
by near 11%. The increase in memory score of 17% is what drives the
increased system score. NUMA slightly reduces 2D, 3D, and disk scores

The pattern is similar and the magnitude similar at other CPU speeds
- 3.4, 3.7, 3.975, and 4.075 GHz, and also when I was using
slower SSD, RAM at 2133, and stock settings for Win10 Pro.

Is this a permanent problem associated with design, or something AMD
memory controller and BIOS design could improve?

**metronome** · Mar-30-2018, 03:23 PM

Hi David, I've been running the CPU Benchmark on an EPYC system (single socket 7551, Gigabyte MZ31-AR0, 128GB ECC RAM @2666, 500GB Samsung 750evo SSD) and I'm getting a typical ~19000 score consistent with what's listed on the big results page.

Looking at the individual component scores, it looks like the overall CPU score gets really sandbagged by the Prime number and Physics tests in particular - both of which have some memory-access dependency per the test descriptions. I suspect this is a NUMA-related issue, based also on looking at the scores (overall and component) I get from each of a Ryzen system (same CPU core, non-NUMA) and an Intel Xeon Gold system (same number of cores, different NUMA architecture) I also have.

As such is there a way to run the CPU Benchmark test in a NUMA-aware mode, or are we entirely reliant on the OS's memory allocation for these tests? Setting the memory interleaving to Channel doesn't seem to have any impact on these CPU Benchmark test components btw.

**David (PassMark)** · Mar-30-2018, 08:40 PM

or are we entirely reliant on the OS's memory allocation for these tests?

Yes, the CPU tests are not NUMA aware. We hope the O/S will do a good job.

From a programming point of view it is a bit painful to make every memory allocation NUMA aware.

**metronome** · Apr-06-2018, 08:57 PM

Understood, thanks for confirming.

More directly related to the thread topic, I have been running the Advanced Memory Tests on my EPYC system, and I've encountered an interesting quirk/anomaly. I don't know if this is something Passmark's doing, Windows is doing, some AMD driver is doing, or what, but I get unexpected latency and memory read speed scores for NUMA nodes 2 and 3, with respect to running local node or distant node memory in the test. NUMA 0 and NUMA 1 behave as expected, as follows:

NUMA 0 Processor 0
Average read speed (MB/s per step size) for local node 0: 6451 MB/s
Average read speed for distant node 1/2/3: ~5150 MB/s
Random Range Latency for local node 0: 68 ns
Random Range Latency for remote node 1/2/3: 110 ns

NUMA1 Processor 16 shows similar numbers, except of course local node is 1 and distant nodes are 0/2/3. This is also as expected.

For NUMA 2 Processor 32 however, I see the following:
Average read speed for "local" node 2: 5207 MB/s
Average read speed for "distant" node 0: 6469 MB/s
Random Range Latency for "local" node 2: 110 ns
Random Range Latency for "remote" node 0: 68ns

So, it's like allocation node 0 memory is really the local memory for NUMA 2! Similarly, the results I get for NUMA 3 Processor 48 indicate that allocation node 1 is really the local memory for this Processor, and not node 3.

Any idea why I'm seeing this? Shouldn't the local node performance always be better? The GUI does identify the expected local allocation node prior to the test e.g. NUMA allocation node 2 is labeled as the local node in the drop-down when Processor 32 is selected.

**Tim (PassMark)** · Apr-12-2018, 04:08 AM

Could you please try this updated build of PerformanceTest, https://www.passmark.com/ftp/temp/petst_debug.exe, and let us know if you still see the same behaviour.

Update: V9 Build 1025,

Fixed an issue with NUMA settings when selecting a processor using a different node to the one the PerformanceTest EXE was running on.
Debug build above should no longer be required.

**EugeneK143** · Jul-13-2018, 04:54 AM

Tried this with a dual-socket Intel Xeon E5 (22 cores/44 logical processors per socket).

In the dropdown box for "Processor#", I have 44 options, 0..43, corresponding to all logical processors of socket 1, and all in NUMA node 0. The other half of the processors are missing.

**Tim (PassMark)** · Jul-17-2018, 12:52 AM

Could you please launch the latest public release of PerformanceTest in debug mode, try to run a test and then send us a copy of the debug logs as well as a screenshot of the Processor # and NUMA node drop down contents.

**touristguy87** · Dec-31-2020, 07:26 AM

Originally posted by David (PassMark) View Post

Yes, the CPU tests are not NUMA aware. We hope the O/S will do a good job.

From a programming point of view it is a bit painful to make every memory allocation NUMA aware.

...yeah, kinda like it's "a bit painful" to optimize an application, compiler, compiler settings, libraries and all that for every bit of x86 kid on the market.

Let me ask a possibly-dumb question (I have a leg up on a lot of the people who will read this).

is it better to optimize new hardware to run old code, or to optimize new code to run on new hardware even if that means worse performance on old hardware?

Say you choose some of the first and some of the 2nd. What balance do you choose?

clearly that depends on the set of pluses and minuses of your new hardware and software vs those of your competitors.
Backwards-compatibility and future-headroom can be opposing forces.

In the desktop /low-end workstation space where NUMA is not a factor, you don't worry about it.

In the server & possibly high-end workstation where NUMA could be a significant factor, you do worry about it.
And you find out just how significant it could be. And if that is still not significant enough, you don't worry about it.
So if 99% of application developers don't worry about it, should a benchmark be written to make it A Big Deal?
Or should hardware developers make it transparent to software?

It becomes a question of how much you want to take on in terms of an upgrade.
If code is optimized for a fraction of the market, and that fraction is cutting-edge from a 2nd-tier supplier,
what are the odds that it will be well shaken-out at the time of deployment?

I see a rash of firmware and driver upgrades in your future, young Jedi.

The funny thing about high-performance computing is that you find out that the most important things are stability and consistency.
You need more performance, you throw more hardware onto the grid. But you don't throw hardware onto the grid that isn't both reliable and fully compatible.
Because nothing bums a client out more than having their system lock-up when they begin to use it intensely.
The full feature-set of cutting edge hardware will never be used. Eventually that feature-set will be used when the hardware is no-longer cutting edge and some group of owners have taken one for the team. Even then, some of that feature set will have proven to be hopelessly buggy and un-fixable and some will have proven to be not worth the time and effort to optimize for. I guarantee that there will be a bunch of cheap boards out there that will never properly support the full feature set of AMD chips or even Intel chips just as there will be OEMs that find that the full feature set isn't properly supported by the hardware vendors. And so they will disable it, and not care who whines about it.. And if a developer builds their code to it, assuming that it is there, and it is not there, and their code crashes, then tough luck. Developers who want their code to run well on a wide variety of systems will turn off or at least tone down their use of advanced features until it is stable enough for them to ship.

So if you do somehow tweak your dev env so that it builds code that runs like a scalded cat on a Threadripper system?
Odds are that it will be because that Threadripper system runs Pentium II code very well.

cheers

Announcement

NUMA Memory benchmarking on AMD ThreadRipper

NUMA Memory benchmarking on AMD ThreadRipper

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment