No announcement yet.

NUMA Memory benchmarking on AMD ThreadRipper

  • Filter
  • Time
  • Show
Clear All
new posts

  • NUMA Memory benchmarking on AMD ThreadRipper

    AMD's new ThreadRipper CPUs actually contains two CPU modules. Which means two separate memory busses.

    As a result of this, there is the possibility of Non-Uniform Memory Access (NUMA). Which means that some of the time CPU 0 might use it's own RAM, and some of the time CPU 0 might need to use the RAM connected to CPU 1. In theory using the RAM directly connected to your own CPU should be faster, and using the more remote RAM should be slow.

    We set out to test this using PerformanceTest's advanced memory test today on a ThreadRipper 1950X. Memory in use was Corsair cmk32gx4m2a2666c16 (DDR4, 2666Mhz, 16-18-18-35, 4 x 16GB).

    Note that you need PerformanceTest Version 9.0 build 1022 (20/Dec/2017) or higher to do this. In previously releases the advanced memory test was not NUMA aware (or had NUMA related bugs).

    Graphs are below, but the summary is:

    1) For sequential reading of memory, there is no significant performance difference for small blocks of RAM, but for larger blocks, that fall outside of the CPU's cache, the performance different can be up to 60%. (this test corresponds to our "memory speed per block size" test in PerformanceTest). So having NUMA aware applications is important for a system like this.

    2) For non sequential read, where we skip forward by some step factor before reading the next value, there is a significant performance hit. (this corresponds to our "memory speed per step size" test). We suspect this is due to the cache being a lot less effective. Performance degradation was around 20%.

    3) Latency is higher for distant nodes, compared to accessing memory on the local node. So memory access are around 60% slower. Again showing why NUMA aware applications (and operating systems) are important. What we did notice however is that if we didn't explicitly select the NUMA node, most of the time the system itself seemed to select the best node anyway (using malloc() on Win10). We don't know if this was by design, or just good luck.

    Note: AMD's EYPC CPUs should behave the same, but we only have ThreadRipper system to play with.

    NUMA Memory Step Size Benchmark

    NUMA Graph AMD ThreadRipper Block Size

    Latency results - Same NUMA node

    NUMA Memory Latency Same Node

    Latency results - Remote NUMA node

    NUMA Memory Latency Remote Node

    Also it is worth noting that the default BIOS setup didn't enable NUMA (ASUS motherboard). It had to be manually enabled

    This was done in the advanced mode, under "DF common options" by selecting the "memory interleaving" settings and changing this from "Auto" to "Channel".

    ASUS NUMA memory setting in UEFI BIOS
    Attached Files

  • #2
    And just for comparison here is a graph from an Intel i7-4770 with 16GB of GSkill DDR3 - F3-17000CL11. 4 x 4GB RAM.

    Click image for larger version  Name:	Intel-RAM-Benchmark.png Views:	1 Size:	81.8 KB ID:	41167

    Linear: 5.4ns
    Random: 78ns
    Random range: 26ms


    • #3
      RE NUMA slowdown of CPU performance

      Thanks for the update on this! I am seeing similar impact of NUMA while benchmarking

      VIDEOSTAR workstation- CPU: Threadripper 1950x GPU: GTX 1080Ti, MBd: ASRock Tai Chi
      RAM: 64GB RAM@3200 SSD NVMe Samsung 960 Pro 1TB, PSU: DarkQuiet Pro 1200W
      Case: Fractal XL R2 Cooler: Arctic 360(6-fan) + 4x 140mm fans DISP: BENQ PV3200PT OS: Win10Pro,

      If I understand correctly for my Tai Chi board- NUMA is 'on' when memory interleaf is set to 'channel'.
      and 'off' when is set to 'auto'. With that definition here are scores for exact same configuration
      with NUMA on or off, CPU @4.125 GHz:

      Passmark 9 NUMA and HPET OFF NUMA and HPET ON

      System 7004 7216
      CPU 28004 25125
      2D 994 981
      3D 15982 15853
      Mem 2238 2622
      Disk 20349 19938

      So NUMA increases system score by about 3%, but reduces CPU score
      by near 11%. The increase in memory score of 17% is what drives the
      increased system score. NUMA slightly reduces 2D, 3D, and disk scores

      The pattern is similar and the magnitude similar at other CPU speeds
      - 3.4, 3.7, 3.975, and 4.075 GHz, and also when I was using
      slower SSD, RAM at 2133, and stock settings for Win10 Pro.

      Is this a permanent problem associated with design, or something AMD
      memory controller and BIOS design could improve?


      • #4
        Hi David, I've been running the CPU Benchmark on an EPYC system (single socket 7551, Gigabyte MZ31-AR0, 128GB ECC RAM @2666, 500GB Samsung 750evo SSD) and I'm getting a typical ~19000 score consistent with what's listed on the big results page.

        Looking at the individual component scores, it looks like the overall CPU score gets really sandbagged by the Prime number and Physics tests in particular - both of which have some memory-access dependency per the test descriptions. I suspect this is a NUMA-related issue, based also on looking at the scores (overall and component) I get from each of a Ryzen system (same CPU core, non-NUMA) and an Intel Xeon Gold system (same number of cores, different NUMA architecture) I also have.

        As such is there a way to run the CPU Benchmark test in a NUMA-aware mode, or are we entirely reliant on the OS's memory allocation for these tests? Setting the memory interleaving to Channel doesn't seem to have any impact on these CPU Benchmark test components btw.


        • #5
          or are we entirely reliant on the OS's memory allocation for these tests?
          Yes, the CPU tests are not NUMA aware. We hope the O/S will do a good job.

          From a programming point of view it is a bit painful to make every memory allocation NUMA aware.


          • #6
            Understood, thanks for confirming.

            More directly related to the thread topic, I have been running the Advanced Memory Tests on my EPYC system, and I've encountered an interesting quirk/anomaly. I don't know if this is something Passmark's doing, Windows is doing, some AMD driver is doing, or what, but I get unexpected latency and memory read speed scores for NUMA nodes 2 and 3, with respect to running local node or distant node memory in the test. NUMA 0 and NUMA 1 behave as expected, as follows:

            NUMA 0 Processor 0
            Average read speed (MB/s per step size) for local node 0: 6451 MB/s
            Average read speed for distant node 1/2/3: ~5150 MB/s
            Random Range Latency for local node 0: 68 ns
            Random Range Latency for remote node 1/2/3: 110 ns

            NUMA1 Processor 16 shows similar numbers, except of course local node is 1 and distant nodes are 0/2/3. This is also as expected.

            For NUMA 2 Processor 32 however, I see the following:
            Average read speed for "local" node 2: 5207 MB/s
            Average read speed for "distant" node 0: 6469 MB/s
            Random Range Latency for "local" node 2: 110 ns
            Random Range Latency for "remote" node 0: 68ns

            So, it's like allocation node 0 memory is really the local memory for NUMA 2! Similarly, the results I get for NUMA 3 Processor 48 indicate that allocation node 1 is really the local memory for this Processor, and not node 3.

            Any idea why I'm seeing this? Shouldn't the local node performance always be better? The GUI does identify the expected local allocation node prior to the test e.g. NUMA allocation node 2 is labeled as the local node in the drop-down when Processor 32 is selected.


            • #7
              Could you please try this updated build of PerformanceTest,, and let us know if you still see the same behaviour.

              Update: V9 Build 1025,
              • Fixed an issue with NUMA settings when selecting a processor using a different node to the one the PerformanceTest EXE was running on.
                Debug build above should no longer be required.


              • #8
                Tried this with a dual-socket Intel Xeon E5 (22 cores/44 logical processors per socket).

                In the dropdown box for "Processor#", I have 44 options, 0..43, corresponding to all logical processors of socket 1, and all in NUMA node 0. The other half of the processors are missing.


                • #9
                  Could you please launch the latest public release of PerformanceTest in debug mode, try to run a test and then send us a copy of the debug logs as well as a screenshot of the Processor # and NUMA node drop down contents.


                  • #10
                    Originally posted by David (PassMark) View Post

                    Yes, the CPU tests are not NUMA aware. We hope the O/S will do a good job.

                    From a programming point of view it is a bit painful to make every memory allocation NUMA aware.

                    ...yeah, kinda like it's "a bit painful" to optimize an application, compiler, compiler settings, libraries and all that for every bit of x86 kid on the market.

                    Let me ask a possibly-dumb question (I have a leg up on a lot of the people who will read this).

                    is it better to optimize new hardware to run old code, or to optimize new code to run on new hardware even if that means worse performance on old hardware?

                    Say you choose some of the first and some of the 2nd. What balance do you choose?

                    clearly that depends on the set of pluses and minuses of your new hardware and software vs those of your competitors.
                    Backwards-compatibility and future-headroom can be opposing forces.

                    In the desktop /low-end workstation space where NUMA is not a factor, you don't worry about it.

                    In the server & possibly high-end workstation where NUMA could be a significant factor, you do worry about it.
                    And you find out just how significant it could be. And if that is still not significant enough, you don't worry about it.
                    So if 99% of application developers don't worry about it, should a benchmark be written to make it A Big Deal?
                    Or should hardware developers make it transparent to software?

                    It becomes a question of how much you want to take on in terms of an upgrade.
                    If code is optimized for a fraction of the market, and that fraction is cutting-edge from a 2nd-tier supplier,
                    what are the odds that it will be well shaken-out at the time of deployment?

                    I see a rash of firmware and driver upgrades in your future, young Jedi.

                    The funny thing about high-performance computing is that you find out that the most important things are stability and consistency.
                    You need more performance, you throw more hardware onto the grid. But you don't throw hardware onto the grid that isn't both reliable and fully compatible.
                    Because nothing bums a client out more than having their system lock-up when they begin to use it intensely.
                    The full feature-set of cutting edge hardware will never be used. Eventually that feature-set will be used when the hardware is no-longer cutting edge and some group of owners have taken one for the team. Even then, some of that feature set will have proven to be hopelessly buggy and un-fixable and some will have proven to be not worth the time and effort to optimize for. I guarantee that there will be a bunch of cheap boards out there that will never properly support the full feature set of AMD chips or even Intel chips just as there will be OEMs that find that the full feature set isn't properly supported by the hardware vendors. And so they will disable it, and not care who whines about it.. And if a developer builds their code to it, assuming that it is there, and it is not there, and their code crashes, then tough luck. Developers who want their code to run well on a wide variety of systems will turn off or at least tone down their use of advanced features until it is stable enough for them to ship.

                    So if you do somehow tweak your dev env so that it builds code that runs like a scalded cat on a Threadripper system?
                    Odds are that it will be because that Threadripper system runs Pentium II code very well.