Announcement

Collapse
No announcement yet.

advanced memory test shows huge performance drop for the 2nd CPU of MP workstations

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • advanced memory test shows huge performance drop for the 2nd CPU of MP workstations

    Hello. Thank you for your interest.

    I have tested my multiprocessor workstation using PerformanceTest and got some weird results.
    There are huge differences in the latency and bandwidth of each NUMA node: "CPU1 socket side is always much slower than CPU0 socket side."
    Yes, MP systems have memory access time (via QPI) delays. So, the latency had already dropped from 45ns (single CPU) --> 55ns (dual CPU).
    However, for the CPU1 socket side, there is a huge additional performance drop.
    The latency difference between CPU sockets (CPU0 and CPU1) is "55ns vs 81ns" and the bandwidth difference is "3800MB/s vs 2100MB/s", which is almost 50~60%.
    latency (random range) NUMA node 0 NUMA node 1
    NUMA allocation node 0 55.58ns 55.58ns
    NUMA allocation node 1 81.58ns 80.88ns
    block write speed NUMA node 0 NUMA node 1
    NUMA allocation node 0 3785MB/s 3813MB/s
    NUMA allocation node 1 2119MB/s 2069MB/s
    I wonder if this huge latency and bandwidth difference issue is common for all of the multiprocessor systems, or just for the issue of my system.
    To those who own a multiprocessor workstation, please help me to share your memory test result of PerformanceTest.

    Below is the way I tested.
    ================================================== ================================================== =============================

    1. I have tested with two E5-2682v4 and E5-two 2680v4 on Huananzhi F8D Plus -- dual CPU socket mainboard.
    Eight (full bank) Hynix HMA42GR7MFR4N-TF memory
    Windows 10 Pro and PerformanceTest 11.0

    4. At the top menu bar of PerformanceTest, "advanced" --> "memory..."
    Click image for larger version

Name:	perftest0.png
Views:	56
Size:	267.8 KB
ID:	56789


    5. (latency test) Select "Latency Test"

    6. Choose "Processor#"

    Be careful that this is choosing cores, not NUMA nodes. To set NUMA node 1, you have to select a core whose number is over 30 or 40 (depending on the number of cores/threads of your CPU)

    7. Choose "NUMA Allocation node" 0 or 1.

    8. Press "Go" button. you can see the result at the bottom of the window.
    Test all combinations of “NUMA node” (0/1) and “NUMA Allocation node” (0/1).
    I could see the latency is always higher for “NUMA Allocation node 1” than “NUMA Allocation node 0”, no matter "Processor #" is at “NUMA node” 0 or 1.

    Click image for larger version

Name:	fcb5c5dd8b51b6801338626a68ac3f24.png
Views:	82
Size:	31.2 KB
ID:	56787

    9. (write test) Select "Block Read/Write" and "Write"

    10. Choose "Processor#"

    11. Choose "NUMA Allocation node" 0 or 1.

    12. Press "Go" button. you can see the result in the new window.
    Test all combinations of “NUMA node” (0/1) and “NUMA Allocation node” (0/1).
    I could see the bandwidth is always lower for “NUMA Allocation node 1” than “NUMA Allocation node 0”, no matter "Processor #" is “NUMA node” 0 or 1.

    Click image for larger version

Name:	a028e494137901a80fc1387269033397.png
Views:	51
Size:	30.6 KB
ID:	56788

    13. (You don't need to do this.) Turn off the computer. Exchange memory DIMMs near cpu0 and cpu1 sockets, then test again.
    14. (You don't need to do this.) Turn off the computer. Exchange CPUs on cpu0 and cpu1 sockets, then test again.
    ================================================== ================================================== =============================

    In my case, exchanging CPUs and DIMMs does not change any result.
    Therefore, I suspect that this is my mainboard issue.
    Or I did some mistakes in setting any options in BIOS or OS?




  • #2
    There are some results from years ago here
    https://forums.passmark.com/performa...d-threadripper

    The whole idea of multiple sockets is going out of fashion, why have two sockets of 14 Cores each, when you can have a single socket CPU with up to 96 cores in one package. Very little software is optimised for NUMA.

    It would definitely be interesting to see results from a few other machines. As I agree the results look at bit strange (like the NUMA RAM allocation nodes were relative to the NUMA CPU nodes).

    As indicated in the linked post, some motherboard have BIOS settings to play around with.

    Comment


    • #3
      This time I tested with Intel Memory Latency Checker, and got the expected result (symmetric remote memory access delays between each numa).

      Isn't there any possibility that the PerformanceTest advanced memory test had malfunctioned?

      Here is the Intel Memory Latency Checker test result.

      ================================================== ==================================================

      Intel(R) Memory Latency Checker - v3.11
      Measuring idle latencies for random access (in ns)...
      Numa node 0 Numa node 1
      Numa node 0 91.8 125.6
      Numa node 1 128.6 90.4
      Measuring Peak Injection Memory Bandwidths for the system
      Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
      Using all the threads from each core if Hyper-threading is enabled
      Using traffic with the following read-write ratios
      ALL Reads : 126783.9
      3:1 Reads-Writes : 122172.0
      2:1 Reads-Writes : 121868.1
      1:1 Reads-Writes : 114215.6
      Stream-triad like: 107334.5

      Measuring Memory Bandwidths between nodes within system
      Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
      Using all the threads from each core if Hyper-threading is enabled
      Using Read-only traffic type
      Numa node 0 Numa node 1
      Numa node 0 64761.6 16684.5
      Numa node 1 16725.6 64444.8
      Measuring Loaded Latencies for the system
      Using all the threads from each core if Hyper-threading is enabled
      Using Read-only traffic type
      Inject Latency Bandwidth
      Delay (ns) MB/sec
      ==========================
      00000 210.29 128318.6
      00002 210.81 128501.7
      00008 211.57 128222.4
      00015 211.70 128083.5
      00050 199.87 127241.8
      00100 183.51 125596.4
      00200 121.22 92607.0
      00300 110.09 63446.0
      00400 104.28 48124.2
      00500 100.70 38986.3
      00700 97.13 28266.5
      01000 97.98 19979.7
      01300 93.68 15644.0
      01700 92.60 12166.5
      02500 91.72 8513.3
      03500 91.03 6294.2
      05000 91.36 4614.1
      09000 91.05 2881.8
      20000 90.88 1685.7

      Measuring cache-to-cache transfer latency (in ns)...
      Using small pages for allocating buffers
      Local Socket L2->L2 HIT latency 39.7
      Local Socket L2->L2 HITM latency 43.4
      Remote Socket L2->L2 HITM latency (data address homed in writer socket)
      Reader Numa Node 0 Reader Numa Node 1
      Writer Numa Node 0 - 97.9
      Writer Numa Node 1 98.5 -
      Remote Socket L2->L2 HITM latency (data address homed in reader socket)
      Reader Numa Node 0 Reader Numa Node 1
      Writer Numa Node 0 - 98.2
      Writer Numa Node 1 97.6 -
      ================================================== ==================================================

      Comment

      Working...
      X