Hello. Thank you for your interest.
I have tested my multiprocessor workstation using PerformanceTest and got some weird results.
There are huge differences in the latency and bandwidth of each NUMA node: "CPU1 socket side is always much slower than CPU0 socket side."
Yes, MP systems have memory access time (via QPI) delays. So, the latency had already dropped from 45ns (single CPU) --> 55ns (dual CPU).
However, for the CPU1 socket side, there is a huge additional performance drop.
The latency difference between CPU sockets (CPU0 and CPU1) is "55ns vs 81ns" and the bandwidth difference is "3800MB/s vs 2100MB/s", which is almost 50~60%.
I wonder if this huge latency and bandwidth difference issue is common for all of the multiprocessor systems, or just for the issue of my system.
To those who own a multiprocessor workstation, please help me to share your memory test result of PerformanceTest.
Below is the way I tested.
================================================== ================================================== =============================
1. I have tested with two E5-2682v4 and E5-two 2680v4 on Huananzhi F8D Plus -- dual CPU socket mainboard.
Eight (full bank) Hynix HMA42GR7MFR4N-TF memory
Windows 10 Pro and PerformanceTest 11.0
4. At the top menu bar of PerformanceTest, "advanced" --> "memory..."
5. (latency test) Select "Latency Test"
6. Choose "Processor#"
Be careful that this is choosing cores, not NUMA nodes. To set NUMA node 1, you have to select a core whose number is over 30 or 40 (depending on the number of cores/threads of your CPU)
7. Choose "NUMA Allocation node" 0 or 1.
8. Press "Go" button. you can see the result at the bottom of the window.
Test all combinations of “NUMA node” (0/1) and “NUMA Allocation node” (0/1).
I could see the latency is always higher for “NUMA Allocation node 1” than “NUMA Allocation node 0”, no matter "Processor #" is at “NUMA node” 0 or 1.
9. (write test) Select "Block Read/Write" and "Write"
10. Choose "Processor#"
11. Choose "NUMA Allocation node" 0 or 1.
12. Press "Go" button. you can see the result in the new window.
Test all combinations of “NUMA node” (0/1) and “NUMA Allocation node” (0/1).
I could see the bandwidth is always lower for “NUMA Allocation node 1” than “NUMA Allocation node 0”, no matter "Processor #" is “NUMA node” 0 or 1.
13. (You don't need to do this.) Turn off the computer. Exchange memory DIMMs near cpu0 and cpu1 sockets, then test again.
14. (You don't need to do this.) Turn off the computer. Exchange CPUs on cpu0 and cpu1 sockets, then test again.
================================================== ================================================== =============================
In my case, exchanging CPUs and DIMMs does not change any result.
Therefore, I suspect that this is my mainboard issue.
Or I did some mistakes in setting any options in BIOS or OS?
I have tested my multiprocessor workstation using PerformanceTest and got some weird results.
There are huge differences in the latency and bandwidth of each NUMA node: "CPU1 socket side is always much slower than CPU0 socket side."
Yes, MP systems have memory access time (via QPI) delays. So, the latency had already dropped from 45ns (single CPU) --> 55ns (dual CPU).
However, for the CPU1 socket side, there is a huge additional performance drop.
The latency difference between CPU sockets (CPU0 and CPU1) is "55ns vs 81ns" and the bandwidth difference is "3800MB/s vs 2100MB/s", which is almost 50~60%.
latency (random range) | NUMA node 0 | NUMA node 1 |
NUMA allocation node 0 | 55.58ns | 55.58ns |
NUMA allocation node 1 | 81.58ns | 80.88ns |
block write speed | NUMA node 0 | NUMA node 1 |
NUMA allocation node 0 | 3785MB/s | 3813MB/s |
NUMA allocation node 1 | 2119MB/s | 2069MB/s |
To those who own a multiprocessor workstation, please help me to share your memory test result of PerformanceTest.
Below is the way I tested.
================================================== ================================================== =============================
1. I have tested with two E5-2682v4 and E5-two 2680v4 on Huananzhi F8D Plus -- dual CPU socket mainboard.
Eight (full bank) Hynix HMA42GR7MFR4N-TF memory
Windows 10 Pro and PerformanceTest 11.0
4. At the top menu bar of PerformanceTest, "advanced" --> "memory..."
5. (latency test) Select "Latency Test"
6. Choose "Processor#"
Be careful that this is choosing cores, not NUMA nodes. To set NUMA node 1, you have to select a core whose number is over 30 or 40 (depending on the number of cores/threads of your CPU)
7. Choose "NUMA Allocation node" 0 or 1.
8. Press "Go" button. you can see the result at the bottom of the window.
Test all combinations of “NUMA node” (0/1) and “NUMA Allocation node” (0/1).
I could see the latency is always higher for “NUMA Allocation node 1” than “NUMA Allocation node 0”, no matter "Processor #" is at “NUMA node” 0 or 1.
9. (write test) Select "Block Read/Write" and "Write"
10. Choose "Processor#"
11. Choose "NUMA Allocation node" 0 or 1.
12. Press "Go" button. you can see the result in the new window.
Test all combinations of “NUMA node” (0/1) and “NUMA Allocation node” (0/1).
I could see the bandwidth is always lower for “NUMA Allocation node 1” than “NUMA Allocation node 0”, no matter "Processor #" is “NUMA node” 0 or 1.
13. (You don't need to do this.) Turn off the computer. Exchange memory DIMMs near cpu0 and cpu1 sockets, then test again.
14. (You don't need to do this.) Turn off the computer. Exchange CPUs on cpu0 and cpu1 sockets, then test again.
================================================== ================================================== =============================
In my case, exchanging CPUs and DIMMs does not change any result.
Therefore, I suspect that this is my mainboard issue.
Or I did some mistakes in setting any options in BIOS or OS?
Comment