Announcement

Collapse
No announcement yet.

More questins: Reg-ECC Recognition, NUMA and Warnings

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • More questins: Reg-ECC Recognition, NUMA and Warnings

    Hi again, some more questions here.
    Not directly relating to problems or crashes, but I'm looking for some help in understanding.
    So no urgent prolem solving needed.
    But I'd be happy if someone found the time to share a bit of knowledge.


    System is:

    Asus Z9PE-D8 WS: dual-Socket, quad-channel, Intel C602 chipset mainbard
    4×/8× Samsung M386B4G70DM0-CMA4: DDR3, 1866MHz, Reg-ECC, 32GB DIMMs
    Intel Xeon E5-2690v2: 10c/20t 3.00GHz CPUs


    First thing is the recognition of the module specs.
    I don't know how the SPDs or SMBIOSes of these DIMMs are programmed, but I wonder why there are many informations or features n/a for Memtest.

    RAM Info is as follows:

    Memory summary:
    Number of RAM slots: 8
    Number of RAM modules: 8
    Number of RAM SPDs detected: 8
    Total Physical Memory: 262114M

    SPD Details:
    --------------

    SPD #: 1
    ==============
    RAM Type: DDR3
    Maximum Clock Speed (MHz): 933 (JEDEC)
    Maximum Transfer Speed (MHz): DDR3-1867
    Maximum Bandwidth (MB/s): PC3-14900
    Memory Capacity (MB): 32768
    Jedec Manufacture Name: Samsung
    SPD Revision: 1.2
    Registered: No
    ECC: Yes
    DIMM Slot #: 1
    Manufactured: Week 40 of Year 2014
    Module Part #: M386B4G70DM0-CMA4
    Module Revision: 0x0000
    Module Serial #: 0x394A4D3B
    Module Manufacturing Location: 0x02
    # of Row Addressing Bits: 16
    # of Column Addressing Bits: 11
    # of Banks: 8
    # of Ranks: 4
    Device Width in Bits: 4
    Bus Width in Bits: 64
    Module Voltage: 1.5V
    CAS Latencies Supported: 6 7 8 9 10 11 13
    Timings @ Max Frequency (JEDEC): 13-13-13-32
    Maximum Clock Speed (MHz): 933
    Maximum Transfer Speed (MHz): DDR3-1867
    Maximum Bandwidth (MB/s): PC3-14900
    Minimum Clock Cycle Time, tCK (ns): 1.071
    Minimum CAS Latency Time, tAA (ns): 13.125
    Minimum RAS to CAS Delay, tRCD (ns): 13.125
    Minimum Row Precharge Time, tRP (ns): 13.125
    Minimum Active to Precharge Time, tRAS (ns): 34.000
    Minimum Row Active to Row Active Delay, tRRD (ns): 5.000
    Minimum Auto-Refresh to Active/Auto-Refresh Time, tRC (ns): 47.125
    Minimum Auto-Refresh to Active/Auto-Refresh Command Period, tRFC (ns): 260.000
    DDR3 Specific SPD Attributes
    Write Recover Time, tWR (ns): 15.000
    Internal Write to Read Command Delay, tWTR (ns): 7.500
    Internal Read to Precharge Command Delay, tRTP (ns): 7.500
    Minimum Four Activate Window Delay (ns): 27.000
    RZQ / 6 Supported: Yes
    RZQ / 7 Supported: Yes
    DLL-Off Mode Supported: Yes
    Maximum Operating Temperature Range (C): 0-95C
    Refresh Rate at Extended Operating Temperature Range: 2X
    Auto-Self Refresh Supported: No
    On-die Thermal Sensor Readout Supported: No
    Partial Array Self Refresh Supported: No
    Thermal Sensor Present: Yes
    Non-standard SDRAM Type: 00
    Module Type: Reserved
    Module Height (mm): -1 - 0
    Module Thickness (mm): front -1-0 , back -1-0
    Module Width (mm):
    Reference Raw Card Used:
    DRAM Manufacture: Samsung


    SMBIOS Details:
    --------------

    DIMM #: 1
    ==============
    Total Width: 72 bits
    Data Width: 64 bits
    Size: 32768 MB
    Form Factor: DIMM
    Device Set: 0
    Device Locator: DIMM_A1
    Bank Locator: DIMM_A1
    Memory Type: DDR3
    Type Detail: Synchronous
    Speed: 1866 MT/s
    Manufacturer: Samsung
    Serial Number: 394A4D3B
    Asset Tag: DIMM_A1_AssetTag
    Part Number: M386B4G70DM0-CM
    Attributes: 00000004
    Configured Memory Speed: 1866 MT/s
    Minimum Voltage: N/A
    Maximum Voltage: N/A
    Configured Voltage: N/A
    Memory Technology: Unknown
    Memory Operating Mode Capability: Unknown
    Firmware Version:
    Module Manufacturer ID: N/A
    Module Product ID: N/A
    Memory Subsystem Controller Manufacturer ID: N/A
    Memory Subsystem Controller Product ID: N/A
    Non Volatile Size: N/A
    Volatile Size: N/A
    Cache Size: N/A
    Logical Size: N/A


    First subtopic of interest is this entry under SPD Details:
    Registered: No

    I am pretty sure these are Registered DIMMs, as the capacity of 32GB per DDR3-DIMM could only be manufactured as Registered DIMMs ...right?
    How come that Memtest does not recognize this?
    Does this have any offect on testing?
    I have run dozens of complete test suites already, but I don't know how registered oder unbufferd state would relfect in the logs.

    Second is the Infos in section SMBIOS Details:
    There are a lot of informations not available.
    Is this common for RegECC-DIMMS?
    Is it of any effect that those infos can't be read, or aren't programmed at all?


    Another topic is the performance of the tests relating to the NUMA-setting of the board's BIOS setup.
    While scrolling through some logs of passed tests, I noticed repeated Entries like this:
    2021-12-26 07:38:03 - Running test #7 (Test 7 [Moving inversions, 32-bit pattern])
    2021-12-26 07:38:03 - MtSupportRunAllTests - Setting random seed to 0x3BC48C3E
    2021-12-26 07:38:03 - MtSupportRunAllTests - Start time: 81241323 ms
    2021-12-26 07:38:03 - MtSupportRunAllTests - Enabling memory cache for test
    2021-12-26 07:38:03 - MtSupportRunAllTests - Enabling memory cache complete
    2021-12-26 07:38:03 - Start memory range test (0x0 - 0x20C0000000)
    2021-12-26 07:38:35 - GetIA32ArchitecturalTemp - MSR(0x19C) = 88370000 (Vendor ID: GenuineIntel 6 3E 4)
    2021-12-26 07:38:35 - MapTempIntel - MSR(0x1A2) = 640E00

    [...]
    2021-12-26 08:34:37 - WARNING - waited for 10s for CPU #32 to finish (BSP test time = 22814ms)
    2021-12-26 08:34:37 - WARNING - waited for 10s for CPU #34 to finish (BSP test time = 22814ms)
    2021-12-26 08:34:37 - WARNING - waited for 10s for CPU #36 to finish (BSP test time = 22814ms)
    2021-12-26 08:34:37 - WARNING - waited for 10s for CPU #38 to finish (BSP test time = 22814ms)

    Sometimes those warnings could sum up to 20~30 per test.
    These Warnings had no effect on the number of reported errors.
    The test runs were alwasy marked as "passed".

    After some trying around I found out that these warnings don't appear when I set the NUMA option in BIOS setup to "disabled".
    Also the complete test suites (4 runs of 13) finish notably faster.

    A complete 4×13 test run finished in 47:53:47hrs when using 4 DIMMs for 2 CPUs (dual channel config).
    Without NUMA, the 4×13 test run finished in 38:00:48hrs.

    There are notably more warnings when using single channel setup (1 DIMM each CPU) than dual channel setup.

    I have some glimpse idea how NUMA memory access works, but no detailed info about how it's organized in IvyBridge-EP systems.
    Does that non uniform memory access apply for internal access paths within one CPU or does it apply for access of cores "cross-CPU", like from cores of CPU1 to memory managed by CPU2?
    Why do these delays occur when NUMA is activated?
    In which cases is it useful to enable NUMA Mode, what kind of task does profit from it?
    When is it useful to disable it?

    Thanks for reading ^ ^
    Last edited by Magnus; Jan-26-2022, 03:09 AM.

  • #2
    but I wonder why there are many informations or features n/a
    Data in SMBios is often wrong or incomplete. Blame the BIOS firmware.

    First subtopic of interest is this entry under SPD Details:
    Registered: No
    I looked up the Samsung spec sheet for that module.
    They say it isn't registered RAM. So I guess it isn't.
    (a Load-Reduced DIMM isn't the same as a Registered DIMM)

    I don't know how registered oder unbufferd state would relfect in the logs
    Should make no difference to the tests as this low level hardware feature is hidden from software.

    We have done no real testing of NUMA modes with MemTest86.
    But some years ago we did look at the NUMA performance in Windows. And different CPU core and different BIOS settings can definitely impact performance.
    Probably makes sense to use the same mode for testing as you would when the server is in production.

    WARNING - waited for 10s for CPU #32 to finish (BSP test time = 22814ms)
    This warning means the CPU core was slow to signal the end of a memory test it was running.
    Probably means some cores were running faster than others. So by itself, nothing to worry about.
    "BSP test time" was the time taken by the main thread to finish the test. So some cores were 22sec behind the main thread in the above example.

    Comment

    Working...
    X