Announcement

Collapse
No announcement yet.

MemTest to find potential corruption of 64 bytes

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • MemTest to find potential corruption of 64 bytes

    Hi,

    We have a strange problem with one x86 server running PostgreSQL database:
    • 192 cores (384 vCPU) Intel Xeon Platinum 8260 CPU @ 2.40GHz
    • 1 processor = 24 cores. 8 processor in total.
    • NUMA is on, 8 NUMA nodes
    • 4 TB of ECC RAM
    • Enterprise storage connected by FC
    • during business hours CPU utilization is around 50-70%, around 200 concurrent SQL queries.
    During last few months we found out 2 data corruptions in WAL files (DB journals) thanks to CRC of WAL records. We've managed to identify original bytes and exact place of corruption and the pattern of corruption is following:
    • Aligned 64 bytes
    • Values of corrupted bytes looks like data from another region of memory
    MCE logs are clear, nothing is suspicious in other OS/database log files.

    Source code looks fine and it looks like:
    Code:
    buf = get_region_from_shmem(SIZE)
    memcpy(buf,page,SIZE)
    write(fd, buf, SIZE)
    This code is proven by time, because it's core of PostgreSQL database. That's why we suppose that data has been corrupted by hardware during memcpy or write code path. The size of corruption is same as size of CPU cache line, so corruption may be correlated with CPU-memory transactions.
    It's worth to mention that "page" and "buf" have size around 8Kb in shared memory (512GB). Shared memory contains only 4Kb pages (no huge pages).

    Now what we want is to find root cause and fix it. We tried to scan memory by MemTest Free edition, but tests were completed successfully. During testing we observed that not all CPU cores were used, only 256. This is disturbing because error may occur only in case of particular parameters: particular CPU core/thread, memory bank, may be ECC corrected error, CPU cache miss.

    On other hand, MemTest Pro edition provides more tests:
    • ECC error injection
    • New 64-bit/SIMD tests
    Can you please tell us:
    • Does "ECC error injection" test work on Intel Xeon Platinum 8260 CPU?
    • Does MemTest Pro edition can utilize all 192 cores/384vCPU during test? If no, is there any plan to support it in future?
    • Can you recommend any test to identify cause or give any idea about cause? I'm sure you have great experience and probably faced similar case (64bytes corruption).
    Thanks!

  • #2
    Does "ECC error injection" test work on Intel Xeon Platinum 8260 CPU?
    The list of CPUs that are supported for ECC injection are list at the bottom of this page
    https://www.memtest86.com/compare.html
    So no, Xeon Platinum 8260 isn't supported for ECC injection. But in the medium term we are working on a new hardware based solution solution that will work with all DDR4 RAM (around April 2021).

    Does MemTest Pro edition can utilize all 192 cores/384vCPU during test?
    No. We stopped using hyperthreading cores as it hurts performance and increased testing time, while finding no additional errors. The problem was that doubling the number of cores (by using hyperthreading) doesn't increase the memory bandwidth.
    There is a setting in the MemTest configuration file to force hyperthreading on however (ENABLEHT).

    I think there is still an overall limit of 256 Cores however. MemTest86 was mainly focused on testing the memory and not the CPU cores.

    and probably faced similar case
    Actually no. We don't have the budget to buy Platinum 8260 with 4TB of ECC RAM

    Values of corrupted bytes looks like data from another region of memory
    If that is true it almost sounds like a software bug.
    If would be strange to have 64bytes of corrupt memory but have the ECC function not notice any of the corruption. Unless the ECC reporting function wasn't working. Or if the corrupt journals were on disk, maybe it was a disk failure.

    Comment


    • #3
      Thank you for quick answer!

      We will continue digging...

      Comment

      Working...
      X