Hi,
We have a strange problem with one x86 server running PostgreSQL database:
Source code looks fine and it looks like:
This code is proven by time, because it's core of PostgreSQL database. That's why we suppose that data has been corrupted by hardware during memcpy or write code path. The size of corruption is same as size of CPU cache line, so corruption may be correlated with CPU-memory transactions.
It's worth to mention that "page" and "buf" have size around 8Kb in shared memory (512GB). Shared memory contains only 4Kb pages (no huge pages).
Now what we want is to find root cause and fix it. We tried to scan memory by MemTest Free edition, but tests were completed successfully. During testing we observed that not all CPU cores were used, only 256. This is disturbing because error may occur only in case of particular parameters: particular CPU core/thread, memory bank, may be ECC corrected error, CPU cache miss.
On other hand, MemTest Pro edition provides more tests:
We have a strange problem with one x86 server running PostgreSQL database:
- 192 cores (384 vCPU) Intel Xeon Platinum 8260 CPU @ 2.40GHz
- 1 processor = 24 cores. 8 processor in total.
- NUMA is on, 8 NUMA nodes
- 4 TB of ECC RAM
- Enterprise storage connected by FC
- during business hours CPU utilization is around 50-70%, around 200 concurrent SQL queries.
- Aligned 64 bytes
- Values of corrupted bytes looks like data from another region of memory
Source code looks fine and it looks like:
Code:
buf = get_region_from_shmem(SIZE) memcpy(buf,page,SIZE) write(fd, buf, SIZE)
It's worth to mention that "page" and "buf" have size around 8Kb in shared memory (512GB). Shared memory contains only 4Kb pages (no huge pages).
Now what we want is to find root cause and fix it. We tried to scan memory by MemTest Free edition, but tests were completed successfully. During testing we observed that not all CPU cores were used, only 256. This is disturbing because error may occur only in case of particular parameters: particular CPU core/thread, memory bank, may be ECC corrected error, CPU cache miss.
On other hand, MemTest Pro edition provides more tests:
- ECC error injection
- New 64-bit/SIMD tests
- Does "ECC error injection" test work on Intel Xeon Platinum 8260 CPU?
- Does MemTest Pro edition can utilize all 192 cores/384vCPU during test? If no, is there any plan to support it in future?
- Can you recommend any test to identify cause or give any idea about cause? I'm sure you have great experience and probably faced similar case (64bytes corruption).
Comment