Announcement

Collapse
No announcement yet.

Need your support to understand test results -> instable system reboots erratically

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Need your support to understand test results -> instable system reboots erratically

    Hi all

    my system (i7-3770K 4C/8T, 4x8 GB RAM) recently started to erratically fail and reboot. I have a hypervisor running on the system with a couple of VMs. It _seems_ the system is stable as long as only the hypervisor runs, but when i start VMs the system seems to run into a condition where it crashed after some time (sometimes a couple of hours, sometimes a day). I therefore concluded, it _could_ be related to growing memory use (normally the system in toltal with all the VMs does use a bit less than 16 GB). So i was thinking it could bit a bad memory stick where the malioucous address range is hit while a VM is dynamically increasing the memory it needs.

    As context information: The cpu was slighly over clocked (with water cooling) and the RAM was run at XMP 2.0 configuration.

    So i run memtest86 8.3, which took about half a day to complete and showed a couple of thounds of errors.

    Since this is the first time i'm using memtest and from what i find in the test results, i'm not sure how to interprete. Mainly because it does not seem to show a random error but rather a kind of systematic error.

    Code:
    2020-02-26 20:11:00 - *** TEST SESSION - 2020-02-26 20:11:00 ***
    Code:
    2020-02-26 20:11:02 - [MEM ERROR - Data] Test: 1, CPU: 0, Address: 1DE6DDB8, Expected: 000000001DE6DDB8, Actual: 000000[B]8[/B]01DE6DDB8
    2020-02-26 20:11:02 - [MEM ERROR - Data] Test: 1, CPU: 0, Address: 243A1DB8, Expected: 00000000243A1DB8, Actual: 000000[B]8[/B]0243A1DB8
    2020-02-26 20:11:02 - [MEM ERROR - Data] Test: 1, CPU: 0, Address: 2466DDB8, Expected: 000000002466DDB8, Actual: 000000[B]8[/B]02466DDB8
    2020-02-26 20:11:02 - [MEM ERROR - Data] Test: 1, CPU: 0, Address: 2466E9B8, Expected: 000000002466E9B8, Actual: 000000[B]8[/B]02466E9B8
    2020-02-26 20:11:02 - [MEM ERROR - Data] Test: 1, CPU: 0, Address: 25DC2938, Expected: 0000000025DC2938, Actual: 000000[B]8[/B]025DC2938
    Code:
    2020-02-26 20:11:08 - Running test #2 (Test 2 [Address test, own address])
    2020-02-26 20:11:08 - MtSupportRunAllTests - Setting random seed to 0x50415353
    2020-02-26 20:11:08 - MtSupportRunAllTests - Start time: 8071 ms
    2020-02-26 20:11:08 - ReadMemoryRanges - Available Pages = 8302249
    2020-02-26 20:11:08 - MtSupportRunAllTests - Enabling memory cache for test
    2020-02-26 20:11:08 - MtSupportRunAllTests - Enabling memory cache complete
    2020-02-26 20:11:09 - Start memory range test (0x0 - 0x81F600000)
    2020-02-26 20:11:09 - Pre-allocating memory ranges >=16MB first...
    2020-02-26 20:11:09 - All memory ranges successfully locked
    2020-02-26 20:11:09 - [MEM ERROR - Data] Test: 2, CPU: 4, Address: 14D46DB8, Expected: 0000000014D46DB8, Actual: 000000[B]8[/B]014D46DB8
    2020-02-26 20:11:09 - [MEM ERROR - Data] Test: 2, CPU: 6, Address: 189825B8, Expected: 00000000189825B8, Actual: 000000[B]8[/B]0189825B8
    2020-02-26 20:11:09 - [MEM ERROR - Data] Test: 2, CPU: 6, Address: 18F65DB8, Expected: 0000000018F65DB8, Actual: 000000[B]8[/B]018F65DB8
    2020-02-26 20:11:09 - [MEM ERROR - Data] Test: 2, CPU: 0, Address: 1D08A5B8, Expected: 000000001D08A5B8, Actual: 000000[B]8[/B]01D08A5B8
    2020-02-26 20:11:09 - [MEM ERROR - Data] Test: 2, CPU: 6, Address: 195C0778, Expected: 00000000195C0778, Actual: 000000[B]8[/B]0195C0778
    2020-02-26 20:11:09 - [MEM ERROR - Data] Test: 2, CPU: 4, Address: 16F65DB8, Expected: 0000000016F65DB8, Actual: 000000[B]8[/B]016F65DB8
    2020-02-26 20:11:09 - [MEM ERROR - Data] Test: 2, CPU: 4, Address: 16FE1538, Expected: 0000000016FE1538, Actual: 000000[B]8[/B]016FE1538
    2020-02-26 20:11:09 - [MEM ERROR - Data] Test: 2, CPU: 4, Address: 1700F5B8, Expected: 000000001700F5B8, Actual: 000000[B]8[/B]01700F5B8
    2020-02-26 20:11:09 - [MEM ERROR - Data] Test: 2, CPU: 0, Address: 1D44DF78, Expected: 000000001D44DF78, Actual: 000000[B]8[/B]01D44DF78
    2020-02-26 20:11:09 - [MEM ERROR - Data] Test: 2, CPU: 6, Address: 1ABA1DB8, Expected: 000000001ABA1DB8, Actual: 000000[B]8[/B]01ABA1DB8
    2020-02-26 20:11:09 - [MEM ERROR - Data] Test: 2, CPU: 6, Address: 1BD45DB8, Expected: 000000001BD45DB8, Actual: 000000[B]8[/B]01BD45DB8
    2020-02-26 20:11:09 - [MEM ERROR - Data] Test: 2, CPU: 4, Address: 242AA9B8, Expected: 00000000242AA9B8, Actual: 000000[B]8[/B]0242AA9B8
    2020-02-26 20:11:09 - [MEM ERROR - Data] Test: 2, CPU: 2, Address: 202ABFF8, Expected: 00000000202ABFF8, Actual: 000000[B]8[/B]0202ABFF8
    2020-02-26 20:11:09 - [MEM ERROR - Data] Test: 2, CPU: 4, Address: 24326938, Expected: 0000000024326938, Actual: 000000[B]8[/B]024326938
    2020-02-26 20:11:09 - [MEM ERROR - Data] Test: 2, CPU: 2, Address: 20325D38, Expected: 0000000020325D38, Actual: 000000[B]8[/B]020325D38
    2020-02-26 20:11:09 - [MEM ERROR - Data] Test: 2, CPU: 4, Address: 24B26D38, Expected: 0000000024B26D38, Actual: 000000[B]8[/B]024B26D38
    2020-02-26 20:11:09 - [MEM ERROR - Data] Test: 2, CPU: 4, Address: 25325D38, Expected: 0000000025325D38, Actual: 000000[B]8[/B]025325D38
    2020-02-26 20:11:09 - [MEM ERROR - Data] Test: 2, CPU: 2, Address: 21B26938, Expected: 0000000021B26938, Actual: 000000[B]8[/B]021B26938
    2020-02-26 20:11:09 - [MEM ERROR - Data] Test: 2, CPU: 4, Address: 25BA3FF8, Expected: 0000000025BA3FF8, Actual: 000000[B]8[/B]025BA3FF8
    2020-02-26 20:11:09 - [MEM ERROR - Data] Test: 2, CPU: 2, Address: 21E6EDB8, Expected: 0000000021E6EDB8, Actual: 000000[B]8[/B]021E6EDB8
    2020-02-26 20:11:09 - [MEM ERROR - Data] Test: 2, CPU: 4, Address: 25E6DDB8, Expected: 0000000025E6DDB8, Actual: 000000[B]8[/B]025E6DDB8
    2020-02-26 20:11:09 - [MEM ERROR - Data] Test: 2, CPU: 2, Address: 22326D38, Expected: 0000000022326D38, Actual: 000000[B]8[/B]022326D38
    2020-02-26 20:11:09 - [MEM ERROR - Data] Test: 2, CPU: 4, Address: 262A9F78, Expected: 00000000262A9F78, Actual: 000000[B]8[/B]0262A9F78
    2020-02-26 20:11:09 - [MEM ERROR - Data] Test: 2, CPU: 2, Address: 223A2DB8, Expected: 00000000223A2DB8, Actual: 000000[B]8[/B]0223A2DB8
    2020-02-26 20:11:09 - [MEM ERROR - Data] Test: 2, CPU: 4, Address: 27105D38, Expected: 0000000027105D38, Actual: 000000[B]8[/B]027105D38
    2020-02-26 20:11:09 - [MEM ERROR - Data] Test: 2, CPU: 4, Address: 2766FFF8, Expected: 000000002766FFF8, Actual: 000000[B]8[/B]02766FFF8
    2020-02-26 20:11:09 - [MEM ERROR - Data] Test: 2, CPU: 2, Address: 23D447F8, Expected: 0000000023D447F8, Actual: 000000[B]8[/B]023D447F8
    Now this goes on like this a couple of thound times with following tests.
    What i can see is, that the error seems to be systematic, meaning, in this case, it seems the the HEX value at position 10 (from the right) seems always to be 8 instead of expected 0. No matter of the CPU used (so for all cores 0, 2, 4, 6).

    Later on with next test it looks similar:
    Code:
    2020-02-26 20:11:17 - Running test #3 (Test 3 [Moving inversions, ones & zeroes])
    2020-02-26 20:11:17 - MtSupportRunAllTests - Setting random seed to 0x50415353
    2020-02-26 20:11:17 - MtSupportRunAllTests - Start time: 16184 ms
    2020-02-26 20:11:17 - ReadMemoryRanges - Available Pages = 8302249
    2020-02-26 20:11:17 - MtSupportRunAllTests - Enabling memory cache for test
    2020-02-26 20:11:17 - MtSupportRunAllTests - Enabling memory cache complete
    2020-02-26 20:11:17 - Start memory range test (0x0 - 0x81F600000)
    2020-02-26 20:11:17 - Pre-allocating memory ranges >=16MB first...
    2020-02-26 20:11:17 - All memory ranges successfully locked
    2020-02-26 20:11:17 - [MEM ERROR - Data] Test: 3, CPU: 2, Address: 1807FC, Expected: FFFFFFFF, Actual: FFFFFF[B]7[/B]F
    2020-02-26 20:11:17 - [MEM ERROR - Data] Test: 3, CPU: 4, Address: 5C153C, Expected: FFFFFFFF, Actual: FFFFFF[B]7[/B]F
    2020-02-26 20:11:17 - [MEM ERROR - Data] Test: 3, CPU: 2, Address: 1821BC, Expected: FFFFFFFF, Actual: FFFFFF[B]7[/B]F
    2020-02-26 20:11:17 - [MEM ERROR - Data] Test: 3, CPU: 2, Address: 104F7C, Expected: FFFFFFFF, Actual: FFFFFF[B]7[/B]F
    2020-02-26 20:11:17 - [MEM ERROR - Data] Test: 3, CPU: 6, Address: AABFFC, Expected: FFFFFFFF, Actual: FFFFFF[B]7[/B]F
    2020-02-26 20:11:17 - [MEM ERROR - Data] Test: 3, CPU: 4, Address: 767FFC, Expected: FFFFFFFF, Actual: FFFFFF[B]7[/B]F
    2020-02-26 20:11:17 - [MEM ERROR - Data] Test: 3, CPU: 2, Address: 44CFFC, Expected: FFFFFFFF, Actual: FFFFFF[B]7[/B]F
    2020-02-26 20:11:17 - [MEM ERROR - Data] Test: 3, CPU: 4, Address: 5C293C, Expected: FFFFFFFF, Actual: FFFFFF[B]7[/B]F
    2020-02-26 20:11:17 - [MEM ERROR - Data] Test: 3, CPU: 4, Address: 66DDBC, Expected: FFFFFFFF, Actual: FFFFFF[B]7[/B]F
    Here, it seems always the second last position is shifted from F to 7.
    Interesting thing is, that the offset seems to be 8 in both examples (shifting from F to 7 in HEX, and from 0 to 8 in the first example). This makes me even more suspicious if this really is an error in the RAM, as it looks too systematic for a random error.

    It would be nice if you would share your expertise - how does this look for you?
    Thanks a lot for your support!

    PS: If needed, i can provide the full test log

  • #2
    You don't need to run MemTest86 for hours if you are getting errors.
    For a server even 1 error is bad. Getting a few is very bad. Getting 1000s is still very bad. So you can pretty much stop testing after the 1st error, to speed things up.

    Yes, initially assume it is bad RAM. See,
    https://www.memtest86.com/troubleshooting.htm

    Maybe you should be running a system with ECC RAM is you have lots of VMs.



    Comment

    Working...
    X