Announcement

Collapse
No announcement yet.

Understanding seemingly random # of errors

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Understanding seemingly random # of errors

    Whilst my Laptop was doing Windows Updates on Thursday night I experienced a BSOD related to "MEMORY_MANAGEMENT" (I also had one that was "SYSTEM_THREAD_EXCEPTION_NOT_HANDLED" soon afterwards), upon looking into the potential causes of the first it was suggested that there may be a hardware RAM fault. I also had the same MEMORY_MANAGEMENT error on Saturday afternoon.

    I downloaded Memtest86 v8.1 free & let it do 2 passes (out of the 4 it defaulted to) and both times it had widely different number of errors on the tests.

    First pass running on 4 CPU's (Intel i7 6700k) had a total of 766 cumulative errors, most of which were Test 6 followed by Test 7, after the second pass the number of errors was up to 3220. Almost half of which were on Test 6 (1573).

    I decided to run Test 6 alone for 1 pass (on 4CPU's) and it produced a different number of errors, ran tests 5-7 again with 4 CPU's on 1 pass and it had 361 errors, ran test 6 only on 1 CPU with 1 pass and it produced 300 errors. Each "error" is only one Bit being flipped, so for an example a 2 at the end of the data becomes an F when read back.

    And after running Memtest86 last night I also had "BAD_SYSTEM_CONFIG_INFO" which suggested that the SYSTEM Registry Hive was either corrupt (unlikely according to MS) or the memory image of it was Corrupt, further suggesting a RAM issue.

    I recently upgraded from 32GB of Samsung 2133MHz M471A2K43BB1-CPB (two 16GB Modules) but I wanted 64GB of RAM but was unable to find anywhere to purchase the same RAM from without it costing a ridiculous amount, so I decided to purchase 64GB of Kingston HyperX 2400MHz HX424S14IB/16 for about the same cost as 32GB of the Samsung RAM imported from another country. Note that when I purchased the Kingston RAM it was cheaper to purchase them as 4 individual modules rather than a specific "64GB Kit", all 4 modules were manufactured in Week 3 of 2019.

    My Laptop is an XMG U716 which is based on the Clevo P775DM1(-G), it has an Intel i7-6700k (using the 100 series chipset I believe) and a GTX980 8GB with a Samsung 951 NVME drive for Windows 10. The CPU is slightly overclocked to 4.3GHz on one core and 4.2Ghz on 2+ cores, temperatures don't get ridiculous although I will admit they're warmer than I'd like but this is a desktop CPU in a Laptop, but memtest86 reported a Maximum CPU temp of 71C. The BIOS doesn't give me an option to change the speed without modifying all the timings manually (I could change the ratio down from 9 to 8 but ideally I'd like to have the proper timings for that rather than leaving them at 14-14-14-35 etc), the default DIMM Profile is apparently 2400MHz and XMP1 is exactly the same.

    Can anyone explain why I seem to be getting different numbers of errors when running the same tests on the same RAM? Surely if this was a RAM issue then the results would always be the same? But if it was a Memory controller issue then more than one bit would be flipped and it would affect all RAM, not just two sticks (judging from the lowest & highest memory addresses being above 32768MB and below 65536MB)?

    Also will the Pro version let me save a list of ALL the errors that occur rather than just the last 10? That way I could compare the tests to see if the errors appear in the same addresses.

  • #2
    RAM errors typically mean the RAM is bad.

    Pass 1 is a quicker pass than Pass 2. So that might account for some of the differences. Also the RAM is likely marginal (running at it's limit) and so tiny differences in EMI, temperature, existing state & timings can result in different behaviour.

    You can't look at memory addresses and know which stick is which address. The mapping is much more complex.

    See also
    https://www.memtest86.com/troubleshooting.htm

    In the Pro release you can edit the configuration file to adjust the REPORTNUMERRS setting to display up to the last 5000 errors. See,
    https://www.memtest86.com/technical.htm#config


    Comment


    • #3
      Could it be another point of failure? For example overheating CPU/Cache/Memory Controller, or a memory controller not able to handle RAM at 2400MHz as it was designed for 2133MHz?

      It's relatively easy to pull out one or two sticks of RAM as half are under the bottom cover but the other two are under the keyboard, however I'm somewhat concerned that it could be a dual channel only issue, or an issue present when only 4 sticks of RAM are present.

      Given that the "lowest memory address" and "highest memory address" for faults are shown in MB and those are above 32768MB and below 65536MB (reported somewhere around 34848MB and 61xxxMB IIRC), does this mean that two sticks are faulty? I understand that mapping Addresses on Intel platforms to physical sticks is technically impossible without Intel revealing their proprietary mapping, so it's trial and error to find out which sticks are faulty I guess, unless it's an issue only present with dual channel/4 sticks.

      What other SPD Information does the Pro version display/save to file? Unfortunately my Laptop only lets me set all the timings manually (including tRAS etc etc) rather than giving me a "speed" to select, although I noticed "ratio" so I could turn that down, but it's somewhat pointless having 2400MHz RAM that can't run that fast because it throws errors imo.

      Comment


      • #4
        does this mean that two sticks are faulty
        No. You can't look at memory addresses and know which stick is which address.


        Could it be another point of failure?
        It is possible. But RAM errors typically mean the RAM is bad.

        IMHO if you want a stable system, give up with overclocking the CPU & RAM unless you have a huge amount of time on your hands and stability is a second priority.

        Comment


        • #5
          I was more referring to the fact that Memtest86 says the lowest address was ~34848MB (but definitely higher than 32768MB) and the highest was about 61xxxMB, which given that each stick is 16GB in size would indicate that two sticks were faulty surely? Unless it's entirely possible that the addresses for 32848MB and 61XXXMB are on the same physical stick? Which I guess if that's entirely possible then it makes finding the culprit slightly more difficult but not impossible I guess.

          The RAM itself isn't overclocked, the default speed of the RAM is 2400MHz, but it can (according to Kingston) run at 2133MHz. The i7 6700k is rated for up to 2133MHz RAM but I've seen people running it with RAM at 3000MHz (on desktops obviously, not Laptops). The only reason I bought 2400MHz RAM was because it was significantly cheaper than the 2133MHz HyperX RAM (like half the price). I've had the CPU overclocked for several months with the original 32GB Samsung 2133MHz RAM the laptop came with installed and had no stability issues. I've only had these BSOD's since Thursday night, and I've only had the new Kingston RAM for about a month and a half now.

          What you said originally about changes in Temperature is something my Dad had thought of also, as the temperature has been quite mild recently and I haven't actually used it so much recently to notice any issues.

          I'll test removing one module at a time tonight and do some quick tests to see if there's a stage where it shows no errors and I'll know I've found the culprit, then I guess if I still get errors then I'll work up to two sticks and see if there's no reported errors and go from there.

          Comment


          • #6
            After some thorough checking of RAM modules last night individually and even in different slots (i.e. one module gave me 6 errors on test 6 in RAM_4 so I swapped it into RAM_1 and it threw a few hundred across tests 5-7), it turns out that it just doesn't like running at 2400MHz, and Memtest86 shows no errors whatsoever when the timings are set for 2133MHz. I'm not sure why it throws so many different numbers of errors but that's to figure out another time.

            When I get more time I will remove the partial overclock on the CPU and try the RAM again at 2400MHz and see if Memtest86 shows any errors.

            Comment


            • #7
              If you are using a CPU rated at 2133Mhz at 2400Mhz, it should not be all that surprising that it isn't stable.

              If the 6700K was always stable that this speed for all the CPUs that were made, then I am sure they would have sold it with higher specs. It's the silicon lottery.

              Comment

              Working...
              X