Announcement

Collapse
No announcement yet.

ECC Errors only on Cold Boot

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • ECC Errors only on Cold Boot

    Hello,

    I've a problem running Memtest86 on my new motherboard. Asus WS C246 PRO with 4 sticks of ECC Memory.
    When I cold boot and load directly memtest86 from usb I get a lot of "ECC Error Detected" but no real error, when I warm reboot into memtest86 , it runs just fine for hours no error.
    I've noticed a difference of memory mapping, how you can see in attached screenshots.

    the description Loader code, loader data refer to memtest86 code or somethings else ?

    is it a bug of BIOS on mapping or a program bug to map the zone correctly ?
    The 3 blocks affected are the same size between the good and bad situation but they should be not.

    thanks.

    Mirko

    Attached Files

  • #2
    I like to add :
    I've also try to limit the starting memory address to test for excluding the zone affected, until a very tiny interval at the very end of free memory space , same problem.
    so the ECC error detected is triggered not by testing of memory directly but by something else and reported by memtest86 polling the ECC interface.

    Comment


    • #3
      This isn't first time we have seen difference in cold boot / warm boot behaviour. See these posts,
      https://www.passmark.com/forum/memte...t-on-cold-boot
      https://www.passmark.com/forum/memte...-memory-errors

      But those post don't exactly match your situation.

      We are a bit busy today, but if we get time we'll come back and have a deeper look. Could well be a BIOS bug however.

      Comment


      • #4
        Hi,

        I know ECC details are secrets and I don't know if you get some help with ECC code in the memtest86 in exchange of NDA, so I don't know if you can share at least some info about how the ECC polling is working in memtest86.
        In my case the ECC Errors looks not triggered by memtest86, the program reports no error directly and the ECC errors are reported with same coordinates whatever the memory addresses range the program is bound to check by setting.

        is it possible I/O Map memory to be written bypassing ECC logic ?
        Memtest86 simply check the ECC polling interface (is it in BIOS or in CPU itself ?) and report whatever the interface pass back it isn't ? so it can be triggered by something else write/read.
        in cold boot the entire memory is initialize by ECC code, but in warm reboot it is not ? or it is never initialized and any program is not suppose to read from a memory address that is not written before ?

        I see in details notes of 8.4 version that you have change the logic of memory management of memtest86, is possible to download the 8,3 or 8,2 version of program to try ?
        I m using Coffe Lake refresh CPU (i3 9100) so at least the 8.0 is needed to support the ecc logic.

        thanks in advance for any support.

        regards

        Comment


        • #5
          looks like my case don't deserve any consideration.
          ok. bye.

          Comment


          • #6
            Unfortunately the UEFI memory map and ECC handling are complex, and their functionality is largely keep secret by Intel.
            Most of the bad behaviour are BIOS bugs or undocumented changes by Intel in a new CPU family.

            So it is easy to spend hours or days investigating problems with individual PCs (often with little hope of a resolution as the problems are largely out of our control). And sometimes we just don't have the time to do this for free. Did you try contacting the people who sold you the hardware?

            Comment


            • #7
              The thing that is very curious about this problem is: the ECC errors are 2 every time the program complete a full scan of memory addresses and whatever is the restricted range, so test that scan memory only 1 full pass like Test 0,1 report only 2 errors (or maybe the routine for polling is the maximum that can report in that time), test that made multiple pass, report multiple pair of errors, the fade test report a lot of errors when in sleeping time. so maybe the polling routine is running and continually intercept error, the addresses should be very low (rank and bank are often 0,0) but with the scrambling active who know actually which addresses is it ?
              if some UEFI routine actually is accessing the memory in concurrency with Memtest86, why that should produce ECC Error ? (for sure that routine don't overwrite the memory that memtest86 use otherwise memtest86 test should report errors.)

              The only possible explanation is that memory was written before ecc logic was activated so the ecc code is invalid for that location and generate spurious errors every time it is accessed. the time is irrelevant, even after 5 hours after the boot same behavior.

              Comment

              Working...
              X