Announcement

Collapse
No announcement yet.

Test (test 0 and test 1) ECC error only after cold boot (not after repeat/reset)

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Test (test 0 and test 1) ECC error only after cold boot (not after repeat/reset)

    Hi,

    my configuration:
    kontron d3644-b,
    Kingston 32GB 3200MT/s DDR4 ECC CL22 DIMM 2Rx8 Hynix C - KSM32ED8/32HC
    i3-9100

    Memtest86 shows few ecc errors only in test 0 and test 1 (compare report below) only after cold boot. After repeating the test without restarting the program no errors appear. After a reset and program restart from the usb stick no errors appear. However, if the computer is switched off and restarted, the ecc errors appear again.

    The behavior does not depend on whether I insert 1 or 2 ram modules and regardless of the slot in which the ram modules are inserted.​

    It is noticeable that the ECC errors are only shown in tests 0 and 1. According to the explanation of Memtest (https://www.memtest86.com/tech_indiv...est-descr.html), only one CPU core is used in these tests, which differs from the other tests. Test 2 seems to do the same as test 1 but the testing is done using multiple cores.

    My conclusion: The ram modules are ok. But there seems to be a bios bug that causes the strange behavior. I can not completely rule out that memtest86 is the cause.

    It seems that there have been repeated complaints about problems with memtest after cold boot:
    https://forums.passmark.com/memtest8...ld-boot​


    excursus/bug report: There seems to be a small inconsistency in the generate report compared to the display on the screen / manual: On the screen and in the manual the tests count from 0-13. In the generated report it seems to count from 1 on. So test 0 on the birl screen corresponds to test: 1 in the report. Correct?

    Best regards,
    Torsten


    Summary
    Report Date 2023-08-18 09:09:39
    Generated by MemTest86 V10.6 Free (64-bit)
    Visit MemTest86.com to Upgrade to Pro
    Result INCOMPLETE PASS
    System Information
    EFI Specifications 2.70
    System
    Manufacturer FUJITSU
    Product Name
    Version
    Serial Number
    BIOS
    Vendor FUJITSU // American Megatrends Inc.
    Version V5.0.0.13 R1.28.0 for D3644-B1x
    Release Date 12/01/2022
    Baseboard
    Manufacturer FUJITSU
    Product Name D3644-B1
    CPU Type Intel Core i3-9100 @ 3.60GHz
    CPU Clock 3692 MHz [Turbo: 4102.6 MHz]
    # Logical Processors 4
    L1 Cache 4 x 64K (212227 MB/s)
    L2 Cache 4 x 256K (90814 MB/s)
    L3 Cache 6144K (53806 MB/s)
    Memory 32516M (14151 MB/s)
    RAM Configuration DDR4 ECC 2400MT/s / 1.200V
    Number of RAM SPDs detected 1
    SPD #0 32GB DDR4 2Rx8 ECC PC4-25600
    Vendor Part Info Kingston / 9965745-039.A00G /
    JEDEC Profile 3200MT/s 22-22-22-52 1.2V
    Number of RAM slots 4
    Number of RAM modules 1
    DIMM A1 32GB DDR4 2Rx8 ECC PC4-25600
    Vendor Part Info Kingston / 9965745-039.A00G /
    SMBIOS Profile 3200MT/s 1.2V
    DIMM A2 Empty slot
    DIMM B1 Empty slot
    DIMM B2 Empty slot
    Result summary
    Test Start Time 2023-08-18 09:07:15
    Elapsed Time 0:02:17
    Memory Range Tested 0x0 - 86C800000 (34504MB)
    CPU Selection Mode Parallel (All CPUs)
    CPU Temperature Min/Max/Ave 49C/55C/52C
    ECC Polling Enabled
    # Tests Completed 4/48 (8%)
    # Tests Passed 4/4 (100%)
    ECC Correctable Errors 7
    ECC Uncorrectable Errors 0
    Test # Tests Passed Errors
    Test 0 [Address test, walking ones, 1 CPU] 1/1 (100%) 0
    Test 1 [Address test, own address, 1 CPU] 1/1 (100%) 0
    Test 2 [Address test, own address] 1/1 (100%) 0
    Test 3 [Moving inversions, ones & zeroes] 1/1 (100%) 0
    Test 4 [Moving inversions, 8-bit pattern] 0/0 (0%) 0
    Test 5 [Moving inversions, random pattern] 0/0 (0%) 0
    Test 6 [Block move, 64-byte blocks] 0/0 (0%) 0
    Test 7 [Moving inversions, 32-bit pattern] 0/0 (0%) 0
    Test 8 [Random number sequence] 0/0 (0%) 0
    Test 9 [Modulo 20, ones & zeros] 0/0 (0%) 0
    Test 10 [Bit fade test, 2 patterns, 1 CPU] 0/0 (0%) 0
    Test 13 [Hammer test] 0/0 (0%) 0
    Last 10 Errors
    2023-08-18 09:07:31 - [ECC Errors] Test: 1, (Channel,Slot,Rank,Bank,Row,Col): (0,0,0,0,1D400,, ECC Corrected: Yes, Syndrome: 0000, Channel-Slot: 0-0
    2023-08-18 09:07:28 - [ECC Errors] Test: 1, (Channel,Slot,Rank,Bank,Row,Col): (0,0,0,0,19300,, ECC Corrected: Yes, Syndrome: 0000, Channel-Slot: 0-0
    2023-08-18 09:07:26 - [ECC Errors] Test: 1, (Channel,Slot,Rank,Bank,Row,Col): (0,0,0,0,15200,, ECC Corrected: Yes, Syndrome: 0000, Channel-Slot: 0-0
    2023-08-18 09:07:24 - [ECC Errors] Test: 1, (Channel,Slot,Rank,Bank,Row,Col): (0,0,0,0,11100,10), ECC Corrected: Yes, Syndrome: 0000, Channel-Slot: 0-0
    2023-08-18 09:07:22 - [ECC Errors] Test: 1, (Channel,Slot,Rank,Bank,Row,Col): (0,0,0,0,10000,, ECC Corrected: Yes, Syndrome: 0000, Channel-Slot: 0-0
    2023-08-18 09:07:19 - [ECC Errors] Test: 1, (Channel,Slot,Rank,Bank,Row,Col): (0,0,0,0,1D300,0), ECC Corrected: Yes, Syndrome: 0000, Channel-Slot: 0-0
    2023-08-18 09:07:17 - [ECC Errors] Test: 0, (Channel,Slot,Rank,Bank,Row,Col): (0,0,0,0,10000,0), ECC Corrected: Yes, Syndrome: 0000, Channel-Slot: 0-0

  • #2
    For the ECC errors on cold boot:
    There is likely a bug in your EDAC/BIOS. Your ECC RAM is OK, but was not initialized properly by the BIOS on boot. In order to initialize ECC, memory has to be written before it can be used. Usually this is done by BIOS, but with some motherboards this step is skipped if "Quick Boot" is enabled. Possible Solution: If your system allows for it, try disabling Quick Boot in the BIOS, some error messages should disappear. The boot process may taker 30-60 seconds longer, but the EDAC error messages disappear due to the RAM check by the BIOS when booting.
    BIOS already known to experience this issue: KONTRON AMI BIOS


    ​So test 0 on the birl screen corresponds to test: 1 in the report.
    No. I don't think this is the case. There is a Test 0 result in results you pasted above.

    Comment


    • #3
      Hi,

      thank you for your reply. I agree that the RAM is ok.

      And it seems that the BIOS has a bug.
      Unfortunately the Kontron AMI Bios has no option to disable quick boot. Therefore I can not check whether that helps.

      In order to narrow it down i made the following tests:

      1. Running memtest86 after cold boot
      Click image for larger version  Name:	mem21.png Views:	0 Size:	360.7 KB ID:	55779
      Shows correctable eecc errors in test 0 and test 1 (not in test 2-13 though)

      2. Running memtest86 after initial run (full 4 passes) and reset by reset button

      Click image for larger version  Name:	mem22.png Views:	0 Size:	220.6 KB ID:	55780
      shows no errors.

      3. Running memtest86 after cold start and reset during post showing A2 on boot screen
      Same as 1 above

      4. Running memtest86 after cold start and reset during start page of memtest86
      Same as 1 above

      5. Running memtest86 after cold start and reset during test 0 (and showing first ecc error)
      Same as 1 above

      6. Running memtest86 after cold start and reset during test 1 (and showing first ecc error in test 1)
      Same as 1 above
      ​​
      7.6. Running memtest86 after cold start and reset during test 2
      Same as 2 above


      Conclusion: It seems that test 2 somehow initializes the ECC RAM properly in a way that a rerun of memtest after reset shows no ecc errors (the board somehow seems to memorize that ECC has been initialized until power off)


      However there is something which might contradict the thesis:
      If memtest86 runs several passes it shows in each pass (in test 0 and test 1) the identical ecc errors. That should not happen if the first run of test 2 initilializes the RAM properly. Then test 0 and test 1 of pass 2-4 should not show the errors again. Unless some function after test 2 de-initializes the ECC RAM again to its original state, in a way that test 0 and 1 find the ECC RAm in the same state as after cold boot.

      Does this make sense?

      Best regards

      (sorry for the confusion regarding my report of non existing inconsitencies between screen and report in the original post. Double chedcked it and you are right. I can not reproduce what led me to the wrong assumption.)



      ​​​

      Comment


      • #4
        Further Tests:

        8. Cold boot configure memtest86 (active only test 2 and 3 all other tests deactivated)
        Click image for larger version

Name:	mem23.png
Views:	663
Size:	362.5 KB
ID:	55784

        This time memtest finds 7 ECC RAM erors in test 2 and no in test 3 (in every pass).


        9. Cold boot configure memtest86 (active only test 3 and 4 all other tests deactivated)


        Click image for larger version

Name:	mem243.png
Views:	647
Size:	324.7 KB
ID:	55785

        This time memtest finds 23 ECC RAM erors in test 3 and no in test 4 (in every pass). ​



        Conclusion:

        It seems that the first test (independent from which test) initializes the ECC RAM. And in the next test no error no error is found [one exeception though in test 0 1 error and test 1 6 errors]. Now I understand your sentence ("In order to initialize ECC, memory has to be written before it can be used.") better.

        From that I would conclude, that even if kontron does not update its bios/implements option to deactivate quick boot, I do no t have to worry. Its seems that ECC RAM is initilized at the first time it is written to. That can be done at boot itme or later. It might be that at the firtst time it is written to an ecc Error is happening. At the second time no error will be happen again.

        Therefore it might be that the operating system loggs few corrected ECC Errors after cold boot - that`s it. Right?

        Best regards

        Comment


        • #5
          Therefore it might be that the operating system loggs few corrected ECC Errors after cold boot - that`s it. Right?
          That is our understanding of the issue yes. But it depends on what comes first in the boot process. Writing to the RAM, or reading errors from the memory controller.

          Ultimately it is a BIOS bug however and the vendor (Kontron / Fujitsu) should fix it.

          Comment


          • #6
            Originally posted by David (PassMark) View Post

            That is our understanding of the issue yes. But it depends on what comes first in the boot process. Writing to the RAM, or reading errors from the memory controller.

            Ultimately it is a BIOS bug however and the vendor (Kontron / Fujitsu) should fix it.
            Hi,

            thank you for your assessment.

            I repeated the test with an I3-8100 (as user datacollector wrote over here https://forums.unraid.net/topic/1362...fehlung-d3644/ that he had no issues with this cpu) but with the same results. So the BIOS bug seems no to be dependant on the CPU.

            I pointed the bug out to Kontron. But they seemed to be not interested (Even though the board has extended lifetime) and but have referred me to the retailer, as if he would be able to do a BIOS update.

            Best regards


            Comment


            • #7
              I pointed the bug out to Kontron. But they seemed to be not interested
              We are not surprised.
              Is nearly impossible to get vendors to fix their motherboard bugs.

              Comment

              Working...
              X