Announcement

Collapse
No announcement yet.

SPD not detected - Samsung M393A2K43BB1-CRC 2400 MT/s 16GB (ECC)

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SPD not detected - Samsung M393A2K43BB1-CRC 2400 MT/s 16GB (ECC)

    I have 4 x 16GB Samsung ECC memory modules (M393A2K43BB1-CRC) installed in an ASRockRack C3758D4I-4Lhttps://www.asrockrack.com/general/p...Specifications.

    After upgrading the motherboard BIOS I am now getting "SPD not detected" on 3 of the 4 modules.

    MemTest also fails when using multiple CPU cores to test in parallel (I added my mobo to the post here: https://forums.passmark.com/memtest8...election-modes). Also, when testing multiple modules together in single CPU mode (all modules inserted and tested in the one pass) I get a lot of ECC errors, but when only testing a single module (the other 3 are removed from the board), all MemTest tests pass with zero bit errors.

    The BIOS upgrade was suggested by ASRockRack to fix these errors and to fix the parallel CPU MemTest issue. No joy.

    Anyone got any ideas? I am running out...of ideas and time...

    Before upgrade:
    Click image for larger version

Name:	image.png
Views:	119
Size:	274.6 KB
ID:	57027
    After upgrade:
    Click image for larger version

Name:	image.png
Views:	136
Size:	287.3 KB
ID:	57026

  • #2
    I shutdown and removed 3 of the 4 RAM sticks. As expected, no "SPD not detected" error:
    Click image for larger version

Name:	image.png
Views:	109
Size:	215.4 KB
ID:	57029
    Can anyone elucidate on why this may be happens?

    This PassMark article on SPD issues states:
    there may be a BIOS option to disable hardware TSOD polling such as "Memory Thermal Throttling" or "Closed Loop Thermal Throttling"
    Should I try this?

    Comment


    • #3
      For basic memory testing it doesn't really matter if the SPD data is found or not.

      Even in the before BIOS upgrade it looks like it failed to read the SPD from one of the modules. So there might be some random aspect to it. e.g. contention on the SMBus. The SMBus was never designed for multiple tasks accessing it at the same time. So yes, if there is any other task reading the same BUS (like memory thermal throttling or temperature readings) turn it off.

      If you are getting large numbers of ECC errors and you are pretty sure they aren't real, that might also be a BIOS bug. We found a small number of motherboard where the ECC RAM was not initialized properly by the BIOS on boot. In order to initialize ECC, memory has to be written before it can be used. Usually this is done by BIOS, but with some motherboards this step is skipped if "Quick Boot" is enabled. So turn off quick boot if doing ECC testing.

      Comment


      • #4
        Originally posted by David (PassMark) View Post
        For basic memory testing it doesn't really matter if the SPD data is found or not.
        I'll have to take your word for it...but you are a PassMark Forum Administrator, so...probably the best word I'll find on here

        Originally posted by David (PassMark) View Post
        ...contention on the SMBus. The SMBus was never designed for multiple tasks accessing it at the same time. So yes, if there is any other task reading the same BUS (like memory thermal throttling or temperature readings) turn it off.
        I looked into my BIOS - I can't find a setting for thermally throttling memory, only CPU. That option's out.

        Originally posted by David (PassMark) View Post
        If you are getting large numbers of ECC errors and you are pretty sure they aren't real, that might also be a BIOS bug...
        Interesting...After trying MemTest with all CPU cores and all memory modules inserted, I ran a full test on each memory module on it's own and using only a single CPU core.
        All 4 modules passed with flying colour (green...). Not 1 single bit error reported for any of them. I also ran each test in a different slot to test slots too. No issues.

        Would this be enough to conclude the "ECC errors...aren't real"?

        Originally posted by David (PassMark) View Post
        ​ We found a small number of motherboard where the ECC RAM was not initialized properly by the BIOS on boot. In order to initialize ECC, memory has to be written before it can be used. Usually this is done by BIOS, but with some motherboards this step is skipped if "Quick Boot" is enabled. So turn off quick boot if doing ECC testing.
        Also very interesting...this might be a cause. This looks like what you mean by "Quick Boot":

        DRAM Confguration settings:
        ​​​​Click image for larger version

Name:	image.png
Views:	101
Size:	51.0 KB
ID:	57033
        Click image for larger version

Name:	image.png
Views:	95
Size:	144.7 KB
ID:	57034
        I will reverse these and see if I still get ECC errors:​
        Click image for larger version

Name:	image.png
Views:	95
Size:	83.3 KB
ID:	57035
        ​​

        Comment


        • #5
          Dang...doesn't seem to have made a blind bit of difference:
          Click image for larger version

Name:	image.png
Views:	93
Size:	198.7 KB
ID:	57037

          Comment


          • #6
            Also, I don't quite understand why it says
            ...Pass 1 / 4 Errors: 0
            when I see lots of red.

            Does it only count an error that cannot be corrected using ECC?
            And do the red errors above represent errors that were found and then successfully corrected using ECC?


            On the upside, SPD was detected on all four modules...

            Comment


            • #7
              I think this answers my own question: ECC Correctable errors shown in final report (this was not a full test, but plenty ECC errors there)
              Click image for larger version

Name:	image.png
Views:	98
Size:	98.2 KB
ID:	57040

              Comment


              • #8
                Does it only count an error that cannot be corrected using ECC?
                Correct. If there was a corrected error, it isn't counted as an error. There was no data corruption and the hardware should be reliable in use (but it still isn't a good sign for the future).

                ECC errors might be real however. Would be interesting to see if they also get recorded in Linux or Windows event log as errors (you would need to place the machine under memory load in the O/S to see this).

                There are also good reasons why some errors only occur with multiple RAM sticks.

                Comment


                • #9
                  Originally posted by David (PassMark) View Post
                  ...isn't a good sign for the future...
                  ...yeah... 8-/
                  Originally posted by David (PassMark) View Post
                  ECC errors might be real however. Would be interesting to see if they also get recorded in Linux or Windows event log as errors (you would need to place the machine under memory load in the O/S to see this).
                  You mean like this?
                  Click image for larger version

Name:	image.png
Views:	91
Size:	142.4 KB
ID:	57054
                  Originally posted by David (PassMark) View Post
                  There are also good reasons why some errors only occur with multiple RAM sticks.
                  Some very interesting points in that section.

                  Am I being too paranoid?

                  One thing I did notice is that each time I launched a full 12-test 4-pass cycle (48 tests) with all 4 sticks inserted, the errors started appearing within seconds of the start of the test cycle, they were always ECC errors, they were always corrected and, although there were thousands, the errors were always detected at only 1 or 2 (maybe 3 or 4) memory addresses during each test, and I think it was the same erroring addresses throughout the whole test cycle (though the addresses sometimes changed between tests within a test cycle).

                  Comment

                  Working...
                  X