Announcement

Collapse
No announcement yet.

ECC Error mapping

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • ECC Error mapping

    Hello,
    Some ECC errors are accompanied with a S/N like
    2022-08-05 21:23:07 - [ECC Error] Test: 8, (Rank,Bank,Row,Col): (N/A,N/A,N/A,N/A), ECC Corrected: Yes, Syndrome: N/A, Channel/Slot: 1/1 (S/N: XXXXXX)
    while others are missing that.
    2022-08-05 21:22:36 - [ECC Error] Test: 8, (Rank,Bank,Row,Col): (N/A,N/A,N/A,N/A), ECC Corrected: Yes, Syndrome: N/A, Channel/Slot: 4/1
    is there anyway to get additional information to help map it to a Dimm slot? What are the channel/slot suppose to mean in this context? Any way to get the associated CPU?

    thank you,
    Peter

  • #2
    The software has no idea about the physical position of the channels / slots for an arbitrary motherboard. The motherboard manufacturer might have put them in any old random position on the PCB.

    One would hope that the motherboard manual would have the slots numbered / labeled. But it would not surprise me greatly if the numbering scheme in the manual didn't match the numbering scheme used in the CPU's memory controller.

    If you had a bad memory stick that reliably produced errors, then you could move it around in the system and map out the locations of each slot.

    We are working on doing something better for a few specific "reference" motherboards in the future. But that won't help you for the moment.

    It might be possible to output the CPU in use for each error. Let me check on that point.

    Comment


    • #3
      Thanks, would appreciate any additional information. I understand the software doesn't have an context about the physical location but does it know which DIMM slot number, as reported by MemTest, the error occurred on?

      Comment


      • #4
        2022-08-07 01:34:42 - [ECC Error] Test: 0, (Rank,Bank,Row,Col): (N/A,N/A,N/A,N/A), ECC Corrected: Yes, Syndrome: N/A, Channel/Slot: 3/0
        SPD #23 32GB DDR4 ECC PC4-19200
        Kingston / KG27000110-001 / 7281FCBB / Channel: 1 Slot: 2
        17-17-17-39 / 2400 MHz / 1.2V
        it doesn't seem to relate the SPD channel/slot and the DIMM doesn't have much identifying info
        DIMM Slot #23 32GB DDR4 ECC PC4-19200
        UNKNOWN / NOT AVAILABLE
        2400 MHz

        Comment


        • #5
          Attached the CPU memory channel/slot for this particular server. I'm going to try and move a known DIMM with ECC errors across the board to see if I can reliably correlate the channel/slot reported to something on the board. So far I have only seen [0-7]/[0,1] reported for channel/slots. The ones with S/N have only been in CPU2 Ch3 and Ch4 and report in MemTest ECC Error log as channel 0 or 1 and slot 0 or 1

          Click image for larger version

Name:	screen_shot_2022-07-15_at_11.25.18_am.png
Views:	708
Size:	1.27 MB
ID:	53357

          Comment


          • #6
            Can you upload or send a copy of the MemTest86.log under EFI\BOOT\ of the USB flash drive.

            Comment


            • #7
              Here you go.
              Attached Files

              Comment


              • #8
                Thanks for the logs.

                It may be possible to map the reported channel/slots to the physical slots on this particular board, but having multiple sticks with ECC errors makes it more challenging.

                But if you are able to isolate the errors to a single module/slot (and grab the logs for these runs) we might be able to do a proper mapping in the code.

                Comment


                • #9
                  This is a different run with a single channel/slot reporting error. This is on a different host but it’s the same model.
                  Attached Files

                  Comment


                  • #10
                    Adding something interesting. I have been able to map the channels MemTest reports to the channels on the board. My issue now is that while slot 0 will always map to the first physical DIMM slot on the board, slot 1 and 2 are both being reported as slot 1 in MemTest.
                    I attached the outputs of 2 runs where I moved the dimm with serial "72720779" from slot2 to slot1. The SPD is updated correctly but the ECC error log reports both as slot1.
                    I have uploaded both log files. Please note it is called slot.log but is actually zgzipped tarball as the log itself was too big to upload in text format. If I need to get it to you in another manner, please let me know.
                    Attached Files

                    Comment


                    • #11
                      Thanks for the logs. It might be easier if you send the logs directly by e-mail and we can correspond there.

                      Would it be possible to run with just the "72720779" stick by itself, switching between slots 0,1 and 2.

                      Here is a new build with additional logging that will help us figure out the mapping:
                      https://www.passmark.com/temp/memtes...b-9.5.0028.zip

                      Comment


                      • #12
                        I wrote the supplied image out with balenaEtcher on my Mac, which I used previously for the standard release. Unfortunately this is what I see after the initial MemTest initialization.
                        Attached Files

                        Comment


                        • #13
                          Any update on a new image or what I may have done wrong?

                          Comment


                          • #14
                            We've never heard of balenaEtcher. So don't know what it does or how it works.
                            There are instructions here for making a bootable Memtest86 USB drive on Mac here.

                            Comment


                            • #15
                              https://www.balena.io/etcher/ Is just a app to flash OS images to usb.
                              I’ll try with DD but this software is what I’ve used for the stable pro memtest version to create my USB stick.
                              could you provide a sha/md5 for the debug image so I know I’m burning a valid image?

                              Comment

                              Working...
                              X