Announcement

Collapse
No announcement yet.

ECC support for v4 versions

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • ECC support for v4 versions

    Hi,

    Does the v4 version have any ECC detection at all ? Or would it be able to detect ECC errors as normal error in any way ?

    I try to use memtest86 on HP SL210t Gen8 but despite the hardware being pretty new (Xeon E5-2600-v2 platform) they are not UEFI based.

    Meaning that I can't use the latest Memtest86 versions I would like to now if it is worth it to run the v4 version.

    I am asking because I tried version 4 on a server with 256Go of ram and a known faulty dimm which raise alerts in the BMC of the server.
    Memtest86 has been running for 114Hours now but did not report any errors although the BMC is reporting that ECC errors are corrected.

    Click image for larger version

Name:	NoError.PNG
Views:	1
Size:	19.5 KB
ID:	35190Click image for larger version

Name:	BMCErrors.jpg
Views:	1
Size:	6.0 KB
ID:	35191

    Below a link to the details of the server :
    http://www8.hp.com/emea_middle_east/en/products/proliant-servers/product-detail.html?oid=5407513#!tab=features

  • #2
    No, V4 doesn't support ECC RAM. Meaning that you can still use the MemTest86 software, but it won't report on any ECC corrections.

    You need either V5 or V6. And both V5 and V6 need UEFI.

    HP have been using UEFI since 2002/2003 on some machines. I haven't seen any machine released in the last 3 years without it.

    Comment


    • #3
      That's what I thought.

      There might be HP workstation or laptop using UEFI since that time but HP servers only started to support UEFI with the DL580 Gen8 released in 2014 ... they said that the Gen9 would support UEFI.

      Source : http://h17007.www1.hp.com/docs/enter...FI_altair.html

      Comment


      • #4
        Doesn't the BMC tell you which module is defect?
        I am used to this from Dell and Supermicro machines.

        Often this requires a configured IPMI webinterface as the BMC log just contains some hex values mostly.

        I have also made quite bad experiences to detect memory errors in such machines.
        My guess is that it's simply not possible without multi processor support to quickly identify 'nearly' stable memory modules.
        They fail way more frequently when running our own software (physics simulation) compared to Memtest4.
        Therefore I am not using Memtest4 for our larger systems anymore, always been a waste of my time.
        (Example: Our software causes the machine to crash within 4hours, Memtestv4 doesn't report (or cause a log event in the BMC) when running over the weekend)

        I hope that I will be able to detect errors way more reliable with newer memtest versions on servers.
        It's really a shame how this needs to be handled right now in enterprise environments, but I am impressed by the progress Passmark has made in the last months on this.

        If you want you can drop me a PM and share some experiences.
        I wonder why I can so find so few information on this topic, it's been a big issue for me.
        Sometimes I tend to swap ALL modules if uptime is crucial because of this.
        Last edited by orioon; Feb-03-2015, 01:51 PM.

        Comment


        • #5
          Actually yes we currently rely on the BMC to tell us if there is any errors.

          But on HP hardware the detection is rubbish : it only logs something if an unknown threshold is crossed or if on uncorrectable error is encountered.

          It is even worse on the ipmi interface as most of the informations are either blank or wrong.

          In my case I have been able to trigger ECC errors running memtest 4 but only the BMC has detected them.(see second screenshot of my first post)

          In modern setup with multiple cpus and a lot of dimms it is really tricky to isolate the fault because memory errors doesn't tell you if it is caused by a bad dimm, CPU or motherboard.

          Without a proper stress test we often end up replacing all three possibilities in one go.

          Comment


          • #6
            Memtestv4 doesn't report (or cause a log event in the BMC) when running over the weekend)
            It would be interesting to try V5/V6 if you get the chance, and the machine supports UEFI. It would also be interesting to know if it runs in a single thread or multiple threads.

            Comment


            • #7
              But on HP hardware the detection is rubbish : it only logs something if an unknown threshold is crossed or if on uncorrectable error is encountered.
              Same on Dell hardware. Supermicro is much worse in this area.
              Dell shows something like 'Multibit error rate exceeded at DIMM B2 on CPU2.
              As long as the BMC made a detection that's actually quite good in my opinion.
              If you have a monitoring system like Nagios you should be informed of this immediately.
              I kind of prefer this, but this happens very rarely.
              We are not doing this for Supermicro systems yet, because it requires to dump the bmc log with a command line tool and analyze it, something on my long to do list...

              It would be interesting to try V5/V6 if you get the chance, and the machine supports UEFI. It would also be interesting to know if it runs in a single thread or multiple threads.
              I've been testing in this area during the last months as I am confident that this can be resolved with Memtest V6 in multithreaded mode.
              The older Dell systems had a memory test which was definitely superior to MemtestV4, but they decided to replace it with a kind of 'useless' test on newer systems.
              That's practically the main reason right now why I am trying to help improving the new Memtest builds.
              Also sent you quite a lot of bug reports which have been taken care of in practically no time, I want to thank the Passmark team and especially Keith for that again.

              'Unfortunately' none of the systems which support this has been showing memory issues.
              I assumed so for some systems, but it either turned out to be a CPU error or did not reoccur.
              Multithreaded mode in V4 is not reliable on MP systems from my experience when you give it enough time, but this should have changed with the newer versions.

              If I remember correctly you got a testing module from Hynix to produce ECC errors.
              Is there any way to get a hand on such modules or build them ourselves?
              I really need to keep a pile of modules which are known to cause memory errors, but I usually have to send them back when I receive the replacement module.
              Last edited by orioon; Feb-03-2015, 09:59 PM.

              Comment


              • #8
                If I remember correctly you got a testing module from hynix to produce ECC errors.
                Is there any way to get a hand on such modules or build them ourselves?
                We have this custom RAM module for testing from Team Group. Press the button and you get a flood of 1bit ECC errors. It was a one off custom hack job as far as I am aware.



                There are more details in this old post about ECC RAM.

                The problem for us, from a testing point of view, is that we don't have a huge selection of motherboards / CPUs that can support ECC.

                The main changes in V5/V6, compared to V4, which might improve the ability of MemTest86 to provoke an error are,

                • Switching to native 64bit code and 64bit address space. V4 was 32bit with a kludge called PAE. This matches real life software better.
                • Multithreading on systems that might not have previously supported it (but there are still some old early UEFI implementations that a single threaded only).
                • Speed optimisations, which result in faster testing and thus higher load.
                • Addition of tests for row hammer and SIMD instructions.

                Comment

                Working...
                X