Announcement

Collapse
No announcement yet.

Not sure what component has gone bad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Not sure what component has gone bad

    A month ago, one my systems (Precision 3620) started reporting uncorrectable memory errors, tried different DIMMs, and Dell Diagnostics locked the system up on memory tests. Switched to non-ECC RAM, errors went away.

    About a week ago, I ordered new Kingston ECC RAM, put it in the system and spun it up and went into Dell Diagnostics. The system locked up and upon next boot told me a correctable memory error occurred. At this point, I thought it was the this particular system.

    I brought over a similar system (PowerEdge T30), I didn't have the same memory issues (at least that's what I thought). I swapped the CPUs and tested all the RAM, the RAM that was in this system was the same make and model of the one that was originally failing, they are used RAM from eBay, so I tested those 2 sticks and the Kingston RAM. I thought the system was checking out okay until I booted the OS. I was experiencing a kernel panic from Ubuntu Server complaining about a uncorrectable multi-bit error. I believe at this point it happened with both sticks of the used eBay RAM.

    I want to say the error went away, so I tried the new Kingston RAM again. I had kernel panics again. So was it me reseating the RAM so many times in testing? Did the 3620 motherboard damage the old and new RAM? I wasn't sure. So a side note, the T30 and 3620 have the same motherboard, they just have different BIOS, different branding, different case front panel, etc. During testing I think I learned that the T30 doesn't have Dell Reliable Memory Technology settings in the BIOS, but the 3620 does. Despite toggling it off on the 3620, the 3620 still reports ECC errors on POST.

    The reason I bring that up is my observation is the 3620 reports ECC errors on POST, it also locks up on the memory errors during testing. The T30 I believe I've never seen it report an ECC error on POST, Dell Diagnostics doesn't lock up either. Also I'm not sure if the RAM was working fine at first in the T30 because I was never really booting Ubuntu, only Dell Diagnostics at the time and waiting for POST to tell me. It was when I was ready to determine it was the 3620 motherboard that I let Ubuntu boot and the kernel panics happened.

    So I purchased MemTest86 Pro, and on the T30 it reports ECC errors that are correctable and continues testing. On the 3620, MemTest86 locks up and on next boot, POST reports an uncorrectable memory error. What's weird after all of this, the 2 used eBay RAM sticks are performing without errors now, it's the new Kingston RAM that is having issues. Yes I had a correctable memory error out of the box, but the only reason it was bought was because I was having uncorrectable memory errors with the 1 RAM stick in the 3620.

    And both Dell Diagnostics and MemTest86 lock up on the 3620 when errors do occur, the same tools on the T30 behave differently by continuing testing. The error with the Kingston RAM is correctable on the T30 but uncorrectable on the 3620.

    Hopefully this all makes sense, it just seems the issue is intermittent and I can't nail down what really is at fault, what is really broken, etc.

  • #2
    I guess it isn't impossible, but it would be rather unusual for a motherboard to be damaging RAM sticks.
    Having two bad systems (or two bad sets of RAM) would also be unusual however.
    Have you always been testing in the same slots? As maybe the behaviour changes depending on the slots in use?
    Are there other external factors that could be the cause (e.g. high ambient temperatures, or an area of of high EMI, dirty mains power supply)?



    Comment

    Working...
    X