hardware: Memory Error Isn't the RAM

Recently, I installed a new Dell PowerEdge server in the data center. All seemed well, but after a couple of weeks I happened to look in the Server Administrator application & saw that one of the memory chips (B3) had a parity error - "single-bit failure error rate exceeded". Since the server wasn't in production yet, I was able to run all the available updates. After a reboot, the error vanished - so, problem solved, right?
Wrong! A week later, the B3 memory chip was showing a parity error again. Having no updates to run, I tried a reboot. This cleared the error again, but I was very uncomfortable putting a server into production with a hardware problem. Dell suggested swapping memory chips to verify whether the error would follow the chip. The technician was suspecting either a bad chip or a bad slot on the motherboard.
After a trek to the data center and swapping the suspect B3 chip with nearby chip B5, the problem resurfaced. On the B5 chip. It looked like the problem was a faulty chip. Not a problem - Dell shipped all new RAM. After swapping out all the chips however, the same problem showed up a day later. On chip B4. The previous error following the chip now seemed nothing more than a coincidence.
This was getting odd. The technician asked me what memory chips I swapped & which was now showing the error. Since the issue stayed with the chip, this technician believed the CPU to be at fault & asked me to swap CPU's. After doing this, sure enough the error happened again - on the same bank of memory chips, but a different slot this time - B2. Dell replaced the motherboard, suspecting the DIMM slots were bad, and the CPU's for good measure. No go. The problem happened yet again a few days later.
My server's problem was escalated. This was getting very odd. The next technician, looking at the DSET that I sent earlier, asked me to change a BIOS setting involving power saving features for the CPU. The idea was that since the server was not under a production load, when it was idle overnight one of the CPU's was put in a low power state. If a network request or anything else came in while the CPU was in this state, and since the memory controller was built in to the CPU's on this server, any requests involving the CPU would be held in the memory controller until the CPU could 'wake up'. It could be that the request couldn't be handled in enough time for the CPU to 'wake up', throwing an error to the memory controller.
That was it! After making that one change, the error never resurfaced. I went on to install the server in the production cluster. Keep this in mind the next time you're chasing a memory issue. The problem may not be the hardware after all.

hardware

Thursday, 10 March 2011

Memory Error Isn't the RAM

2 comments: