Nate just linked me to this post interviewing an inside source in Microsoft about the causes of the RROD. Now that I’m involved in hardware manufacturing of consumer devices, it’s a fascinating case study of what not to do, so I’m paying attention and taking notes.
A while back I posted that I was looking for an RROD Xbox360; I actually sent it off to MEFAS to get digested for solder joint inspection on the GPU through a process called “dye and pry”. In this process, the motherboard is flooded with red ink, and then the GPU is mechanically pried off the board. The red ink flows into any of the tiny cracks in the solder balls, and at least in theory, when you pry the GPU off the cracked regions will shear first so you will be left with visible red spots at the points of failure.
The findings were interesting. Below is what a normal ball looks like after the test:
(click on the image for a larger version)
And here is one of several balls on the GPU that exhibited signs of partial failure:
There was also some “voiding” seen in the balls, e.g. trapped gas bubbles inside the solder balls that might serve as starting points for mechanical failure. Some voiding is expected, and there’s not a lot of data I can find correlating failure with voiding, but I could imagine in a stressful mechanical environment these things don’t help.
I was a bit puzzled by these results because you didn’t see any “catastrophic” failure — pools of red ink over a connection interface — just partial cracking. Partial cracking isn’t terribly uncommon, and many products work quite well despite such artifacts. However, after reading the article linked above, if Microsoft shorted safety margins around many of the design parameters to get the product out on time, it makes sense that the summation of many partial failures could lead to a total system failure — failures that have symptoms that vaguely cluster together but are difficult to point to any single root cause. Heisenbugs. Yuck.
Complex systems are a bitch to get right — and reliable. I think about that every time I step onto an airplane, or when I read about the space program. Respect to the engineers at Boeing and NASA!