[From: firstname.lastname@example.org (Gary Newman)]
Memory errors are categorized as either "HARD" failures, or "SOFT" failures. Either form of failure can cause anything from an unexplained system crash to a nice warning message saying:
"soft error corrected at address 0x00343487 pattern 0x0004000"
The methods that have been developed to deal with these failures are outlined here.
HARD ERRORS occur when one or more bits in a memory consistently read back different data than is written to them. There are a myriad of causes for these failures including failed: memory cells, memory chips, solder connections, SIMM socket connections, and circuit traces. Hard errors are signs of truly broken hardware and require physical repair to correct. If you are lucky, simply removing and reinserting a SIMM in its socket is sufficient to make a better connection. Usually it means you have a bad memory chip or motherboard.
SOFT ERRORS occur when one or more bits in a memory read back different data than was written to them, BUT after rewriting the same data the memory reads it back correctly. In other words: the error is transient and not reproducible. Soft errors are usually intermittent with anywhere from hours to years between occurrences. There are two design causes for soft errors, motherboard noise and internal DRAM noise due to alpha particles or marginal circuits. On a well designed motherboard, noise does not cause measurable soft errors unless the board is defective.
Both soft errors and hard errors can be caused by static electricity damage or otherwise defective parts. Unfortunately these problem parts don't always cause instant hard errors. Failures can appear weeks or months after initial damage as soft (due to degraded performance) or hard errors. "Burn in" (which is heavy exercise of hardware for it's first few days) is a method used by manufacturers to weed out these failures at the factory.
Users of computers can also "change the design" of their computer without understanding the ramifications of what they are doing. Adding "SIMM converters" to fit 30 pin SIMMs into a 72 socket, decreasing the DRAM refresh rate, overclocking, and changing the DRAM access timing all can push a design beyond allowable specifications. The problems frequently show up as parity errors, or on a system without parity just as system flakiness.
INTERNAL DRAM NOISE is caused by two different sources. Marginal circuits on the DRAM are one source that quality manufacturers nearly always find at the factory through testing of the parts. HOWEVER, SOME MARGINAL DRAM MAKES IT TO MARKET! The result is a part that produces a soft error more often than normal (see below). A system of mine had such a part that produced a single bit error (always in the same DRAM chip of a SIMM) once a month.
ALL DRAM PRODUCES SOFT ERRORS DUE TO ALPHA PARTICLES. The plastic packaging of the DRAM contains small amounts of radioactivity that produce alpha particles. These are energetic, fast moving, helium atoms which are missing their electrons. When an alpha particle emitted by the packaging hits a sense line in the DRAM during a read cycle, the noise it produces causes the sense amplifier to misread the data. Then, as with all DRAM, the memory cell is refreshed after reading and the bad data becomes permanent.
Memory Error Likelyhood
In 1990, alpha particle induced soft errors occurred in 16 Mb computer systems at the mean rate of roughly one error every 3 months. Improved DRAM designs have greatly reduced that error rate so that today the mean error rate in a 16 Mb system is roughly one bit error every 16 years. Note that since the errors only occur when memory is being read, faster access rates to memory make for shorter times between errors. When a computer is idle, the only DRAM access is due to infrequent memory refresh cycles. When a program is constantly reading from memory at the maximum memory bandwidth, bit errors occur more frequently.
With computers DESIGNED to produce memory errors at a rate of roughly one bit error per system per 16 years, manufacturers have been cutting costs by not including "parity" memory with systems they sell. THIS ERROR RATE PRODUCES A SINGLE BIT ERROR DURING A TYPICAL THREE MONTH WARRANTY IN 1.6 PERCENT OF ALL THE COMPUTERS SOLD! There are two main risks of using a system without parity memory. One is that the computer user will have no warning when a memory error (soft or hard) has occurred, and the other is that side effects of the error may be hard to isolate. A single bit error can produce side effects such as: a wrong result in a spreadsheet, erroneous data in a database, a bug in the instructions of an application program or operating system causing mysterious system crashes.
With 100 million computers in use today, we should expect roughly 6 million single bit errors per year. Computer hardware and software companies must receive thousands of "side effect" bug reports and support calls due to memory errors alone. The costs of NOT including parity memory must be huge!