reply to post by ArMaP
In error correction, you add in redundant bits to the data that are encoded in some way (not just a duplicate). Generally it's some sort of parity
information. The more bits of redundant information you add, the larger the syndrome set you can derive.
In this case, "syndrome" means, and IIRC this is a textbook definition, something like all possible errors that are output from the parity bit check
stage. By examining the syndrome, you can tell which bit(s) are in error and replace them with the correct value. If you've got the minimum in
redundant info, say Hamming(7,4), then you can detect and correct any single bit error, and detect any two bit error. Beyond that, it's possible
you'll detect an error but you might not. To try to clarify it, say I have a magic box. If I put in the data when I'm writing it, the box will provide
me with extra bits, which I also save. When I'm reading it, I input the data and the extra bits I got earlier. The output of the magic box is called a
'syndrome'. If it's zero, no error (or a lot of them, beyond the capacity to detect). If I get a pattern instead, and if the error is within the
syndrome space, I can do a fast lookup of that pattern and get the corrected data, because that non-zero syndrome contains the information I need to
know what was messed up. The magic box can also tell me something like "Too messed up to make a real syndrome", and this is where you get into two bit
detection but not correction. For Hamming (7,4), the syndrome can give you the info to fix any single bit error, but not two bit errors. For a two bit
error, it'll at least know the syndrome is unusable. For three or above, maybe, maybe not.
If you've got more check bits than the minimum, you can derive a larger syndrome set that will allow you to detect and correct two bits, and detect
any three bit error.
In a more robust system, you generally design flash storage in such a way that you scatter bits across devices, instead of doing it bytewise in one
part. For instance, if the internal word size of your FFS is eight bits (common), then you have eight flash devices, each of which stores a bit of
your data word. You end up with bigger sectors that way, but it helps this sort of thing a lot. If you take a cosmic ray hit in one of the eight
devices, you end up with a bunch of single bit errors, if you're doing your correction external to the devices. These are always correctable, even
with Hamming (7,4). If you put the whole byte in a device such that the word is all right in one physical spot in the memory, then you will often get
multibit errors for a cosmic ray hit, and then you get this sort of problem. Note that you are much much more likely to get a SEU in a group of
physically adjacent memory cells than you are to, say, get a hit on two separate parts at the same time in the same area corrupting the same data
word. By separating bits out into physically separate flash parts, you raise the odds against an uncorrectable error by several orders of
magnitude.
If you're doing it on the cheap, though, you might use stock FFS and stock NAND flash parts, which will always want to pile the bits together in one
heap. Not good for this sort of thing, I'm afraid. Even DVDs scatter bits across the disk surface so that a scratch tends to cause a lot of single bit
errors. Civilian FFS does not, but then they're not trying for rad hardening.
If the system was designed with triple redundant FFS, or if it were designed to scatter bits and do the error correction and FFS management in an
external controller rather than use the quick, cheap, dirty on-chip correction, you wouldn't have had this issue. But it's easier to just buy it off
the shelf. Then, of course, you get this.
edit on 2-3-2013 by Bedlam because: (no reason given)
edit on 2-3-2013 by Bedlam
because: (no reason given)