Curiosity/MSL: Status Updates on Technical Condition and Functional Capabilities, page 2

ArMaP

posted on Mar, 2 2013 @ 07:34 AM

link

Originally posted by Bedlam
The number of bits you can correct depends on the syndrome size.

Could you please tell us what's the meaning of "syndrome" in this context?

Thanks in advance.

Bedlam

posted on Mar, 2 2013 @ 08:05 AM

link

reply to post by ArMaP

In error correction, you add in redundant bits to the data that are encoded in some way (not just a duplicate). Generally it's some sort of parity information. The more bits of redundant information you add, the larger the syndrome set you can derive.

In this case, "syndrome" means, and IIRC this is a textbook definition, something like all possible errors that are output from the parity bit check stage. By examining the syndrome, you can tell which bit(s) are in error and replace them with the correct value. If you've got the minimum in redundant info, say Hamming(7,4), then you can detect and correct any single bit error, and detect any two bit error. Beyond that, it's possible you'll detect an error but you might not. To try to clarify it, say I have a magic box. If I put in the data when I'm writing it, the box will provide me with extra bits, which I also save. When I'm reading it, I input the data and the extra bits I got earlier. The output of the magic box is called a 'syndrome'. If it's zero, no error (or a lot of them, beyond the capacity to detect). If I get a pattern instead, and if the error is within the syndrome space, I can do a fast lookup of that pattern and get the corrected data, because that non-zero syndrome contains the information I need to know what was messed up. The magic box can also tell me something like "Too messed up to make a real syndrome", and this is where you get into two bit detection but not correction. For Hamming (7,4), the syndrome can give you the info to fix any single bit error, but not two bit errors. For a two bit error, it'll at least know the syndrome is unusable. For three or above, maybe, maybe not.

If you've got more check bits than the minimum, you can derive a larger syndrome set that will allow you to detect and correct two bits, and detect any three bit error.

In a more robust system, you generally design flash storage in such a way that you scatter bits across devices, instead of doing it bytewise in one part. For instance, if the internal word size of your FFS is eight bits (common), then you have eight flash devices, each of which stores a bit of your data word. You end up with bigger sectors that way, but it helps this sort of thing a lot. If you take a cosmic ray hit in one of the eight devices, you end up with a bunch of single bit errors, if you're doing your correction external to the devices. These are always correctable, even with Hamming (7,4). If you put the whole byte in a device such that the word is all right in one physical spot in the memory, then you will often get multibit errors for a cosmic ray hit, and then you get this sort of problem. Note that you are much much more likely to get a SEU in a group of physically adjacent memory cells than you are to, say, get a hit on two separate parts at the same time in the same area corrupting the same data word. By separating bits out into physically separate flash parts, you raise the odds against an uncorrectable error by several orders of magnitude.

If you're doing it on the cheap, though, you might use stock FFS and stock NAND flash parts, which will always want to pile the bits together in one heap. Not good for this sort of thing, I'm afraid. Even DVDs scatter bits across the disk surface so that a scratch tends to cause a lot of single bit errors. Civilian FFS does not, but then they're not trying for rad hardening.

If the system was designed with triple redundant FFS, or if it were designed to scatter bits and do the error correction and FFS management in an external controller rather than use the quick, cheap, dirty on-chip correction, you wouldn't have had this issue. But it's easier to just buy it off the shelf. Then, of course, you get this.

edit on 2-3-2013 by Bedlam because: (no reason given)

ArMaP

posted on Mar, 2 2013 @ 09:55 AM

link

reply to post by Bedlam

Thanks for the answer, but I have another question: what's a SEU?

Arbitrageur

posted on Mar, 2 2013 @ 12:24 PM

link

Originally posted by Bedlam
If you're doing it on the cheap, though, you might use stock FFS and stock NAND flash parts, which will always want to pile the bits together in one heap. Not good for this sort of thing

Thanks for elaborating...seems like they might have done that, though I didn't think $2.5 billion was all that cheap.

But it definitely makes sense to be able to correct two bit errors in high rad environments, especially in Space, and maybe also robots exposed to high radiation. I'm a little surprised Curiosity can't do that, but apparently, it can't, if the news story is correct.

Bedlam

posted on Mar, 2 2013 @ 02:00 PM

link

reply to post by ArMaP

A Single Event Upset. A cosmic ray smack is a basically random thing, thus the single event nomenclature. If it whacks your memory and flips bit states, then it's an 'upset'. The data is erased but the device still works. It's also called a "soft error", if you closely inspect the full data sheets of flash or dram, you'll find that there is a soft error rate and a hard error rate - even without help they randomly lose data due to alpha emitters in the encapsulation epoxy.

You can also get single event gate ruptures, which destroy the memory cell, or single event latchups or burnouts, due to the cosmic ray triggering current to flow where it ought not.

Bedlam

posted on Mar, 2 2013 @ 02:10 PM

link

Originally posted by Arbitrageur
Thanks for elaborating...seems like they might have done that, though I didn't think $2.5 billion was all that cheap.

Heck, that's for the whole thing. The temptation here is to buy the function as COTS, instead of actually buckling down and designing it.

Had we done the thing, I'd have pitched two or three redundant or hard-by-design systems, but never a stock NAND flash/FFS implementation.

However, a lot of first tiers will just look at the FITs, which are calculated for ground benign environment, and not how the thing actually works in space. This is why you sometimes can't both use COTS and have nice things too.

Oh, sorry Armap, I've got the Army thing, all acronyms. Also can't talk without dropping f bombs, either, thanks, Army.

COTS: commercial off the shelf

FITs: Failures in time. It's a way to boil down the likelihood of each individual part failing, then you do a bit of math which amounts to 1 / (1/FIT + 1/FIT + 1/FIT....), and you get the mean time to failure. That works if you actually use the right FIT, and you betcha they used ground benign.

jeep3r

posted on Mar, 2 2013 @ 03:59 PM

link

Thanks for all your replies up to now, especially those related to the technical details of the glitch (even though I can't understand all of that as a rather non-technical person) ...

In the meantime, an article had been published by National Geographic containing some new statements from Richard Cook, project manager for the Curiosity project:

Mars Rover Curiosity Has First Big Malfunction (National Geographic)

... they obviously hope to fix the issue within a week or so and I (still) keep my fingers crossed that this is going to end well!!

jeep3r

posted on Mar, 3 2013 @ 01:37 PM

link

And for the sake of knowing what computers we're actually talking about, here go some specs regarding Curiosity's system configuration:

The two identical on-board rover computers, called "Rover Compute Element" (RCE), contain radiation hardened memory to tolerate the extreme radiation from space and to safeguard against power-off cycles. Each computer's memory includes 256 kB of EEPROM, 256 MB of DRAM, and 2 GB of flash memory. This compares to 3 MB of EEPROM, 128 MB of DRAM, and 256 MB of flash memory used in the Mars Exploration Rovers.

The RCE computers use the RAD750 CPU, which is a successor to the RAD6000 CPU used in the Mars Exploration Rovers.The RAD750 CPU is capable of up to 400 MIPS, while the RAD6000 CPU is capable of up to 35 MIPS. Of the two on-board computers, one is configured as backup, and will take over in the event of problems with the main computer.

Source: Wikipedia

Also, I found two related articles concerning the technical side of things:

1. Curiosity swaps out its primary computer, to hopefully restore full functionality
2. Inside NASA’s Curiosity: It’s an Apple Airport Extreme… with wheels

Source: ExtremeTech

Just FYI.

jeep3r

posted on Mar, 4 2013 @ 08:36 PM

link

Just a quick update in this thread ...

According to this link, all is looking good for our 'Patient on Mars', Curiosity is finally out of "Safe-Mode"!!

NASA's Mars rover Curiosity has transitioned from precautionary "safe mode" to active status on the path of recovery from a memory glitch last week. Resumption of full operations is anticipated by next week.

Just FYI.