Author Topic: redundancy design (Read 3889 times)

medvedm · « **on:** April 28, 2011, 10:46:28 AM »

Hi-

I made a post awhile back regarding how one might create some redundancy and corruption checking with a tinycore system. See http://forum.tinycorelinux.net/index.php?topic=7142.0 if interested... but I'm thinking about this again and am looking for ideas/designs. The basic problem is put forth in the previous thread, but succinctly is this:

1. In space, your data might get corrupted by radiation hits.
2. We don't have $ or space for extra hardware, and in the worst case, the HDD can be removed and replaced with on orbit spares.
3. I don't want foolproof coverage, simply whatever I can do easily to protect/verify a majority of the files stored on the disk.
4. Space (Megabytes) is not an issue. Redundant files are totally fine.

What I decided to do is store three copies of each of bzImage and tinycore.gz and use grub to cycle through the files at each boot up. This way, if there is one corrupted, when they system is rebooted, it will use a different version. Software can do MD5 checksums to verify the files are good. I realize that grub is vulnerable as is the default file used by it to make this scheme work. So be it.

Now I'm thinking about what to do with my applications and config files. I'm thinking that if I let these be backed up in mydata.tgz, I can get some form of error checking and redundancy by relying on tar to spit out an error if it fails upon extraction. So maybe what happens is that on every shutdown, I verify that the tgz file is good (by extracting it w/o errors?) and then make two copies of it. At bootup, if the first shot at extracting mydata.tgz gets an error, it tries mydata2.tgz.

I want my apps to go in the mydata.tgz because then I could just uplink my application if (really more like when) it needs to be updated on orbit, as opposed to having it remastered into the tinycore.gz.

Any better ideas? I'd love for a "why don't you just...." that makes this way easier.

Guy · « **Reply #1 on:** April 28, 2011, 12:17:27 PM »

I don't see a reason for doing this.

In what situation do you want to use this?

What is your reason for doing this?

I suggest just backing up personal files regularly.

tinypoodle · « **Reply #2 on:** April 28, 2011, 12:44:43 PM »

See the thread of url posted in OP of this thread to understand

Rich · « **Reply #3 on:** April 28, 2011, 01:02:34 PM »

Hi medvedm
Redundancy can be a double edged sword as it adds more potential points for a failure. In your case
adding n number of redundant copies increases the size of the target for a radiation hit by a factor of n.
I know you don't want to add hardware, but you still might want to consider shielding the two large
surfaces of the drive parallel to the platter thus minimizing the size of the target. Maybe do some
research to see how thin a sheet of lead would be required to get the radiation susceptibility of the
drive down to a level you deem acceptable.

gerald_clark · « **Reply #4 on:** April 28, 2011, 01:25:56 PM »

Also be aware that hard drives have a maximum operating altitude.
The heads actually fly above the platter.
They will not operate in a vacuum.

Guy · « **Reply #5 on:** April 28, 2011, 03:02:30 PM »

Quote

What I decided to do is store three copies of each of bzImage and tinycore.gz and use grub to cycle through the files at each boot up. This way, if there is one corrupted, when they system is rebooted, it will use a different version.

Why not have three (or more) complete installations?

Not knowing much about your situations, you could get small usb hard drives, and have three (or more) complete installations on different hard drives. I don't know how long it needs to last. If it is not long, you could use regular usb drives. Hard drives may last longer.

medvedm · « **Reply #6 on:** April 29, 2011, 05:56:57 AM »

Quote from: Rich on April 28, 2011, 01:02:34 PM

Maybe do some research to see how thin a sheet of lead would be required to get the radiation susceptibility of the
drive down to a level you deem acceptable.

Thanks for the suggestion - trying to fly anything leaded like that is a non-starter... hazardous materials problems out the ying yang.

medvedm · « **Reply #7 on:** April 29, 2011, 05:57:54 AM »

Quote from: gerald_clark on April 28, 2011, 01:25:56 PM

Also be aware that hard drives have a maximum operating altitude.
The heads actually fly above the platter.
They will not operate in a vacuum.

Thanks, we fly lots of COTS hard disks that don't have a problem operating in microgravity. It is on ISS, so they're not in a vacuum.

medvedm · « **Reply #8 on:** April 29, 2011, 06:02:11 AM »

Quote from: Guy on April 28, 2011, 03:02:30 PM

Why not have three (or more) complete installations?

Not knowing much about your situations, you could get small usb hard drives, and have three (or more) complete installations on different hard drives. I don't know how long it needs to last. If it is not long, you could use regular usb drives. Hard drives may last longer.

Seriously, more hardware is not an option. Period. I'm looking for a software solution that would simply increase the system's robustness and at least tell me if I've got a problem.

Now the three complete installations idea has some merit. I suppose I could partition the drive into 5 partitions - one for each full install, one for all the science data, and a small swap partition. I guess I'm still trying to see the pros/cons of simply doing what I said before with the three bzImage/tinycore.gz files. I guess that would protect you from the "grub takes a hit" scenario (if and only if you use additional hardware like USB drives).

Is there like a software RAID or something I can do with multiple partitions?

danielibarnes · « **Reply #9 on:** April 29, 2011, 09:04:45 AM »

Quote

Is there like a software RAID or something I can do with multiple partitions?

I'm sure you're familiar with a FMEA (Failure Modes Effect Analysis). Have you done one for this project? It might help with these decisions. Software RAID adds complexity, but I don't believe it can address the boot issues you are asking about. I think multiple complete systems (3-5) would be better, especially if they detect and repair each other. Sounds like a fun project.

Rich · « **Reply #10 on:** April 29, 2011, 09:09:12 AM »

Hi medvedm
Fair enough. Lead is not the only material that will provide shielding, just one of the more effective ones.

There are software raid modules available for TC. It might be possible to break up the drive into
multiple partitions and have RAID use them as an array of drives. You'll have to wait and see if someone
more knowledgeable and familiar with RAID weighs in with an answer. Grub would probably still be
at risk, and possibly the raid module itself.

medvedm · « **Reply #11 on:** April 29, 2011, 09:31:10 AM »

Quote from: danielibarnes on April 29, 2011, 09:04:45 AM

I'm sure you're familiar with a FMEA (Failure Modes Effect Analysis). Have you done one for this project? It might help with these decisions. Software RAID adds complexity, but I don't believe it can address the boot issues you are asking about. I think multiple complete systems (3-5) would be better, especially if they detect and repair each other. Sounds like a fun project.

Yeah - it is pretty cool! The project will do an overall FMEA later in the lifecycle. For the type of stuff we do, they usually focus on what hardware can break, and how do we spare it appropriately. I guess the bottom line is that we've got systems flying right now without the sort of protections I'm talking about and we don't have many problems. It is also important to note that we are not mission/safety critical, so if "bad stuff" happens, it sucks but it isn't the end of the world. Given this status, there isn't money or time for fully a redundant, multiple backup solution.

That is why I don't care if grub is at risk... it is acceptable. If I can cover 95% of my butt, that is definitely good enough.

I'm simply looking for "easy" solutions that would give higher confidence than what we've already done. Multiple bzImage and tinycore.gz files, for example is trivial to do and yet gives me a nice warm fuzzy - if one gets corrupted, I can reboot and get back to a good one. I can md5sum them and compare it with a known good value on the ground and know if bad happened or not.

The simplicity of TC is what attracted me to it in the first place, so I'm looking for as simple a solution(s) as possible.

medvedm · « **Reply #12 on:** April 29, 2011, 09:34:00 AM »

Quote from: Rich on April 29, 2011, 09:09:12 AM

There are software raid modules available for TC. It might be possible to break up the drive into
multiple partitions and have RAID use them as an array of drives. You'll have to wait and see if someone
more knowledgeable and familiar with RAID weighs in with an answer.

Thanks for the reply, Rich. I've been reading some of the TC forum posts about RAID and it certainly seems complicated. Maybe that is just in the eyes of a n00b.

medvedm · « **Reply #13 on:** April 29, 2011, 09:36:59 AM »

FYI, if you're interested, the system I'm working on is the control box for the ACME experiment:

http://issresearchproject.grc.nasa.gov/Investigations/ACME/

M

danielibarnes · « **Reply #14 on:** April 29, 2011, 10:02:23 AM »

Quote from: medvedm on April 29, 2011, 09:31:10 AM

I guess the bottom line is that we've got systems flying right now without the sort of protections I'm talking about and we don't have many problems.
...
FYI, if you're interested, the system I'm working on is the control box for the ACME experiment:
http://issresearchproject.grc.nasa.gov/Investigations/ACME/

Hey, curaga, here's one for the "See where we're used" page.

Quote

if one gets corrupted, I can reboot and get back to a good one.

It sounds like you have connectivity. That certainly makes it easier than having to worry about creating something that can detect and repair faults on its own.

Quote

The simplicity of TC is what attracted me to it in the first place, so I'm looking for as simple a solution(s) as possible.

It sounds like multiple redundant installations are your best bet. Can't get much simpler than that. I'd also suggest taking the approach that maro did in your last thread. Use virtualization to create various scenarios and develop procedures. You can test a damaged MBR, grub, kernel, and initrd very easily this way.

Tiny Core Linux

News:

Author Topic: redundancy design (Read 3889 times)

medvedm

redundancy design

Guy

Re: redundancy design

tinypoodle

Re: redundancy design

Rich

Re: redundancy design

gerald_clark

Re: redundancy design

Guy

Re: redundancy design

medvedm

Re: redundancy design

medvedm

Re: redundancy design

medvedm

Re: redundancy design

danielibarnes

Re: redundancy design

Rich

Re: redundancy design

medvedm

Re: redundancy design

medvedm

Re: redundancy design

medvedm

Re: redundancy design

danielibarnes

Re: redundancy design