Monday, February 21, 2011

Exceptions and Errors in embedded systems

These past few posts have been ramblings to myself at the cusp of starting a new CFT (Copious Free Time) project.  I am weighing an "elegant" path (Haskell) vs a "Old Unix hacker" path (Shell scripts).

While the Haskell approach is alluring, there is a lot of learning to do there and I am an "Old Unix hacker".  I am very familiar with the benefits of functional programming and have found the past 3 months doing Haskell (some on my day job) a lot of fun.

But, I know I can get more accomplished sooner if I take a "Unix hacker" approach.

Now, for the meat of this post (and an often arguing point against using shell scripts in critical environments): Safety.

Or, more specifically, what about all of the points of unchecked failure in a shell script?
Doesn't this betray the notion of an embedded system?

Well, there is the dangerous situation of uncaught typos, but let's say we are real careful. How do we handle problems like:
1. A process in the pipeline dies unexpectedly.
2. The filesystem becomes 100% full.

Interestingly, while something like "dd if=$1 | transform | gzip >$2" looks like it can be full of the above problems, I could argue that you have this problem using any programming language/approach.

However, because it is so difficult to catch "exceptional" errors in the shell, it starts to make me wonder how I would handle this in a language that supports "exceptions".

This is where things start to unravel (for me).  What do you do in that exception? How do you recover?
Let's look at some approaches:

1. Unix approach:   Wrap the "dd" line in a script and have a monitor start it, capture and log stderr and restart it if necessary (but not too aggressively -- maybe at some point give up and shutdown the system).
2. Erlang approach:  Interestingly similar to above.
3. Language w/ exceptions: Catch the error, close the files and.... um, restart?

In the Unix approach, the cleanup is mostly done for you. Good fault tolerance practice (as suggested by Erlang) is pretty much handled by variants of init (I believe that daemontool's supervisor has been doing this well for years).

I am sure there are holes in my argument, but for my CFT, I am persisting all important data on disk (an event queue is central to my system). Every change (addition, execution, removal) of an event is an atomic disk transaction. If any process dies, it can be relaunched and pick up where it left off.

For fault tolerant (embedded) systems I am not sure what I would do in an "exception" handler... outside of clean up and die.

/todd

1 comment:

  1. Excellent pieces. Keep posting such kind of information on your blog. I really impressed by your blog.

    ReplyDelete