Over the past week plenty of commentary
has been made, including my own, in relation to the 40th
anniversary of the historic flight of Apollo 11. One comment on
Twitter by alexr
however caught my specific attention, "SW
engineers should take this moment to consider if they'd trust their
code to have gotten the LM onto the lunar surface safely."
Indeed a sobering thought at first glance. In this day and age it is quite common to run into a program written by one or more software engineers that seems unstable or error-prone. What if all those engineers had to deal with the rigors of getting their code "flight ready" where billions of dollars and at least 2 men's lives at risk?
But on second thought I think this comment does more harm than good. It seems to imply that either; the challenges of writing critically important, life-and-death software only occurs once in a blue moon1 or that the good-old-days of elite superman programmers who wrote error-free programs are long since gone, replaced by thousands of mediocre programmers writing millions of bug-infested computer code.
Instances of life-or-death situations being directed by computers (hardware and software) might not be an everyday occurrence for a programmer, or even a once in a career occurrence. But it does still occur. I recall the Professor who taught my Assembly Language class in college mentioning his work on a project for Motorola on a car fuel-injection system. The engine had habit of shutting down when entering the presence of electrical interference.
Just imagine driving down a highway at 55 mph only to have your car shutdown while passing by some high-tension power lines..... Now consider the added complexity of today's hybrid engines.
And while not every programming challenge is "life-and-death", plenty of software code in today's world is "mission critical" with millions, if not billions, of dollars at stake.2
In any case the coding of the Lunar Module's software was hardly error-free. In fact in regards to the Apollo 11 moon landing two specific instances occurred with the Eagle's Guidance Computer during the critical decent to the moon's surface.
At 102:38:30 Neil Armstrong calls out a program alarm, "1202". Ten seconds later, Armstrong is asking for feedback from Houston on the error code.3 Houston gives the astronauts a "go" to continue their decent. But less than 5 minutes later, with 2000 feet separating the LM from the surface the ship's computer issues a "1201".
102:42:13 Armstrong: (on-board): Okay. 3000 at 70.
102:42:17 Aldrin: Roger. Understand. Go for landing. 3000 feet.
102:42:19 Duke: Copy.
102:42:19 Aldrin: Program Alarm. (Pause) 1201
102:42:24 Armstrong: 1201. (Pause) (On-board) Okay, 2000 at 50.
102:42:25 Duke: Roger. 1201 alarm. (Pause) We're Go. Same type. We're Go.
What was a 1201 and 1202 type error?
Only that the Apollo Guidance Computer was indicating that it was
overloaded with data inputs, couldn't keep up and was resetting
Yeap, that's right, the guidance computer for the LM rebooted, at least twice, during one of the most critical phases of the mission because it ran out of memory.
The problem? An error in one of the crew's check lists had them turn on the rendezvous radar during the landing phase. Of course the LM crew was hardly trying to rendezvous with the Command Module during their decent, but the repeated calls to the computer to process imaginary rendezvous radar data filled up the limited writable computer memory4 the on-board system had, causing the system to repeatedly restart.
Now I suppose somebody will argue that the computer was hardly to blame. It was a user-generated error turning on the rendezvous radar (or a documentation error) not a computer programmer error. Moreover the program was designed to reset itself if it got overloaded on purpose.5
But, that's just it. No programmer, not matter how good, can take into account every possible error or misuse, whether created by the programmer or the user.6 Would you have considered at first that an over-head power line might scramble your car's fuel injection system?
This is where the programming concept of fault-tolerant programming comes into play. The idea is pretty basic; enable the system to continue operating properly in the event of an error. Just as the Apollo spacecraft (and Saturn V launcher) had mechanical backups to keep the physical system running in case of failure the guidance program (and properly designed programs of today) manage the error and keep it from causing catastrophic results.
Thus the statement should not be, engineers consider if you'd trust your code to get the LM to and from the moon safely. Instead it is, do you consider your software fault-tolerant enough to get one to the moon and back safely?
Interesting side note there is a community programming effort that has created a software emulator of the Apollo hardware and software, Virtual AGC and AGS.
1 Pardon the pun.
2 And by indirect implication the lives of the employees, customers, stockholders, et al.
3 Sounds eerily familiar for any modern day computer user; the computer reports some cryptic error code and the next step is to go searching for additional information on what's gone wrong.
4 Now a days we classify memory as Read-Only Memory (ROM) or Random Access Memory (RAM) and talk about Gigabytes (109) of RAM for a laptop (or even a smartphone). The Apollo Guidance Computer? About 64 Kilobytes (103) of ROM and only 2 Kilobytes of writable RAM.
5 The idea was to clear the fault and reestablish import tasks, i.e. clear out the waiting calls for calculating unnecessary rendezvous telemetry and reestablish jobs for processing landing telemetry.