Monday, June 24, 2013

Perfection is not enough

From Chapter 1:

Excerpt
If one character, one pause, of the incantation is not strictly in proper form, the magic doesn't work. (page 8)

"Perfection" is another of the inherent woes of the tar pit.  Brooks has taken a some literary license to make the point that computers do not tolerate programing errors.  The burden of correctness falls on the programmer.

Programming tools have significantly improved since the time the IBM 360 was developed. For example:
  • Compilers are interactive.  Programmers get as many tries as they need to get a program to compile.
  • Text editors are more helpful.  They detect spelling and syntax mistakes.
  • Debuggers can step through a program.  The programmer can watch program flow.
  • Version control and configuration management tools help keep track of changes.  A programmer can backtrack to see where a bug crept in.
  • New test tools help identify a wide range of problems.  Programmers can watch memory allocation or check  for illegal paths of program operation..

While these improvements have made programming easier, the problems persist. A little more than a decade ago commercial software vendors advertised 'bug-free software.' Today, no responsible software engineer would make that claim (unless perhaps the program was trivially simple.) Why? There are too many unknowns. Consider the following possible causes of any run-time error (i.e. an error that occurs when the program is running as opposed to an error that appears when the program is being created.) Here's a just a few possible causes:
  • The error might be in the hardware.
  • The error might be in the operating system.
  • The error might be in the tools that build the program.
  • The error might be the result of interactions with another running program.
  • The error may be caused by an untested interaction between parts of the new program.
  • The error might be caused by a configuration of some or all of the piece parts listed above.

Now days, the conventional wisdom is that a program will have bugs. In fact good programs are built with the assumption there will be defects. A competent programmer include assertions about the data being processed, and, if an error occurs, handle the problem gracefully. For example, it used to be that if your word processor crashed you lost your work. Now days there's a decent chance the data will be preserved. If that sounds difficult to program, it is.

The problem is tougher when you leave the comfy confines of the workstation or the desktop. A space system must operate in the physical world; the challenge is much more complex. For one thing, the system must be designed so that any program bugs can be handled remotely; if not, a mission may come to an abrupt end. (e.g. Mars Pathfinder). New tools have come along that help considerably. However, these have limited value. Similar problems were not detected by these new tools in the recently launched Mars Science Laboratory Mission--fortunately without mission ending consequences.

But these kinds of programming challenges are just the tip of the iceberg. The greater challenge stems from the formidable fact that a space system must reason about the physical world. It must have knowledge of its position, its stored power, its temperature, its memory, etc. And, the physical world is unforgiving. A part failure, an operational error or an extreme environmental condition (like extreme temperature or uncontrollable tumble) can end a mission. Bottom line: a spacecraft has much more rigorous error detection and response requirements than a word processor.

These systems that operate in the physical environment are often referred to as control systems. A simple control system will obtain data from a sensor, like a temperature sensor, and take actions based on the sensor data. A complex control system will have a panoply of sensors. The system reasons about the data from the sensors to create knowledge about itself and the environment. It then takes action in response to that knowledge. The selection of an action will depend on operational intent which may change over time. And while it's doing all that, it's checking for errors that may threaten the mission.

In the space age, program "perfection" means something quite different from what Brooks suggested. It requires an overall system knowledge that extends to all aspects of system design. This is a bewilderingly complex problem that conventional engineering practice does not adequately address. If we are to make substantial progress, our engineering methods will have to evolve to better manage system complexity.

In the current Agency culture, changes in engineering practice are typically superficial, underfunded and greeted with a deserved skepticism.

No comments:

Post a Comment