Wednesday, April 23, 2014

Sasquatch, El Dorado and Bug-free Software

Digression from the previous post: A fly in the ointment

From They Write the Right Stuff 1

Excerpt
What['s]...remarkable is how well the software works. This software never crashes. It never needs to be re-booted. This software is bug-free. It is perfect, as perfect as human beings have achieved....
Once upon a time...
there was perfect software
...That's the culture: the on-board shuttle group produces grown-up software, and the way they do it is by being grown-ups. It may not be sexy, it may not be a coding ego-trip -- but it is the future of software.— Charles Fishman, Fast Company, 1996

From Software and the Challenge of Flight Control2

Excerpt
A mythology has arisen about the Shuttle software with claims being made about it being “perfect software” and “bug-free” or having “zero-defects,”all of which are untrue.—Nancy Levison, 2013
After reading the last posting, a friend, from my NASA days, sent me a link to a 1996 Fast Company article by Charles Fishman about the on-board Shuttle team at Johnson Space Center. The article, They Write the Right Stuff describes how the team managed their work. For the remainder of this posting, I'll refer to the article as simply the "...Write...Stuff."

My friend, who is very familiar with how Shuttle software was developed and maintained, was pointing out that the "...Write...Stuff" described the shuttle software team as having a 10-to-1 ratio of programmers to testers—roughly the same ratio proposed for Brooks' Surgical team.3, 3a That's an exception. The NASA software teams I saw were typically staffed with a 1-to-1 ratio of programmers to a combination of system engineers and testers. However, if you included management and process oversight staff (i.e. those who do not contribute directly to developing the product), the ratio was more like 4 to 1.

I found the "...Write...Stuff" disturbingly wrong headed. It portrayed the Shuttle software development methodology as the wave of the future——the very culmination of software development as it should and will eventually be for all-time. Fishman is unapologetic this claim.
As the rest of the world struggles with the basics, the on-board shuttle group edges ever closer to perfect software. — Charles Fishman

Sure, the "...Write...Stuff" was written over 15 years ago. It's a feel-good piece. It's written by a journalist (i.e not a software engineer) for a for-profit publication whose sales depend on stirring the heart of your wistful science romantic who adores tales of shiny things that are technical.

So why take the piece seriously? Because like Sasquatch , or El Dorato, the best tall tales could be true. After all, who knows what's really beyond the campfire. A monster? A city of gold waiting for someone with the will to find it? When understanding is limited, the unlikely seems credible and impossible disprove. Who's to say a tall tale of software perfection is a fiction? Well-meaning people who haven't lived in the woods or traveled the West can be taken in—especially when, they want to believe.

Consider the plight of your typical techno-political bureaucratic manager. His efforts to reign in those software people fail repeatedly. He has endured scores of broken commitments. He now believes but a fraction of what he hears. He is flummoxed. He is living the misery of the 'software problem'4. He hears of a cure all. A new tool; a better process; a silver bullet; a perfect team building the perfect software. He is a captivated. He has a course of action; a fix is in reach. Shuffle a few budgets; levy a new approval process, demand a new document, an additional review, a few more extra process steps.5 VoilĂ ! No more problems. What could be easier?

In other words, articles like the "...Write...Stuff" foster a management expectation that there is a quick fix. When these fixes are imposed on development teams pressing to meet a delivery, they can do real damage. Improvement is hard. It's expensive. It's disruptive. It's risky. Perfection is the stuff of stories.

Space Shuttle Main Engine Hoisted into Test Stand - GPN-2000-000546
Space Shuttle Main Engine hoisted into the A-2 Test Stand, Stennis Space Center  (1979)

An abridged history of shuttle software6
The on-board shuttle group worked on the code that controlled the Shuttle rocket engines. The "...Write...Stuff" portrays them as a highly regulated, strictly-managed group who worked regular hours. These constraints were necessary because software had to work or risk human life.

While that is true, the interesting thing about on-board software is how it came to be. No account of the team's success is complete without accounting for the quarter century of prior effort. To understand why that matters, the history of shuttle software might be instructive.
SSME controller
Space Shuttle Main Engine Controller (2005)
The controlled was attached directly to the engine body
In July 1971, NASA contracted Rocketdyne to build the Space Shuttle Main Engine (SSME). Each shuttle main engines had a pair of dedicated, redundant controllers that were attached directly to the engine. These controllers managed the low-level functions of the main engines like servo control, command data converter, sensor data transmission and fault response. In the initial implementation, these controllers were Honeywell HDC-601s and the software was written in assembler. By early 80's it was clear an update was needed. Rocketdyne updated the SSMe controllers to the Motorola 68000 and rewrote the software in C.

In March 1973, NASA contracted with IBM to build the Primary Avionics Software System (PASS). PASS was comprised of two parts the Flight Computer Operating System (FCOS) and the application software. FCOS handled engine sequencing, steering and redundancy management by providing the controlling function for the software built by Rocketdyne. The application software which included guidance, navigation and systems management. PASS is the software that is discussed in the "...Write...Stuff."

PASS ran five 'general purpose computers.' IBM selected the AP-101/S processor which was based on the same architecture as the IBM 360. Earlier versions of the AP-101/S had been flown on the B-52 and the B-1B. The processor used a variety of word sizes depending on the function. For example, instructions could be either 16 or 32 bits. Floating point could be 32, 40 or 64 bit words. The average speed for math operations was about half a MIP. Program state for all process was preserved in 64-bit status word that was updated with every instruction cycle. As if that wasn't complicated enough, the processor was capable of handling 61 interrupts at 20 priority levels while preserving real-time constraints.

The five AP-101/S computers were divided into two separate systems. A quad-redundant, fault tolerant 'primary' system, and a Backup Flight Control System (BFS). PASS controlled both. All the computers processed the same data. This meant all the primary computers had to stay in synch. Reliability and safety in the primary was based on a voting scheme that checked all the computers to ensure they were producing the same result. If one computer had a different result, it was isolated.

If the primary system failed, BFS kicked in. That meant the BFS also had to be synchronized with the primary. Synchronizing the primary and backup systems would proved to very difficult and lead to a major defect that delayed the initial shuttle launch. The synchronization problem is a fascinating story captured in John Garman's article, "The Bug heard 'Round the World." 7

Shuttle software loads.
(from Tomayko, Developing software for the space shuttle )
Each processor had access to about 100K words (36-bit words) of shared memory. Memory was addressed by using a 16-bit word plus an extension for the last 4-bits from the status word. Since the memory was limited, software loads had to be swapped out at different mission phases. The largest loads were about 100K words for Ascent and Entry. Swapping software loads while maintaining system state must have been a bit tricky.

HAL/S code snippet
from http://history.nasa.gov/computers/Appendix-II.html 
PASS was going to be (and still would be) a difficult programming job. After coding the Apollo software in assembler, the engineering team knew PASS would be too complicated to write in assembler. After much debate, they decided they needed a high-level programming language. In 1969, NASA contracted Intermetrics8 to develop a new programming language, High-order Assembly Language/Shuttle or HAL/S9

The software was expensive; the cost was vastly underestimated. Originally, NASA thought the cost for development of the Shuttle software would be around $20M. In the end, NASA spend $200M for the original development (and that's 1970 dollars). Not surprising, over 50% of the PASS modules changed during the first 12 flights in response to requested enhancements. By 1991, the Agency had spent a total of $324 million. In 1992 NASA estimated it was spending $100M a year for maintenance of the onboard software.10

It's easy to understand why the maintenance costs were astronomical. The system was very complex and difficult to understand. Each release had to be free of any mission-ending errors. Both factors would drive testing budgets. Not only because large testing teams running extensive test protocols were necessary, but because the hardware and software required for development and test was not commercially available. Hardware had to be stockpiled, salvaged or reengineered from scratch at a significant cost. Similarly, all the software development tools had to be custom built and all programmers had to be trained at agency expense. (No university was churning out cadres of HAL/S programmers.)

Despite all that expense, the software was not bug free. According to Levison, during the first ten years, 16 'severity 1' errors were found. Of those, eight of remained in the flight code—operators simply avoided command that might trigger these errors. In addition, there 12 errors of lower severity that occurred during flight.

Not all errors occurred in the early days. In April 2009, a serious software communications error occurred in flight a few minutes after Endeavor reached orbit. So happens that bug was introduced in 1989 when a warning about code misalignment was inserted in the code. In other words, despite the Fishman's panegyric, the shuttle software was never bug free.

The "...Write..Stuff" is not entirely false. It has elements of truth which, to my mind, only serve to lend credibility to the otherwise misleading and potentially pernicious claims about "perfect software." It's no mere detail that the "bug free" software had been under maintenance for 25-years. The "...Write...Stuff" team was not doing development. It did not even have to deal with the major maintenance bug-a-boos of migrating to new hardware, operating systems, or compilers.

In my experience, managers who haven't programmed are unable to grasp the fundamental differences between development and maintenance tasks. They tend to be content with the idea that one size fits all. (see The Maintenance Mindset.) In other words, since the approach used by the on-board team produced "perfect software," there was and is a tendency to believe that the on-board shuttle development methods should be applied to all NASA software tasks.

I experienced fall out from this mindset while working on the Constellation Program (Cx). In the beginning of the project, we were full of hope. We were developing plans to use modern software techniques like use of software architecture and product-line concepts, an advanced computing system in the avionics package and the use of modern fault analysis techniques. These had all been used successfully in DOD or European projects. But our efforts were in vain. The project leaders had adopted a rigid philosophy like the one described in the "...Write...Stuff." All the new approaches were vetoed. It would be business-as-usual. The only software engineering team on the project was defunded.

I often wonder how things might have been different if the Cx Project Manager had absorbed the basic truth about software development that was captured so handily by John Garman in his article about the development of the shuttle software.
"If there are lessons to be learned from "the bug", they must be in how we view ourselves and our task. Building software in a large system against fixed schedules is not conducive to "bugfree" products. We can minimize the errors, and we can minimize the flight criticality of the ones that remain...but we can't treat it like a problem with a methodological solution." — Jack Garman, "The Bug Heard 'Round the World" (1981)
Despite all the tumult that the cancellation of the Cx Program created in the Agency, from a software perspective, it's a good thing it was stopped. The management did not have a realistic understanding of the software challenge they were facing. By contemporary standards, the Shuttle's PASS software would not be considered a 'big' package. Estimates at the time sized the Cx software in 100's of millions of lines of code. Using the methods described in the "...Write...Stuff," those software development efforts would not have converged—even with an infinite amount of money. An unimaginable failure was in the making.

If you are working on the development of a large software system, you can only hope that your manager doesn't read the "Write Stuff" or if she does, she knows it's irrelevant. Better she should read the account of the team that built the the HAL/S compiler. It's much better advice:
Software is a very unusual industry. You can run an assembly line with a whip, although American managers are belatedly realizing that there are better ways...Formal devices and management tricks can aid or impair it, but the impetus must come from within. The collective mind and will of the technical staff is the essence...If management strays too far from the will of the workers, tries to shape it into something that it is not, the only possible outcomes are chaos or outright rebellion. — Tony Flanders, A History of Intermetrics
But then who knows, maybe, one day, someone in senior NASA management will grasp its own lessons learned, so that, once again, NASA can do the right stuff.


1. Fishman, C. "They Write the Right Stuff." Fast Company. 1996
2. Levison. "Software and the Challenge of Flight Control" (Chapter 7), Space Shuttle Legacy: How We Did It/What We Learned AIAA. Edited by Roger Launius, James Craig, and John Krig. 2013.
3. He tells me that out of a staff of 400 only 50 were allowed to write code.
3a. Remember that Brooks's proposal most included non-technical staff. See "Fly in the ointment."
4. First described in the 1968 publication "Software Engineering, Report on a conference sponsored by the NATO SCIENCE COMMITTEE."
5. The additional effort is typically mandated the empty reassurance that "you have to do the work any." Actually we didn't.
6. The following is based on material found in four excellent sources:

a) Tomayko's Computer in Spaceflight: The NASA Experience, Chapter 4
b) Levision's Software and the Challenge of flight Control
c)Lethbridge's Spaceline.org. http://www.spaceline.org/rocketsum/shuttle-program.html
d) Mattox and White's Space Shuttle Main Engine Controller

7. Garman was Deputy Chief of the Spacecraft Software Division at JSC and played a key leadership role in shuttle software development. See Garman, J. "The Bug Heard 'Round the World," Sofware Engineering Notes. Vol. 6. No. 5. October 1981.
8. Intermetrics was formed by a group from Draper Labs who had worked on the Apollo software. The story of Intermetrics is fascinating. One of the principles, Tony Flanders, has posted a very interesting history. see http://www.whysheep.com/i2/daf-history2.html
9. According to Tomayko, 2001 a Space Odyssey was playing in the theaters at the time the new language was contracted. Perhaps Kubrick's film influenced the name.
10. Id. Levison.

No comments:

Post a Comment