Friday, January 31, 2014

The Maintenance Mindset

From Ratus rattus: A digression from the previous post

Excerpt

Fiat Lux Canticle map
A map of North America in 3174
from "A Canticle for Leibowitz."
Software development and software maintenance are fundamentally the same activity and can be funded and managed the same."

From A Canticle for Leibowitz1

Excerpt
For Man was a culture-bearer as well as a soul-bearer, but his cultures were not immortal and they could die with a race or an age
It was a busy and satisfying holiday season. Blogging took a back seat to out-of-town guests and holidays events.

Not all holiday surprises are good. The warp and woof of home repair marches on no matter what the season. 2013 departed with a few unanticipated, and costly, home maintenance levies. So maintenance, in particular maintenance of software systems, has been on my mind.

In the Ratus rattus posting, I listed a few assumptions about the "immutable" facts of life that adversely impact the lives of the NASA software development community—assumptions that must be challenged if we are to build the next generation space system. The list included the claim that NASA management approaches development and maintenance activities as if they are fundamentally the same.

To those who develop software for a living, the distinction between development and maintenance activities may seem obvious. Both entail requirement development, design, coding and test. However, in my experience as a development manager, the differences are fuzzy and easily misunderstood—especially by NASA senior managers and executives who have no experience as programmers. The misunderstanding leads to unrealistic schedules and budgets and, ultimately, to the miseries described 40 years ago in the Mythical Man-Month.

The skeptics among you will have doubts about that claim. In this post, I'll try to explain why this is a mountain and not a molehill. First I'll distinguish how software development differs from software maintenance. Then I'll discuss how the 'maintenance mindset' has become woven in to the fabric of the agency. At the risk of being sketchy, I'll try to be brief.

First a quick comparison.
Availability A system under development has never been deployed and is used only by programmers and testers.
A system under maintenance has been deployed and is being used by mission customers.
Code Change During development, a code change can be introduced without affecting a mission customer.
In maintenance, a change may affect a mission customer, and the impact of that change must be studied before the change is implemented and the system is redeployed.
Requirements During development requirements for the product change as the team balances known-customer needs and discovered-costumer needs against the realities of schedule a budget.
During maintenance, the requirements (or more likely a subset of the requirements) have been implemented, and requirement changes are limited to correcting errors and adding new functions that address customer needs.
Design During development the system design quickly evolves as the team discovers how to address changing requirements.
During maintenance, the system design exists and design changes are confined to correcting defects or fitting in new structures that come with new functions.
Architecture During development the system architecture is evolving.
During maintenance, the system architecture is fixed.
Interfaces During development interfaces morph as the team works through the intricacies of getting the new code to work with existing code.
During maintenance, interfaces become brittle with time and changes may have catastrophic consequences.2
Test During development the test regimen, like the code it tests, is in flux as new tests are created and older tests break.
During maintenance, the test regimen is established and serves as a standard of system readiness.
Programming to Test Ratio During development the programming budget is typically 100-400 percent of the integration-and-test budget.3
During maintenance the programming budget is typically less than 10% of the integration-and-test budget. After all, the product is 'done.'

To sum up, changing a system under maintenance is expensive; changing a system under development is cheap—at least relatively cheap. A poorly planned change to a system under maintenance will likely produce unintended and expensive consequences. So, it is quite sensible that, despite the considerable added expense, changes to a system under maintenance should be subject to elevated standards of governance. It is common practice in NASA that a change to a deployed (i.e. maintained) system requires mountains of paper work and the blessing of one or more change boards.

Conversely, a relaxed standard of governance is appropriate for a system under development—otherwise development costs significantly inflate. Nothing is inherently wrong with the added cost, so long as budgets and schedules match. When they don't, the stage is set for failures that even the best software engineers can not overcome. However, despite the shrinking budgets, it has become common practice in NASA to apply the same governance approach to both maintenance and development tasks. The result: it has become nearly impossible to make fundamental improvements to our systems.

Why has this happened? It's the natural outcome of a historic trend. Think back.

The great innovations in NASA took place when Americans and Soviets were locked in a competition for technical ideological ascendency. It was the era of Apollo and Voyager. The Nation's prestige was on the line. The Agency was well-funded. Engineers solved problems for the first time. Real success was required. Mistakes were expected. Reasonable risks were embraced. 'New' was necessary.

A decade rolls by. Astronauts walk on the moon. Jupiter and Saturn have their close ups. But, with the national prestige assured, the public interest in wanes. Space is no longer prime time—coverage of the Moon landings disappear below the fold. The budgets shrink. And, there are technical problems in paradise. NASA is no longer perfect. The Shuttle Program is delayed two years while engineers burn the night oil trying figure out how to safely attach the heat shield tiles.4 Congress nearly cancels the shuttle program. Meanwhile, flagship science missions like Galileo are dogged by technical and budget problems. Things were not going smoothly. 'New' was becoming risky.

Another decade passes. There are more problems. The big science missions are overrunning schedule and budget. The space station project is stuck in a mire of international politics and technical indecision. Worst of all, there's is a major accident. Meanwhile, the Agency's budget is shrinking. The priorities shift to saving cost and risk reduction. NASA's engineering culture morphs. In the Manned programs, the engineering talent is outsourced. NASA engineers no longer build new systems, rather, they oversee the work of the contractors who build systems that are optimized for cost at the expense of innovation. At the same time, project managers shun new design concepts for science spacecraft because cloning of the previous mission is perceived as being less risky and more cost effective.

It's now 2014 and it's been decades since NASA developed a novel system.5 In particular, the engineering techniques, the avionics, the operational concepts and, above all, the design of the integrated software are fundamentally unchanged. In essence NASA's contemporary approach to space system development amounts to building a minor variant of the last mission.6 That would seem to be the cheapest least risky thing. But consider this...

Each new variant conveys cruft from the previous missions. As the cruft accumulates the software grows brittle and breaks in unexpected ways. The more brittle the software, the more it costs to maintain.

How exactly does software grow brittle? Consider the software needed for a robotic planetary explorer.

N2 Chart Key Features
As the number of inter-dependencies grow
the greater the chance that a change will break the system
An entire system includes a LOT of software. For example: There is software needed for mission design. There is software for operations. There is software for capturing, displaying and storing telemetry. There is software for commanding. There 's software for communications. There is software for turning the data into pictures. That's not to mention the onboard flight code, dozens of other domain specific tools, thousands of tests and thousands of user scripts. All told, 20-30 million lines of code is needed to build and fly a robotic planetary explorer. That's a lot. And, it all has to work together.

Here's the rub: no one really knows exactly how all those pieces fit together. No one knows the dependencies. No one knows what software will break when a change is made. The longer a system has been around, the less is known, the more brittle it becomes. For example, a recent operations system upgrade to the Cassini ground system took over a year and multi-millions of dollars to complete. That was just the OS! As a practical matter, managers responsible for older systems will minimize code changes and invest in as much testing as schedule and budget permit.

With each iteration of the old systems, the problem is multiplied. There's s a dilemma. Do you fund maintenance for each version of each application for each mission or do you fund the maintenance of a single version that's used on several missions. The latter is the rule because superficially it is cheaper to maintain a single version. However, that means that every change now impacts all users and maintenance cost sky rocket. That's why 'New' is perceived forbiddingly expensive and risky.

After 3 decades of rebuilding the same systems, the Agency has lost the instincts for new development. There is a single, standard for the governing software tasks. Developers of new products are expected to produce project plans, requirements and design documents long before the system is understood well enough to produce those artifacts. And, they must do that on unrealistically low budgets that were formulated to compete with specious cost and risk savings assumed for reuse from last mission. In other words, the obligations are going up while the budgets are going down. The cards are stacked against the developer before the project begins.

Perhaps the most worrisome consequence of this maintenance mindset is the lost ability to distinguish the essential from the nonessential. The original reasons for many interfaces, applications, and requirements have been forgotten. Perhaps a file indexing tool or an identifier in an data structure or a data conversion format with its suite of conversion tools were once needed. In time these artifacts become 'necessary' and are sustained with dedicated budgets, managers and staff. Reduced cost and risk aside, any new approach that dispenses with one of these 'necessities' will confront a fierce political and technical struggle from competent people whose jobs are on the line. The fight typically consumes enough of the development resources to renew the skeptics belief that improvements are bound to fail.

This is admittedly a grim picture, but a natural one for a mature bureaucracy like NASA. Bureaucracies will develop organization structures, institutional policies and a management cadre that resists change. It is a natural as the Fall following Summer and Spring. There is vitality in the private sector; corporations who do not evolve, perish. Not so with government bureaucracies. They are especially resistant to change and tend to persist so long as there's an influential constituency on the receiving end of some benefit. Unless there's a crisis of dynamic proportions, no one can expect a government bureaucracy to change.

One the bright side, there are examples of organizations who have found ways around the stagnation. The Lockheed Skunk Works is the premier example. The secret: provide a talented group with a high degree of autonomy. The talent exists in NASA. Now if only there was an imaginative and gutsy executive who could live with the autonomy.

Does anyone see an asteroid headed in our direction?



1. Miller, W. M., "Canticle for Leibowitz." J.B. Lippincott & Co. 1959.
2. Mission users create hundreds, if not thousands, of special purpose applications using both official and unofficial interfaces. Few of these are visible to the development team. Any interface change to a deployed system may break significant parts of a working system in unpredictable ways. This 'brittleness' gets worse with time.
3. Depending on how many times the development team has build similar systems.
4. Williamson, R.A. "Developing the Space Shuttle." from Exploring the Unknown, Chapter 2. P.176. http://history.nasa.gov/SP-4407/vol4/cover.pdf
5. The Constellation Program was no exception. During my stint on the Program, I saw heard upper management sincerely proclaim the intention to build something new. Initially, everyone I knew working on the project was full of hope that we could at last build a system that embodied what we knew needed to be done. However, in the end, the constraints of budget and schedule, the necessity of an affordable bid from a contractor, and the bureaucratic drive for consensus lead to a software system that a mere rehash of all that was built before.
6. Any improvement is cause for great concern. During my last assignment, the decision sue a commercial database for processing and storing telemetry data caused great handwringing. After a year of discussion, the technical leadership remained undecided. Meanwhile, a new mission was seriously considering using a telemetry processor that 15 years out of date.