Saturday, December 21, 2013

Turning the system inside out

From Ratus rattus: A digression from the previous post

Excerpt
"Sponsors and upper management should not be exposed to development details even when those details drive cost and schedule." 
Santi di Tito - Niccolo Machiavelli's portrait
Niccolo Machiavelli (1469-1527)

From The Prince1

Excerpt
There are thee different kinds of brains, the one understands things unassisted, the other understands things when shown by others, the third understands neither alone nor with the explanations of others. The first kind is most excellent, the second also excellent, but the third useless. (Chapter 22, page 104)


In the Rattus rattus posting, I listed a few "frustrated utterances of immutable facts" that adversely impact the lives of the NASA software development community.

The single greatest challenge I faced as a NASA software development manager was finding ways to communicate key decisions to management without delving into technical details. That's right. It was inevitably a mistake to get technical with our management. If we did, we would likely be derailed by tangential questions or hostile interlocutors.

This was a fact of life I never fully accepted. After all, I was working for NASA, the home of advanced technologies, best and the brightest, the nation's stake in the future.

It was 1998. I was fresh blood, full of ideas; the "ancestral pieties"2 didn't apply. What I saw was very smart people working with decade-old technologies. Time to put the past behind. I'd been hired into a group with forward thinkers in leadership roles. We wanted to use the new tools: object-oriented languages, real-time operating systems with protected memory and compilers that supported generic programming. I was about take part in building the next generation space system using state of the art software.

Our goals were at odds with a lunch-time scuttlebutt that was punctuated with aphorisms like "software is an evil necessity," or "there's never time to do it right; there's always time to do it over." As yet, I had no appreciation of the machinery that preserved the established order. I would get my first exposure soon enough.

The implementation phase of our project was about to start. The first major review was around the corner.3 The team had decided to adopt C++ as our programing language. We needed funds for new tools, infrastructure and training. We needed management buy-in. Our manager asked me to present a rationale for using a new language and not selecting Ada or C, the programming languages used on the last missions.4 I prepared a balanced, 20-slide deck with code examples that illustrating the benefits and the pitfalls of C++.5

Five slides into the pre-review walk-through, my manager sent me to the showers. I had too much detail. I had highlighted potential difficulties. By discussing issues that would interest a responsible software developer, I had unwittingly painted a picture of a disaster in the making. My boss warned me that my material would freak the project management and we would surely get 'help' we did not want. If you happen to be a software engineer working in a hardware-centric, government-sponsored bureaucracy, the last thing you need is 'help.'

I was learning. Slowly. There were bigger surprises ahead.

Since I was the new guy, I often reached out to the team's most-experienced programmers for advice. We were about to start programming, but I did not yet have adequate requirements. One day, over coffee, I started grousing to one of the senior guys about the lack of requirements. He smiled knowingly. "Our requirements were useless," he said. By requirements he meant "shall" statements. Given what I'd seen, I had to agree. My personal favorite awful requirement was, "The software shall not harm the hardware." "The Systems Engineers haven't a clue," he said, "we just have to figure it out ourselves."

"What about testing?" I asked. You need requirements to know what to test. "We test it," he said. He meant the programming team. "We show the testers what we did and they just repeat it. No value added." Suffice to say it is NOT considered a best practice for programmers to test their own code. Then there was this clincher. "All that counts is the code. If it's in the code, it's on the mission."

So happened that this particular programmer was an expert user of the system he was building. He knew what the system should do and could build a usable system without requirements. Still, I was skeptical that his code pass the delivery review. Surely there would be a reckoning. There wasn't. That review went something like this: "Code delivered on time and tested." His delivery was a rip-roaring success. Management was sastified--there was no apparent cause for worry.

During the past decade, our development practices became more rigorous. The agency adopted a set of required processes for software.6

In spirit, these process mandates are reasonable. In practice, they levy a significant, typically unfunded, burden that produces a mountain of paper that describes the code and how it was built. So much paper that only a small portion of the documents are carefully read. An even smaller portion are treated to thoughtful analysis by an engineer with sufficient expertise to render a useful opinion. Nevertheless, these documents become an official record of engineering thoroughness--a certification of the quality of the code. When reviews roll around, management can conveniently meet their obligations by ticking down a list of required documents to see if any of their number is missing.

It's a very practical arrangement that has become settled convention. Developers are free to do their work without exposing coding details or any risks that might be associated with design choices. Managers are assured by the process machinery that all is in order without taking the trouble to understanding the software. For much of the schedule, the project purrs along with happy sounding Earned Value metrics until the predicable budget overruns and schedule slips (which the required process did nothing to alleviate) light up the FEVER charts with red and yellow like a Christmas tree .

For years I failed to abide by this unwillingness to understand the details of the software. However, as I assumed greater management responsibility, I came to appreciate, even accept, why software engineering was the red-headed step child on NASA projects. In the conventional view, a spacecraft is fundamentally a very complicated piece of hardware; it just happens to have some software inside. Managing a $400-500M enterprise is hard enough without getting bogged down in the minutia of software piece parts. The spacecraft-as-hardware is a cultural mindset with roots that reach back to the Apollo era. The vestiges from that time live on in the project WBS and major milestone reviews. For example: a typical WBS places the flight software under the avionics subsystem a bureaucracy away from the ground system software. Similarly, a typical three-day, 36-hour gateway review, allows but a couple of hours for the discussion of the flight and ground software efforts.

And yet, in project after project, Brooks' 40-year-old admonitions reign supreme--software remains a persistently vexing management problem. The code is late and over budget. It doesn't do what it is supposed to do. There are bugs. The maintenance costs a fortune. There always a plague of technical gotyas that resist a simple fix. The tools that worked for the last system no longer work. The explanations from the software people are arcane and bewildering. Is it any wonder why project management is distrustful of software when such a small portion of the budget repeatedly causes so much trouble?

I have been on the receiving end of this management skepticism. There is no good answer to questions like: "Why are you reinventing the wheel?" Or, "Are those changes really needed?" Technical reasons, no matter how good, sound defensive or seem to obfuscate. I've heard it on reliable authority that in their corner offices senior managers confide to each other that the software problem stems from a lack of discipline and a lackadaisical attitude about commitment. So when a crisis of budget or schedule beckons and a management decision is required, it's usually rendered with the preamble, "I don't know anything about software, but..." True enough. It's a decision made on the basis of mistrust without knowledge of the details.

If you've read elsewhere in this blog, you'll know I believe that the next-generation space system must be a software-intensive system and not a spacecraft-as-hardware system. In other words, the design and implementation of the software would become an overarching project concern that links power, propulsion, mass, attitude control, navigation, operational concepts and fault protection. This means turning the system concept inside out so that project leadership makes a priority of understanding the software and how it connects across the system. Until that happens, the development of a smart, reliable, affordable system capable of complex operations, human or robotic, will remain beyond the reach of space system engineering.

Still, the underlying problem of managing a large, complicated development effort remains. The leadership must be able to understand the software and still orchestrate the work of the many engineering efforts by the collaborating disciplines. No one can master all.

Management will need to have an intuition about the software to grasp which details matter. Intuition that only comes from the experience of writing code under deadline pressure for an unknown user. Code that is designed for change, reuse and longevity. The kind of code Brooks called a "programming systems product." Only then will a manager have the gut-wrenching experiences needed to understand why software development is not like a music box. It is a discovery process that varies with the maturity of the team, the tools and the product. Failing that experience, it's very unlikely a manager will have a reliable intuition.

To the best of my knowledge, there is not, nor has there been, a single senior manager in NASA who has worked as a professional programmer. After all, NASA is an mature, hardware-centric, government bureaucracy with an entrenched culture. Cultural adjustments are disruptive. Of all the challenges that face the Agency, introducing a software-centric focus may be the most difficult.

Whitehead famously writes about the advance of civilization through the effect of certain ideas. "An idea is a prophecy which procures its own fulfillment."7 A reworking of NASA to prepare for the development of a next generation's space system is not beyond the realm of possibility. The transition could be realized in a single administration by an enlightened, determined leadership. It has happened in the past when the agency was born. It could happen again.


1. Michiavelli, N., The Prince. Translated by Ricci, L. Revised by Vincent, E.R.P., Oxford University Press, World Classics. Reprinted 1968.
2. Nifty phrase lifted from A.N. Whitehead. "Adventures in Ideas". Free Press Paperback. 1933.
3. A Preliminary Design Review (PDR)that occurred in the spring of 1998.
4. The Cassini flight software was written in Ada. The flight software for Pathfinder was written in C.
5. C++ is a very powerful but difficult programming language because it is easy to make subtle errors that lead to bugs and performance issues. I personally had a dozen books well-read books that provided programming guidance.
6. For a sample of the required NASA processes see the NASA Process web site.
7. A.N. Whitehead. "Adventures in Ideas". Free Press Paperback. 1933. Page 42.

Tuesday, November 19, 2013

A textbook case: next chapter

Update: A textbook case

In a previous posting I made the claim that the problems with the rollout of Healthcare.gov were a text book example of problems Brooks describes in the Mythical Man-Month. If misery loves company, and if you are a NASA software developer, the continuing reports about Healthcare.gov should provide reassuring proof you are not alone.

The Washington Post reported today that, last spring, the administration was warned by 'independent' consultants that the website deployment might be delayed. A number of risks were listed, but the following was especially telling:
...the policy and requirements of a program are best defined at the outset, leaving sufficient time for testing and revision. By contrast...the federal marketplace’s design was marked by “evolving requirements” that shifted throughout the design phase, leaving scant time to test the system before its launch.
(from Private consultants warned of risks before HealthCare.gov’s Oct. 1 launch, November 19, 2013 Edition of the Washington Post)
For many of us in the software development business, the report of requirement creep will have a creepily familiar ring. Requirement creep is inevitable. It's always been with us; it always will be. (A topic for later posting.) Here is particularly trenchant example of requirement creep and the Humpty-Dumpty Effect1 that occurred four decades ago during development of the Shuttle.
Even though NASA engineers estimated the size of the flight software to be smaller than that on Apollo, the ubiquitous functions of the Shuttle computers meant that no one group of engineers and no one company could do the software on its own. This increased the size of the task because of the communication necessary between the working groups. It also increased the complexity of a spacecraft already made complex by flight requirements and redundancy. Besides these realities, no one could foresee the final form that the software for this pioneering vehicle would take, even after years of development work had elapsed, since there continued to be both minor and major changes. NASA and its contractors made over 2,000 requirements changes between 1975 and the first flight in 1981. As a result, about $200 million was spent on software, as opposed to an initial estimate of $20 million. 2
The Post article hints at another area that mirrors the NASA development culture. According to the article3 "programs of this scale are ideally pursued in a more orderly process..." That phrase was lifted by the Post from a slide deck that was provided by the House Energy and Commerce Committee. The consultants deserve kudos for showing a particularly deft touch at reporting what appears to be a poorly implemented development process.4

Soft pedaling concerns has become a way of life for reviewers--especially reviewers whose livelihood depends are the next contract. This sort of conflict of interest is commonplace because there is small community of experts with the requisite skills to offer a meaningful opinion.

For those in positions of high authority, this is an old problem. After all, can you believe someone who has a vested interest in telling you what you want to hear? Machiavelli weighed in on this point. He had the following advise for Lorenzo de’Medici.
It is an infallible rule that a prince who is no wise himself cannot be well advised...The counselors will all think of their own interests, and he will be unable to either to correct or to understand them. And it cannot be otherwise, for men will always be false to you unless they are compelled by necessity to be true. Therefore it must be concluded that wise counsels, from whoever they come, must necessarily be due to the prudence of the prince...5 [my emphasis]
In other words, Machiavelli suggests that for effective governance, the management must be wise about the subject at hand and receive true counsel from her advisors. Ironically, successful management of a government technology project depends heavily on the one's ability to craft the acceptable message. So much the better if the message is true.

1. See The Humpty-Dumpty Effect for a discussion what happens when work is divided between engineering teams.
2. Tomayko, J., Computers in Spaceflight: The NASA Experience, Chapter Four-Computers in the Space Shuttle Avionics System-Developing software for the space shuttle. NASA contractor report, 1988. Page 114.
3. ibid
4. Given the reported lack of coordination between development groups, this is anything but surprising.
5. Michiavelli, N., The Prince. Translated by Ricci, L. Revised by Vincent, E.R.P., Oxford University Press, World Classics. Reprinted 1968. p.108.

Friday, November 15, 2013

INEDIA, the new, improved FBC

From Ratus rattus: A digression from the previous post

Excerpt
"Projects overrun because the NASA business model encourages under bidding and the promise of unrealistic expectations."
World Peace, World Hunger and a World Series win for the Astros were overlooked
In the last posting, I listed a few "frustrated utterances of immutable facts of life at NASA" that adversely impact the lives of the software development community. Starting with this posting, I'll launch into a perfectly foolish effort to dispute a few of these "immutable" facts with the hope of describing a Agency culture where it's rational to develop the kind of software needed for the next-generation space system.

I'll start with frustrated utterance #1: NASA's business model encourages underbidding and over optimism. Why is that? Bear with me for a few mental switch backs.

First stop, the money. For the last decade NASA budget has hovered around $17B per year.1 While this is a lot of money, it's small potatoes in the world of government programs. Here's a few comparisons:
  • Homeland Security has had a budget of roughly $55B since 2011. That includes $8B for TSA in 2012. That's nearly twice the size of NASA 'Science' program.2 7
  • The Farm subsidies budget has varied from %15-24B since 2000. The subsides are controversial since much of the money goes to large corporate farms.3
  • The 'Prisons and Detention' budget has around $8B since 2011. A needed expediture, but an uncomfortable comparison with the NASA science budget.4
No criticism of these programs is intended. I only mean to provide some reference for understanding that, in the grand scheme of things, NASA's budget is modest. On the other hand, NASA stated aspirations are anything but modest.

What about the NASA budget itself? It's not what it used to be. The degree of decline might surprise you. Consider the following:
  • In the years from 2000 to 2013, NASA budget remained flat at around $15-17B. 5
  • In 2004, the President announced a major space initiative to return to the Moon and Mars. There was no matching budget increase to accommodate this major new program.6
  • During the Apollo program, NASA's budget was over 4% of the Federal Budget
  • From 1975-2000, the NASA budget represented about 1% of the Federal Budget
  • The 2013 NASA budget is about 0.5% of the Federal Budget.
That $17B has to go a long way. It supports all the programmatic goals called out in the Mission Statement. About 30%, or $5B, is allocated to 'Science'. About 60% is allocated to 'Exploration'7 (for developing and operating the shuttle replacement systems). The rest goes to odds and ends.8 Once you throw in all the piece parts of the NASA mission statement, it's easy to conclude that the Agency menu calls for steak on a hamburger budget. How can Headquarter get by with that? Cleverness counts.

Two decades ago, the Administrator introduced a new approach for NASA missions, "Faster, Better, Cheaper (FBC)". In theory, the work could be streamlined; the funding gap would be made up by efficiency. At first, the policy looked like a winner. The first FBC mission Mars Pathfinder was a huge success. FBC was anointed as the new NASA business model.

However, the Pathfinder success wasn't what it seemed. According to a manager who played a key role on that project, Pathfinder met it's budget because it received significant help from institutional investments and technologies contributions from other missions. In other words, the actual cost of Pathfinder was really significantly higher than the FBC budget. It just appeared to be less because the project managers were good horse traders.

Subsequent missions would not enjoy the same advantage. The result: two mission failures, a slew of cost overruns, an over worked and dispirited engineering staff and a halt in the infusion of new mission technology. Almost everyone I know considers FBC a sham. The standard retort around the Agency was, "Faster, Better, Cheaper--pick two."

FBC endures as a set of cost-caped science programs.9 Unfortunately, due to the mission failures, the intended efficiencies did not survive and the Agency is now rife with oversight policies and processes that drive up cost. As if that wasn't enough, labor costs have increased sharply in the last decade. In other words, costs are rising, budgets are declining and expectations remain unchanged.

Perhaps the policy that underlies the NASA business model should really be called, "It's Not Enough, Do It Anyway (INEDIA)."

The realities of INEDIA require that something must give. But what? In order to understand how the sacrifices are chosen, it is necessary to understand NASA's real role in the government. In other words, who's the real NASA customer and what do they expect in return for the $17B?

Many would say that the science community is NASA's customer. In my unorthodox view, the real customer is sitting in the White House. NASA provides each President with a lasting legacy of concern for the future and ensures our national prestige as the leader among nations. The scientists are a really just a focus group. In other words, science is not primary; it's a means to an end. The value that NASA provides any administration is as public relations generator--$17B worth of public relations. If science generates a public-relations benefit, then the customer is satisfied.

If you except this assertion, here's a few conclusions that follow:
  • Myths are good PR. NASA is in the business of science myth-making on a fixed budget. Rover tweets are cheap. Cool webpages with galactic photos and bios of glamorous young engineers galvanize the nerdy imagination.
  • Science proposals must have a gee-whiz quotient with a PR payoff.
  • Science discoveries have a short life in the news cycle. Space dramas like the MSL landing, the MER Landing or the final Hubble Repair Mission focus national attention. The important scientific discovery that there was once a lake on Mars had a half-life equal to last week's Monday Night Football game.
  • The PR beast must be constantly fed--doesn't matter if it's only a celebration of the trivial.
  • Science programs receive a modest proportion of the budget because the manned program tends to provide better space drama. This is captured in the oft repeated aphorism "No Buck Rogers, no bucks."
  • NASA can poke along with it's programs so long as there's no disasters. Mission failures are bad PR. The President is too busy with real crises and doesn't need one from NASA.
  • The lack of space drama has created a crisis of direction in the Agency.
In other words INEDIA, but don't foul up.

So what's a NASA manager to do? So many areas in the mission statement, so few funds. Hard decisions like the closure of a NASA center or the abandonment of program are fraught with political peril. But, worry not. A system has evolved that preserves all the hobby horses. Here's how it works:

At this level of planning, it doesn't matter if the 'e' comes before the 'a' in software

Most of the sound and fury illustrated in the diagram is dedicated to the politics of consensus. Consensus is essential. No one wants a nattering nabob disrupting the harmony.

The engineering, the underlying reality, shows up in the upper right around the "proposal development" box. That's where the politics and engineering get mixed in a recipe that looks something like this: Stir in the need for gee-whiz. Add a reduction of development managers with an extract of dubious commitment justified by facile assumptions of reuse and inheritance. Whip up a fluff of directives from upper management who fret about keeping an institution open. Viola! There you have it. An INEDIA-faithful mission proposal infused with elevated expectations and a budget/schedule time bomb. As the wags back at work used to say, "The Director never met a proposal he didn't like."

If you happen to be a software manager who added ingredients into the brew, you would now be responsible for flight products based on inherited software. You would be under budget pressure because your budget is too large. No one on your team believes that the as-built, inherited system will be deployed because it never is. The works starts; the schedule gets chewed up. In time the actual requirements appear. Before long it becomes obvious that schedule and budget relief are needed. But a software manager has no authority. Your project management dismisses your concerns and assumes these are merely the usual scare tactics of extortion. The proposal commitments are sacrosanct. A project manager's reputation depends on devotion to the proposal cannon. Once again, it's time to apply the screws to those undisciplined programmers. But, as the launch date approaches, the truth of testing cannot be repressed. Project management finally orders up reserves from the keep. By then it's late in the schedule and the team has no choice but to cut features, apply ad hoc patches, and take design short cuts. And so, another task slips budget and schedule.

Perhaps the worst of this tired tale is that the resulting spaghetti-like architecture is destined to become the basis of the next mission proposal and the cycle repeats. It's a Brooksian Tarpit.

How might things change? Is a INEDIA business model for NASA simply the intractable result of the natural forces alive in a government bureaucracy. Could it be different?

Indian Space Research Organisation LogoI firmly believe so. If for no other reason than once the Chinese or Indian Space programs land systems on the Moon and Mars, there will likely be mounting political pressure to reassert the Nation's leadership in space technology. And, it's not just the Chinese and Indians; Brazil, Argentina and Iran are emerging nations who have their own space program. If that doesn't turn the tide, an asteroid scare could change the equation. Eventually, an administration will be called on to address the loss of national prestige. When that happens, the Agency goals will be refactored so that budgets match expectations.

It need not come to that. Here's a few reckless suggestions for how the Agency might escape the shackles of the INEDIA cycle.
  • Scale back the NASA Mission statement to match the budget
  • Assume that the most exciting mission will be enabled by engineering improvements. In particular improvements to Systems and Software Engineering that address that complexities of building a software-intensive space system.
  • Add a focus group who's objective is to define projects and programs that generate space drama and national prestige.
  • Alter the proposal process to address the full cost of ownership and prevent low balling the development costs
  • Reward management that can accurately sniff out the false cost assumptions claimed from reuse and inheritance.
  • Establish a 5-year suspension for project proposers who overrun budgets and slip schedules because of low-balling development and operations.
I know it's crazy, but the current way of doing business is not immutable. It could change. A new business model might just revitalize the Agency... even for a paltry $17B.


1. See slide 7 of NASA 2013 Budget Estimates
2. Homeland security Budget p.25
3. Farm Subsidy Table on Environmental Working Group (EWG) web site
4. U.S. Department of Justice Overview page 6.
5. NASA Annual Budgets
6. Constellation Program
7. Exploration program has officially replaced 'manned' program because of the gender bias implied in the name.
8. By comparison, the 2013 European Space Agency Budget (ESA) budget is about $4B. ESA allocates only 9% to the human space flight program and the rest developing and flying pure science missions. In other words, the NASA and ESA science budgets are roughly equivalent.
9. For example, the cost cap for a recent Explorer Program Announcement of Opportunity (AO) was $30-60M. The Discovery Program missions have a cap of $425M. The last New Frontiers AO (2009) had a cost cap of $650M. This may seem like a lot of money, but compared with 2.5B for MSL, $9B for the James Web Telescope, and $450M per Shuttle Launch (not including the large recurring infrastructure costs to keep JSC and KSC operational), these budgets are not generous.

Tuesday, October 29, 2013

Rattus rattus

From Chapter 2: The Mythical Man-Month

Excerpt
What are the alternatives facing the manager?...The...only alternatives are to trim [the task] formally and carefully, to reschedule, or to watch the task get silently trimmed by hasty design and incomplete testing(page 24)
1660 engraving Scenographia Systematis Copernicani
Scenographia Systematis Copernicani
(The Copernican System, 1660 engraving)

From A Distant Mirror1

Excerpt
Doctors struggling with the evidence could not break away from the terms of astrology, to which they believed all human physiology was subject. (page 107)



Brooks is recounting the unenviable remedies available to a development manager after a late milestone is missed. Nearly every experienced software development manager has had to negotiate these options.

Couple things worth noting: Brooks' remedies go from bad to worse and they are not mutually exclusive, but they are exhaustive. So, as a rule, the most damaging effect, "hasty design and incomplete testing," is typically part of the package.

The recent news abounds with a shining existence proof of Brooks' Law at work: Healthcare.gov. As of this writing, the Administration is pulling out all the stops and bringing in the 'best and brightest' to help resolve the problems. But, according to Brook's Law, there will months of delays and handwringing, but, in the end, the remedies will be those plainly called out 40 years ago by the good Dr. Brooks.

Bear this in mind: the challenge of building Healthcare.gov pales in comparison to building a high reliability, autonomous, affordable, flight-ground space system. After all, the architecture for web-based data systems, like Healthcare.gov, are well understood. I don't mean to trivialize the intellectual challenge of architecting a large system, but we have several decades of experience building variants of the client/server architecture. By comparison, we do not, as yet, have a proven, nevertheless viable, approach for building a affordable, high reliability, autonomous, flight-ground space system.

If I learned anything during my years managing software efforts, it's just this: when you miss a milestone late in the schedule, it's too late. The fix must happen much earlier in the lifecycle. But what exactly should be fixed? What are the root causes of missing a late milestone?

"Ah..." you say, simple enough? Not at all. Determining a true root causes is not so simple. No matter how urgent the need for a deep understanding, the bias of accepted fact will cloud our observations. Not even when the stakes are high. Not even when the survival or the human race hangs in the balance.

In October of 1347, a Genoese trading vessel pulled in the harbor of the Sicilian port of Messina. The ship was carrying a cargo from the Black Sea. The crew was dying with large black swellings around the armpits and groin; they were dying of the plague. This was the start of the European pandemic called the Black Death.

Victims of the disease suffered terribly. Most died within three to five days of showing the first symptoms. In some cases the sick went to bed healthy and died in their sleep--they were the most fortunate. While the suffering lasted, there was little to ameliorate the agony. The treatments included blood letting and exotic medicines like powders made from stag horns, gold, pearls and emeralds. None helped.

Prevention was everywhere an urgent priority. Based on a millennia of medical know-how going back to Hippocrates (460-377 BCE) and Galen (130-201 CE), medical experts understood that the disease was spread by corrupt air, or miasma. As the pestilence spread, physicians prescribed burning incense, smoking tobacco or carrying posies as way to purify the air and stave off the disease. But the scourge spread was unabated.

As the situation grew more critical, Phillip IV,2 King of France, sent an urgent request to the medical faculty at the University of Paris for a report. The University of Paris was the leading academic institution of the day; these men were the best and the brightest. The subsequent report confirmed that the disease was spread by a miasma and, according the medical theories of the day, identified an astral alignment as the event that triggered the miasma. They were very specific. The miasma was caused by the "conjunction of Saturn, Jupiter and Mars in the 40th degree of Aquarius said to have occurred on March 20, 1345."3. The report from the University of Paris was copied and circulated. It became the accepted scientific explanation across the Christian and Muslim worlds.

Bill of MortalityBut, the ordinary person, being devout Christian, was skeptical of the scientists. Most felt they were suffering from the wrath of God caused by the indulgences of the church and the sins of society. Popular movements sprung up to appease their maker. At first there were penitent processions. When those proved insufficient, there was a scramble to obtain sacred relics, by stealing if needed. Soon, marauding mobs of flagellants traveled from town to town fomenting a hysterical and desperate religious fervor. All this was accompanied by vicious pogroms against Jews, Muslims and any group who might be responsible for the Devine wrath. Still the pestilence spread.

The Black Death would ravage Europe for nearly 50 years. By the end of the 14th century, it is estimated that 40-50 percent of Europe's population died from plague. That wasn't the end of it.

In the winter of 1664 a comet appeared in the heavens portending another disaster. An epidemic broke out in London in the fall of 1564. By the following September, the London death rate was over 7,000 per week. By that time, there had been some scientific advances and physicians had come to believe the disease was transmitted by animals as well as miasma. During the height of the epidemic, the Mayor of London ordered that great bonfires be lit to cleanse the air and that all cats, dogs and pigeons be killed. Killing the cats would prove to an error that worsened the epidemic.


Vergleich Hausratte Wanderratte DEPlague pandemics continued for another 500 years. In time it was learned that the disease was spread by fleas and rats--Rattus rattus, the common black rat. The microbe was initially spread by flea or rat bite in the form of the bubonic plague. Once a person was infected the disease would spread in a highly contagious respiratory infection called pneumonic plaque.

The actual plague bacterium was finally isolated in 1894.4 The first plague vaccine was tested in 1897. The identification of the bacillus and development of the vaccine was possible only after a series of hard-won, 19th century scientific breakthroughs. Development of the vaccine required identification of the bacillus. Identification of the bacillus was possible because, in 1870, contagious diseases were linked to bacteria.5 Bacteria became a respectable candidate for contagion only after the theory of Spontaneous generation was dismissed and Germ Theory became widely accepted as the mostly likely means of contagion.6 Each breakthrough was resisted by the establishment of the day.

The history of plague prevention seems like a particularly compelling example. Here we have a dire need for a means of prevention. The survival of all that's near and dear is at stake and yet there is a glaring inability to put aside assumptions and examine the observable evidence that might have led to the discovery of the actual means of contagion.7

Shaking off assumptions is a fundamentally hard. History is rife with tragic examples like those made by the physicians of the 14th century. It's our nature. We are reassured in the knowledge that we know how to and do the right thing. We are inclined to interpreted our observations so they our accepted views. The result is an undeserved and complacent confidence that fills the mind with a false sense of reality.

So it is with large-scale software projects. I've often heard (and sometimes asserted) the claim that '...if only [thus-and-such] had been done the work, the project would have succeeded." Frankly these claims make me uneasy. They ring of naiveté or arrogance. They fail to acknowledge our history of failures. We still see the same botched commitments, that Brooks described 40 years ago. In the simplest terms, there's a failure to recognize that the development approach for large software system is not a settled matter.

The lessons of history suggest that, if we want a different result, we should be aggressive in challenging our assumptions. In other words, if we hope to build large-scale, software-intensive space systems, we need to question our approaches to budgeting, scheduling, requirement collection, design, and testing. No doubt this is a project for a generation or two.

But, What the heck? Why not start now?

Here's a list of claims I frequently overheard during my years at NASA. For the most part, they were frustrated utterances of assumed immutable facts of life as a NASA developer. See if they don't start the gears turning about why software projects have faltered for the same reasons they did when Brooks managed the IBM 360 project in the mid 60's.
  • Projects overrun because the NASA business model encourages under bidding and the promise of unrealistic expectations.
  • Sponsors and upper management should not be exposed to development details even when those details drive cost and schedule.
  • Development methods that work on small projects scale up to meet the needs of large projects.
  • Software development and software maintenance are fundamentally the same activity and can be funded and managed the same.
  • Additional process obligations may be levied on teams without impacting cost and schedule because "they have to do that anyway."
  • Processes can be successfully deployed without tool support or field testing.
  • Reviews success depends on signaling a message of smooth sailing.
  • The ORG chart (and not the system function) drives system architecture.
Each assertion stirs a debate in my mind on the whys and wherefores. I'm quite sure there's a rat in each, but I'm not sure just where. I plan to kick around a few suggestions in subsequent posts.

Closing note:
It was never proved that astral events weren't a cause of contagion. As a measure of modest caution, I checked the internet to see if there was a upcoming conjunction of Jupiter, Saturn and Mars in Aquarius anytime soon. I'm glad to report there's only two conjunctions of Jupiter and Saturn in the next sixty years and neither are in Aquarius. I suppose that's one less thing to worry about.

1. Tuchman, B.W., "A Distant Mirror, The Calamitous 14th Century." Alfred A. Knopf. 1978.
2. Phillip IV was known as the Fortunate. First of the House of Valois. 1328-1350
3. Tuchman, p. 107
4. This discovery occurring during an outbreak in Hong Kong that killed 50-100 thousand people. The plague bacillus, Yersinia pestis, was identified separately by two scientists, Alexandre Yersin and Kitasato Shibasaburō. The discoveries happened within days of each other. There been a long running controversy because Kitasato was slow to receive the credit he deserved.
5. Robert Koch was the first to link a microbe (anthrax) with a contagious disease and proved the germ theory.
6. Spontaneous generation postulated that some life, like fleas, could arise from inanimate matter. The theory was eventually disproved by Luis Pateur by his "swan neck flasks" experiment.
7. On occasion, the actual cause was reported, but mostly ignored. For example, in 1498, a renown Physician reported the disease was "communicated by means of air breathed out and in." (Tuchman, p.106)

Friday, October 18, 2013

The pinch of Trimble's boots

A digression triggered by "Risk Avoidance and other Eggistential Consequences"


If you read the "About me" page, you know that I'm retired. I may not be all that astute, but I couldn't help but notice I'd become older than most of my colleagues. Not just that, but the baton of authority was getting passed to some of those I used to manage. That was a really big hint. I knew that if I wanted want to try something different, time had come to be alert for the next junction on the trail.

Perhaps that is why I'm drawn to the episode between Issac Trimble and Richard Ewell at Gettysburg.
Isaac R. Trimble
Major General Issac Trimble (1802-1888)
One oldest Rebel officers and was known for his lack of tact...
Issac Trimble was a relatively minor figure in the Civil War. He was 60 when the armies met at Gettysburg. There were older generals in the Confederacy1, but he was older than any of the corps commanders or, for that matter, General Lee. He was 15 years older than Richard Ewell when they had their famous chat north of town near the Adams County Alms House.
Trimble graduated from Graduate of West Point in 1822. He was 17th of 42. Only one other of his classmates joined the Confederacy. He had served in the US Army for 10 years before retiring in 1832. He eventually moved to Maryland and was the construction engineer for several railroad lines. By the time the War started, he was Superintendent of several east coast railroads. No doubt he was accustomed to exercising his will over his younger subordinates.

Trimble joined the Confederate Army in May of 1861. Later that summer, he was promoted to Brigadier General. The following summer, he campaigned with Stonewall Jackson as part of Ewell's Division. During this period, Trimble and Ewell must have become well acquainted. Both men served with distinction.2 In August, during the 2nd battle of Bull Run, both Trimble and Ewell were seriously wounded. Ewell lost a leg. Trimble would be incapacitated for nine months.

Richard S Ewell
Lieutenant General
Richard Ewell (1817-1872)
Richard Ewell graduated West Point in 1840 (13 of 41).3. By most accounts, Ewell was odd-looking, nervous and eccentric. He believed he was afflicted by a mysterious disease and lived off a steady diet of frumenty. He had a high pitched voice, a quick wit and a reputation for exceptionally profane language.

He joined the Confederate Army in 1861 a few weeks after the attack on Fort Sumter. Nine months later Stonewall Jackson appointed Ewell division commander and promoted him to Major General. He performed well under Jackson until he was wounded at 2nd Bull Run.

Ewell's recuperation took nine months. He returned to duty under Jackson in May '63. When Jackson died of wounds received at Chancellorsville, Lee promoted Ewell to Lieutenant General with command the 2nd Corps.4 Until this point, Ewell had proved his ability carry out orders, but he'd never exercised independent field command.

Meanwhile, Trimble had been promoted to Major General in January 1863. However, he was also slow to recover and his division was assigned to another Major General.5. In June, Trimble felt he was fit for duty and, despite the lack of orders, rode to join Lee as he marched towards Gettysburg.

Within days of arriving, Trimble's famous lack of tact caused a disruption in Lee's headquarters. Lee had crossed the Potomac; distractions could not be tolerated. Lee promptly ordered Trimble to join Ewell, his old commander, as "supernumerary," i.e. an officer without a command or any official role. He arrived in Ewell's camp on June 30th. The Battle of Gettysburg would start 2 days later. On the first day of the battle, he roamed the battlefield free to form his own ideas of how the battle should be fought. When he saw the heights south of town were unoccupied, Trimble galloped to Ewell so that Lieutenant General would immediately issue the order to advance. (see my the "Ewell at Gettysburg" section of the previous post). But Trimble's advice was ignored and Ewell missed his historical moment.

Surely Trimble must have fixed on that moment. It must have been clear as the day is long that Lee might have won he battle if only Ewell had pursued the obvious course. Why did the leadership select an unproven man like Ewell for independent command? Why not officers with more experience who would listen and lead?

Trimble would have known the difference between effective and ineffective leadership. He was 15 years Ewell's senior. He had managed the building of railroads. He was accustomed to exercising independent judgment. He was a leader in his own right. How frustrating it must have been to see history turn on a decision Trimble knew, with certainty, to be wrong.

In the end, it was better for the country that Ewell's judgment failed him, but when I read about the Trimble/Ewell incident, I reminded of shockingly poor decisions made by younger colleagues who had obtained positions of authority. In the heat of schedule and deadline pressure, there little more frustrating than a decision based on inexperience, faulty reasoning and flawed judgment. Even now I wince when I think of the wasted development funds. That's why avoidable, costly errors like the one made by Ewell are evocative.

Of course, we learn by mistakes. I certainly made decisions that caused dismay among my elders. Each new generation must make its own mistakes, and each passing generation must feel the pinch of avoidable misfortune due to lost wisdom. Are do they?

The NASA workforce is aging; a generational change is underway. The generation that developed Voyager, Mariner, Galileo and Cassini are retiring. But before heading out the door, they established a library of mandatory rules, practices, processes and guidelines that were intended to avoid the mistakes of the past. There are now hundreds of volumes and thousands of pages of requirements for spacecraft developers.

While the development of these documents was well intentioned, the result is a text book case of unintended consequences. Here's a few that come to mind:
  • The volumes of rules mandate a lot of unnecessary work. Developers are now overwhelmed with process minutia. An effective manager needs to be a "process lawyer" in order to protect a team from distractions.
  • A compliance industry has evolved to ensure process compliance.
  • Technical authority was passed from those who do the technical work to those who are expert in the rules and oversee compliance.
  • Rule compliance now consumes a significant part of every development budget
  • The next generation is being trained in a culture of compliance that has encoded current practice and stymies innovation.
In effect, engineering judgment is reduced to simple checklist compliance.

This yoke of compliance does more harm that good. Rules cannot replace the complex analysis and decision process that works best when experience informs intuition. Imagine if the army had established rules for Ewell that would have prevented his lapse of judgment by replacing judgment with written procedure? Does anyone think that would be a better way to run a command? Is there a more likely recipe for disaster?

When taken to the logical extreme, we're faced with the horns of a dilemma. On one hand we are destined to repeat the errors of the previous generation. On the other, we're destined to sacrifice creativity and innovation to rigid compliance.

Clearly prudence is needed. Currently the scale is tipped toward rigidity. That seems natural given the Agency's aging workforce. What's alarming is that many of the next generation have embraced the culture of compliance. I say, "let the next generation make mistakes, even costly ones," so long as those are not the mistakes of compliance. That is the path for genuine progress.

Denoument
Two days after the exchange between Trimble and Ewell, Trimble commanded a division in Pickett's charge. Trimble was seriously wounded. Since he was too weak to travel back across the Potomac with Lee, Trimble was left to Gettysburg and tended to by a Union family. When he recovered, Trimble was transferred to a union prison for the rest of the war.

Ewell continued to commanded the 2nd Corps until the following May at the Battle of Spotsylvania where Ewell had a famous encounter with Lee. The battle was going badly for the Confederates. Lee happened upon Ewell when corps commander had lost control of his troops and was shouting profanities to restore order. Lee rode up to Ewell and famously said, "How can you expect to control these men when you have lost control of yourself."

Shortly after, Ewell fell from his horse and became too incapacitated to command the 2nd Corps. He was then relieved of his command and ordered to arrange the defenses around Richmond. That is where he served out the war.

1. There were several older Confederate generals including David Emanuel Twiggs (71), William Smith (67) Samuel Cooper(66). These men were a hardy lot.
2. However, Stonewall was to disparage Trimble as "I do not regard him as a good disciplinarian.' It should be noted that Jackson was fanatical about both discipline and religion.
3. His classmates included William T. Sherman and George Thomas; both played significant roles in the Union victories in the western campaigns. Eight of Ewell's other classmates served in the Confederate army, but none were prominent.
4. The hierarchy high command derives from the Cromwell's "New Model Army" which was organized during the English Civil Wars. A 'Captain-General' held the top spot. The second rank was called 'Lieutenant-General." The third rank was called 'Sargent-Major General.' Over the years sergeant was dropped and the rank was simply called 'Major General.' Hence a Lieutenant General out ranks a Major General.
5. Major General Allegheny Johnson.

Thursday, October 17, 2013

A textbook case

Perhaps you've seen the news about problems with the roll out of Healthcare.gov, the government's Affordable Care Act (ACA) website.

Developing software in a politically charge environment often leads to a pathological gap between actual progress and plan. This must be especially troubling when critics are confuting software development problems, that have plagued the profession for 40 years, with problems with the ACA itself.

The work is being lead by a Canadian company with experience implementing a healthcare website for the Canadian system.1 The article includes hints at the lethal combo of serious requirement creep and inflexible delivery date. What's a development manager to do with a company's reputation riding on the line?

Today the following showed up in an article published by Reuters. (see As Obamacare tech woes mounted, contractor payments soared)
How and why the system failed, and how long it will take to fix, remains unclear. But evidence of a last-minute surge in spending suggests the needs of the project were growing well beyond the initial expectations of the contractor and the U.S. Department of Health and Human Services. (emphasis added)

It's a textbook case. I can't help but wonder if anyone on the team was waving "The Mythical Man-Month" in the air in a plea for sanity.

1. Apparently the company missed deadlines on the Canadian project and the contract was cancelled. (see Meet CGI Federal, the company behind the botched launch of HealthCare.gov

Wednesday, October 9, 2013

Risk Avoidance and other Eggistential Consequences

From Chapter 2: The Mythical Man-Month 

Omelet served at Sunrise High Sierra Camp in the summer of 2007

Excerpt
An omelet, promised in two minutes, may appear to be progressing nicely. But when it has not set in two minutes, the [Antoine's] customer has two choices—
wait or eat it raw. (Page 21)



From "The Last Lion, William Spencer Churchill, Visions of Glory, 1874-1932"1

Excerpt
"We simply couldn't have failed...and because we didn't try, another million lives were thrown away and the war went on for another three years. (Commodore Roger Keyes, British Royal Navy. Page 542)"

From "Lee's Lieutenants, A Study in Command, Volume III"2

Excerpt
Our Corps Commander was simply waiting for orders when every moment of time could not have been balanced with gold.
(Captain James Power Smith, CSA, reporting on a meeting of Confederate Generals. About 4:00p, Day 1. Battle of Gettysburg, Page 94)
Brooks' observation that an omelet takes time to set implies a challenge question: Why would the omelet still be raw when the chef promised it would be done? Brooks might explain with a quote from Yogi Berra, "It's tough to make predictions, especially about the future."

Brook's suggests a remedy: provide managers with better schedule estimation tools. The hope is that a calculus of schedule estimation would produce accurate schedules just like an egg timer produces 'just right' eggs. With better estimation tools, managers could "stiffen their backbones," defend realistic schedules and dispense with those "patron-appeasing," "gutless estimations" that lead to disasters.

I'm skeptical.3 By way of a brief refresher, NASA is a very large, hardware-centric, government-sponsored, bureaucracy responsible for high-profile demonstrations of an administration's commitment to tomorrow. The more space drama, the better. That changes the rules. So long as the government forks over funds, there is no real schedule disaster. The rules for projects living off government funding are quite different than those that depend on profit.

With that in mind, here are a few alternative suggestions for why the omelet would be raw at eating time:
  • The customer expected Antoine's to provide the same service as McDonald's because he considers the omelets equivalent.
  • The chef spent the majority of the allotted cooking time asking all his colleagues if they approve of his cooking oil.
  • Obtaining collegial consensus about cooking oil always takes longer than anticipated.
  • The chef is willing to serve a raw egg because he knows that if the omelet is sent back, he'll have plenty of time to set things straight later.
  • The chef knows the customer simply craves the moral superiority that comes from eating at Antoine's. The omelet could be slimy as a Louisiana lagoon or as chewy as a tire. Doesn't matter; why worry?
In each instance, the outcome depends on how the chef spends his time. So it is with software projects.

While I worked at NASA, we were always on a tight schedule. Nevertheless, we spend huge blocks of budget and schedule on content-starved reviews that necessitated plenty of handwringing but seldom resulted in a decision of consequence. I can't recall an instance where a bold decision was made to grab an opportunity. So while we sat around conference tables listening to monotonous PowerPoint talks, grand displays of reviewer intellect and solemn rehearsals of managerial discretion, opportunities sped past us like stripes on the interstate. It was apostasy to suggest we were wasting our resources and that we could have done better.

I have a (somewhat perverse) interest in lost opportunities that changed the course of history. The catalog of military history is ripe with these stories. A couple of mostly-forgotten examples come to mind: The British catastrophe at the Gallipoli and a lost opportunity at Gettysburg.

Making of Gallipoli
The failure of British leadership at Gallipoli unfolded with the inevitability of a Shakespearean tragedy.4

British and French Battle Fleet at the entrance to the DardanellesOn the Ides of March, 1915, Rear Admiral John De Roebeck assumed command of the Allied Fleet in the Mediterranean. His assignment was to force the Dardanelles, cross the Sea of Marmara and take Istanbul. If successful, the Turkish regime would collapse and expose the Kaiser's armies to a devastating attack that would end the war.

On March 18, a De Robeck ordered the attack. The fleet sailed unopposed into the Dardanelles and proceeded to decimate Turkish defenses. Just as the ships were entering the Sea of Marmara, the French battleship Bouvet hit a mine. Moments later three more battleships hit mines. Fearing that the Turks were floating mines towards his fleet, De Robeck ordered a general retreat.

The Turks had done no such thing. The De Robeck's ships has merely strayed to close to the south shore and stumbled into a haphazardly prepared minefield. In fact, the Turks were beaten. The government was preparing to abandon Istanbul; the country's gold and art treasures had already been removed.

But the British were cautious. The top brass decided that the naval effort was inadequate and needed to be supplemented by an infantry assault of the Gallipoli peninsula. What followed was an ill-fated invasion that was caused by a failure of British leadership that Liddle Hart described as "a chain of errors in execution almost unrivalled in British history."

General Ian Hamilton was given command of an Allied expeditionary force ordered to capture Gallipoli. He sailed ahead of the troops to begin planning with De Robeck. Shortly after arriving, Hamilton conducted a cursory survey of the Gallipolian terrain and concluded that three weeks would be needed to prepare for the assault. One delay led to another. The British would not land until April 25th, a 5-week delay. In that interval, the Turks, with help from the Germans, prepared a strong defense. The subsequent British-led invasion lasted nearly 8 months. Allied casualties amounted to nearly 200,000 with over 50,000 killed in action. Turk casualties were comparable.

If only De Robeck had ordered a follow-up attack on March 19th, there would have been no debacle at Gallipoli and a German surrender might have come in short order. Instead, the war would continue for another 3 years.
Ewell at Gettysburg
The Battle at Gettysburg was the turning point in the Civil War. The Confederates might have won Gettysburg if Lieutenant General Richard Ewell5, the Commander 2nd Corps, had taken the initiative.
Gettysburg Battle Map Day1
On July 1, 1863, the Union and Confederate Armies collided at Gettysburg. About 4:00 pm, Ewell's Corps drove the Federals south through the town onto Cemetery Ridge. The Yankees were vulnerable as they scrambled to consolidate their hold on those heights. A moment of victory was at hand.

A brigade commander, Brigadier General John Gordon saw there was a magnificent opportunity to rout the Yankees. Gordon approached Ewell to obtain the order to charge the enemy. No orders were forthcoming. Ewell was frozen. "Inwardly, something had happened to the will Richard Ewell."6

A short time later, Brigadier General Isaac Trimble rode up. Trimble recognized that the Federals had not yet occupied the heights at either Culp's Hill or Cemetery Hill. Trimble was one oldest Rebel officers and was known for his lack of tact. According to tradition, Trimble said to Ewell, "We've had a grand success; are you not going to follow it up and push our advantage?"7 Ewell replied that he needed further orders from General Lee. Trimble persisted, "...give me a Brigade and I will engage to take that hill." Ewell responded with silence. In an act of insubordination Trimble then said, "give me a good regiment and I will do it." When Ewell still refused to answer, Trimble gave up.

By next day the Federals had dug in on both Cemetery Hill and Culp's Hill. The Confederates would stage a day-long series of intense attacks, but they never secured those heights. Ewell's failure to act had allowed the Yankees to secure Cemetery Ridge where, two days later, they would beat back Pickett's charge and defeat Lee's Army.
A senior manager once told me, "We are a risk adverse institution. We do not break the rules." Another senior manager repeatedly scolded me that "Things happen slowly here. We change by evolution, not revolution." From a bureaucratic perspective, this advice was impeccable--after all "Failure is not an option."

Being a reader of history, I was painfully aware of the implications of indecisive or overly cautious leadership. While the over caution that characterizes NASA management does not cost the tens of thousands of lives, it does result in the waste of tens, if not hundreds, of millions of dollars. If "commercial" contractors like SpaceX or Orbital have an advantage over NASA it's simply that they are willing to take reasonable risks.

Excessive caution was the bane of most of my forward-thinking colleagues who saw the few available opportunities awarded tasks that were the innovative equivalent of putting new paint on an old car. But that's not the worst of it. I believe the true concern was best captured by a frustrated system engineer with the following epigram: "The trouble with evolutionary change is that relies on extinction."

1. Manchester, W. "The Last Lion, William Spencer Churchill, Visions of Glory, 1874-1932. Laurel Trade Paperback. 1983.
2. Freeman, D.S., "Lee's Lieutenants, A Study in Command, Volume III". Scribner and Sons. 1944.
3. See my previous post, In the beginning... there is the estimate.
4. Churchill's role in the Dardanelles has stirred plenty of controversy among historians. Manchester lays the disaster at the feet of De Robeck and Hamilton. On the other hand, Correlli Barnett in Engage the Enemy More Closely: The Royal Navy in the Second World War (Norton & Co. 1991), lays most of responsibility on Churchill and points to a similar 1940 disaster in Norway. Both accounts are extremely well researched and convincing.
5. Ewell had replaced Stonewall Jackson who died from wounds received at the Battle of Chancellorsville in May.
6. Freeman. p.92.
7. Freeman. p.94.

Monday, September 30, 2013

In the service of ambition

From Chapter 2: The Mythical Man-Month

Excerpt
...delay at this point has unusually severe financial, as well as psychological, repercussions. The project is fully staffed, and cost-per-day is maximum...Indeed, these secondary costs may far outweigh all others. (page 20)

From The Campaigns of Napoleon.1

Excerpt
[Napoleon's] inhuman demands on his own followers during desert marches...reveal his lack of concern for his men. To Bonaparte armies were merely instruments for his use.(p. 248)
Jean-Léon Gérôme 003

Brooks is burnishing his argument on the necessity of ensuring sufficient schedule for system test. The repercussions--in terms of human cost--cannot be overstated. There can be devastating consequences on a project team.

A mission is a campaign. A team of can be unwittingly lead to tremendous self sacrifice necessitated by a management whose aspirations exceed its resources. And as in war, the loyal are often sacrificed. I'll strain that comparison for the sake of illustration.

History is awash with examples of military campaigns where the armies suffered terribly in the service of an ambitious leader: the Athenian army under Nicias in Syracuse, the Roman army in Persia under Crassus, the 4th Crusade, the British army in Virginia wilderness under Braddock, or the French army under LeClerc in Saint-Domingue (Haiti) and both the German and Napoleonic adventures into Russia are just a few examples. However, Napoleon's Syrian campaign is an especially poignant example of suffering in the service of ambition--perhaps because it's a tragedy that is relatively obscure.

In July of 1798, a Napoleon-lead French expedition of nearly 40,000 men landed in Egypt. Napoleon had promoted the expedition. "We must go to the Orient; all great glory has been acquired there."2 This allusion to glory is intended to invoke Alexander the Great. Interestingly, the Egyptian adventure included some 500 artists and scientists who were to open the field of Egyptology.3

From earliest stages of the expedition, the French army was dispirited. The campaign opened with a 72-hour, waterless march through the desert that was called a 'living hell.' The army was near mutiny and one of Napoleon's generals committed suicide. Napoleon exercised his relentless will and, within a month, Egypt was in French hands. Success seemed assured.

However, by October the tide had turned. A British Naval blockade had isolated the expedition. Back home, the Second Coalition was clobbering French armies on the continent and the Directory had lost interest in the Egyptian campaign. Again, the army was demoralized. Senior officers started to resign and return to France. But most pressing, the Turkish Sultan had issued a firman declaring holy war on the French.

Napoleon was never one to sit around and wait to be overtaken by events. He would defeat the Sultan and establish French dominance in the region. Preparations began for an invasion of Syria. The stage was set for one of the crueler adventures of Napoleon's career.

On February 6th 1799, the vanguard of a 13,000-man French army headed off into the Arabian desert. Napoleon always pushed his armies hard; speed was his best tactical tool. The plan was for the army to cover 15 miles per day and reach Gaza in 8 days. Due to unexpected Turkish resistance and harassment from the British Navy, the provision-strapped French army suffered through a month-long trudge through the desert before encamping outside of Jaffa . After a short siege, the Turkish garrison in Jaffa surrendered on a French promise of clemency. Once the Turks were in custody, Napoleon ignored the promise of clemency and had the entire Turkish garrison of 3,000, along with 1400 additional prisoners, executed. In his memoir, Napoleon explained that the executions were necessary because of scarce provisions.

As if by cosmic justice, a major outbreak of the plague afflicted the French. Once again French moral tanked. To rally the troops, Napolean made a risky (and later self-celebrated) visit to the afflicted in the Jaffa Pestiferies. He even helped carry a diseased corpse at tremendous personal risk--a gesture that is said to have inspired the army to march on.

Détail Bonaparte visitant les pestiférés

The army next advanced to Acre. Acre was considered the "key to Palestine"--its capture was key to Napoleon's strategy. On the Ides of March, the army opened siege. This time, with the help of a timely arrival by the British navy, the Turks foiled the French efforts.

By early May, Napoleon realized the chance of success was rapidly slipping away. He ordered a series of desperate but failed attacks. When the Turks began to receive reinforcements, Napoleon knew he was beaten. A retreat was ordered. The worst miseries of the campaign were to follow.

Meanwhile, Napoleon issued a series of proclamations for the Egyptian and French audiences that described his great victories in Syria. The retreat to Egypt, he explained, was due to the upcoming summer season.

Napoleon's retreat from Acre was saddled with daunting problems. It was a disaster that foreshadowed his retreat from Moscow in 1812, but that was 13 years away. One seemingly insoluble problem was the need to transport 2,300 sick and wounded back to Egypt. Napoleon proposed poisoning the hopeless cases. He was initially dissuaded, but after the Turks mounted an aggressive pursuit, he ordered the mercy killing of those French who were too incapacitated to walk back.

Napoleon y sus Generales en Egipto

The expedition ended on June 3rd, when the demoralized remnants of the French army straggled into Katia, Egypt. When all was said and done, the year of campaigning had cost the French army 1/3 of its men.

Nonetheless, Napoleon staged a triumphal march in Cairo on June 14. Two months later, Napoleon returned to France as the self-declared "Savior of France" and became the 1st Consul of the French Government. Thus began the greatest phase of Napoleon's career. Meanwhile, a dispirited French army was forced to remain in Egypt until the Treaty of Amiens was signed in 1802.

In summary, the Egyptian campaign was not quite up to Ceasar's "Veni, vidi, vici." More like I saw glory, I over committed, I declared success.4

Over commitment among software development teams is so common that it has a name: Death March. Every experienced programmer has lived through one. Most pledge, unsuccessfully, never to do 'that' again, but few have the will to resist the drive to succeed. I saw the regrettable repercussions on the mission teams: divorce, heart attack, even suicide. It's a serious business. In recent years, the strains on NASA budgets has made the death march the norm and not the exception.

There are times when self-sacrifice is warranted. Our great country has thrived because many have answered the call to protect and serve. We owe a profound debt to those men and women. But the romantic calling to build a new vehicle for a return to the Moon or a robotic spacecraft to crawl about Mars is not war and personal sacrifice seems an unwarranted antidote to an ill-conceived NASA project.

Great achievements require great leadership and a great leader's requisite ambition. Napoleon was a great genius, but he equated his personal ambitions with the good of all. That is the road to malevolence. Few leaders have the both requisite wisdom to maintain that separation and the luck to survive. Perhaps it's too tall an order, but we can at least hope that the ambitions of the leaders in the American Space program would be tempered by a concern for the people in the trenches.


1. Chandler, D. "The Campaigns of Napoleon. The Mind and Method of History's Greatest Soldier." Scribner. 1966.
2. Chandler, p.209.
3. Among other things, the expedition returned with the Rosetta Stone.
4. According to Google translator, the latin would be: Vidi gloriam meam super commisit ego, successu. Suggested improvements are welcome.



Tuesday, September 24, 2013

Must it be?

Continued from Riding the bow wave.
Shamelessly altered excerpt
The test effort usually exposes important gaps... Why? ... 'interface management.'
From the score of Beethoven's String Quartet No. 16 in F major (Op. 135)1
Marking for the first chords of the last movement
of Beethoven's String Quartet #16 in F major (Op 135)
Beethoven opens the last movement of his last significant compositional effort with the question: Must it be? He concludes: It must be, but should we?

Thanks to a timely mishap on the latest International Space Station (ISS) resupply mission, I'm in an existential frame of mind.

In recent postings I've been ruminating about the ad hoc nature of testing. Thanks to an ex-colleague, who is apparently omniscient about all-things NASA, I learned about this timely existence proof of test fallibility. He sent me packing to the Orbital Sciences Corporation web page for reports on the status of the Cygnus rendezvous with the Space Station. Here's the link: Antares/Cygnus Updates.

Cygnus is a cargo carrier; Antares is the launch vehicle. The Cygnus/Antares combo is latest "commercial" solution for ISS resupply to come down the pipe--the SpaceX offerings, Dragon and Falcon, were first. The Orbital system is comparable to SpaceX, but Orbital lacks a gifted self-promoter and is less well known.

Orbital and SpaceX are the two surviving contenders in NASA's Commercial Orbital Transportation Services (COTS) Program. Both Orbital and SpaceX have received substantial funds from the Agency. Prior to 2011, Orbital received $260M and SpaceX received $376M from NASA. The combined COTS budget for FY12/13 was around a $billion (FY12=$400M, FY13=$525M), which I'm sure was judiciously split between Orbital and SpaceX.2 If you add it up, that's in the vicinity of $1B apiece. (But not quite even-steven.) If this is 'commercial', what would you call 'government'? I digress.

Orbital's initial launch of Antares/Cygnus was on September 19. Cygnus was scheduled to rendezvous with the space station on September 22. Here's a few telling excerpts from the Orbital status page.
September 21, 2013
As of mid-day today, Cygnus continues to perform...remaining on track for its rendezvous...tomorrow morning.
7:45 a.m. Sunday, September 22, 2013
Following the discovery of a data format discrepancy...today's rendezvous with the station was postponed...A software update has been developed and will be tested on a ground-based simulator during the day on Sunday.
September 22, 2013
This morning, at around 1:30 a.m. EDT, Cygnus...found that some of the data received had values that it did not expect, causing Cygnus to reject the data. This mandated an interruption of the approach sequence.
Monday, September 23, 2013
...Orbital and NASA together decided to postpone the approach...Over the past 24 hours, the Orbital team developed and tested a software fix for the data format mismatch...This new schedule will allow the Orbital operations team to carefully plan and be well-rested before restarting the critical final approach.
I have no direct knowledge of the circumstances that lead to the defects in Cygnus data processing, however I do know a bit and could surmise the following:
  • The discovery of the data problem was well past the 11th hour.
  • The Cygnus/Antares systems under went thousands of hours of testing prior to launch. It's likely the wrong test data was used.
  • The ISS ICD has been in place for decades and successful used by the both American and Russian vehicles.
  • The ISS telemetry data format does not conform to the most commonly used international standard4
  • The Orbital engineering team most surely competent and staffed with talented engineers
  • The energy expended in handwringing at Orbital Ops must have enough to power an village in Virginia. Everyone is exhausted.
  • The Orbital management has reported the results accurately but with a tone that conveys there has been no departure from the routine.

Why do these things happen?

Last night, I had dinner with another buddy from my NASA days. We discussed the Cygnus data problems and he pointed out that "they had problems building the pyramids. Why should things be different." He is right of course; nothing unusual here. No engineering effort will be free of problems.

But, should we simply accept the status quo of test practice as the inevitable? Is testing finally and ultimately doomed to be an ad hoc, incomplete, finger-crossing process? Should we simply shrug and accept it must be this way?

I hope not.

How can we progress to the next phase of building vastly more complex systems if there is no progress in the way we approach our objectives. Especially testing! What if we were still confined to the building techniques used by the Egyptians? There would be no arch, no concrete, no flying buttresses, no use of glass, iron and concrete. Without those innovations there would have been no Roman aqueducts, coliseum or Parthenon. No Haj Sofia. No Notre Dame. No Crystal Place. No Chrysler building. No Astrodome.

It's not that we should expect the end of problems, but that we should rue complacency. In many respects the Cygnus problem is a routine matter. But, it is the very fact that it is considered routine that causes dismay. Must it be? Let's hope it's not.


1. If you want to hear the chords, jump to 16:45 in the recording. But, think twice! This is one of Beethoven's masterpieces.
2. 2102 funding and pre-2012 funding comes from the NASA FY13 budget request. The 2013 budget was reported in the Space Politics Blog.
3. These links tend to be broken.
4. “The good thing about standards is that there are so many to choose from.” Andrew Tenenbaum

Saturday, September 21, 2013

Riding the bow wave

From Chapter 2: The Mythical Man-Month

Excerpt
Failure to allow enough time for system test, in particular, is peculiarly disastrous. Since the delay comes at the end of the schedule, no one is aware of schedule trouble until almost the delivery date. (page 20)

From "Acquisition Archetypes: The Bow Wave Effect"1 

Excerpt
Making Wakes - geograph.org.uk - 1485246"We don't compromise on schedule delivery date... we just kept dropping functionality... A growing mass of work had to be done at the end."

Brooks will repeatedly admonish us to allocate 50% of the schedule to test.2 I'm assuming he has the waterfall model in mind. If so, there is more to the test phase than just testing; the waterfall test phase includes the integration of separately-developed, 'finished' pieces. Typically these are big pieces. For a space system this integration will include the ground system command development infrastructure with the on-board command infrastructure, the command system and with the telemetry system, the downlinked telemetry with prediction analysis tools, the onboard fault protection software with the power, attitude control, thermal and propulsion software...and that's just a sample. In the waterfall model, if all the previous steps were done correctly, everything should fit together like so many Lego blocks. That's the theory--a theory seldom if NEVER works. Only the most unwitting manager would insist that software integration is simply a matter of testing properly implemented pieces. (Unlike sasquatch, they exist and manage NASA budgets!)

In practice the test effort usually exposes important gaps in development. Why? The two principal culprits: 'the bow wave' and 'interface management.'

The bow wave
A development bow wave is the unfinished work that has been pushed to the end and bites you from behind at the end of a development cycle.

Bow waves build because development is the process of discovery. There are always unanticipated problems: A broken library, a compiler bug, a design mismatch, an incorrect requirement, an error condition that emerges from the blue... The possibilities are endless.

These discoveries happen in the context of programmatic commitments made in the form of budget and schedule. The definition of a good team (i.e. one that functions without disruptive 'help' from management) is one that meets its commitments. Once you add in procrustean schedule notions like EVM, you have a bow-wave-generating recipe for delaying or modifying features for the sake of sunny-day progress reports. Meanwhile, the bow wave builds as the unanticipated collides with the commitment. It resembles a Ponzi scheme predicated on a schedule slips or subsequent releases. Just like the Ponzi scheme, the piper must be paid and payment inevitably comes during 'Test.' Consequently, what was 90% complete prior to test, suddenly becomes 50% complete and that hefty 'test' schedule becomes synonymous with good planning, albeit at an exorbitant price.

Interface management
A large system, like a space system, is built by a dozen-or-two teams each working on separate functional pieces the must be integrated to work as a whole.3 Just how these pieces will fit together is typically captured in document, called a Interface Control Document (ICD) that describes the inputs and outputs of each piece. The ICD may also describe the correct steps for functional pieces to interact (often called a protocol).

As a rule ICDs were seldom accurate and interfaces tough to manage. Here's a few of they reasons why:
  • Interface change is fundamentally a serial activity. While an intended changes may be documented, actual implementation may be different. Consequently, the documentation may not reflect the actual artifact.
  • Implementers are busy. In the rush to a deadline, a developer will be hard pressed to record each changes.
  • Auto documentation by document generators is no panacea. Distribution needs to be timely. If developers from other teams implement immature designs, the code may be whipsawed and the stability of the build will be jeopardized.
  • Over time, a software systems will develop undocumented interfaces that make the system brittle. An interface change will break the system in unanticipated ways that only show up during test.
  • Interface and protocol documentation methods are squishy, if not misleading, and easily misinterpreted. Rigorous definitions, like those provided through formal methods, require a specialist with an advanced degree, narrowly focused skills and plenty of time. Except for the you're-lucky-to-get, all-star programmer, something better is needed for people doing the real work.
  • Bureaucratic organizations guard their interfaces--with good reason. To lose ownership is to lose funding; a interface change may cause a function piece to slip from the organizational grasp. Consequently, inter-organizational interface changes typically requires painful committee work among distrustful managers and a grievously painful negotiation. Meanwhile, the necessities of project requirements, budget and schedule will drive developers to built solutions that then plant the seeds of system ossification.
  • Interface improvements will be vetoed if it weakens an organization role.

The result: problems with interface management usually swell the bow wave during the so-called test phase. Viola, developers will found busily 'fixing' problems late in the cycle by fixing interfaces and building new features. Test is hardly just testing.

Experienced software engineers know how to cope with interface change and the bow wave. Most who build software for a living figure this is just how things are done and ride the wave. There are, however, techniques like early integration and iterative development in common practice and go a long way towards mitigating schedule and budget risk. Call me a curmudgeon, but I don't believe these techniques scale to large system developments that cross organizational boundaries. The problem is not just software, it's software in the context of large organizations with parochial interests. I don't believe we know how to manage that.


1. "Acquisition Archetypes: The Bow Wave Effect". SEI white paper, 2007. http://www.sei.cmu.edu/library/assets/bowwave.pdf

2. For a bit of skepticism on my part, see Orders of Complexity.

3. I discussed some of the challenges that stem from functional decomposition in Humpty-Dumpty Effect