Friday, June 21, 2013

Failure is not an option

Continued from previous post "Control of circumstances"

From Chapter 1:

Excerpt
It seems that in all fields, however, the jobs where things get done never have formal authority commensurate with responsibility. (page 8)

As described in the previous post, Brooks alerts software developers to a few "inherent woes." One of the woes he calls out is a programmer's absence of authority. In a subsequent passage, Brooks will distinguish between 'actual' and 'formal' authority. That's a topic for another time.

For the moment, I want to describe a few examples of a management mechanism used to avoid the consequences of a-decision-gone-bad. The mechanism turns on the ease of redefining success in the absence of an objective measure like profit. I'll call it the failure is not an option mechanism.

When I first joined JPL, I was hired as an ancillary contributor to a team that was preparing for a rover field test. I was new and relishing a job that was easily the most fun I'd ever had earning the rent. The team was staffed with really sharp guys and the project was fun.

The rover had been outfitted with several new instruments and operational capabilities. The test was intended to show the efficacy of a instrument laden rover with fancy new technology. Success was important. The team was under considerable pressure--the sponsor from NASA headquarters was due to attend. A failure could mean a budget cut.

As the day of the field test approached, the system was acting flakey. The team was wrestling with hardware, software and wireless network issues. (Interesting side note: the rover was passing NETBIOS traffic and, to almost everyone's dismay, the rover processor was getting bogged down with lab-wide printer traffic.) Suffice to say the system was very difficult to debug.

My role in the project ended prior to the actual field test. Before moving on to the next thing, I dropped in on the tech lead to thank him for the opportunity. We chatted. At one point, I asked what would happen if the rover failed to work. "It's an outcome," he said.

Seven years later, I was present at a flight software review (the Preliminary Design Review or 'PDR') for the next major Mars mission. The review board was picking apart the flight software development plan. It was clear to everyone, including the manager presenting the materials, that the plan was unrealistic. However, it fit the budget and schedule needs demanded by the project management. A different, more realistic, plan would have been dubbed unresponsive.

Except for a few odd action items, the flight software team passed the PDR with flying colors. In subsequent hallway discussions with respected colleagues, I frequently heard concerns about the project's status which were shrugged with customary dismay. This is how things are done; anything short of success was a disaster.

Ultimately the Mars mission was delayed 2 years. The Office of the Inspector General (OIG) issued a report on the sources of the delay. The OIG report does not refer to the planning problems that were discussed in a review three years before the launch delay.

A few years later, I was working on NASA's Constellation Program (often abbreviated as Cx). Cx was an enormous undertaking--big as anything NASA has ever attempted. It required close coordination of different project teams working on separate large-scale systems including: a oversized manned capsule, a family of launch vehicles, a revitalized launch control system, an upgraded mission-operations system (i.e. 'Mission Control Houston'), and a new distributed simulation system for testing. Each project was very large. Coordination between the projects was required for success. The closest similar effort NASA had undertaken was the space station which took nearly 30 years from project start of Space Station Freedom in 1984 to completion of the International Space Station in 2011. Plans for Cx dwarfed the space station program goals.

Despite the complexity, Cx was underfunded; especially for execution in NASA's highly, risk adverse approach to system development. Worse, the funds for the program depended on the retirement and replacement of the Shuttle--if the Shuttle continued to fly, there would not be funding for Cx. Consequently, the project had a very aggressive, and unforgiving schedule.

In NASA, major schedule milestones are represented as reviews of significant engineering accomplishments. In Cx, the first round of reviews were focused on requirement documents. Requirement documents describe what the system shall do. (e.g. "The system shall provide astronauts with life support system".) For program success, these requirement documents needed to be consistent and coordinated across all the various projects. In order to ensure the requirements are correct, the program mandated a set of System Requirement Reviews or 'SRR'.

There was tremendous schedule pressure on the engineering teams. Everyone I knew was trying to keep pace with a relentless schedule. Suffice to say, the work could not be accomplished in the time provided by the program's schedule. None-the-less, the reviews of the SRR were all successful. They had to be; all forward work on the project would have stopped because the management would have been forced to reckon with the conclusion that the project was not ready for the next phase. That would have then caused subsequent delays and ultimately required the extension of the Shuttle program and a serious budget conundrum.

Meanwhile the unfinished engineering was bookkept as what's called "future work," i.e. the work that typically required to pass the review in the first place. The unfinished work then created a bow wave of that in turn lead to subsequent schedule and budget problems. But the bow wave never slowed progress. When the Program was cancelled in February 2010 because the budget was found to be inadequate, Cx was considered to be more or less on schedule.

To wrap up...
I think it's fair to say that all but the biggest mission-ending failures can be redefined. For example, the shuttle accidents, the Mars Observer mishap, and the Mars '98 mishaps. On the other hand, the crash landing of Genesis, the ballooning budgets and schedules of the James Web Telescope or the Mars Science Laboratory and dozens of similar circumstances that are an ordinarily part of a difficult engineering task can are easily redefined and celebrated as success. And that is good news, it keeps the dollars flowing.

Perhaps the point of true interest here is that the Agency culture has evolved to ensure that there are very few wrong decisions. If that's the case, there may ultimately be a remedy. But what?

No comments:

Post a Comment