Wednesday, June 26, 2013

More on the last bug

Continued from previous post "The last bug"

From Chapter 1:

Excerpt
...finding nitty little bugs is just work... So testing drags on and on, the last difficult bugs taking more time to find than the first. (page 9)

This seems like a good spot to include a passing note about scrum-style, Agile development. I've heard colleagues make claims about how schedule problems are solved by Agile because you plan and test the product out as you go. (Admittedly a gross over-simplification, but adequate for this purpose.)

I'm not a big fan--partly because it's often touted as a cure-all that can cure for cancer, solve world hunger and lead to world peace. (Brooks will discuss the general phenomena of programming cure-alls in the silver bullet chapter.) On less contrarian grounds, I'm not a fan because I've seen it improperly applied with very poor results. But, to be fair, I've also seen it applied with very good results.

From my perspective the good thing about Agile the incorporation of the iterative/incremental style of development that has been in use since the late '80 as Objectory and later as the Rational Unified Process. The unfortunate thing about Agile is that it deemphasizes the upfront planning. The result is a highly-tactical decision process that makes it impossible to predict what will be in the final package.

Probably the worst thing about Agile it that it invokes a 'true-believer' mindset that is intolerant of deviation from the creed. This sort of orthodoxy gives rise to all sorts of pathological behavior. For example, I know of a team of very talented team of programmers who subverted an Agile process in order to ensure the task had a long range plan that would produce the product promised to the customer. Call it a reverse Faustian strategy.

On the plus side, Agile seem very constructive for tasks where the architecture is very well understood and the design decisions happen on the edges of the core system. For example, an e-business application. However, for the development of new infrastructure or embedded (i.e. control system) applications, the lack of a clear vision will likely lead to considerable rework, schedule slips and budget overruns since the progress to the end will be a random walk.

Tuesday, June 25, 2013

The last bug

From Chapter 1:

Excerpt
...finding nitty little bugs is just work... So testing drags on and on, the last difficult bugs taking more time to find than the first. (page 9)

Brooks continues his discussion of developers "inherent woes" with a nod to the drudgery of testing. In the "Mythical Man-Month" chapter he'll assert that testing is the most underestimated part of a project and recommend that testing should entail half the schedule.

As a software development manager, I never able to allocate close to that amount of schedule for testing. For a tasks that developed new software, I usually laid a schedule with the following guidelines:
  • A 6-month (i.e. 26 week) delivery boundary. A shorter schedule precluded the introduction of significant new features. A longer schedule was vulnerable to funding uncertainty.
  • A coding period of 16 weeks. Coding time drives other aspects of the schedule. The more coding, the more planning and testing. Longer schedules are too difficult to plan; too many details are yet to be discovered. Shorter schedules are prone to schedule slip from the inevitable surprises.
  • A planning period of 4 weeks with additional preliminary planning occurring during the test period for the previous delivery. In practice, this is almost always too little, but a longer period often leads to a development stall. (Topic for another posting)
  • An informal test period of 4 weeks. During this period there is no new feature development. The only code changes are bugs are fixes. The goal is to prepare the system for "formal" testing.
  • A formal test period of 4 weeks concluding with a "run for the record" or "acceptance test". In theory, but never in practice, there is minimal re-coding.
  • If a product is developed incrementally (i.e. in stages), the schedule is adjusted to so that earlier builds have more planning and later builds have more testing.

In practice, this outline often equated to a forced march. There was no respite between releases. My teams were often pushed to the limits of a their endurance. Burn-out was always a serious concern, especially for teams who worked on infrastructure systems like ground systems. Fundamentally, this was a consequence of the budgets and oversold expectations. Sadly, this is the norm for teams working on products for missions. I came to accept as fact that realistic budgets and schedules do not sell. If faced with the choice of be overworked or get laid off, what would you chose? For the sake of discussion... Let's say that NASA software development managers were routinely able to follow Brooks advice and allocate 1/2 of schedule to test. Would that be enough? It depends. A experienced manager will consider the maturity of the product, the complexity of the system, the skill of the team and the risk inherent in the application. Whatever the case, it's simply impossible to fully test any software product. There are two principle reasons:
  • Testing for all the possible conditions of the system is a practical impossibility. An experience test lead will have a instinct for the tests that matter.
  • a space system cannot be tested in the actual operational environment. Here's a point of major departure from the systems Brooks managed. A space system can only be tested in simulations or analogous environments.
    For example, the MSL landing code was tested by running millions of simulation using high-fidelity models of the spacecraft and Mars. Consider that challenge! Both the simulation code and the control system code have to be right. The simulation must be an accurate depiction. The control system must accurately analyze the simulated sensor data. Complementary errors must not correct for one another.
It is simply not enough for the space system to be bug free--it must accurately capture the physics.

Over the years, I came to believe teams have an intuitively reliable knowledge of product readiness. You might say the same of an author and her novel or a visual artist and his painting. It's somewhat miraculous. Somehow you just know.

Regrettably, this sort of intuition is insufficient for a management cadre steeped in the engineering catechism of rigor. 'Engineering' requires objective evidence, data and proof of rigor. So, as a practical matter, an elaborate charade of test completeness is concocted on paper to demonstrate to management's satisfaction that everyone is acting responsibly. After all, we must not fail.

Monday, June 24, 2013

Perfection is not enough

From Chapter 1:

Excerpt
If one character, one pause, of the incantation is not strictly in proper form, the magic doesn't work. (page 8)

"Perfection" is another of the inherent woes of the tar pit.  Brooks has taken a some literary license to make the point that computers do not tolerate programing errors.  The burden of correctness falls on the programmer.

Programming tools have significantly improved since the time the IBM 360 was developed. For example:
  • Compilers are interactive.  Programmers get as many tries as they need to get a program to compile.
  • Text editors are more helpful.  They detect spelling and syntax mistakes.
  • Debuggers can step through a program.  The programmer can watch program flow.
  • Version control and configuration management tools help keep track of changes.  A programmer can backtrack to see where a bug crept in.
  • New test tools help identify a wide range of problems.  Programmers can watch memory allocation or check  for illegal paths of program operation..

While these improvements have made programming easier, the problems persist. A little more than a decade ago commercial software vendors advertised 'bug-free software.' Today, no responsible software engineer would make that claim (unless perhaps the program was trivially simple.) Why? There are too many unknowns. Consider the following possible causes of any run-time error (i.e. an error that occurs when the program is running as opposed to an error that appears when the program is being created.) Here's a just a few possible causes:
  • The error might be in the hardware.
  • The error might be in the operating system.
  • The error might be in the tools that build the program.
  • The error might be the result of interactions with another running program.
  • The error may be caused by an untested interaction between parts of the new program.
  • The error might be caused by a configuration of some or all of the piece parts listed above.

Now days, the conventional wisdom is that a program will have bugs. In fact good programs are built with the assumption there will be defects. A competent programmer include assertions about the data being processed, and, if an error occurs, handle the problem gracefully. For example, it used to be that if your word processor crashed you lost your work. Now days there's a decent chance the data will be preserved. If that sounds difficult to program, it is.

The problem is tougher when you leave the comfy confines of the workstation or the desktop. A space system must operate in the physical world; the challenge is much more complex. For one thing, the system must be designed so that any program bugs can be handled remotely; if not, a mission may come to an abrupt end. (e.g. Mars Pathfinder). New tools have come along that help considerably. However, these have limited value. Similar problems were not detected by these new tools in the recently launched Mars Science Laboratory Mission--fortunately without mission ending consequences.

But these kinds of programming challenges are just the tip of the iceberg. The greater challenge stems from the formidable fact that a space system must reason about the physical world. It must have knowledge of its position, its stored power, its temperature, its memory, etc. And, the physical world is unforgiving. A part failure, an operational error or an extreme environmental condition (like extreme temperature or uncontrollable tumble) can end a mission. Bottom line: a spacecraft has much more rigorous error detection and response requirements than a word processor.

These systems that operate in the physical environment are often referred to as control systems. A simple control system will obtain data from a sensor, like a temperature sensor, and take actions based on the sensor data. A complex control system will have a panoply of sensors. The system reasons about the data from the sensors to create knowledge about itself and the environment. It then takes action in response to that knowledge. The selection of an action will depend on operational intent which may change over time. And while it's doing all that, it's checking for errors that may threaten the mission.

In the space age, program "perfection" means something quite different from what Brooks suggested. It requires an overall system knowledge that extends to all aspects of system design. This is a bewilderingly complex problem that conventional engineering practice does not adequately address. If we are to make substantial progress, our engineering methods will have to evolve to better manage system complexity.

In the current Agency culture, changes in engineering practice are typically superficial, underfunded and greeted with a deserved skepticism.

Friday, June 21, 2013

Failure is not an option

Continued from previous post "Control of circumstances"

From Chapter 1:

Excerpt
It seems that in all fields, however, the jobs where things get done never have formal authority commensurate with responsibility. (page 8)

As described in the previous post, Brooks alerts software developers to a few "inherent woes." One of the woes he calls out is a programmer's absence of authority. In a subsequent passage, Brooks will distinguish between 'actual' and 'formal' authority. That's a topic for another time.

For the moment, I want to describe a few examples of a management mechanism used to avoid the consequences of a-decision-gone-bad. The mechanism turns on the ease of redefining success in the absence of an objective measure like profit. I'll call it the failure is not an option mechanism.

When I first joined JPL, I was hired as an ancillary contributor to a team that was preparing for a rover field test. I was new and relishing a job that was easily the most fun I'd ever had earning the rent. The team was staffed with really sharp guys and the project was fun.

The rover had been outfitted with several new instruments and operational capabilities. The test was intended to show the efficacy of a instrument laden rover with fancy new technology. Success was important. The team was under considerable pressure--the sponsor from NASA headquarters was due to attend. A failure could mean a budget cut.

As the day of the field test approached, the system was acting flakey. The team was wrestling with hardware, software and wireless network issues. (Interesting side note: the rover was passing NETBIOS traffic and, to almost everyone's dismay, the rover processor was getting bogged down with lab-wide printer traffic.) Suffice to say the system was very difficult to debug.

My role in the project ended prior to the actual field test. Before moving on to the next thing, I dropped in on the tech lead to thank him for the opportunity. We chatted. At one point, I asked what would happen if the rover failed to work. "It's an outcome," he said.

Seven years later, I was present at a flight software review (the Preliminary Design Review or 'PDR') for the next major Mars mission. The review board was picking apart the flight software development plan. It was clear to everyone, including the manager presenting the materials, that the plan was unrealistic. However, it fit the budget and schedule needs demanded by the project management. A different, more realistic, plan would have been dubbed unresponsive.

Except for a few odd action items, the flight software team passed the PDR with flying colors. In subsequent hallway discussions with respected colleagues, I frequently heard concerns about the project's status which were shrugged with customary dismay. This is how things are done; anything short of success was a disaster.

Ultimately the Mars mission was delayed 2 years. The Office of the Inspector General (OIG) issued a report on the sources of the delay. The OIG report does not refer to the planning problems that were discussed in a review three years before the launch delay.

A few years later, I was working on NASA's Constellation Program (often abbreviated as Cx). Cx was an enormous undertaking--big as anything NASA has ever attempted. It required close coordination of different project teams working on separate large-scale systems including: a oversized manned capsule, a family of launch vehicles, a revitalized launch control system, an upgraded mission-operations system (i.e. 'Mission Control Houston'), and a new distributed simulation system for testing. Each project was very large. Coordination between the projects was required for success. The closest similar effort NASA had undertaken was the space station which took nearly 30 years from project start of Space Station Freedom in 1984 to completion of the International Space Station in 2011. Plans for Cx dwarfed the space station program goals.

Despite the complexity, Cx was underfunded; especially for execution in NASA's highly, risk adverse approach to system development. Worse, the funds for the program depended on the retirement and replacement of the Shuttle--if the Shuttle continued to fly, there would not be funding for Cx. Consequently, the project had a very aggressive, and unforgiving schedule.

In NASA, major schedule milestones are represented as reviews of significant engineering accomplishments. In Cx, the first round of reviews were focused on requirement documents. Requirement documents describe what the system shall do. (e.g. "The system shall provide astronauts with life support system".) For program success, these requirement documents needed to be consistent and coordinated across all the various projects. In order to ensure the requirements are correct, the program mandated a set of System Requirement Reviews or 'SRR'.

There was tremendous schedule pressure on the engineering teams. Everyone I knew was trying to keep pace with a relentless schedule. Suffice to say, the work could not be accomplished in the time provided by the program's schedule. None-the-less, the reviews of the SRR were all successful. They had to be; all forward work on the project would have stopped because the management would have been forced to reckon with the conclusion that the project was not ready for the next phase. That would have then caused subsequent delays and ultimately required the extension of the Shuttle program and a serious budget conundrum.

Meanwhile the unfinished engineering was bookkept as what's called "future work," i.e. the work that typically required to pass the review in the first place. The unfinished work then created a bow wave of that in turn lead to subsequent schedule and budget problems. But the bow wave never slowed progress. When the Program was cancelled in February 2010 because the budget was found to be inadequate, Cx was considered to be more or less on schedule.

To wrap up...
I think it's fair to say that all but the biggest mission-ending failures can be redefined. For example, the shuttle accidents, the Mars Observer mishap, and the Mars '98 mishaps. On the other hand, the crash landing of Genesis, the ballooning budgets and schedules of the James Web Telescope or the Mars Science Laboratory and dozens of similar circumstances that are an ordinarily part of a difficult engineering task can are easily redefined and celebrated as success. And that is good news, it keeps the dollars flowing.

Perhaps the point of true interest here is that the Agency culture has evolved to ensure that there are very few wrong decisions. If that's the case, there may ultimately be a remedy. But what?

Thursday, June 20, 2013

Control of circumstances

From Chapter 1:

Excerpt
It seems that in all fields, however, the jobs where things get done never have formal authority commensurate with responsibility. (page 8)

In the 'Tar Pit' chapter, Brooks alerts software developers to a few "inherent woes." He points out that a developer does not control the circumstances of his or her work. In practice that means that the management decides what funding, schedule, tools and infrastructure are available to accomplish the goal. The individual contributor must work with what's provided. These management choices may leave much to be desired. Brooks describes the aggravation of working with poor tools, but it would be fair to add inadequate funding and busted infrastructure to the list. Bottom line, a developer will not have the authority to change these circumstances and must live with them.

In my experience, obtaining and maintaining authority is an overarching concern across the Agency. It affects decision making at all levels. In this respect, NASA is anything but unique; authority is a ubiquitous concern in any bureaucracy. However, the methods for exercising authority in NASA plays a significant role in shaping how the software is funded, scheduled, developed, tested and used.

The topic of authority is certain to come up many times in these postings, but I'll start by describing the use of role statements at JPL.

A role statement describes your job. A good role statement will define your responsibilities, the authority you have to meet those responsibilities and the obligation you have to show others that you are getting the job done. A well-structured job description will balance responsibility and authority; otherwise, an engineer may be held responsible for outcomes without the ability to meaningfully influence events. Many large organizations capture allocation of responsibility and authority in a Responsible, Accountable, Consulted, and Informed matrix or RACI. The RACI is sometimes called a linear responsibility chart. My own preference was for the shorter Responsibility, Authority, Accountability chart or RAA which was a 1-slide, bulleted description.

I seldom saw role statements used at JPL. In practice that best job description was often the one used in the hiring requisition. (How many of you actually ended up doing the job described in the want ad?) However, in the rare instances where a manager did prepare a RAA, approval was highly political and time consuming. Inevitably the sticking points were rooted in the authority bullets. Why? Because authority equates to delegation and delegation enables an engineer or middle manager to make decisions. Why the reluctance? In a bureaucracy, a-decision-gone-bad has consequences, and those consequences propagate up the management chain. Someone will end up holding the bag.

There are two standard remedies for the risks associated with making a decision: 1) decision by committee 2) process loading or the piggy-backing of mountains of paperwork on top of each engineering step. If accountability should come knocking, both remedies provide substantial benefit--there will be a room full of people and a warehouse full of paper ready for the defense. Without this protection, the risks to the upwardly-mobile manager or aspiring careerist are grave. And since prevention is better than remediation, delegation is done with the utmost caution and awarded to only the most conventional and predicable members of the community. This is how the select are chosen.

Committees and processes have become an end in themselves. Delivery of the product has become an ancillary goal. Of course, committees and processes drive up costs, however, if a software package is delivered without the imprimaturs of a committee and a stack of process documents, it will not trusted and never be used. More than likely, that code would simply grow stale in some nameless repository. This cultural convention plays a large role in the lack of progress in the Agency.

Obstacles to progress will be a recurring theme in later postings.

Wednesday, June 19, 2013

"The Grip of Tar"

From Chapter 1:

Excerpt
Large-system programming has over the past decade been such a tar pit, and many great and powerful beasts have thrashed violently in it. (page 4)
Brooks begins the Mythical Man-Month with a summary of the difficulties he encountered as manager of the IBM 360 project. He notes that most programming efforts have failed to meet planned schedules or budgets. He specifically calls out a few sources of trouble: intolerance to defects (software must be 'perfect'), the disconnect between responsibility and authority, the travails of testing and debugging, and the certainty of obsolescence. He also admonishes us to consider that developing a software "product" is an order of magnitude more expensive than just writing a program--an all too familiar experience for those of us who have had to maintain our own products. Brooks does soften this bleak message with a tribute to the joys of programming, but the main point is that, back in the 60's, software development was a tough business.

It's still a tough business. During the first decade of the 21st century, the NASA software development culture fits the tar-pit depiction. Brook's observations remain relevant, but the problems confronting NASA software developers are more diverse. In subsequent postings, I plan to reflect on what software developers encounter in the space biz and why.

To first order, the development challenges originate from the technical domain and the culture. The technical challenges stem from the diversity inherent in the system (flight/ground), the complexity of operating in a physical environment and the need to autonomously respond to faults. The cultural challenges stem from the engineering biases inherent in a dominant hardware culture, the dependence on government funding which is inherently political at all levels and the time-tested conservatism characteristic of large bureaucracies. Added together you might say that the budget, schedule and technical challenges are the result of risk-adverse decision making in response to highly constrained budgets by a management that prioritizes hardware concerns.

During my years at JPL, I was often dismayed by decisions that struck me perniciously wrong-headed. However, with the wisdom of hindsight that comes from retirement, those decisions (which still seem unfortunate) were fundamentally rational in the context of the NASA culture. To put it another way, the agency is staffed with dedicated, highly talented managers and engineers who would be acting against their own best interest if things were done differently.

Obviously NASA has enjoyed tremendous success in both manned and planetary exploration--tremendous credit belongs to those who have developed and operated those systems. But, so far, we've addressed the easy software-engineering challenges. The tough challenges, those that must be met if were are to build affordable, reliable space systems, remain. We've barely scratched the surface.