Monday, September 30, 2013

In the service of ambition

From Chapter 2: The Mythical Man-Month

Excerpt
...delay at this point has unusually severe financial, as well as psychological, repercussions. The project is fully staffed, and cost-per-day is maximum...Indeed, these secondary costs may far outweigh all others. (page 20)

From The Campaigns of Napoleon.1

Excerpt
[Napoleon's] inhuman demands on his own followers during desert marches...reveal his lack of concern for his men. To Bonaparte armies were merely instruments for his use.(p. 248)
Jean-Léon Gérôme 003

Brooks is burnishing his argument on the necessity of ensuring sufficient schedule for system test. The repercussions--in terms of human cost--cannot be overstated. There can be devastating consequences on a project team.

A mission is a campaign. A team of can be unwittingly lead to tremendous self sacrifice necessitated by a management whose aspirations exceed its resources. And as in war, the loyal are often sacrificed. I'll strain that comparison for the sake of illustration.

History is awash with examples of military campaigns where the armies suffered terribly in the service of an ambitious leader: the Athenian army under Nicias in Syracuse, the Roman army in Persia under Crassus, the 4th Crusade, the British army in Virginia wilderness under Braddock, or the French army under LeClerc in Saint-Domingue (Haiti) and both the German and Napoleonic adventures into Russia are just a few examples. However, Napoleon's Syrian campaign is an especially poignant example of suffering in the service of ambition--perhaps because it's a tragedy that is relatively obscure.

In July of 1798, a Napoleon-lead French expedition of nearly 40,000 men landed in Egypt. Napoleon had promoted the expedition. "We must go to the Orient; all great glory has been acquired there."2 This allusion to glory is intended to invoke Alexander the Great. Interestingly, the Egyptian adventure included some 500 artists and scientists who were to open the field of Egyptology.3

From earliest stages of the expedition, the French army was dispirited. The campaign opened with a 72-hour, waterless march through the desert that was called a 'living hell.' The army was near mutiny and one of Napoleon's generals committed suicide. Napoleon exercised his relentless will and, within a month, Egypt was in French hands. Success seemed assured.

However, by October the tide had turned. A British Naval blockade had isolated the expedition. Back home, the Second Coalition was clobbering French armies on the continent and the Directory had lost interest in the Egyptian campaign. Again, the army was demoralized. Senior officers started to resign and return to France. But most pressing, the Turkish Sultan had issued a firman declaring holy war on the French.

Napoleon was never one to sit around and wait to be overtaken by events. He would defeat the Sultan and establish French dominance in the region. Preparations began for an invasion of Syria. The stage was set for one of the crueler adventures of Napoleon's career.

On February 6th 1799, the vanguard of a 13,000-man French army headed off into the Arabian desert. Napoleon always pushed his armies hard; speed was his best tactical tool. The plan was for the army to cover 15 miles per day and reach Gaza in 8 days. Due to unexpected Turkish resistance and harassment from the British Navy, the provision-strapped French army suffered through a month-long trudge through the desert before encamping outside of Jaffa . After a short siege, the Turkish garrison in Jaffa surrendered on a French promise of clemency. Once the Turks were in custody, Napoleon ignored the promise of clemency and had the entire Turkish garrison of 3,000, along with 1400 additional prisoners, executed. In his memoir, Napoleon explained that the executions were necessary because of scarce provisions.

As if by cosmic justice, a major outbreak of the plague afflicted the French. Once again French moral tanked. To rally the troops, Napolean made a risky (and later self-celebrated) visit to the afflicted in the Jaffa Pestiferies. He even helped carry a diseased corpse at tremendous personal risk--a gesture that is said to have inspired the army to march on.

Détail Bonaparte visitant les pestiférés

The army next advanced to Acre. Acre was considered the "key to Palestine"--its capture was key to Napoleon's strategy. On the Ides of March, the army opened siege. This time, with the help of a timely arrival by the British navy, the Turks foiled the French efforts.

By early May, Napoleon realized the chance of success was rapidly slipping away. He ordered a series of desperate but failed attacks. When the Turks began to receive reinforcements, Napoleon knew he was beaten. A retreat was ordered. The worst miseries of the campaign were to follow.

Meanwhile, Napoleon issued a series of proclamations for the Egyptian and French audiences that described his great victories in Syria. The retreat to Egypt, he explained, was due to the upcoming summer season.

Napoleon's retreat from Acre was saddled with daunting problems. It was a disaster that foreshadowed his retreat from Moscow in 1812, but that was 13 years away. One seemingly insoluble problem was the need to transport 2,300 sick and wounded back to Egypt. Napoleon proposed poisoning the hopeless cases. He was initially dissuaded, but after the Turks mounted an aggressive pursuit, he ordered the mercy killing of those French who were too incapacitated to walk back.

Napoleon y sus Generales en Egipto

The expedition ended on June 3rd, when the demoralized remnants of the French army straggled into Katia, Egypt. When all was said and done, the year of campaigning had cost the French army 1/3 of its men.

Nonetheless, Napoleon staged a triumphal march in Cairo on June 14. Two months later, Napoleon returned to France as the self-declared "Savior of France" and became the 1st Consul of the French Government. Thus began the greatest phase of Napoleon's career. Meanwhile, a dispirited French army was forced to remain in Egypt until the Treaty of Amiens was signed in 1802.

In summary, the Egyptian campaign was not quite up to Ceasar's "Veni, vidi, vici." More like I saw glory, I over committed, I declared success.4

Over commitment among software development teams is so common that it has a name: Death March. Every experienced programmer has lived through one. Most pledge, unsuccessfully, never to do 'that' again, but few have the will to resist the drive to succeed. I saw the regrettable repercussions on the mission teams: divorce, heart attack, even suicide. It's a serious business. In recent years, the strains on NASA budgets has made the death march the norm and not the exception.

There are times when self-sacrifice is warranted. Our great country has thrived because many have answered the call to protect and serve. We owe a profound debt to those men and women. But the romantic calling to build a new vehicle for a return to the Moon or a robotic spacecraft to crawl about Mars is not war and personal sacrifice seems an unwarranted antidote to an ill-conceived NASA project.

Great achievements require great leadership and a great leader's requisite ambition. Napoleon was a great genius, but he equated his personal ambitions with the good of all. That is the road to malevolence. Few leaders have the both requisite wisdom to maintain that separation and the luck to survive. Perhaps it's too tall an order, but we can at least hope that the ambitions of the leaders in the American Space program would be tempered by a concern for the people in the trenches.


1. Chandler, D. "The Campaigns of Napoleon. The Mind and Method of History's Greatest Soldier." Scribner. 1966.
2. Chandler, p.209.
3. Among other things, the expedition returned with the Rosetta Stone.
4. According to Google translator, the latin would be: Vidi gloriam meam super commisit ego, successu. Suggested improvements are welcome.



Tuesday, September 24, 2013

Must it be?

Continued from Riding the bow wave.
Shamelessly altered excerpt
The test effort usually exposes important gaps... Why? ... 'interface management.'
From the score of Beethoven's String Quartet No. 16 in F major (Op. 135)1
Marking for the first chords of the last movement
of Beethoven's String Quartet #16 in F major (Op 135)
Beethoven opens the last movement of his last significant compositional effort with the question: Must it be? He concludes: It must be, but should we?

Thanks to a timely mishap on the latest International Space Station (ISS) resupply mission, I'm in an existential frame of mind.

In recent postings I've been ruminating about the ad hoc nature of testing. Thanks to an ex-colleague, who is apparently omniscient about all-things NASA, I learned about this timely existence proof of test fallibility. He sent me packing to the Orbital Sciences Corporation web page for reports on the status of the Cygnus rendezvous with the Space Station. Here's the link: Antares/Cygnus Updates.

Cygnus is a cargo carrier; Antares is the launch vehicle. The Cygnus/Antares combo is latest "commercial" solution for ISS resupply to come down the pipe--the SpaceX offerings, Dragon and Falcon, were first. The Orbital system is comparable to SpaceX, but Orbital lacks a gifted self-promoter and is less well known.

Orbital and SpaceX are the two surviving contenders in NASA's Commercial Orbital Transportation Services (COTS) Program. Both Orbital and SpaceX have received substantial funds from the Agency. Prior to 2011, Orbital received $260M and SpaceX received $376M from NASA. The combined COTS budget for FY12/13 was around a $billion (FY12=$400M, FY13=$525M), which I'm sure was judiciously split between Orbital and SpaceX.2 If you add it up, that's in the vicinity of $1B apiece. (But not quite even-steven.) If this is 'commercial', what would you call 'government'? I digress.

Orbital's initial launch of Antares/Cygnus was on September 19. Cygnus was scheduled to rendezvous with the space station on September 22. Here's a few telling excerpts from the Orbital status page.
September 21, 2013
As of mid-day today, Cygnus continues to perform...remaining on track for its rendezvous...tomorrow morning.
7:45 a.m. Sunday, September 22, 2013
Following the discovery of a data format discrepancy...today's rendezvous with the station was postponed...A software update has been developed and will be tested on a ground-based simulator during the day on Sunday.
September 22, 2013
This morning, at around 1:30 a.m. EDT, Cygnus...found that some of the data received had values that it did not expect, causing Cygnus to reject the data. This mandated an interruption of the approach sequence.
Monday, September 23, 2013
...Orbital and NASA together decided to postpone the approach...Over the past 24 hours, the Orbital team developed and tested a software fix for the data format mismatch...This new schedule will allow the Orbital operations team to carefully plan and be well-rested before restarting the critical final approach.
I have no direct knowledge of the circumstances that lead to the defects in Cygnus data processing, however I do know a bit and could surmise the following:
  • The discovery of the data problem was well past the 11th hour.
  • The Cygnus/Antares systems under went thousands of hours of testing prior to launch. It's likely the wrong test data was used.
  • The ISS ICD has been in place for decades and successful used by the both American and Russian vehicles.
  • The ISS telemetry data format does not conform to the most commonly used international standard4
  • The Orbital engineering team most surely competent and staffed with talented engineers
  • The energy expended in handwringing at Orbital Ops must have enough to power an village in Virginia. Everyone is exhausted.
  • The Orbital management has reported the results accurately but with a tone that conveys there has been no departure from the routine.

Why do these things happen?

Last night, I had dinner with another buddy from my NASA days. We discussed the Cygnus data problems and he pointed out that "they had problems building the pyramids. Why should things be different." He is right of course; nothing unusual here. No engineering effort will be free of problems.

But, should we simply accept the status quo of test practice as the inevitable? Is testing finally and ultimately doomed to be an ad hoc, incomplete, finger-crossing process? Should we simply shrug and accept it must be this way?

I hope not.

How can we progress to the next phase of building vastly more complex systems if there is no progress in the way we approach our objectives. Especially testing! What if we were still confined to the building techniques used by the Egyptians? There would be no arch, no concrete, no flying buttresses, no use of glass, iron and concrete. Without those innovations there would have been no Roman aqueducts, coliseum or Parthenon. No Haj Sofia. No Notre Dame. No Crystal Place. No Chrysler building. No Astrodome.

It's not that we should expect the end of problems, but that we should rue complacency. In many respects the Cygnus problem is a routine matter. But, it is the very fact that it is considered routine that causes dismay. Must it be? Let's hope it's not.


1. If you want to hear the chords, jump to 16:45 in the recording. But, think twice! This is one of Beethoven's masterpieces.
2. 2102 funding and pre-2012 funding comes from the NASA FY13 budget request. The 2013 budget was reported in the Space Politics Blog.
3. These links tend to be broken.
4. “The good thing about standards is that there are so many to choose from.” Andrew Tenenbaum

Saturday, September 21, 2013

Riding the bow wave

From Chapter 2: The Mythical Man-Month

Excerpt
Failure to allow enough time for system test, in particular, is peculiarly disastrous. Since the delay comes at the end of the schedule, no one is aware of schedule trouble until almost the delivery date. (page 20)

From "Acquisition Archetypes: The Bow Wave Effect"1 

Excerpt
Making Wakes - geograph.org.uk - 1485246"We don't compromise on schedule delivery date... we just kept dropping functionality... A growing mass of work had to be done at the end."

Brooks will repeatedly admonish us to allocate 50% of the schedule to test.2 I'm assuming he has the waterfall model in mind. If so, there is more to the test phase than just testing; the waterfall test phase includes the integration of separately-developed, 'finished' pieces. Typically these are big pieces. For a space system this integration will include the ground system command development infrastructure with the on-board command infrastructure, the command system and with the telemetry system, the downlinked telemetry with prediction analysis tools, the onboard fault protection software with the power, attitude control, thermal and propulsion software...and that's just a sample. In the waterfall model, if all the previous steps were done correctly, everything should fit together like so many Lego blocks. That's the theory--a theory seldom if NEVER works. Only the most unwitting manager would insist that software integration is simply a matter of testing properly implemented pieces. (Unlike sasquatch, they exist and manage NASA budgets!)

In practice the test effort usually exposes important gaps in development. Why? The two principal culprits: 'the bow wave' and 'interface management.'

The bow wave
A development bow wave is the unfinished work that has been pushed to the end and bites you from behind at the end of a development cycle.

Bow waves build because development is the process of discovery. There are always unanticipated problems: A broken library, a compiler bug, a design mismatch, an incorrect requirement, an error condition that emerges from the blue... The possibilities are endless.

These discoveries happen in the context of programmatic commitments made in the form of budget and schedule. The definition of a good team (i.e. one that functions without disruptive 'help' from management) is one that meets its commitments. Once you add in procrustean schedule notions like EVM, you have a bow-wave-generating recipe for delaying or modifying features for the sake of sunny-day progress reports. Meanwhile, the bow wave builds as the unanticipated collides with the commitment. It resembles a Ponzi scheme predicated on a schedule slips or subsequent releases. Just like the Ponzi scheme, the piper must be paid and payment inevitably comes during 'Test.' Consequently, what was 90% complete prior to test, suddenly becomes 50% complete and that hefty 'test' schedule becomes synonymous with good planning, albeit at an exorbitant price.

Interface management
A large system, like a space system, is built by a dozen-or-two teams each working on separate functional pieces the must be integrated to work as a whole.3 Just how these pieces will fit together is typically captured in document, called a Interface Control Document (ICD) that describes the inputs and outputs of each piece. The ICD may also describe the correct steps for functional pieces to interact (often called a protocol).

As a rule ICDs were seldom accurate and interfaces tough to manage. Here's a few of they reasons why:
  • Interface change is fundamentally a serial activity. While an intended changes may be documented, actual implementation may be different. Consequently, the documentation may not reflect the actual artifact.
  • Implementers are busy. In the rush to a deadline, a developer will be hard pressed to record each changes.
  • Auto documentation by document generators is no panacea. Distribution needs to be timely. If developers from other teams implement immature designs, the code may be whipsawed and the stability of the build will be jeopardized.
  • Over time, a software systems will develop undocumented interfaces that make the system brittle. An interface change will break the system in unanticipated ways that only show up during test.
  • Interface and protocol documentation methods are squishy, if not misleading, and easily misinterpreted. Rigorous definitions, like those provided through formal methods, require a specialist with an advanced degree, narrowly focused skills and plenty of time. Except for the you're-lucky-to-get, all-star programmer, something better is needed for people doing the real work.
  • Bureaucratic organizations guard their interfaces--with good reason. To lose ownership is to lose funding; a interface change may cause a function piece to slip from the organizational grasp. Consequently, inter-organizational interface changes typically requires painful committee work among distrustful managers and a grievously painful negotiation. Meanwhile, the necessities of project requirements, budget and schedule will drive developers to built solutions that then plant the seeds of system ossification.
  • Interface improvements will be vetoed if it weakens an organization role.

The result: problems with interface management usually swell the bow wave during the so-called test phase. Viola, developers will found busily 'fixing' problems late in the cycle by fixing interfaces and building new features. Test is hardly just testing.

Experienced software engineers know how to cope with interface change and the bow wave. Most who build software for a living figure this is just how things are done and ride the wave. There are, however, techniques like early integration and iterative development in common practice and go a long way towards mitigating schedule and budget risk. Call me a curmudgeon, but I don't believe these techniques scale to large system developments that cross organizational boundaries. The problem is not just software, it's software in the context of large organizations with parochial interests. I don't believe we know how to manage that.


1. "Acquisition Archetypes: The Bow Wave Effect". SEI white paper, 2007. http://www.sei.cmu.edu/library/assets/bowwave.pdf

2. For a bit of skepticism on my part, see Orders of Complexity.

3. I discussed some of the challenges that stem from functional decomposition in Humpty-Dumpty Effect

Thursday, September 12, 2013

Orders of Complexity

From Chapter 2: The Mythical Man-Month

Excerpt
...testing is usually the most mis-scheduled part of programming...Failure to allow enough time for system test, in particular, is peculiarly disastrous. Since the delay comes at the end of the schedule, no one is aware of schedule trouble until almost the delivery date. Bad news, late and without warning, is unsettling
to customers and to managers. (page 20)

From A Spiral Model of Software Development and Enhancement1

"Stop the life cycle-! want to get off!"
"Life-cycle Concept Considered Harmful."
"The waterfall model is dead."
"No, it isn't, but it should be."
On to Brooks third, and last, 'fallacious thought mode': Unwarranted optimism lead to inadequate test schedules.

As a remedy he suggests the development schedule be allocated as follows:
  • 30% Planning should represent ~30% of the schedule
  • 20% Programming
  • 50% Testing
This may be one area where Brook's commentary is dated or at least peculiar to the development of a new compute platform. Three objections come to mind: 1) the implied use of the waterfall development model , 2) the suggested length of the schedule, and 3) underestimation of the testing challenge. I'll try to tackle each in turn.

Waterfall development

The idea behind the waterfall model is simple. First you do analysis (i.e. requirements), then you do design, then you do implementation, then you do test. After that you deployment and keep the system alive and healthy with maintenance.

On the surface, the waterfall model is very desirable. It's easy to comprehend. Budgeting and scheduling are straight forward. Progress is obvious. Best of all, the approach lends itself to the currently-fashionable, CMMI-fueled metrics craze--especially EVM. (see Is it done yet?.)

The only problem with the waterfall model is that it doesn't work. Software development simply does not function that way. There's too much to discover in the doing. It was the waterfall model that gave rise to that famously captious platitude, "The last 10% of the schedule takes 90% of the time."2

In 1986, Barry Boehm published highly influential article1 that suggested a spiral model for software development.3 In a nutshell, you might think of spiral as a bunch of mini waterfalls wrapped in a waterfall. The point is that the analysis, design, implementation and test is integrated in a repeating cycle. This allows the inevitable discoveries to be incorporated into the development process. Boehm's article changed things forever in the software community--I do not personally know a skilled software developer who currently prefers the waterfall model to one of the spiral models.

From a budgeting and scheduling perspective, spiral models do not isolate test as a distinct phase. A good team will incorporate testing every part of design and implementation effort as well as dedicated test phase prior to delivery. Testing is just part of the development mélange. There will not be a testing milestone that culls out 50% of the schedule. Developers who work for managers that insist on milestones that represent a 50% test schedule will need to keep up on those handwaving exercises.

Schedule length
Starting and completing a development cycle is expensive. A good programming team will develop momentum, so it is advisable to provide a programming phase that gives the team a chance to work up a head of steam and make substantial progress. However, the programming phase (and ultimate product release) should not so long that the customer requirements change or the sponsorship interest waffles. For major release of a infrastructure or embedded application, I discovered a 16-week programming phase a sweet spot for a 6-month release cycle.4

For the sake of argument, assume a 16-week programming phase. Using Brook's recommendations the breakdown would look like this: 24 weeks of planning, 16 weeks of development and 40 weeks of testing. That would be an overall schedule of 80 weeks! Heck, the Empire state building was build in less time than that. If my experience is indicative, a development effort that goes a year and a half without a major release is doomed. Why? Platforms change, security requirements change and customer requirement change and, most importantly, the sponsor's priorities change. In other words, a schedule with based on the '50% test rule' is unrealistic.

Testing Challenge
To first order, there are two types of testing: tests that check if a product meets requirements and tests that check if the product performs correctly. The former is know as validation; the latter is known as verification (V&V in the common parlance). I do not know of a single case where a non-trivial software product was completely validated or verified.

Consider what's entailed.

Validation requires the testers develop tests that prove a requirement is met. For example a typical requirement may say that that 'the system shall authenticate a user.' Simple enough, but imagine all the unstated possibilities. What institutional infrastructures should be supported? Is it authentication with using LDAP? Active Directory? An custom adaptation layer? In theory, functional requirements and design specifications will spell out all the ambiguities; in practice they the never will. As a result, the process of test design is interpretive and begs two key questions: Is the interpretation correct? What is omitted?

There's a greater test challenge. The tester is asked to determine if the program can enter an illegal state. From a purely software perspective, the test might check to see if the program runs out of resources, corrupts memory, or violates a process deadline5. Checking all the possible system states is simply impossible with current practice.

For space systems, there a much greater challenge, the tester would need to check that the software will never put the entire system into an illegal physical state, i.e. a state that causes mission failure. For example the tests should check that the software will not allow for a power failure, uncontrollable tumbling of the spacecraft, or ramming an instrument platform into a solar array. There are a very large number of permutations in the antecedent conditions that might lead to an illegal system state. If that wasn't a big enough challenge, the space system must be tested in simulated conditions and the simulations themselves are subject to the same test challenges.

In other words, no matter what the schedule allocation, complete testing a space system is impossible. As a practical matter, testers focus on a few well-defined operational scenarios and ensure the system works in just those limited conditions. Given the highly constrained NASA software budgets testing is really just a best-effort.

For the current generation of space systems, our development methods, scheduling and test practices are good enough. But these practices severely limit what can be accomplished. Autonomy is primitive. Operations expensive. Reliability suspect. However, if we are going to build smart systems with capabilities that can accomplish complex operational goals, we will need to develop methods for handling additional orders of complexity.


1. Boehm B., A Spiral Model of Software Development and Enhancement. ACM SIGSOFT Software Engineering Notes. Volume 11 Issue 4, August 1986. Pages 14-24

2. Sometimes coined as the ninety-ninety rule.

3. 'Spiral' was just the start. Since then there has been a string of refinements: Objectory, Rational Unified Process, Team Software Process, Extreme Programing and Agile. Each with a different emphasis; each offering a panacea. Over the years, I never met an experienced developer who preferred the waterfall to these later developments.

4. Shorter release cycle are very desirable minor upgrades on well understood systems. However, if system failure can cause mission failure, more rigor is required. A topic for another day.

5. For a real-time system, meeting a process deadline is necessary for program correctness.

Thursday, September 5, 2013

Where does the time go?

From Chapter 2: The Mythical Man-Month

Excerpt
Since software construction is inherently a systems effort—an exercise in complex interrelationships—communication effort is great, and it quickly dominates the decrease in individual task time brought about by partitioning. (page 19)

Gantt Chart of Typical 16-week schedule
Brooks is wrapping up his discussion of the second 'fallacious thought mode' with a zinger. Think of it; communication "dominates the decrease in individual task time." How does that happen? Who knows where the time goes?1

Only a wee bit of wistful contemplation is needed to find a culprit: meetings. Just to put things in perspective, I genned-up a little table of the meetings that made up the routine part of the our lives as NASA software developers. Perhaps it's of interest.


Meeting TypePurposeFrequencyTimePrep Needed
Development-related Meetings
Development Team TagupsDevelopment status, plans and technical issuesDaily/Weekly15->30mNo
Project Team MeetingsProject-level coordination across development teamsWeekly1hOccasional
Technical Interchange meetingsTwo or more development teams meet to coordinate effortsAd hoc1hr->3dYes
Task Technical Reviews
Requirement reviewDiscuss and approve documented requirements. Focus: Process compliance, selected requirementsBuild Lifecycle2->4hYes
Architecture reviewDiscuss and approve documented architecture. Focus: Process compliance, selected topicsBuild Lifecycle2->4hYes
Design ReviewDiscussion and approve documented design. Focus: Process compliance, selected design considerationsBuild Lifecycle2->4hYes
Development kickoff reviewsReview the viability of the plan for a build cycle. Focus: Process compliance, development readinessBuild Lifecycle2->4hYes
Test readiness reviewDiscuss and approve readiness of product for testing. Focus: Process complianceBuild Lifecycle1->2hYes
Delivery and Deployment reviewDiscuss and approve product delivery. Focus: Process complianceBuild Lifecycle1->2hYes
Programmatic-related meetings
Monthly Management ReviewsStatus report Monthly3hYes
Budget MeetingsDiscuss and defend development and maintenance budgetsSeasonal1hYes
Change Board (CCB)Approve requirement and/or design changesWeekly1hYes
Risk BoardDiscuss and diposition risksMonthly1hYes
Programmatic Staff MeetingsWeekly update on Insitutional mandates, budgets, and status.Weekly1hNo
Quiet hoursIndividual meeting with managementMonthlyOccasional
All Hands MeetingsSenior management addresses entire staffAd hoc1hNo
Software-related Project Lifecycle Reviews
Preliminary Design ReviewsDiscussion of requirements and initial design. Approval to proceed to full designProject Lifecycle1dYes
Critical Design ReviewDiscussion of design. Approval to proceed start developmentProject Lifecycle1dYes
Test readiness reviewDiscuss and approve readiness of product for testing. Focus: Process complianceBuild Lifecycle4hYes
Delivery reviewDiscuss and approve product deliveryBuild Lifecycle2hYes
Institutionally-related Meetings
Group MeetingsLine organization meetings (think homeroom)Monthly1hNo
Section Staff MeetingsCarefully crafted commentsWeekly2hNo
Quiet hoursIndividual meeting with managementMonthly1hNo
All Hands MeetingsSenior management addresses entire staffAd hoc1hNo

Bear in mind that a single team may support multiple tasks and multiple projects. Team members often present the same material in different meetings. Also, note all the required preparation; it's significant.

By the way, the developers who support these meetings are also supposed to program. Daunting eh?! There are no ivory towers on the NASA grounds, and precious little time to think deep thoughts. Bear in mind that not everyone goes to every meeting. However, the more senior the engineer, the more meetings must be attended, the less time there is for careful thought.2

Are all those meetings necessary? I believe so. Here's a few reasons why:
  • Meetings can be an efficient way to exchange information and coordinate activities.
  • In a matrix managed organization, every developer has several managers. At a minimum there's a line manager and a program manager.2 More than likely, there are multiple line managers and program managers with skin in the game.
  • Every manager must meet with the team. Even on the face of it, to do otherwise would be considered negligent.
  • Management perception of status is capricious. Misinformation is readily spread. If it crops ups, it must be guarded against.
  • Failure is not an Option. Delegation is risk. The culture-of-consensus is a meeting culture.
Even if no one is added to meet the 'unmeetable' milestone, communication in a large organization will quickly soak up development time. The build-in communication aides and high-level of integration in the new breed of software process tools (e.g. IBM Rational Jazz, or the plethora of agile support tools) should help, but only for a while. If the past is any indicator, the added automation will become a liability as additional artifacts become the stuff of meetings. Efficiency is elusive; culture forces are resistant.

Is there hope? Yes, but only with leadership that has a stomach for risk. Regrettably survival usual means a stomach ectomy.

1. Who Knows Where the Time Goes is an 60's anthem written by Sandy Denny of Fairport Convention.   Here's a link to a beautifully performed cover song by Judy Collins.

2. For a top-notch engineer, I estimate that as much a 50% to 75% of work week can be taken up with meetings.

3. Line refers to institutional organization. e.g. business unit, division, section and group. Line managers are responsible for hiring the staff projects need and ensures that staff complies with institutional rules. Program management refers to management of funds. Program managers are responsible for delivering products on time and on schedule.

Tuesday, September 3, 2013

Chocolaty Little Finger Prints

From Chapter 2: The Mythical Man-Month

Excerpt
When a task cannot be partitioned because of sequential constraints, the application of more effort has no effect on the schedule. The bearing of a child takes nine months, no matter how many women are assigned. Many software tasks have this characteristic because of the sequential nature of debugging.
(page 16)
 
This quote should probably be in Barlett's. Brooks is referring to the not-so-special case where partitioning work is impossible. He has a lot more to say about debugging in subsequent chapters.

Brook's text is based on his experience building an operating system for the then new IBM 360. All new hardware; all new software. There's a strong analogy to the job of developing hardware and software for a new spacecraft. Debugging can be nasty. Is it a bug in the software? The COTS operating system? The lastest-greatest version of the VHDL-Verilog that's been loaded in that fancy FPGA? Some weird combo that no one ever thought about and only shows up once in a blue moon?

Seems like there's a nasty, intractable bug in every project. The only hope of getting to the bottom of the problem is to stop the presses and start the analysis on a non-moving target. Hence, the credible claim: debugging is sequential.

The 360 team had it's work cut out. They had to produce a new OS for the newly minted processors and instruction sets; a far tougher problem than that faced by the current generation of spacecraft developers who use established instruction sets and COTS operating systems.1 What's more the 360 team was writing tests on keypunch cards and running them in batch mode.2.

To say the least, contemporary debugging practices are vastly improved. We have interactive debugging, sophisticated make utilities, powerful configuration management tools , test harness tools, and shelves full of run-time verification tools. This affords some parallelization of debugging, but when final integration rolls around, the work is serial.

And it's not just debugging. There are antecedents in the sequence that simply happen; they cannot be skipped, even if they are not in the published schedule. Here's a few examples:
  • Antecedent: No planned requirement development or approval phase. Subsequent: programmers will invent the requirements they need for code development.
  • Antecedent: No planned architecture or design phase. Subsequent: The architecture will emerge as each programmer independently designs the functional pieces
  • Antecedent: No established development process for maintaining separate code branches. Subsequent: Programmers will merge their code branches on the trunk leading to entanglements, broken builds, and integration parties.3
  • Antecedent: No strictly enforced code freeze. Subsequent: Programmers change code that is under test compromising the test process.
The last antecedent is particularly challenging. Programmers will program. Here's a typical scenario. The code is 'finished.' The Integration and Test (I&T) team starts testing. A bug is discovered. A fix is needed. The programmer opens the code for repair and, while the code is open, adds a few extras. No one knows what, if any impact, the extras will have. In other words, a thoughtful I&T lead would ask, "Is the previous testing still valid?" A question that jeopardizes both the cost and the schedule. (This guy better be on good terms with management.)

The best I&T Leads will strive to ensure that the code under test is not changed. A talented colleague once put it this way, "I don't want to see any of the programmer's chocolaty little finger prints on the code we're testing."3

In my experience, releasing untested or under-tested code is common place. As a practical matter, that's a lofty goal--the necessities of the moment almost always carry the day. I have no direct experience developing code where human lives might be at risk (e.g. airplanes, cars, or nuclear plants), but testing was commonly short changed in NASA's under-funded software development culture. Even our most high-handed reviews conveniently skipped lightly over these profoundly inconvenient technical details. Our rigorous institutional process had, for the most part, merely formalized a system of looking in the obvious places and recording those findings on paper. A colleague once compared the review process to looking for a lost key under a street light because that's what you can see. I would only add that the results would also be captured in PowerPoint and displayed to 30 colleagues.

Today's conventions for managing the serial character of software development is sufficient for today's systems. We get by. Same for the our approach to testing. But, the current practice is wholly inadequate for building the large, complex software-intensive systems imagined in movies and books.

Interestingly, I don't believe these kinds of topics are the focus of serious research. Progress will be slow.


1. The current generation of deep-space missions use the
RAD750, a radiation hardened version of the processor that was introduced in 1997 and powered the multi-colored iMacs. The most commonly used COTS OS is VxWorks.
2. IBM System/360 Operation System: Programmer's Guide to Debugging
3. An entanglement occurs when source code files are mutually dependent and inconsistent. Consistency can only be restored by repairing all files at the same time. A broken build occurs when the source code will not compile and link. An integration party occurs when the whole team must stop development and work on restoring the build.
4. "...chocolatey little finger prints" is a phrase borrowed from Stephen Harrington, a witty and respected colleague from my Cx days.

Monday, September 2, 2013

Six good things about retirement

1. No deadlines*
 
2. No document review
 
3. No export review
 
4. Dog Lake


5. Lyell Creek (headed for Donohue Pass)

6.Bristlecone Pine ( 4,000 year-old)

* At least none that can be missed.