Using a Waterfall/Agile hybrid model to reduce risk on a massive system go-live

This post has been a long time coming. I’ll start by painting the context to explain the challenge:

  1. Big corporate IT system replacement program
  2. Involves a new architectural paradigm and re-engineered business processes
  3. >200 full time resources on a multi-year development and roll-out cycle
  4. SDLC aspects use scaled up Scrum (> 20 Scrum teams)
  5. Here’s the kicker: due to particular constraints around the architecture and revised business processes, we needed to deploy a critical mass of new architecture and business processes together in order for the solution be internally cohesive i.e. as much as we hated to go this route from a risk point of view, the first push had to be a big-bang. We did take smaller atomic subsets live earlier but there was no getting around the big-bang.

Coming into the start of the year, I volunteered to handle the commissioning phase planning and execution coordination. This included the planning for the build-out of the Production host environment that would host the solution as well as the system deployment coordination leading up to the go-live.

Challenge: Our Agile tooling was inadequate for the commissioning phase

In a purer agile model we would be taking small increments of solution delivery live relatively frequently, something like the picture below. Scrum demands that you deploy all the way to operational usage at the end of each sprint. That’s relatively easy to do at a small tech start-up and more complicated at a big corporate were there are various degrees of role separation between the developers and the deployers. Yes, it is an anti-pattern and all going well, will eventually be cured by a fully automated deployment pipeline – a subject for another post. For now I’ve drawn it as a three sprint batch that is buffered as a release.

Small functional increments to Production

Small functional increments to Production

In truth even with the three Sprint buffer, we would have been happy if this were actually the case. I did however mention in the introduction that we were forced to take a large chunk live just for the system to be functional. The real situation looked like this (ouch!):

One big batch needing to go live

One big batch needing to go live

The Scrum teams in the SDLC cycle had been continuously deploying to the integration testing environment (“Dev-Test”). The solution then languished here for an extended period of time because it would not be functionally cohesive until partnered with additional business processes to create an atomic whole in the new business paradigm.

With such a big piece of tech needing to be commissioned and a limited deployment window (we still have over a hundred thousand customers in the pilot country so the allowed downtime was a window of a few hours), Sprint planning and post-it notes were not going to suffice as the mechanism to pull off this execution. We would not be able to transparently see the constraints: the team was too big and the tech too complicated for people to hold the dependencies in their heads.

Step 1: Create an execution model using predictive planning techniques (enter: Gantt)

Unlike the SDLC process which is creative and iterative, we needed the deployments to be utterly predictable: the business risk was high and the short deployment window was going to be unforgiving to execution glitches.

We needed to appropriately model the data migration and solution deployment sequence with full transparency of  all dependencies. In order to optimize the execution time, we also needed transparency of the critical path (spending time tuning anything other than the critical path would be a waste!).

Personal disclaimer: I don’t have any default preference between project management methodologies. Like anything in engineering: choose the right tool for the job and if you understand many techniques then you can leverage different approaches to solve the problem at hand 🙂   

…So after a long break I blew the dust off MS-Project and got stuck in, working closely with all the propeller-heads to create an execution model that satisfied the various constraints that became apparent.

Deployment Gantt: Models dependencies and provides critical path visibility

Once we could see the critical path we were able to check that the allowed deployment window was indeed feasible and find ways of tuning the execution from a time point of view. The final sequence including once-off data migration activities involved over a hundred atomic tasks that had to be executed across 30 or so people in a few hours (and yes, that’s after automation…).

Step 2: Create a view for the Team that abstracts away the planning complexity (enter: Trello)

In manufacturing operations, shop-floor workers are interested in three things:

  1. What to make
  2. When to make it
  3. How to make it

i.e. the people on the floor don’t need to be bothered with full supply chain visibility or plant flow constraints (that’s why you buy finite scheduling software). Taking this philosophy into the challenge at hand: I had the answers to the three questions above but I needed to present it to the team in a way that was:

Easy to access, showed (1),(2) and (3) and provided real-time transparency of where we were in the sequence (we had teams in separate geographic locations). I’d been using Trello for some teams that we were coaching and was impressed with it’s slick online Kanban features.

Porting from MS-Project to Trello

There was an initial challenge of porting from MS-Project to Trello but that turned out to be a breeze using Excel as a bridging mechanism. I would typically copy the “Task Name”, “Duration”, “Start” and “Finish” from MS-Project (they are all conveniently next to each other on the default Gantt view

Copying tasks from MS-Project

Copying tasks from MS-Project

…and paste all tasks into a spreadsheet. Several things happen in the spreadsheet:

  1. Delete parent tasks
  2. Concatenate the start time, description and task duration to come up with a better Trello task description (simple Excel formula). I also deleted leading spaces created by MS-Project’s indents.
  3.  Reorder the task sequence so that it was based on the start time (oldest to newest)

An excerpt is shown below. The “Trello name” is the generated name in the far right hand column.

Excel bridging between MS-Project and Trello

Excel bridging between MS-Project and Trello

Once the previous steps were done (they only take a few minutes), the entire “Trello name” column can be copied and pasted into Trello (see screen grab below).

Trello importTrello allows you to import up to a hundred tasks at a time. As we had over a hundred I had to do it twice. The entire process going from MS-Project to Trello took about 10 minutes in total with checking!

The one part I wasn’t able to automate in Trello was the color coordination of tasks (for better readability) and the addition of the resources to the tasks. Although it’s possible to write a parser and use the available Trello API, we didn’t have the time to go that route this time ’round 🙂

Step 3: Empowering the Team with a ‘Live view’ of the go-live execution

The end result is a prepped deployment board that allowed everyone to see what they needed to do and when they were expected to do it without needing to understand the intricacies of the planning process. They can also see which tasks are in motion and which things have been completed (people move their tasks as they do on a traditional Scrum board). This allowed people to be ready when it was their time to deliver on something. We aimed for a near-zero communication loss between parties and as Trello was a live view on the deployment, everyone was able to get themselves setup to act in advance of their ‘cue’!

The picture below shows the board in the latter stage of the execution (no tasks left in ‘Planned’). Unfortunately I didn’t have a screen capture from earlier in the day when the originally imported plan was intact.

Excerpt from the DEVOPS Trello board

Excerpt from the DEVOPS Trello board

Step 4: Weekly ‘dress rehearsals’ for practice and fine tuning the plan based on empirical performance data

At this point in time, I had a comprehensive execution model on the Gantt and an elegant way of presenting those tasks for consumption by the deployment team resources (using Trello). As the Production environment was new (no consumers yet) we were in a position where we could use it as a staging environment until the actual go-live. What we then decided to do, is to run through the full deployment execution as a dress rehearsal once a week. This would:

  1. Give everybody practice on the execution process
  2. Allow us to refactor the Gantt model based on empirical performance data (everyone captured the start and end times of their tasks during the dress rehearsals). This made the plan more accurate.
  3. Expose gaps in the Production environment as part of each dress rehearsal (the smoke tests had to go green each time and if they didn’t, the techies would troubleshoot)
Task duration capture

Task duration capture

In effect we were running a “Plan Do Check Act” Deming Cycle:

  1. Plan the execution on the Gantt
  2. Present the plan for execution using Trello as an online Kanban board
  3. Gather the execution statistics and note Gaps found
  4. Refactor the task duration on the Gantt using the empirical performance data

…and repeat every week starting two months before Go-Live

Deming Cycle: Plan->Do-> Check->Act

Execution Tip: Keeping everyone informed

We created a Trello task called “Deployment Commentary” which I used to keep all stakeholders informed as to where we were with the execution. As the coordinator of the deployment I was fully aware of the actual state (because I was monitoring the Trello board and orchestrating the actions) but in the background I was reconciling the critical path tasks against their planned start and end times on the Gantt chart. This meant that at any point in time I knew whether we were running ahead or behind of the planned execution and I knew the number of minutes. As each Critical path task was done, I checked the Gantt, worked out the delay and posted on the commentary so that EVERYONE knew what was going on from a single source. Most people had configured alerting on that task so they would get the update pushed directly to their smartphones or e-mail. It also limited the number of people coming into the Command Center to ask “where we’re at?”.

Excerpt of deployment commentary

Excerpt of deployment commentary

So how did it all pan out?

As the famous (South African) golfer, Gary Player was credited with saying: “The more I practice, the luckier I get”. The crew completed the solution deployment and data migration an hour ahead of time and handed over to the testing team for deployment confirmation. The system went live smoothly and was fully operational the next day and all was well with the world 🙂

This entry was posted in Project management and tagged , , , , , , , . Bookmark the permalink.

2 Responses to Using a Waterfall/Agile hybrid model to reduce risk on a massive system go-live

  1. Liz J. says:

    Hello – and thank you for sharing this post. It was incredibly helpful.

    After copying and pasting the task names, I’m sure some changes in trello took place (e.g. Task name, start / finish dates). How did you manage “in-trello” changes within your MS Project to ensure everything was in synch?

  2. crossbolt says:

    @Liz: That’s a fantastic question. Each time we do a dress rehearsal, the project plan is refactored with additional tasks, revised times, revised dependencies etc. These changes are done on the ‘master model’ in MS project. It used to take me around 3 hours to generate a new Trello list with resources, due dates, checklists and labels but around two weeks ago we completed a fully automated ‘Board Generator’ that uses my Excel export from MS-Project and automatically generates a Trello list (tasks, resources, labels, due dates, checklists…) – using Trello’s API! I can generate a new sequence in 10 mins (some minor cleanup still required). Two nights ago, I generated a revised 100 task board pretty much on-the-fly after we hit a major impediment in one of our rehearsals 🙂

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s