Uncovering Weaknesses, The Hard Way

Imagine sitting at your desk one ordinary afternoon when, suddenly, with a single keystroke, you find yourself in a middle of a hail storm of notifications because your database just vanished, and now nothing works.

The Crowdstrike Incident and their Root Cause Analysis reminded me of how hard lessons from Incidents can be.

Through them, you can find the weakest links in the complex chain that is your socio-technical system. And when an operator is involved, they are rarely the root cause. Instead, the focus should be on the circumstances that allowed that situation to occur.

This is the story of how one such incident unfolded and the weaknesses we uncovered in our architecture and ways of working.

The Incident

It was a normal afternoon in the office. Sitting at my desk, in between meetings, and, as usual, receiving a flurry of notifications of emails and slack messages. Another notification popped up in the top right, just catching my eye, and then disappearing. However there was just something about it that seemed off.

Following up, it was a New Relic notification about an exception. This did happen at times, nothing too dramatic here. But upon further inspection…

Uh-oh.

Account::NotFoundError

Something’s terribly wrong.

I immediately get up from my desk.

I look towards the area where the tech lead for the system in question sets. He’s not there, probably in a meeting. I grab my laptop in one hand, the other opening up the New Relic dashboards, and I start to walk towards the meeting rooms to find him. After ten steps I find him walking in my direction, laptop in hand too. We stop in front of each other, exchange a look for a split second but felt like an eternity. Like some vulcan mindmeld.

We then ended up saying, “We need to find SRE.”

They were huddled together in a nook of the office for a standup. As we reach them, one said straight away: “We’re looking into it, there seems to be a long running operation executing a db.collection.drop() in MongoDB.”

We forced the operation to stop and then monitored to see if it would appear again. It didn’t.

Okay then. No more dataloss, but our application was down.

Other collections in the DB may have been dropped, but the accounts one was one of the most critical. Without loading account information, users couldn’t use our system at all.

We moved into a meeting room to start the recovery of the data. We had backups. All we need to do is a restore, right?

Well… We found two major issues when starting the restoration:

The prospect of restoring functionality to users after 8 hours was terrifying. Being late afternoon, we had both users in Europe finishing their day, while those in the U.S. were just starting.

At this point the meeting room was filled with a whole bunch of folks from different teams. We needed an alternative.

Our luck changed for the better.

One of the teams had recently ran a database migration for one of their systems. Before applying the migration, they backed up the documents that were going to be affected. The accounts collection was one of them. Of course it was in a custom format that couldn’t be directly imported to MongoDB, but we had something.

It took us around 2 hours since the incident started to recover the accounts that were in business hours and trying to use our product. Within 3 hours we were able to restore the whole accounts collection and get systems back to relative normality. Full restoration still took 8 hours, but this wasn’t noticeable to users.

Digging for the Root Cause

As the dust settled and systems returned to normal, we still had work ahead of us. We needed to understand how this happened and how to prevent future occurrences.

Working on the post mortem required objectivity, patience, and an open mind to be able to look at some uncomfortable truths about our systems and processes.

We determined that a developer, by mistake, triggered the database deletion.

One of their tasks had required them to load up the production ENV variables in their shell. They then moved on to a dev task and came a point they executed all the system’s tests.

The test suite included E2E tests that integrated with a local MongoDB instance. All collections are dropped from the DB before each run to ensure each test starts with a clean slate.

The E2E test code default pointed to the local instance, however it used the same DB initialization as production. This meant ENV variables could set the connection string.

So when the tests ran in the shell with the Production ENV loaded… Yeah.

This was a hard realization, and tough to digest.

But had we found the root cause?

No. The operator happened to be the unlucky soul that conjured up the right conditions for this to happen. There should have been guardrails to prevent this.

The question became: what were the circumstances that allowed an operator to be put into such a position that such an incident could happen?

This incident revealed how with just the right conditions within the weakest links of our socio-technical system could cause such a scenario to materialize.

On the technical side:

On the social side:

It was with an unfortunate set of conditions where all these factors came into play, in the worst combination, that led to such a disaster.

There were many lessons learned for everyone involved. Some I took away were:

Picking Up the Pieces

It was a sobering moment for us all. A milestone in our organizational maturity. We weren’t in startup mode any longer, we now had serious customers with serious businesses depending on us.

This incident revealed the various weaknesses in our systems and procedures that came to be.

It allowed a team member to be put into such a situation where they unintentionally deleted data.

The restoration of services took much longer than it should have, and relied on luck.

It’s hard to face such failures. But they will happen, and can happen to anyone. What matters is how you pick yourself up afterwards, as well as extracting as much learning from the experience as possible.

By digging deeper for the root cause and looking for the weak links, we were able avoid this type of incident from happening again, and be better prepared for any other type of dataloss incident.

In the end, we came out better for it.

So if you find yourself participating in a post mortem process, remember to also ask: what were the circumstances that allowed it to happen?