Uncovering Weaknesses, The Hard Way
Imagine sitting at your desk one ordinary afternoon when, suddenly, with a single keystroke, you find yourself in a middle of a hail storm of notifications because your database just vanished, and now nothing works.
The Crowdstrike Incident and their Root Cause Analysis reminded me of how hard lessons from Incidents can be.
Through them, you can find the weakest links in the complex chain that is your socio-technical system. And when an operator is involved, they are rarely the root cause. Instead, the focus should be on the circumstances that allowed that situation to occur.
This is the story of how one such incident unfolded and the weaknesses we uncovered in our architecture and ways of working.
The Incident
It was a normal afternoon in the office. Sitting at my desk, in between meetings, and, as usual, receiving a flurry of notifications of emails and slack messages. Another notification popped up in the top right, just catching my eye, and then disappearing. However there was just something about it that seemed off.
Following up, it was a New Relic notification about an exception. This did happen at times, nothing too dramatic here. But upon further inspection…
Uh-oh.
Account::NotFoundError
Something’s terribly wrong.
I immediately get up from my desk.
I look towards the area where the tech lead for the system in question sets. He’s not there, probably in a meeting. I grab my laptop in one hand, the other opening up the New Relic dashboards, and I start to walk towards the meeting rooms to find him. After ten steps I find him walking in my direction, laptop in hand too. We stop in front of each other, exchange a look for a split second but felt like an eternity. Like some vulcan mindmeld.
We then ended up saying, “We need to find SRE.”
They were huddled together in a nook of the office for a standup.
As we reach them, one said straight away:
“We’re looking into it, there seems to be a long running operation executing a db.collection.drop()
in MongoDB.”
We forced the operation to stop and then monitored to see if it would appear again. It didn’t.
Okay then. No more dataloss, but our application was down.
Other collections in the DB may have been dropped, but the accounts
one was one of the most critical.
Without loading account information, users couldn’t use our system at all.
We moved into a meeting room to start the recovery of the data. We had backups. All we need to do is a restore, right?
Well… We found two major issues when starting the restoration:
- The full restoration would take 8 hours to complete.
- We couldn’t control the order of restoration,
accounts
could be the first or the last.
The prospect of restoring functionality to users after 8 hours was terrifying. Being late afternoon, we had both users in Europe finishing their day, while those in the U.S. were just starting.
At this point the meeting room was filled with a whole bunch of folks from different teams. We needed an alternative.
Our luck changed for the better.
One of the teams had recently ran a database migration for one of their systems.
Before applying the migration, they backed up the documents that were going to be affected.
The accounts
collection was one of them.
Of course it was in a custom format that couldn’t be directly imported to MongoDB, but we had something.
It took us around 2 hours since the incident started to recover the accounts that were in business hours and trying to use our product.
Within 3 hours we were able to restore the whole accounts
collection and get systems back to relative normality.
Full restoration still took 8 hours, but this wasn’t noticeable to users.
Digging for the Root Cause
As the dust settled and systems returned to normal, we still had work ahead of us. We needed to understand how this happened and how to prevent future occurrences.
Working on the post mortem required objectivity, patience, and an open mind to be able to look at some uncomfortable truths about our systems and processes.
We determined that a developer, by mistake, triggered the database deletion.
One of their tasks had required them to load up the production ENV variables in their shell. They then moved on to a dev task and came a point they executed all the system’s tests.
The test suite included E2E tests that integrated with a local MongoDB instance.
All collections are drop
ped from the DB before each run to ensure each test starts with a clean slate.
The E2E test code default pointed to the local instance, however it used the same DB initialization as production. This meant ENV variables could set the connection string.
So when the tests ran in the shell with the Production ENV loaded… Yeah.
This was a hard realization, and tough to digest.
But had we found the root cause?
No. The operator happened to be the unlucky soul that conjured up the right conditions for this to happen. There should have been guardrails to prevent this.
The question became: what were the circumstances that allowed an operator to be put into such a position that such an incident could happen?
The Weakest Links
This incident revealed how with just the right conditions within the weakest links of our socio-technical system could cause such a scenario to materialize.
On the technical side:
- Our systems ran on Heroku for Development and Production
- As a 12-factor app, configuration was loaded from ENV variables, including database credentials
- At the time there was no way to limit access to database credentials in the app
- Some E2E tests re-used the same database initialization code as in production, which allowed the credentials to be overridden when the ENV vars were present
- Our database hosted many collections from multiple systems making it quite large and thus have a lengthy restoration time
On the social side:
- Teams are responsible for the code and operation of their system, thus they needed to access and manage their ENV variables
- It wasn’t uncommon for devs to load Production ENV variables in their shell for debugging or operational purposes
- We had backups, but we didn’t have a habit of regularly testing restoration procedures
It was with an unfortunate set of conditions where all these factors came into play, in the worst combination, that led to such a disaster.
There were many lessons learned for everyone involved. Some I took away were:
- Database Credentials
- Never re-use system credentials. If an operator needs access to the database, they must request temporary access.
- Not all operations need to be permitted. They should match the system’s use case. Is there a valid need to
drop
entire collections at once? Probably not.
- Testing Practices
- Hard-code dependencies in tests to something safe.
- If configurability is really needed, then use different names to those in production.
- Backups
- Regularly test backup restore procedures. Without that, they’re Schrödinger’s backups: they might work, they might not. Don’t wait to find out when you need it.
- Have multiple backup options. This depends on the system, but we were able to create an additional backup specifically for operationally critical collections. We also added another MongoDB replica which had a 6 hour delay.
Picking Up the Pieces
It was a sobering moment for us all. A milestone in our organizational maturity. We weren’t in startup mode any longer, we now had serious customers with serious businesses depending on us.
This incident revealed the various weaknesses in our systems and procedures that came to be.
It allowed a team member to be put into such a situation where they unintentionally deleted data.
The restoration of services took much longer than it should have, and relied on luck.
It’s hard to face such failures. But they will happen, and can happen to anyone. What matters is how you pick yourself up afterwards, as well as extracting as much learning from the experience as possible.
By digging deeper for the root cause and looking for the weak links, we were able avoid this type of incident from happening again, and be better prepared for any other type of dataloss incident.
In the end, we came out better for it.
So if you find yourself participating in a post mortem process, remember to also ask: what were the circumstances that allowed it to happen?
Published on by Raoul Felix, putsdebug.com (c) 2024