CrowdStrike Update Crash: What Happened
Imagine a security guard dog (CrowdStrike Falcon Sensor) accidentally knocking over the entire house (Windows 10 computer) while trying to protect it. Refer link
A recent Microsoft Azure and CrowdStrike outage caused widespread disruption due to a faulty CrowdStrike update(The actual technical details not released yet). The issue has been resolved, but some residual impacts may linger. CrowdStrike is a well-known cybersecurity firm, and its Falcon Sensor software is designed to protect systems from cyberattacks(Irony, so now you know how first line on dog makes sense)
This morning, CrowdStrike released an update that went wrong. Falcon Sensor, their security program, caused Windows computers to crash and wouldn’t let them restart. This is similar to security vulnerabilities like the 2019 Log4j and WebLogic issues, which I recall being involved in past. We identified log4j during chritmas holiday and I recall the stunned looks we all had on we didnot even know how to identify it on thousands of servers which the company hosted. The mitigation we did was manual and took a long time for thousands or servers. When at Oracle recall the weblogic zero day which caused issues in login. We recall rushing to revert the patch or restoring the backup as soon as multiple customers reported it. Some customers reverted to DR and were able to work through the issues. Now Crowdstrike issue didnot impact the company or me , so I was just seeing half the world stopping and chatter on reddit or whatsapp on what to do now ? Thankfully there was a immediate guideline to fix it for PC’s. But the servers was another story.
Why is this a big deal?
- Recovery can be tricky. Thankfully, Microsoft has a “safe mode” to bypass the faulty update, but it’s not perfect.
- Virtual machines (cloud servers) can be especially difficult to fix. They might need to be moved to another server, which adds complexity.
- Encryption can create a Catch-22. Some companies store encryption keys on affected servers, making it hard to access them for repairs.
The happy news?
- CrowdStrike and Microsoft found a fix quickly.
- Companies with good disaster recovery plans (DR) will be better off.
The not-so-good news?
- Fixing thousands of computers is a hassle.
- IT departments are likely frustrated, especially for a Friday release.
- Companies with the most security (like banks and airlines) were hit the hardest due to encrypted servers.
This situation highlights the importance of disaster recovery plans. Companies that planned for these types of issues will recover faster than those who didn’t. I am sure the choas engineering etc is of no use in such a case only a truly tested DR saved the day for many companies. Most banks and financial institutions do a annual DR drill to test the resiliency and I do recall on weekends working with customers to test DR in past in on premises days where it was a day long process of restoration shutdown and network changes. With cloud DR tests and restoration is just a click away but companies just do DR tests for regulations sake. Time for everyone to gear up and work on their DR plans.