Global outage due to Friday’s release of CrowdStrike

Global outage due to Friday’s release of CrowdStrike

Anonymous

Ever heard the unspoken rule: “Never release on Friday”? We have, but CrowdStrike hasn’t. They released a tiny driver on an ordinary Friday morning, which became the cause of a huge outage all over the world.

An incorrect update for CrowdStrike’s EDR (Endpoint Detection and Response) solution has affected Windows devices around the world — giving corporate users the Blue Screen of Death (BSOD). The failure has affected, for example, airport information systems in the US, Spain, Germany, the Netherlands and other countries.

Who else was affected by CrowdStrike’s Friday release and how to roll back bricked computers — all in this post…

What happened

It all started early Friday morning with corporate users around the world reporting problems with Windows. At first, a glitch in Microsoft Azure was blamed, but later CrowdStrike confirmed that the root cause was in the csagent.sys or C-00000291*.sys driver for its CrowdStrike EDR. And it was this driver that caused an abundance of silly office photos showing off the (dreaded) blue screens.

Blue screen of death on all computers = a day off for airport linemen

Blue screen of death on all computers = a day off for airport linemen

If we wanted to list everyone affected by this outage, such a list sure wouldn’t fit into this post – or dozens of them. So instead we’ll briefly cover the main victims of CrowdStrike’s negligence. Airline companies, airports, and people who want to either go home or go off on a long-awaited vacation were the most affected:

  • London’s Heathrow Airport, like many others, announced flight delays due to a technology glitch;
  • Scandinavian Airlines posted a notice on its website saying, “Some customers may experience difficulties with their bookings due to an IT issue affecting several countries. SAS is fully operational but delays are expected”;
  • In New Zealand, banking, communications and transportation systems are experiencing problems.

Various medical centers, chain stores, the New York subway, the largest bank in South Africa and many other organizations that make lives more comfortable and convenient on a daily basis were affected. The fullest list of those affected by the outage we can find is here — and it’s growing by the minute.

How to fix it

At this stage, it’s rather problematic estimating how long it’ll take to fully restore the affected computers around the world. Things are complicated by the fact that users need to manually reboot their computers in Safe Mode. And in large corporations, this is usually impossible to do on your own without the help of a system administrator.

Nevertheless, here are the instructions for how to get rid of the blue screen of death caused by the CrowdStrike driver update:

  1. Boot your computer in Safe Mode;
  2. Go to C:\Windows\System32\drivers\CrowdStrike;
  3. Locate and delete the csagent.sys or C-00000291*.sys file;
  4. Restart your computer in normal mode.

And while your sysadmins are doing this, you could use a hack that’s come out of India today: employees of one of the country’s airports have started filling out boarding passes… manually.

India isn't too worried about the global disruption

India isn’t too worried about the global disruption. Source

How the failure could have been avoided

Avoiding this situation should have been straightforward. First, the update shouldn’t have been released on a Friday. This is as per a rule that’s been known to all in the industry since the year dot: if an error occurs, there’s too little time to fix it before the weekend, so the system administrators at all companies affected need to work over the weekend to fix things.

It’s important to be as responsible as possible about the quality of updates released. We at Kaspersky launched a program back in 2009 to prevent mass failures such as this one at our customers, and passed an SOC 2 audit, which confirms the security of our internal processes. For 15 years now, every update has been subjected to multi-level performance testing on various configurations and operating system versions. This allows us to identify potential problems in advance and resolve them on the spot.

The principle of granular releases should be followed. Updates should be distributed gradually, not all at once to all customers. This approach allows us to react instantly and stop an update if necessary. If our users have a problem, we register it, and its solution becomes a priority at all levels of the company.

As with cybersecurity incidents, in addition to fixing the visible damage, you need to find the root cause to prevent these types of problems repeating in the future. It’s necessary to check software updates on test infrastructure for operability and errors before rolling them out to the company’s “combat” infrastructure, and to implement changes gradually — continually monitoring for possible failures.

Incident handling should be based on an integrated approach to building protection from a trusted supplier with the strictest internal requirements for the security, quality and availability of its services. The basis for this work can be the Kaspersky Next line of solutions. This will help your company not only stay afloat — but also increase the efficiency of your information security system. This can be done either gradually — increasing protection step by step — or all in one go. Protect your infrastructure today with us so that the next global outage doesn’t affect your customers.

And we, for our part, can help you make this decision: switch to Kaspersky and unlock two years of Kaspersky Next EDR Optimum for the price of one. Experience the pinnacle of robust, reliable cybersecurity protection!