How to minimize software update risks

How to minimize software update risks

Anonymous

According to Microsoft, the blue screen incident caused by a CrowdStrike Falcon security solution update affected more than 8.5 million computers worldwide. This incident has cost many companies dearly, and has also brought up a lot of debate about how to prevent similar situations from happening again.

First of all, no one is immune to errors; it’s simply impossible to guarantee the complete absence of bugs in complex software systems. However, a properly structured process for developing, testing, and delivering products and their updates to users’ devices can significantly minimize the risk of a serious failure. And we’ve had such a process firmly in place for years.

We, too, had incidents directly related to updates for our products. But the last time we had a notable problem of the kind was back way in 2013. After that unpleasant episode, we conducted a thorough analysis of the root causes — leading to a complete overhaul of our approach to the creation and testing of updates both in products for business and home users. And the system we built has proven itself to be very reliable: in 11 years we’ve not had a single failure of a similar magnitude.

We make no secret of the update release mechanisms we’ve built, and are ready to share them with the industry. After all, without the free exchange of best practices and solutions developed by different companies, progress in the cybersecurity industry will be hindered greatly. Among the main update release safeguarding mechanisms are multi-level testing, gradual rollout of updates, and automatic monitoring of anomalies. Let’s talk about them in detail.

Multi-level testing

There are two types of updates for our products — some are used for adding new detection logic, and some are for changing the functionality of a given product. Adding new functions potentially adds more risks, but sometimes logic updates can cause problems as well. Therefore, we carefully test both types of updates at different stages.

Checking for false positives

When creating and releasing detection rules (both those automatically generated and those written by analysts), we test them on an extensive database of legitimate (or “clean”) objects — files, web pages, behavior patterns, and so on. This way, false positives are identified and filtered out. We’ve an extensive and constantly updated collection of legitimate objects — both software and clean web resources — on which all created rules are tested.

One of the ways this collection is replenished is through our Allowlist Program, which allows software developers (both customers that develop and use their own solutions and independent vendors) to provide us with their software. This reduces the number of potential false positives and the risk of incorrect software classification.

Other methods for obtaining files and metadata include exchanging information with technological partners, using our Threat Intelligence Portal, and so on. In total, our database of legitimate objects contains information on around 7.2 billion objects.

Testing on virtual machines

But update testing isn’t limited to checking them on file collections. If no problems are detected at the first stage, all updated components then undergo multi-stage automatic testing on virtual machines with various configurations of security products, software and operating systems. Various scenarios are run related to our products and the operation of security mechanisms and also the imitation of typical user actions.

Regarding specifically product scenarios, these include a through file system scan, the process of the product update’s installation, rebooting after the update, and so on. This allows us to make sure that the product functions normally after the update, and neither crashes nor affects system stability. Each update goes through this check.

User scenarios simulate typical human behavior on a computer — opening a browser, visiting a web page, downloading a file, launching a program. This check allows us to make sure the product doesn’t have a negative impact on the computer’s performance, speed of work or stability.

Separately, updates are automatically tested for compatibility with industrial software (for example, SCADA systems). Any negative impact on solutions related to this sphere may lead to an inadmissible halt in production processes and potential financial damage.

Quality control

In addition to the above-mentioned checks, we also have a separate quality control team. Not a single product update release is delivered to our clients without confirmation of its readiness by its experts. It also, if necessary, adjusts and constantly improves the verification processes, and monitors the emergence of possible operational risks.

Phased release of updates of protective technologies

Of course, we are realists, and admit that this entire multi-level system of checks may still not be enough. For example, some third-party software will be updated at the same time as ours, and this may cause an unforeseen conflict. And in general, it’s impossible to predict all combinations of configurations of different programs and systems. Therefore, after an update affecting the functionality of security solutions is ready and approved, it doesn’t get sent to all our users’ computers at once. Instead, the updates are released in phases.

An update undergoes preliminary testing on machines in our own network before being published on public update servers. If no problems are detected, the update is first received by a very small number of randomly selected users. If no problems or failures are manifested, then the number of computers that receive the update gradually increases at certain intervals, and so on until the update is available to all users.

Automatic anomaly monitoring

So what happens if an update does cause problems? We monitor the behavior of updated solutions using voluntarily transmitted anonymized data through our KSN, and promptly halt update distribution if something goes wrong.

But most importantly, thanks to the combination of automatic anomaly monitoring and phased release of updates, an error would affect only a very small number of computers — hundreds, not millions or even thousands of them.

Testing updates on the client side

Our company provides the ability to check the received updates once again, only this time on the client side through the Kaspersky Security Center management console.

The client’s system administrators can establish an isolated test group of computers (or virtual machines) with the most common configuration and set of software for the organization’s network, and then create a task to check updates — specifying this test group as the target. In this case all incoming updates are first installed only on test machines, tested in action, and only after the test are they distributed across the entire company network. More information on how to set up such a check can be found on our technical support website.

We thoroughly analyze each and every problem related to software updates that may arise (including those identified in preliminary tests), come to understand the reasons for their occurrence, and then take measures to ensure they don’t happen again. In addition, we’ve implemented a practice of proactively identifying and assessing the risks for possible problems, and address them systematically. As a result of doing this throughout the entire lifetime of our company, we’ve established a multi-level system that allows us to significantly reduce the risk of new problems emerging.

Of course, in just one blog post it’s impossible to tell you everything about all the nuances of our multi-level system for checking product updates. However, if this topic arouses interest in the industry, we’re ready to continue sharing details. Only open cooperation of all players in the information security sphere can create an effective barrier to the actions of cybercriminals.