Botched firmware patches: Six tips to avoid the pain
Botched firmware patches: Six tips to avoid the pain
Earlier this year, Intel voiced its concern that its Meltdown and Spectre CPU firmware patches were obsolete. Immediately, it instructed for the distribution and upload of these patches to cease. However, and rather unfortunately, many organisations had already downloaded the patch which resulted in endless reboots and unpredictable system behaviour. The effect was felt across the industry with many for end users, OEMs, cloud service providers and system manufacturers falling victim to the effects.
 
Even more worryingly, Intel is not alone, as US-based company AMD has joined the debacle and is facing class action lawsuits over how it responded to the Meltdown and Spectre CPU flaws.
 
It was discovered that Intel was more than likely aware of the firmware issues weeks, or possibly even months, before a leak forced them to announce it publically. Unfortunately, OEMs don't have the luxury of passing off failures like this to customers as it impacts reputation and revenues. Some OEMs were better prepared than others, with dedicated labs and processes to test patches before they're implemented. Others not so much. However, there are a few steps organisations can take to help protect both their business and customers.
 
1. Don't skimp on CPU 
It is important to have enough CPU resources in place, from both the vendor engineering and end-user perspective, to handle workloads in all failure scenarios. Prior to Meltdown and  Spectre, these failure scenarios were typically only in the domain of hardware; compute clusters were sized accordingly to tolerate the complete failure of one or more nodes. However, in light of this new flaw, it is vital to take into account software mitigations, which may also significantly affect performance.
 
2. Be ready to take action
One of the biggest frustrations over this incident is the apparent lack of processes in place to address issues like Meltdown and Spectre. In this case, Intel was too late to release the microcode to fix the flaws, and even then, it allowed it out the door with bugs.
 
Learning from this incident is important, OEMs can be prepared and build an internal process to fix critical issues and release them as quickly as possible. There needs to be a path through each organisation's quality and assurance department that ensures a software build or patch is ready to go at any time to mitigate security vulnerabilities and downtime for customers.
 
3. Adapt and be flexible
Even if you don't have processes in place to address fixes quickly, at least have the flexibility and adaptability in your organisation to drop other things and shift gears quickly to get the job done.
 
This can be as simple as adjusting priorities to develop a path within your organisation to get a stable fix out the door. Establish the resources to test patch systems to ensure they are running with a level of stability that meets your comfort level.
 
If you have customers who are already pushing the envelope on performance for their workloads, then have a team ready to support them through the upgrade process.
 
4. Always communicate 
When working through a patch fix issue like Meltdown and Spectre, it's important to communicate internally and externally. In our case, when the patch flaw was revealed, we worked to create internal communications with employees to help them understand the severity of the issue. We wanted to educate them on how it impacted system vulnerability and what we were doing to address it.
 
As a result, our internal teams were ready with answers when our customers called. Because we were prepared with immediate answers, we were able to help alleviate customer fears to the point that they were confident we were on top of the situation.
 
5. Automate testing
For many OEMs, testing is automated so that at a push of a button they can have an answer for everything. Testing automation speeds the process of applying patches as they come down from vendors in the microcode and OS. But unfortunately, not everyone has this advantage.
 
When testing, make sure you can replicate as many customer scenarios as possible to provide an accurate assessment. Know also whether or not they are using untrusted code, which can be critical if the customer's infrastructure has become vulnerable.
 
6. Trust each other – even competitors
The last point to make here is that if processor and software manufacturers can't be open and honest, then maybe users are going to have to look after each other. In this situation, open source communities spring to mind. If someone spots something odd during a testing process, they could alert everyone else rather than waiting for an official statement or patch release from the manufacturer. For this type of community to work organisations have to trust each other – even competitors. The overall collective priority for mutual security would have to provide an environment where data is shared for testing. The environment would also require boundaries or a code that commits everyone to honour each other's proprietary secrets. However, this is unfortunately an unrealistic scenario at this stage.
 
Following these guidelines will help organisations to prevent a similar flaw. To ensure quality products are delivered with timely fixes as an IT community, organisations and end-users need to reply on processes, automation and strong communication more. Spectre and Meltdown not only demonstrated a technology flaw but a breakdown in communication.
 
To ensure organisations are all delivering quality products and services with timely fixes when required, we might need to rely on the processes, automation, and strong communication a little bit more.

Contributed by Phil White, chief technology officer at Scale Computing

*Note: The views expressed in this blog are those of the author and do not necessarily reflect the views of SC Media UK or Haymarket Media.