Combating the Rise of Simulation-Resistant Superbugs

Moore’s Law has stumbled, and the semiconductor industry will never be the same.¹ While the number of transistors continues to double every two years, clock speed and power have flatlined. Designers have had to pivot to novel functional approaches to improve throughput, response time, and power efficiency. Their heroic efforts have led to innovations in parallelism and concurrency to meet the ever-increasing market demand for high-performance designs.

But the use of these clever techniques to keep performance marching forward has led to the rise of simulation-resistant superbugs—functional bugs that are resistant to detection during simulation and even emulation. They arise because extreme corner scenarios are required to activate and detect them. This task has proven too much for traditional functional verification methods to tackle, so superbugs are often initially discovered in silicon—and sometimes by the end customer.

The Superbug Epidemic and Verification

Verification has never been harder. CPU developers are unable to rely solely on making individual processors go faster and instead pack more cores on a chip to meet performance and efficiency goals. This surge in functional complexity and the increasing adoption of parallelism and concurrency have contributed to an increase in insidious bugs that are nearly impossible to find pre-silicon.

There are many other contributing factors, such as how finding the root cause of superbugs can be elusive during the debug process and how they may even be hiding sister superbugs that are revealed only when the initial superbug is identified.² So, it’s difficult to predict when all of these superbugs will be fixed and the chip will be blessed for tape-out.

Now imagine data centers and networking systems utilizing thousands of these multicore chips and their non-deterministic bugs, and you start to understand the scale of the superbug epidemic and the seemingly insurmountable hurdle they present in verification.

Simulation-resistant superbugs occur across an array of design types and are typically application-specific. For example, CPU functionality related to cache coherence, speculative issues, data prefetching, and memory subsystems may foster superbugs. Whereas networking applications use of resource managers for multiple ports, linked list controllers, and XBARs create superbug exposure.

In the wireless domain, multi-user MIMOs, Rx and Tx channel interference, and aggressive 802.11 standard compliance are prime illustrations of functionality that contribute to superbugs. The chart below depicts common areas or functionalities that are susceptible to superbugs by domain. Applying application-specific verification methodologies is the key to combating these superbugs.

Www Electronicdesign Com Sites Electronicdesign com Files Oski Fig1

Employing the Right Verification Strategy

With mask costs reaching extraordinary levels, it’s never been more important to ensure that all bugs are identified and fixed before tape-out. However, the set of complex issues that parallelism and concurrency introduce, including non-determinism, race conditions, deadlock, and performance and scalability challenges, have forced a change in verification strategies.

Simulation has historically been the verification tool of choice because it provides high controllability and observability, but its execution speed limits how much can reasonably be tested. Simulation is strong in verifying designs that are more sequential in execution, even if they implement complex functionality. However, simulation runs into a wall when trying to account for all of the different scenarios brought on by parallelism and concurrency.

Emulation has been growing in popularity because its speed allows for the execution of low-level software that can stimulate the design in ways that aren’t possible with simulation. It’s also powerful for exploring system-level performance and validating software.

However, relying on emulation to discover corner-case superbugs can be detrimental to meeting schedule demands because emulation isn’t typically focused on verifying corner-cases, and is implemented late in the design flow. Finding serious bugs at this late stage is better than a bug escape to silicon, but it can still derail a design schedule, delaying market release and sacrificing profits.

Is Deep Application-Specific Formal the Answer?

Formal sign-off methodologies have the power to prove the absence of bugs, including superbugs brought on by parallelism and concurrency. Formal techniques require a thorough understanding of the low-level details of the design.

When bugs are found, formal methods identify the conditions under which the bug occurs—even if those conditions are bizarre corner cases that no one would ever think of. Deep application-specific formal methodologies are ideally suited to cover all corner-case scenarios, since all scenarios are exhaustively verified no matter how unlikely they are in the presence of parallelism and concurrency.

Www Electronicdesign Com Sites Electronicdesign com Files Oski Fig2

Formal verification can zero in on specific application behavior, making it the most robust and efficient way to knock down superbugs. Many examples exist of mission-critical functionality that have been formally signed off. All of these formal methods were developed in response to post-silicon bugs being overly common in these block types.

Formal sign-off can happen in parallel with other verification methods, but the testbenches may take a significant amount of time to think through and create. Formal specialists have been through this drill numerous times, accumulating patterns and best practices that can save months of verification time. Application-specific abstraction techniques are often key to overcoming the exponential proof complexity that might happen through a naïve usage of formal tools.

Such experts offload your verification team so that simulation engineers can focus on simulation-friendly blocks, and emulation engineers can focus on system-level integration and software bugs. This approach leverages the strengths of each verification domain to deliver chip-level functional signoff within a project’s schedule.

While superbugs might seem like an implacable foe, raising the specter of deficient systems that fail after deployment, they need not be. Formal sign-off with the assistance of an experienced team such as Oski Technology can eliminate superbugs without delaying time-to-market.

Craig Shirley is President and CEO of Oski Technology.

References:

1. https://www.technologyreview.com/s/601441/moores-law-is-dead-now-what/

2. https://newsroom.intel.com/editorials/addressing-new-research-for-side-channel-analysis/