Formal Analysis: A Valuable Tool for Post-Silicon Debug

Test-lab dilemma

Memory controller

Protocol violation

End-to-end property

Input constraints

The verification of today’s bleeding-edge chips requires the best methodology and tools, including the application of high-capacity formal verification technologies throughout the design flow, from architectural exploration to post-silicon debug. We see this last area, post-silicon debug, as an important value delivered by formal technology for design and verification teams who have not employed formal analysis earlier in the process to get the design right the first time. As case studies demonstrate, using formal analysis to find bugs, fix them, and verify the fixes adds tremendous value in the post-silicon lab.

To understand the stakes involved with post-silicon-bugs, consider these widely acknowledged consequences: Finding bugs in model testing is the least expensive option, as the cost of a bug goes up 10 times when it’s detected in component test, 10 times more in system test, and 10 times more if it gets into the field, which is, of course, the worst possible outcome for everyone involved.

Post-silicon debug can be very stressful. What can you do when the chip doesn’t work and time is running out? Missing the deadline can put one’s job—and one’s company—in jeopardy. Management expects these post-silicon-bugs to get isolated quickly and is regularly calling the design and debug team for updates.

Effort is spent trying to reproduce bugs seen in the lab using directed-random simulation and emulation. But these traditional approaches often run out of steam and cannot find the root cause of the bug fast enough. Formal verification has proved to be a lifesaver in these situations, for it uncovers the root causes of bugs and exhaustively validates fixes when other approaches have failed.

BRING ON THE BUGS

When the post-silicon debug team detects a problem in the chip under test, it is initially very abstract and not well understood. The chip hangs up, is non-responsive, drops packets, and/or sends out wrong output, among other functional problems. The first step is to know what is happening inside the chip.

Many chips today have some kind of on-chip trace extraction capability, which consists of means of freezing the chip when certain events are identified; on-chip logic analyzers that allow a selected group of signals to be multiplexed to external pins; the ability to save the value of certain signals some number of cycles before the freeze event; or scan chains to scan out all the flops. Given this, the post-silicon debug team can extract a failure trace, capturing a limited number of signals for a number of cycles before (and possibly after) a problem is detected.

The next step is to isolate and root-cause the post-silicon bug. At this point it is known that the chip is exhibiting illegal behavior, as it can be seen in the trace. But the trace represents the last N cycles of the run, and how this state was reached is not known. Typically, there will be a limited number of signals in the trace, and it is difficult to choose the right set of signals to show the problem.

This scenario is a classic test-lab dilemma (Fig. 1). The last few cycles of the failing scenario can be observed, but how can the root cause of the problem be found? How can the designer know in which block the bug is located?

USING DIRECTED-RANDOM SIMULATION

Traditionally, the directed-random simulation team is called in at this point to isolate the bug. Can it discover how this state was reached using simulation? Where the bug is happening isn’t clear, but it is known that it is causing block D to function incorrectly. The bug happens after a three- to four-hour run in the lab when a certain kind of traffic is injected—for instance, read transactions only on bus X.

In reality, finding the root cause with directed-random simulation will be, in most cases, like “Mission Impossible.” If it took four hours of real time with random traffic to hit the bug, how long will it take to reproduce it when simulation time is 1000 times slower? Four thousand hours of simulation? And can that be done in a week?

FORMAL TO THE RESCUE

Naturally, the right formal verification tool, with advanced capabilities, is required. These technologies include:

High capacity for end-to-end proofs, making formal analysis practical for very large designs of hundreds of thousands of blocks
The ability to prove the integrity of data transfers across a design, as offered in Jasper Formal Scoreboard and Proof Accelerators; multiple input and output ports make it possible to verify many different types of data transfers, such as transfers relying on byte enables and bus-width conversion, or serial-to-parallel conversion; Proof Accelerators for many common blocks increase performance and proof convergence
Visualization for easy debug; the ability to easily refine the scenarios to match the lab
Asynchronous clock handling

One of the key advantages of using proven formal verification technologies in post-silicon debug is their ability to find bugs fast. Usually, finding counter examples (CEX) with formal analysis is much faster than reaching full proofs on the same property, and this technique lends itself very well to bug hunting. A capable formal verification tool lets the user freeze a specific state in a specific cycle and then continues the analysis from that point, so it is unnecessary to analyze the entire design at once.

Finding the bug with formal analysis follows almost the same process as in a pre-silicon formal verification flow. There are, however, some significant differences:

Search is for one specific bug and one specific scenario
The process is not looking for full proof or coverage completeness
The process just needs to find the scenario that leads to the illegal behavior
Over-constraints are allowed to simplify the process—for example, disallow write transactions because the bug only happens with read transactions

In a typical application of formal verification, the team would start with what is known. A trace already shows the illegal scenario, and the problem is known to happen when a read transaction is followed by another read transaction. All that needs be done then is define a property stating - not (illegal_scenario).

In the process, it is unnecessary to constrain the design as in a normal formal run, thereby simplifying the process and allowing for additional shortcuts. For example, it is acceptable to over-constrain the design and add the constraint “no_write_transaction_allowed” since it is known the bug happened in “read” and not in “write” transactions.

Then it’s simply a matter of asking the formal engines to prove this property, and they will compute the failure trace backward from the illegal scenario. Of course, after finding the bug, the designer can modify the RTL and run the formal proof on the same property to confirm that the illegal scenario is eliminated. The ability to visualize this process is a feature in Jasper’s tools, which automatically generates and manipulates waveforms without a testbench to answer “what-if” questions and provide visual confirmation of functionality.

To illustrate the power of formal analysis in post-silicon debug, let’s examine a trio of examples from Jasper Design Automation customers who overcame big challenges using these techniques.

THE HAZARDS OF IP BUGS

The first example describes the debug of a memory controller (Fig. 2) violating a bus protocol. This large system-on-a-chip (SoC) with a processor and multiple peripherals was behaving badly in the field, hanging up under certain conditions, and the manufacturer recalled it.

The members of the post-silicon debug team used directed-random simulation. They started with the little information available from the lab: The chip hung and the problem was isolated as coming from the memory controller when it performs a read transaction.

The team worked more than three months until it was able to find the root cause of the bug. The cause was identified in the memory controller, which was sending its data to the wrapper block in a very specific pattern that activated a bug in the wrapper and caused a violation of the bus protocol (Fig. 3). This very specific timing alignment of different events in the system was very hard to hit with random simulation. As a result, it escaped as a silicon bug and was extremely difficult to detect in post-silicon simulations.

Obviously, three months of simulations to find the root cause of the bug was a major problem for the chip manufacturer, its customers, and the intellectual property (IP) supplier, especially since products containing the chip had already reached the end customers, resulting in a recall.

Formal analysis was called upon to help after the fact to see if it could root-cause the bug faster than simulation. The formal engineer got exactly the same information that the simulation team started with. In this case, the bug was found after just two and a half weeks, and much of that time was spent ramping up on the design and protocols involved. Actually, once the setup was finished, and properties written, the runtime to find the CEX was less than a minute compared to weeks of simulations trying to hit the bug randomly.

These are realistic and expected results since formal analysis works mathematically backward from the failing trace to generate its root immediately. With simulation, engineers build a checker to detect the bug when it happens, use direct random to do more reads, and hope to hit the bug eventually.

Afterward, formal analysis was rerun on the fixed RTL code. It uncovered two more new bugs that the post-silicon simulations missed, saving the chip manufacturer another respin.

POST-SILICON PARTNERSHIP

A second case describes how the formal team worked in tandem with engineers in the post-silicon lab. Once the bug was discovered in the lab, formal verification was used to help.

The chip included a series of four blocks in which the bug occurred. Each block processed the data on its inputs and sent it to the next block. The chip incorporated a built-in logic analyzer that can route out a group of signals to be observed on the chip output. However, for the lab team, selecting which signals to probe was always a challenge. It is difficult to know which signals will display the problem. Since the failure trace was detected on the output of the chip, or the outputs of block D, the lab team focused on block D. Thus, the logic analyzer was probing this block.

The formal team initially wrote an end-to-end property (Fig. 4) from the inputs of block B to the outputs of block D. In this case, block A was not relevant. The property was written based on the illegal trace captured in the post-silicon lab, and only inputs that caused the bug were allowed while others were constrained out.

Due to the size of the three blocks, the property did not converge and had to be broken into smaller properties. So, a property was written from the inputs of block D to its output. For this property, it was necessary to add input constraints (Fig. 5) to define legal behavior on the inputs of D. This property proved very quickly. This means that if the inputs of D behaved according to the constraints, block D would never produce the illegal behavior.

Block D was cleared and determined not to have the bug. Next, the post silicon-debug team moved its focus to block C, setting the logic analyzer to capture signals from that block. Meanwhile, the formal team worked on verifying block C. The assumes on block D’s inputs were converted to asserts on block C’s output. In addition, new constraints were written to allow only legal inputs on Block C.

Within minutes, these properties were proven. Now that the formal team had exhaustively exercised and cleared block C, the post-silicon-debug lab changed its focus to block B. The logic analyzer was set to capture signals from block B. In short order, the post-silicon-debug team was able to find the bug in the trace extracted from block B.

This case presents an excellent example of how the two teams can work hand in hand, each using its strengths and sharing information with the other team to quickly isolate a bug.

FORMAL FINDS MISSED BUGS

A final real-world scenario demonstrates the value of this approach for a design team that had not included formal analysis as part of its initial verification strategy. The original verification strategy was simulation and emulation only, with no formal analysis. But even after this exhaustive testing, a bug was found in the post-silicon lab. Inherent limitations with simulation and emulation caused the bug to be missed. It was a corner case and was cycle-dependent.

After executing an engineering change order (ECO), a couple of weeks of simulation and emulation confirmed that the fix was correct. To be certain, formal analysis was applied to double-check the result. Within just a few hours, it was determined the fix was broken. Even in the absence of a test bench, the formal tool found another scenario that would cause the bug. The fix did not cover all cases.

A second ECO, when reviewed with formal analysis, led to the discovery that the same bug could occur in a different part of the logic. Finally, a third ECO passed the formal test with no problem.

In this case, formal review saved the company from multiple respins because its simulation environment would not have uncovered the bugs after rerunning the fixes. The company now designs with formal analysis and has done away with block-level test benches.

SUMMARY

Formal verification tools deliver significant return on investment (ROI) in the post-silicon lab, potentially saving enormous amounts of time and money while protecting a company’s reputation. Addressing post-silicon debug solely with sophisticated, expensive hardware to inject and capture signals provides an incomplete solution. Formal analysis is delivering impressive results for users who cannot afford delay or to ship flawed parts even once in today’s competitive marketplace.

Ambiguity in the post-silicon debug process can have devastating effects. Verification ambiguity includes incomplete test bench setup, improper property specification, software bugs, and more. Each of these can be a potential bug source, so eliminating them early is critical to isolating the root cause of the bug. The benefit of deploying formal verification in post-silicon debug is to remove this ambiguity. Because of high-capacity formal’s exhaustive behavior analysis, conclusive answers can be generated as to whether or not a particular verification component is the source of the undesired behavior, enabling quick isolation of the root cause of the problem.

The real solution is to integrate formal analysis as part of the design flow to root out bugs early. But it can also be used in concert with traditional post-silicon methodologies to ensure production silicon is bug-free. Pairing formal verification with the capabilities of the post-silicon lab produces highly desirable outcomes, as it eliminates working in the dark. The exhaustiveness of formal analysis can rule out the existence of the bug in a given block and prevent multiple respins, as evidenced by the three case studies cited.