Debugging Low Test-Coverage Situations

Scan is a structured test approach in which the overall function of an integrated circuit (IC) is broken into smaller structures and tested individually. Every state element (D flip-flop or latch) is replaced with a scan cell that operates as an equivalent state element and is concatenated into long shift registers called “scan chains” in scan mode. All the internal state elements can be converted into controllable and observable logic. This greatly simplifies the complexity of testing an IC by testing small combinational logic segments between scan cells. Automatic test pattern generation (ATPG) tools take advantage of scan to produce high-quality scan patterns.

The combination of scan and ATPG tools has been shown to successfully detect the vast majority of manufacturing defects. When you use an ATPG tool, your goal should be to achieve the highest coverage of defects as possible. Because high test coverage directly correlates to the quality of the parts shipped, many companies demand that the coverage for single stuck-at faults be at least 99% and transition delay faults be at least 90%.

When the coverage report falls short of these goals, your task is to figure out why the coverage is not high enough and perform corrective actions where possible. Debugging low defect coverage historically requires a significant amount of manual technique and intimate knowledge of the ATPG tool, as well as design experience especially when device complexity increases.

Automating more of the debug process during ATPG greatly simplifies this effort. I have seen some cases in which automation saved hours, even days, of manual debugging effort and other cases in which the tool provided answers when no feasible, manual technique was possible. Before exploring why you might be getting low coverage and why further automation is needed, I’ll explain how ATPG tools in general categorize and report different categories of faults.

INTERPRETING THE MYSTERIES OF ATPG STATISTICS
The ATPG tool generates a “statistics report” that tells you what the tool has done and provides the fault category information that you have to interpret to debug coverage problems. If you’re an expert at using an ATPG tool, you’ll probably have little problem understanding the fault categories listed in the statistics report. But if you’re not a design-for-test expert, this data may as well be written in hieroglyphics (Fig. 1). Although the statistics report contains a lot of information, it can be difficult to interpret and rarely gives enough useful information to determine the reasons for low coverage, even for an ATPG expert.

When debugging low coverage, you’ll need to understand some of the basic fault categories that are listed in most typical ATPG statistics reports. The first and broadest category is what is sometimes referred to as the “fault universe.” This is the total number of faults in a design. For example, when dealing with single stuck-at faults, you have two faults for each instance/pin, stuck_at logic 1 and stuck_at logic 0, where the instance is the full hierarchical path name to a library cell instantiated in the design netlist.

This number of total faults really is only important when comparing different ATPG tools against each other. The total number can vary if “internal” faulting is turned on and whether or not “collapsed” faults are used. Internal faulting extends the fault site down to the ATPG-model level, rather than limiting it to the library-cell level. ATPG tools, for efficiency purposes, are designed to collapse equivalent faults whenever possible. Typically, you’ll want to have the internal faults setting turned off and uncollapsed faults setting turned on. These settings most closely match the faults represented in the design netlist.

SHOULD YOU CARE ABOUT UNTESTABLE/UNDETECTABLE FAULTS?
Faults that cannot possibly be tested are reported as untestable or undetectable. This includes faults that are typically referred to as unused, tied, blocked, and redundant. For example, a tied fault is one in which the designer has purposely tied a pin to logic high or logic low. If a stuck-at-1 defect were to occur on a pin that is tied high, you could not test for it because that would require the tool to be able to toggle the pin to logic low. This cannot be done because of the design restriction, so the fault is categorized as “untestable.”

Untestable/undetectable faults are significant for two reasons. First, they distinguish “fault coverage” from “test coverage,” both of which are reported by ATPG tools. When most tools calculate coverage, fault coverage includes all the faults in the design.

Test coverage subtracts the untestable/undetectable faults from the total number of faults when calculating coverage. For this reason, the reported number for test coverage is typically higher than fault coverage.

The second reason that untestable/undetectable faults are important is that nothing can be done to improve the coverage of these faults; therefore, you should direct your debugging efforts elsewhere.

One last thing to be aware of regarding untestable/undetectable faults is that ATPG-tool vendors vary in how they categorize these faults. These differences can result in coverage discrepancies when comparing the results of each tool.

WHAT IS MORE IMPORTANT—TEST COVERAGE OR FAULT COVERAGE?
This begs a question as to which is the more critical figure: test coverage or fault coverage? Most engineers, but not all, rely on the higher test coverage number. The justification for ignoring untestable/undetectable faults is that any defect that occurs at one of those fault locations will not cause the device to functionally fail. For example, if a stuck-at 1 defect occurred on a pin that is tied high by design, the part will not fail in functional operation. Others would argue that fault coverage is more important because any defect, even an untestable defect, is significant because it represents a problem in the manufacturing of the device. That debate won’t be explored here though.

Some faults are testable, meaning that a defect at these fault sites would result in a functional failure. Unfortunately, ATPG tools cannot produce patterns to detect all of the testable faults. These testable but undetected faults are called “ATPG_untestable” (AU).

Of all the fault categories listed in an ATPG statistics report, AU is the most significant category that negatively affects test coverage and fault coverage. Determining the reasons why ATPG is unable to produce a pattern to detect these faults and coming up with a strategy to improve the coverage is the biggest challenge to debugging low-coverage problems.

Here are some of the most common reasons why faults may be ATPG_untestable:

Pin constraints: At least one input signal (usually more than one) is required to be constrained to a constant value to enable test mode. While this constraint makes testing possible, it also results in blocking the propagation of some faults because the logic is held in a constant state. Unless you have special knowledge to the contrary, these pin constraints must be adhered to, which means you cannot recover this coverage loss.

Determining the effect on coverage loss is not as simple as counting the number of constrained faults on the net. The effect on defect coverage also extends to all the logic gates that have an input tied and whatever upstream faults are blocked by that constraint. Faults downstream from the tied logic have limited control, which further affects coverage.

Black-box models: When an ATPG model is not available for a module, a library cell, or more commonly a memory, ATPG tools treat them as “black boxes,” which propagates a fixed value (often an “X” or unknown value). Faults in the “shadow” of these black boxes (i.e., faults whose control and observation are affected by their proximity to the black box), will not be detected. This includes faults in the logic cone driving each black-box input as well as the logic cones driven by the outputs. Obtaining an exact number of undetected faults is complicated by the fact that some of those faults may also be in other overlapping cones that are detected. The solution is to ensure that everything is modeled in the design.

Random access memory: In the absence of either bypass logic or the ability to write/read through RAMs, faults in the shadow of the RAM may be undetected. Similar to black-box faults, it is difficult determine exactly which faults are not detected because of potentially overlapping cones of logic.

If you make design changes, adding bypass logic may address this problem. Some ATPG tools are capable of special “RAM-sequential” patterns that can propagate faults through memories so long as the applicable design rule checks (DRCs) are satisfied. This may be an option to get around having to modify the design to improve coverage.

Cell constraints: Sometimes you need to constrain scan cells with regard to what values they are capable of loading and capturing (usually for timing-related reasons). These constraints imposed on the ATPG tool will prevent some faults from being detected. If the cell constraint is one that limits capturing, then to determine the effect, you’ll need to look at the cone of logic that drives the scan cell and sift out faults that are detected by overlapping cones.

If found early enough in the design cycle, the underlying timing issue can possibly be corrected, which makes cell constraints unnecessary. However, this type of timing problem is often found too late in the design cycle to be changed. Using cell constraints is a bandage approach to getting patterns to pass, and the resulting test coverage loss is the price to be paid.

ATPG constraints: You may impose additional constraints on the ATPG tool to ensure that certain areas of the design are held in a desired state. For example, let’s say you need to hold an internal bus driving in one direction. As with all types of constraints, parts of the design will be prevented from toggling, which limits test coverage. Similar to pin constraints, if the assumption is that these are necessary for the test to work, the coverage loss cannot be addressed.

False/multicycle paths: Some limitations to test coverage are specific to at-speed testing. False paths cannot be tested at functional frequencies; therefore, ATPG must be prevented from doing so to avoid failures on the automatic test equipment. Because transition-delay fault (TDF) patterns use only one at-speed cycle to propagate faults, multicycle paths (which require more than one cycle) must also be masked out. Determining which faults are not detected in false paths is complicated by the manner in which false paths are defined.

Delay-constraint files usually specify a path by designating “-from”, “-to” and possibly “-through” to describe a start and end point of the path. In between those points, there can be a significant amount of logic to trace and potentially multiple paths if you don’t use “-through” to specify the exact path.

STEPS TO IDENTIFY AND QUANTIFY COVERAGE ISSUES
There are three aspects of the debug challenge:

How you identify which coverage issues (as described above) exist,
How you determine the effect each issue has on the coverage, and
What, if anything, you can do to improve the coverage.

Typically, we have had to rely on a significant amount of design experience as well as ATPG tool proficiency to manually determine and quantify the effects of design characteristics or ATPG settings that limit coverage. The usual steps that are required to manually debug fault coverage are:

Identify a common thread in the AU faults.
Investigate a single representative fault.
Rely on your experience to recognize trends.
Determine the effect of the issue on test coverage.

Let’s look at these steps in turn. When it comes to identifying a common thread in the AU faults, it is extremely difficult, if not impossible, to identify a single problem by looking at a list of AU faults. You have to recognize trends in either the text listing of faults or graphical view of faults relative to the design hierarchy. For example, a long list of faults that are obviously contained in the design hierarchy of the boundary scan logic may be caused by a single problem.

At some point, you’ll need to focus your analysis efforts on one fault at a time, so pick one you think might represent a larger group of faults. You might zero in on design elements like registers or memories, but this is usually based more on intuition than anything else. ATPG tools have different reporting capabilities that can be used to report on the inherent controllability and observability of a fault location, which can help but often provide limited information. Interpreting the reports at this level requires an in-depth knowledge of the ATPG tool’s capabilities and a fair amount of instinct regarding where to focus efforts.

As is often the case, your success with debugging relies on having been through the process and identifying similar situations. For example, if a significant number of boundary scan faults are listed as AU, this may be an indication that the boundary-scan logic has been initialized to a certain desired state and must be held in that state to operate properly. Making connections like this between the trends you identify in the list of AU faults to what you know about designs and design practices in general requires a fair amount of experience.

Once an issue is identified, how you determine its significance will be different depending on the issue. As previously described, you often need to keep track of backward and forward cones of logic fanning out from a single constrained point to determine the potential group of affected faults. From there, you also need to evaluate each of those potential faults to assess if it is possibly observed in another overlapping cone of logic.

Some other possible techniques can approximate the effect of some issues. For pin constraints, it may be possible to have the tool temporarily treat them like a tied-untestable fault so that coverage can be recalculated and compared to the original coverage number. Whole design modules can be no-faulted (for example, memory built-in self-test \\[MBIST\\] logic) to see the difference in coverage.

All of these approaches require a combination of special scripts to trace logic paths backward and/or forward, multiple runs of the ATPG tool with different settings, and a high level of tool expertise. Even then, the actual effect is usually still based on an approximation.

AUTOMATED DEBUG ANALYSIS
Recently, ATPG tools have been improved to automatically identify issues that affect test coverage and quantify just how much each issue affects the coverage. The most common method to display this information is through a modified version of the traditional statistics report that you can access in the command line mode of ATPG tools. Mentor Graphics’ ATPG tools FastScan and TestKompress are used as an example here to demonstrate what’s available for automated analysis of low test coverage.

Without any additional ATPG tool runs or any of the manual debug steps, the new statistics report automatically provides details about coverage issues (Fig. 2). Note the list of the total number of uncollapsed faults in the design, which is then broken down into various ATPG categories (Fig. 2, arrow #1). The percentage listed within the parentheses is based on the total number of faults.

The next important area of the report is the test coverage achieved by the patterns generated (Fig. 2, arrow #2). In this case, the coverage is 83.67%, which may not be acceptable. If that test coverage is unacceptable, the next place to look is the line in the statistics report that indicates the number of atpg_untestable or AU (Fig. 2, arrow #3). This line points out that 57,563 faults (or 14.56% of the total number of faults) are AU.

Up to this point, the information is very typical of what you would find in a traditional report. Moving down to the “Untested Faults” section (Fig. 2, arrow #4), you can now get a detailed breakdown of which AU categories have a significant effect on test-coverage loss. The first most significant category of test coverage loss is TC or tied cells (Fig. 2, arrow #5). This category of AU faults accounts for 4.46% of the total number of faults. In this case, “tied cells” refers to registers that are tied to a particular state as a result of the ATPG tool having performed DRCs and simulating an initialization or “test_setup” procedure.

The report also lists the most significant individual tied cells (as well as the state to which they are tied), so that you may evaluate the severity of effect on test coverage at a fine level of detail. A quick review of the instance path names of these tied cells suggests that it’s all test-related logic (boundary scan and MBIST).Although you must still perform additional manual analysis to determine if this category of AU faults can be reduced, this report gives a clear indication of where to look in the design. If it is determined that nothing can be corrected because the test mode requires this logic to be tied, then at least you will be able to explain why 4.46% of the faults will remain untestable.

The next significant category of AU faults is FP or “false_path” faults (Fig. 2, arrow #6). This transition-fault pattern set includes a definition of false paths so the coverage will be lower. From this report, you can see that 5.37% of the faults cannot be tested because of the false path definitions. Many test engineers believe that test coverage should not be penalized as a result of false paths because they are functionally false paths that, by definition, cannot be tested at-speed.

A relatively significant number of multicycle-path faults (1.01%) hurt the test coverage (Fig. 2, arrow #7). Given this information, you may choose to address these faults by targeting them with another pattern set using a clock cycle that will exercise them at a lower frequency. There is no guarantee that all of these faults will be detected at a different frequency because other issues may prevent detection. What the report tells you is that these definitely can not be tested because of the reason listed. This is true for all the categories.

The SEQ (sequential_depth) category (Fig. 2, arrow #8) refers to faults that cannot be detected because the sequential depth of the ATPG tool has not been set high enough. This implies that there may be some non-scan logic or memories that require an increased sequential depth to propagate and detect faults. You can affect this number by changing some of the settings during pattern generation.

Right after the SEQ category is another category called “Unclassified.” This is a group of faults that does not fall into any of the pre-defined categories that the ATPG tool can determine. They are faults that traditional statistics reports would normally indicate as AU—there’s just no additional detailed analysis available to determine why they are AU. These faults will require manual analysis.

I previously mentioned that many test engineers do not believe false path faults should be included in the calculation of test coverage while others do. To satisfy these differing requirements, a new column of information called “total relevant” has been added to the statistics report (Fig. 2, arrow #9).

Faults that were not considered relevant were deleted, which resulted in the lower number of total faults (374,238) as compared to the total number of faults in the neighboring column (395,480). How can you tell which faults were detected from the relevant coverage calculation? If you trace down the “Total Relevant” column of information, you will eventually see the word “deleted” corresponding to the false-path category. This means that the 21,242 false-path faults were deleted from the total relevant faults, and the coverage was recalculated. The relevant coverage was 88.46% as compared to 83.67% (Fig. 2, arrow #10). You can see both coverage numbers side by side and determine which one should be used.

Another way to slice the coverage information is to view it with respect to the clock domains (Fig. 2, arrow #11). The next column to the right indicates what percentage of the total number of faults is covered by that clock domain (e.g., 58.71% of the faults in the design are in the clk1 clock domain).

The next column indicates the test coverage of that clock domain’s fault population. In this case, 94.88% of the clk1 faults were detected. The point in listing both the percentage of total faults and percentage coverage of each clock domain is so that you can investigate low coverage for clock domains that represent a significant percentage of the design. Additional reporting capability is available so that a detailed analysis of the AU faults can be shown for the fault universe of each individual clock domain.

Some tools provide more graphical means of viewing this information relative to design hierarchies as well as the design’s clock domains. In addition to the traditional statistics report viewed on the tool’s command line, you can look at the coverage analysis graphically. An example shows how the AU analysis categories can be displayed relative to the design hierarchy (Fig. 3, top left panel). The bottom panel displays the same statistics report as shown on the command line, but design instances are hyperlinked so that you can bring up the schematic view of that instance (Fig. 3, top right panel). You can also overlay the fault category information on the schematic view. The example shown here is the same one discussed earlier in which boundary-scan logic is tied because of the initialization procedure, which resulted in a loss of 0.24% test coverage.

The additional information provided in detailed statistics reports like this provides valuable insight into how to identify and address potential test coverage issues. Debug automation in an ATPG tool means that the most significant test-coverage issues are quickly highlighted along with the effect on coverage. In many cases (such as pin constraints and tied cells), you will be able to immediately determine that nothing can be done to fix the issue and you can easily determine what the test-coverage ceiling will be.

Further automation within the ATPG tool eliminates significant manual effort and debug time required to sift through an otherwise nonsensical listing of untestable faults. As a result, you are freed to focus on the task of resolving the identified problems.