Benefits and Costs of Overstress Testing

Nov. 1, 2003

12 min read

Overstress test, tests using stresses beyond the design limit of the product, is successful at uncovering faults in both product design and the manufacturing process and ensures the overall robustness of the product. The benefits of overstress test include the following:

Early detection of latent faults.
Reduced total engineering time and cost.
Reduced production and warranty costs.
Earlier and mature product introduction with yields stabilized.
Higher reliability.
Greatly reduced manufacturing screening costs.

Overstress tests traditionally have been divided into highly accelerated life test (HALT)^1,2 and highly accelerated stress screen (HASS). In this article, the test results and the conclusions drawn from them are based on my experience at Tandem Computers, Compaq, and Cisco Systems. Products from these companies can be characterized as high-complexity, high-bandwidth electronic products primarily for enterprise customers.

HALT

HALT is an engineering step-stress-to-fail destruct test in which environmental and functional stresses gradually are increased to find design weaknesses and determine the operational limits of the UUT. HALT is not a compliance test and not limited by component or product specifications. In the case of systems, HALT will find defects in the weakest subsystem.

HASS

HASS is a nondestructive manufacturing screen or process. The purpose of HASS is to fail bad products that most likely will fail early in the field and pass good products. Failing units are subjected to root-cause analysis and corrective action.

HALT traditionally is performed first to expose persistent design problems and establish operating margins. After this has been done, what is the need for further overstress testing? The answer lies in the statistics of the problem.

The HALT column in Table 1 (see below) shows actual product failures in Tandem fault-tolerant computers in the late 1990s.³ After first customer shipment (FCS) and the testing and corrective actions taken to assure reliability, we knew that the original prototypes had faults in the percentages listed in the HALT column.

Table 1. Product Fault Summary

HALT* HASS** HASS*** Bad SRAM 43% 89% 100% Scache 15% 48% 100% Cracked CPU Chip 15% 48% 100% Microcode 2% 8% 87%

* = One Unit
** = Proof of Screen, Four Units, 100 Cycles Each
*** = Proto Screen, 100 Units

For starters, one product had persistent SRAM problems that appeared in 43% of the pilot production. Another had Scache faults in 15% of the pilot production. The probabilities of detecting these faults using the two variations of HASS are listed in Table 1. It’s obvious from the values that a program that includes HASS will produce higher quality products. See Reference 3 for details.

HASS Overview

Several stresses are capable, both alone and in cooperation, of identifying product weaknesses. When available, each of these stresses should be used in HASS test design.

Temperature

Operation under prolonged elevated temperatures can uncover marginal design, bad components, or process problems inherent in the product. Temperature cycling detects weak solder joints, lack of IC package integrity, thermal coefficient of expansion (TCE) mismatch, and PCB processing and mounting problems, all problems that will show up over time once a product is in the field.

Vibration

Vibration is useful for testing a product’s poor solder conditions and robustness during shipping. Cold/insufficient solder connections can be stressed to failure with vibration levels that will not harm good connections.

Voltage Margining

Voltage margining can be useful in identifying marginal components and marginal design, especially when used in conjunction with temperature.

Frequency Margining

Frequency margining is not always an option. But if the circuit under test allows for it, it can be useful in detecting marginal components.

Functional Stresses

The most abstract and sophisticated aspect of HASS design is the functional test of the product. The goal is to have the product endure a combination of environmental and electrical stresses while operating at peak processor utilization. Functional test should simulate this worse-case real world as accurately as possible to ensure that no product functionality goes untested.

HASS Profile

The five stresses identifying product weaknesses then must be combined into a HASS test profile. No industry-standard HASS profile exists. Instead, it must be tailored to the needs of the product. The requirements for a successful HASS profile are as follows:

Positive Return on Investment (ROI)

The benefits of HASS must exceed the costs of the testing procedure.

Benefits

The direct benefits include improved reliability, a reduced rate of return material authorization (RMA) or field returns of failed products, bad boards that fail in HASS, good boards that pass, and reduced RMA costs. The indirect benefits are a company reputation for quality and increased orders.

Costs

The price of HASS includes capital expenditures for test equipment, operating expenses, and repair of failed products. All of the technical decisions are secondary and must be driven by the overriding consideration of the ROI of the HASS project.

At Tandem, a typical HASS profile was determined from HALT results, such as shown in Figure 1. Here, the temperature dwells were determined by the time required for components to stabilize. Vibration was gated on and off to attempt to maximize the efficiency of stress combinations during the temperature ramps. Such a HASS profile can be quite aggressive when compared to the product design.

At Cisco, a single HASS profile was used for essentially all products. Temperature dwells were determined by the time required for diagnostics to complete, temperature magnitudes were set at 10°C beyond product specifications, and vibration was not used due to the difficulty of getting good transmission through the product cabinet to the target boards.

HASS Program Considerations

I conducted a survey of the perceived reasons for doing or not doing HASS at the IEEE workshop on overstress testing. The results, summarized in Table 2 and Table 3 (see below), are somewhat reminiscent of the tastes-great/less-filling advertisement: supporters concentrate on the benefits, and detractors focus on the costs.⁴ As was pointed out in Reference 4, a framework is needed that takes into account all of the issues stated in Tables 2 and 3, assigns a cost/benefit to each, and nets out the ROI.

Table 2. Reasons for HASS

Key

Stated Reason

Comments

a Increased reliability/quality Hard to measure, hard to quantify benefits, compare to n b Sales advantage/ customer satisfaction Same as a, but more difficult to quantify c Reduce field service costs Equivalent to a d Reduce DOA/early life fails Equivalent to a e Identify failure modes in-house Benefits seen only by redesigning to avoid failure modes f Better product Equivalent to a g Reduced field returns Equivalent to a

All the variables on the success of the HASS program were tabulated by order of importance. By far, the dominant factor was the improvement in reliability obtained as a result of the HASS testing.

Table 3. Reasons Against HASS

Key

Stated Reason

Comments

k Additional cost Virtually all opposition is cost based, easier to measure than benefits l Outside of component specs, design limits Equivalent to n m Additional WIP time Additional step assumes all else equal, part of k n Decreases manufacturing yields Easy to measure, easy to quantify, compare to a o Afraid of damaging good product See comments on a, effect on reliability is uncertain p Seen as critical of known good process Known good implies improved reliability is of no benefit or screen is no good

HASS can be performed at many points in the product life cycle. Figure 2 (see below) shows typical fail rates for HASS when performed on prototypes, production verification runs, production, and field returns. Normally, companies emphasize the production HASS tests; the other three possibilities represent greater opportunities for large ROI. HASS tests on prototype units, process verification units, and field returns provide an excellent entry point for companies not yet doing HASS.

Figure 2. Typical HASS Failure Rates at Various Times in the Product Life Cycle

Determinants of HASS Success

For a HASS test to be successful, the closed-loop corrective-action process must be followed. This may be through either the short-loop process of failure analysis and repair of the failed unit (typically hours to days) or a long-loop process involving design change (typically months). Results then are fed back to design to make a circuit change or select a different component supplier or to manufacturing for a process change.

Specific corporate requirements for HASS success are the following:

Management—Management must allocate sufficient resources, time, and funds for HASS testing to take place. The team also must provide support during the failure analysis phase to provide closed-loop corrective action in a timely fashion.
Development/Engineering—Design engineering must be immediately available to troubleshoot failures that sometimes can be beyond the abilities of the HASS test engineers.
Suppliers—Suppliers must be willing and able to provide component failure analysis to obtain root cause.

What can be expected from a HASS program in terms of improved product reliability? It depends on the product. The most useful information would be actual or predicted numbers for field reliability without HASS and the yield at final test.⁵

As an example, for high-end electronics such as those forming the basis for this article, field return rates typically are 1% to 5% per year, and final-test fail rates fall in the same range. If the candidate for HASS is at the upper limit of these numbers, HASS can produce improvements as large as 50%. If the candidate product has a field return rate and a final-test fail rate of 1%, then cost-effective HASS improvements are unlikely.

Conclusions

Several benefits can be obtained from a well-designed HASS program: the HASS process is adept at finding process and component faults, normally after HALT has determined operating margins, and improving reliability; the costs of HASS can be justified in terms of HASS program benefits; and HASS decisions must be made based on ROI.

Acknowledgments

The test data serving as the basis for this paper was obtained at Tandem Computers and the Cisco HALT Laboratories in Cupertino and San Jose, CA. For this data, I am indebted to the efforts of Nahum Meadowsong, Todd Broome, Hung La, Sheryl Spake, and Ken Miu at Cisco and Cheryl Ascarrunz, Mark Roettgering, Paul McQuiddy, Phung Truong, and Lee Kulhanek at Tandem. Field data was obtained with the help of Henry Bertram at Cisco and Michael Hirshhoff at Tandem. QualMark has been supportive and helpful throughout the four-year history of overstress testing at Cisco.

References

Meadowsong, N. and Kyser, E.L., “HALT—Benefits and Costs of Overstress Testing,” EE-Evaluation Engineering, October 2003, pp. 62-67.
Hnatek, E.R. and Kyser, E.L., “Straight Talk About Accelerated Stress Testing—Lessons Learned,” Proceedings of Future Circuits International, 2000.
Ascarrunz, C. and Kyser, E.L., “HALT and ESS for Quick-to-Market Scenarios,” NEPCON, 1998.
Kyser, E.L., Hnatek, E.R., and Roettgering, M.H., “The Politics of ESS,” IEST 2000 Proceedings, 2000.
Kyser, E.L., “The Final Test,” IEEE CPMT Workshop on Accelerated Stress Test, October 2000.

About the Author

Dr. Edmond L. Kyser is a consultant for electronic hardware reliability issues. He has directed HALT and HASS programs at Cisco Systems, Compaq, and Tandem Computers. Dr. Kyser, the author of several articles on accelerated stress test, holds seven U.S. patents. His Ph.D. is from U.C. Berkeley in applied mechanics. 650-960-0138, e-mail: [email protected]

FOR MORE INFORMATION

on HASS
www.rsleads.com/311ee-183

Return to EE Home Page

Published by EE-Evaluation Engineering
All contents © 2003 Nelson Publishing Inc.
No reprint, distribution, or reuse in any medium is permitted
without the express written consent of the publisher.