The partnership of design and manufacturing is central to the process of bringing a product to market. The impact of problems in either of these areas can increase exponentially if they go unnoticed until after the product reaches the customer.
Overstress test, using stresses beyond the design limit of the product, is successful at uncovering such faults in both the product design and the manufacturing process and ensures the overall robustness of the product. The benefits of overstress test include the following:
• Rapid design and process maturation.
• Less total engineering time and cost.
• Reduced production and warranty costs.
• Earlier and mature product introduction (yields stabilized).
• Higher mean time between failures (MTBF).
• Reduced manufacturing screening costs.
• Faster corrective action for design and process problems.
• Satisfied customers.
Highly accelerated life test (HALT) is a step-stress-to-fail destruct test that gradually increases the environmental stresses to determine the operational limits and find any design faults. The process is one of test, fail, and corrective action to prevent possible field failures.
HALT is not a compliance test and not limited by component or product specifications. All products are candidates for HALT.
In the case of systems, HALT finds defects in the weakest subsystem. Production should be delayed until HALT results are satisfactory.
The following stresses, both alone and combined, can identify product weaknesses. When available, each of these stresses should be used in HALT design.
Operation under prolonged elevated temperatures can uncover marginal design, bad components, or process problems inherent in the product. Temperature cycling detects weak solder joints, IC package integrity, a temperature coefficient of expansion (TCE) mismatch, and PCB processing and mounting problems, all of which will show up over time once a product is in the field.
Vibration is useful for testing poor solder connections and a product’s robustness during shipping. Cold and insufficient solder joints can be stressed to failure with vibration levels that will not harm good connections.
Voltage margining can be useful in identifying marginal components and marginal design, especially when used in conjunction with temperature.
Frequency margining is not always an option, but if the circuit under test allows for it, it can be useful in identifying marginal components.
The most abstract and sophisticated aspect of the HALT design is functional test. The product must endure the combination of environmental and electrical stress while operating at peak processor utilization and bandwidth. Functional test should simulate this worst-case real world as accurately as possible to ensure that no product functionality goes untested.
There is no industry-standard HALT profile; it should be tailored to the needs of the program. Our experience with testing telecommunications products at Cisco has lead to the design of the following three-phase profile.
Begin with two hours at each temperature step, one hour with a high-voltage margin and one hour with a low-voltage margin (Figure 1a). Power cycle after each temperature step. The first step is 60°C, and the last meaningful step is 90°C. The temperature may continue to be stepped; however, failures become less and less meaningful after a certain temperature. The point of diminishing returns is around 90°C.
This is similar to the hot step (Figure 1b). The first step is at -10°C, and last meaningful step is -40°C.
Vibration is stepped 5 grms/h until the maximum capabilities of the vibration table are reached (Figure 1c). For the QualMark OVS4 Chambers, the value is 60 grms. Accelerometers should be used to determine the amount of vibration incident on the UUT. Generally, and especially when the UUT is part of a larger system, only a fraction of the table vibration is transmitted to the test subject.
Determinants of HALT Success
For HALT to be successful, the closed-loop corrective-action process must be followed. If failure analysis is not carried through to root cause, the benefits of HALT are lost. To be effective, the results must be:
• Fed back to design to make a circuit change, select a different supplier, or improve the existing supplier’s process.
• Fed back to manufacturing for a process change.
• Used to determine the production test profile.
Another key for success is the intimate involvement by members of several departments within the company.
Management must allocate sufficient resources, time, and funds for HALT to take place. Support must be provided during the failure-analysis phase to get closed-loop corrective action in a timely fashion.
Design engineering must be immediately available to troubleshoot failures that sometimes can be beyond the scope of the HALT test engineers.
Suppliers must be willing and able to provide component failure analysis to obtain root cause.
Other key factors to success include the placement of HALT in the product development time line, sample size, the perceived relevance of failures, and failure reporting. HALT should begin as soon as the hardware and software are available and stable.
Test as many units as possible because the probability of uncovering a defect increases with sample size. Failures that occur during testing should be treated as relevant and pursued to root cause. Finally, failure reporting must be visible enough so the failures and solutions gleaned from HALT are not overlooked.
MTBR Prediction Using HALT Margin
Can HALT be used to predict a product’s mean time between returns (MTBR)? If so, then we are in a good position to estimate the benefits and cost-effectiveness of HALT.
The data that follows was collected from multiple HALT procedures of similar Cisco telecommunications products. MTBR and return material authorization (RMA) require that this data be tracked for a year past the completion of HALT.
Figure 2 shows the correlation between MTBR, the actual field performance of the product, and the HALT margin. In this data, the HALT margin is the smallest margin in degrees C between operating specifications and any HALT failure. Vibration failures are not considered.
The correlation is strong and intuitive. MTBR can be predicted based on the following least-squares fit:
MTBR = [(0.0131)(HM) + (0.0876)] (NF)
where: HM = HALT margin for the product
NF = normalization factor used for Figure 2
Having obtained the relationship between the HALT margin and MTBR, we can assess the cost-effectiveness of HALT. For HALT to be cost-effective, the cost must be less than the anticipated benefits.
HALT requires destruction of at least one prototype at a critical stage, and prototype build is a leading cost item in product development. Also, there are costs such as the manpower required to conduct HALT, the depreciation of test equipment, consumable costs, and corrective-action costs.
Cost Justification: Improve Reliability
As can be seen in Figure 3, if the operating margin is increased n°C, the normalized RMA rate is reduced 0.0192 n. Also, the cost of each RMA approximates the cost of producing the board, which is termed whole product-cost (WPC) dollars.
The benefit of a HALT test can be calculated as:
Benefit = (# of RMAs prevented)(cost of an RMA)
= n(0.0192)(RMA intercept)(Pvol)(WPC)$
where: Pvol = the annual production volume
The cost of a HALT test is
Cost = (WPC + Esalary + DEP + CON + CA)$
where: Esalary = fully burdened weekly salary
DEP = weekly depreciation of equipment
CON = consumable costs
CA = corrective action costs
The break-even point is where costs equal benefits. For HALT to be cost-effective, it must, on average, increase the operating margin n°C. Setting costs equal to benefits, we can solve for the margin n necessary to justify the cost of HALT:
n = (WPC + Esalary + DEP + CON + CA) /
Additional Potential HALT Benefits
HALT is not a highly accelerated life test. In fact, it is not a life test at all. There is no appropriate acceleration algorithm or acceleration factor. The deliverables of HALT are operating margin and failure modes. However, the operating margin is an indicator of field performance. Low margins indicate poor performance (short life), and high margins indicate good performance (long life).
A HALT program with a few tests under its belt does not have the data necessary to correlate HALT performance to field performance. That is not to say that a correlation does not exist. Quite the contrary.
A seasoned HALT program that has conducted multiple tests on similar products possibly could predict MTBR more accurately than current predictors in use such as those based on component count and individual component field data. Current predictors do not factor any aspects of the actual product design into their calculations where HALT margin is specific to the product under test.
Consider the comparison of a reliability determination test (RDT) using traditional methods and RMA as predicted by the HALT margin in Figure 4. It shows the required test time for an 80% confidence level in an MTBF prediction greater than 75,000 hours. The blue line indicates that RDT requires 40 boards tested for 10 weeks, assuming one failure and Arrhenius acceleration due to a 50°C test temperature. RDT is a very time-consuming test and needs equipment and labor that otherwise would be used for production.
The same prediction using the HALT margin requires as little as one board for one week. In Figure 3 showing the RMA rate vs. the HALT margin, the dashed line illustrates an 80% confidence for an operating margin of 30°C indicating a normalized RMA rate below 0.55. The traditional MTBF predictor may be improved by factoring the HALT margin into the MTBF calculation.
Several benefits can be obtained from a well-designed HALT program:
• The HALT process is adept at finding and correcting design faults and determining design margins.
• The costs of HALT can be justified in terms of improved margin.
• The HALT margin also may be used to estimate MTBR, obviate traditional RDT, and improve traditional MTBF prediction.
1. Kyser, E. L. and Meadowsong, N., “Economic Justification of HALT Tests: The Relationship Between Operating Margin, Test Costs, and the Costs of Field Returns,” Proceedings of IEEE/CPMT Workshop on Accelerated Stress Test, 2002.
2. Hnatek, E. R. and Kyser, E. L., “Straight Talk About Accelerated Stress Testing (HALT and ESS)—Lessons Learned,” Proceedings of Future Circuits International, 2000.
3. Silverman, M., “Why HALT Cannot Produce a Meaningful MTBF Number and Why This Should Not Be a Concern,” QualMark, Santa Clara ARTC Division, http://qualmark.com/content/OurLibrary_CaseStudies.html
The test data serving as the basis for this paper was obtained at the Cisco HALT Laboratory, San Jose, CA. For this data, we are indebted to the efforts of Todd Broome, Hung La, Sheryl Spake, and Ken Miu. Field data was obtained with the help of Henry Bertram of the Cisco Quality organization. QualMark has been supportive and helpful throughout our four-year history of HALT at Cisco.
About the Authors
Nahum Meadowsong is a hardware reliability engineer at the Cisco Systems HALT Lab. He has worked for Cisco since 2001 and earned a B.S. in electronics engineering from California State University, Chico. Cisco Systems, HALT Team, 170 W. Tasman Dr., San Jose, CA 951344-1706, 408-853-9461, e-mail: [email protected]
Dr. Edmond L. Kyser is a consultant for electronic hardware reliability issues. He has directed overstress reliability and test programs (HALT and HASS) at Cisco Systems, Compaq, and Tandem Computers. Dr. Kyser, the author of several articles on accelerated stress test, holds seven U.S. patents. His Ph.D. is from U.C. Berkeley in applied mechanics. 650-960-0138, e-mail: [email protected]
FOR MORE INFORMATION
on HALT guidelines
Return to EE Home Page
Published by EE-Evaluation Engineering
All contents © 2003 Nelson Publishing Inc.
No reprint, distribution, or reuse in any medium is permitted
without the express written consent of the publisher.