Impact of Burn-In on Power Supply Reliability

For the PDF of this article, click here.

Users of power supply products demand increasingly higher levels of reliability and performance. Although the suppliers of individual components can confidently provide impressive life and reliability data, the compound effect on overall reliability can be significant when a large number of individual components are combined in a module such as a power supply. Perhaps more important in terms of product reliability is the quality and repeatability of the assembly process. Solder joints, connectors and mechanical fixings are all potential origins for product failure. In use, operating temperature and other environmental factors also affect the longevity and reliability of a power supply.

Burn-in and various other forms of life and stress testing help provide the data to enable power supply manufacturers to continually improve the reliability of their products. Indeed, when analyzed correctly and fed back into the design and assembly process, the accumulated data can be used to optimize the test and burn-in process.

The Burn-In Process

The purpose of the burn-in process for power supplies is to weed out “infant mortalities,” as seen in the first portion of the well-known “bathtub curve” of failure rate versus operational time (Fig. 1). These latent, early life failures may be due to intrinsic gross faults within the bought-in components, assembly errors or faults induced in components by inappropriate handling, e.g. ESD damage. It should be noted that there are no absolutes in the world of reliability testing, only probability and confidence levels for large populations. Hence, there is never a guarantee that all infant mortalities are caught by the burn-in process.

The conventional approach to power supply burn-in over many years has involved running the supplies at an elevated temperature, often the maximum-rated operating temperature listed in the product data specifications, where the rate of appearance of latent defects is accelerated. The supplies are run under full load with power cycling, and the input voltage is run at either the maximum or minimum voltage to provide either maximum voltage stress or maximum current stress, depending on the design topology.

Care in the choice of conditions is necessary because some components in some topologies can see more stress at light loads, such as snubber networks in variable-frequency converters. Some ingenuity can also be applied. For example, if a product is intended to operate normally with forced air, it could be run in still air at light load and still achieve comparable temperature stress levels of the hottest components. However, without the “temperature spreading” effect of the forced air, other components might see little stress under these conditions.

A technique sometimes used by C&D Technologies, dependent on the product topology, is to burn-in products into outputs cycled between short and open circuit. This can apply an appropriate current-stress level while exercising the inbuilt protection circuitry on short circuit and imposing a high-voltage stress level to many components on open circuit. There is a major benefit in the fact that the power in the short- or open-circuit load is theoretically zero, although practically the short might be a MOSFET, turned on, dissipating a few watts.

This method alleviates the real problem of energy waste in burn-in loads. However, some types of component stresses are not applied with this method because the overall power supplied by the unit is low, and therefore self-heating may be low. An elevated ambient temperature will compensate for this in part, perhaps using the waste heat from the burn-in loads. As mentioned, some product topologies are not suitable for this burn-in method, such as those that have a poorly defined or strongly re-entrant short-circuit current characteristic. That is, if on a “hard” short circuit the output current reduces to much less than the rated maximum output current, the level of burn-in stress may be too low to be effective. The decision on burn-in configuration is made jointly between the design and reliability/quality engineers to ensure optimized screening.

Data logging and analysis of the units under test is important for determining whether a failure has occurred, and if so, when. If all failures occur in the first few minutes of a 48-hr burn-in sequence, there would be good reason to shorten the time and increase throughput while saving energy. C&D Technologies tests products comprehensively before and after burn-in to ensure that any changes in performance are identified. This also can show whether there are any intermittent problems. Understanding and using burn-in data to modify product design and manufacturing processes can result in improved reliability and yield. C&D Technologies uses its burn-in data to drive the continuous improvement quality process.

Experience in burn-in testing has shown that thermal cycling precipitates more infant mortalities than a constant elevated ambient, although the sets of failures don't completely overlap. Thermal cycling with a dwell time at each thermal extreme is therefore the preferred process. Increasing the thermal rate of change precipitates more failures in fewer cycles as illustrated in Fig. 2.

Note that with increased thermal rate, different populations of failures can appear that are more or less affected by this type of stress and the occurrence of some residual failure types is unaffected. Even though there is equipment available to achieve thermal rates of change of up to 60°C per minute, some manufacturers don't exceed 40°C per minute to prevent excessive thermal stress that may cause cracking of multilayer ceramic capacitors (MLCCs).

In the absence of thermal cycling chambers, power cycling at an elevated ambient with judiciously selected cycle times approaches the effectiveness of the thermal cycling/dwell process. Care must be taken to ensure that the products are not stressed outside of their ratings in the often-atypical environment of burn-in. If overstressed, some useful life of a good product could be used up, and at worst, hard or latent failures could actually be induced in otherwise good product.

At C&D Technologies, the burn-in process normally starts with a duration of 48 hr, with a decision process to reduce the time of burn-in when no failures occur after a set number of hours. Depending on the product's complexity and topology, a decision is made to reduce the future burn-in hours by half after 200 to 500 units have gone through the process with no failures occurring in a quarter of the current burn-in time. This process is continued until the burn-in time is reduced to 2 hr, where it is held for the remainder of production.

Some contend that burn-in can be eliminated when no failures occur after multiple production builds. However, it could be argued that this removes the insurance against a group of defective components being used and/or a process anomaly occurring. In volume production of parts that are known to have a significant infant mortality rate — perhaps because of the degree of manual assembly — a regime of variable burn-in can be used whereby failures are expected. However, when a precalculated period of failure-free operation of a batch has elapsed, burn-in is terminated. This period is found from statistical tables, given the expected percentage of infant mortalities, their known failure rate and distribution type, batch size and percentage confidence level required that only a given number of latent failures remain.

For example, consider a batch of 10,000 units that historically has had 10 infant mortalities per batch of a type found to have a mean time to failure (MTTF) of 10 hr at the burn-in temperature. In this case, tables in the book Electronic Component Reliability: Fundamentals, Modelling, Evaluation, and Assurance^[1] by Finn Jensen show that a failure-free period of 13 hr must pass to give a 90% confidence level that only one latent product failure remains. The period extends to 24 hr to have the same confidence level that no latent infant mortality-type failures remain.

Some manufacturers have taken the burn-in process further after finding that the types of burn-in described do not eliminate, within a reasonable time, all of the failures seen to occur in the early life of a power supply. Also, conventional burn-in does not provoke early failures that could be a result of the shock and vibration of shipping and handling. To combat this, a more aggressive highly accelerated stress screen (HASS) can be used that applies mechanical, thermal and electrical stress typically beyond product ratings but within design margins. Acceleration factors of more than 40 over conventional burn-in have been claimed for this method, giving correspondingly shorter test times. A problem however is that the stress levels are so extreme there is a risk of damaging good product with hard or latent failures.

In answer to this, the highly accelerated life test (HALT) process was designed to identify the real damage limits in a product by stressing the product to failure with temperature extremes, thermal cycling, progressively higher levels of vibration, and then a combination of thermal cycling and vibration. During this testing, the destruction limits of the power supply are identified. These operating limits are then used to set the less-severe HASS test levels.

HALT also is used extensively during product development to identify potential weaknesses in the design. The test equipment required to do HALT must typically ramp temperature between -55°C to 125°C while applying six-axis linear and rotational random vibration. This requires a major capital investment and is often subcontracted to specialist test houses. Some vendors such as C&D Technologies already have internal HALT facilities.

The No Burn-In Model

As described earlier in the article, once burn-in failures have reduced to a certain level, some manufacturers feel that the pro cess can be dropped completely. This can be considered only if the manufacturing process is entirely predictable and the quality of bought-in material is such that it has no gross latent intrinsic defects. In other words, the bought-in components themselves don't exhibit significant infant mortalities and only have their intrinsic low-level latent defect rate.

Although commodity components approach this quality level and modern manufacturing quality control can minimize process variations, there is still a real risk that a customer may see some early life failures. The cost of this in terms of goodwill has to be weighed against the cost of burn-in. Remember that customers will still see the intrinsic failure rate of the product in its service life. A small extra number of failures attributable to infant mortalities may not be significant. For example, one product from C&D Technologies that uses quality components is built using a stable, mature process without burn-in and has an observed field mean time between failure (MTBF) of more than 25 million hr. This figure is derived from 130 failures in the total sales of 4.37 million parts shipped evenly over six years. In this case, it is assumed that the parts are powered for 25% of any given period and that only 10% of failures are actually reported.

While extended burn-in tests may be employed on small numbers of units to gage whether all infant mortality failures have been identified, at C&D Technologies, ongoing life tests are run for up to six months on 25 to 50 units at a moderately elevated temperature. These tests are normally only used when there are large quantities of units built on a continuing basis and can give an estimate of the intrinsic reliability of a product in service, that is, MTBF.

The accuracy of this figure depends on the relatively mild failure-rate acceleration during the test having a known relationship to the real-life failure rate. The Arrhenius equation can give a value for the acceleration factor given a constant failure rate after infant mortalities. The Arrhenius equation has its origins in chemistry. So in theory, it requires a knowledge of effective “activation energies” for all failure modes. But in practice, the rule of thumb is to double the acceleration factor for each 10°C rise above the real-life operating temperature.

As an example, 50 units running for six months at 70°C with no failures gives 219,000 operational hr. From statistical tables, this represents a failure rate of 4110 failures in 10⁹ hr of operation (FITs) with a 60% confidence level or 10,502 FITs with 90% confidence. At a lower temperature of say 40°C, our rule of thumb for an acceleration factor to 70°C is eight, so the figures reduce to 514 FITs and 1313 FITs.

FIT is λ × 10⁹, and MTBF is 1/λ, so these figures represent 1.95 million hr or 760,000 hr MTBF at 60% and 90% confidence levels, respectively. It may seem odd that a test with no failures gives a finite failure rate. This is because it is assumed that the first failure is just about to happen. It should be emphasized that real field failure rate is the most accurate measure of the reliability of a product.

A calculated MTBF can be compared with the demonstrated figure obtained through life testing to check for consistency. However, the calculations can be misleading depending on the base failure rates used for components and the method of calculation. A recent survey by C&D Technologies found a variation of a factor of more than 100 between MTBF figures for the same circuit calculated by several different power supply manufacturers. Different standards such as MIL-HDBK-217F and Telcordia SR332 will give different answers.

In addition, the MIL standard also gives two different methods. One method is the parts count, which gives a quick but conservative measure, and the other is the part stress method, which requires detailed knowledge of the electrical operating conditions. The latter method is more realistic. As an example of a part stress calculation according to MIL-HDBK-217F, a general-purpose diode has a failure rate per million hours given by:

λ_P = λ_BΠ_TΠ_SΠ_CΠ_QΠ_E

where λ_B is a base failure rate for different types of diodes and the P factors are for temperature, electrical stress, internal construction, manufacturing quality and environment of use, respectively. For a Schottky power diode operating at a junction temperature of 80°C, with a voltage stress of 75% of its rating, metallurgically bonded construction, plastic commercial packaging and operated in a “ground benign” environment, the calculation is changed by substitutions from the tables in the standard to become:

λ_P= 0.003 × 5 × 0.58 × 1 × 8 × 1 = 0.0696 failures per million hours, or 69.6 FITs.

Optimizing Process Control

The important point to note is that quality and reliability cannot be “tested in” or “inspected in.” Burn-in testing is ultimately another inspection process, but serves as a mechanism for process control and feedback. Failures in burn-in along with field failures prompt failure analysis and corrective action to ensure that the product design and process have been centered and optimized to provide the best product possible to the field. Studies have shown that higher factory yields give higher product reliability, happier customers and lower warranty-return costs.

References

Jensen, Finn. Electronic Component Reliability: Fundamentals, Modelling, Evaluation, and Assurance. John Wiley & Sons, 1995. The tables in this reference are credited to Marcus and Blumenthal (1974) by permission of the American Statistical Association.