Electronic Design

Improve Your Card Power System's Reliability

In addition to choosing the proper dc-dc converters, pay careful attention to the power-management design, and make sure you thoroughly qualify your system.

Over the last few years, the voltages in typical equipment cards have dropped dramatically—in many cases down to 1 V or lower—while the total card power continues to soar. An increase in different rail voltages also has added complexity to the power system in the form of sequencing and tracking between rails. Meanwhile, expectations for reliability and availability are rising due to the ongoing drive to reduce equipment downtime.

There are several ways to meet the added power-system design requirements without compromising reliability. High-reliability power converters make up a key part of these solutions, but they need support from a well-chosen overall equipment architecture. Also, attention must be paid to details in the power-system integration.

Current products no longer rely on a simple 5-V power-distribution system. Nowadays, it's not uncommon to have six or more voltages on a single card. Some high-end systems may have up to 20 or more separate power rails, with most below 2 V. These very low voltages must be delivered efficiently at high current, and they must meet increasingly tight regulation, ripple, and transient specifications. Consequently, distributed power systems are now the norm, with multiple dc-dc converters on each card to generate the low voltages very close to the load.

In addition to the need for very low-voltage rails, many ICs impose requirements for sequencing and tracking between power rails during startup and shutdown. The power rails must be controlled so that the difference between them doesn't exceed the specified voltage and/or time limits, even under short-term transient conditions. Combine these requirements with the need to monitor all rails for overvoltage (OV) and undervoltage (UV) protection, and it's plain to see that card power systems have moved out of the "simple-to-build" realm.

The figure illustrates an example of a card power system. In this case, a typical product is powered from 48 V dc, such as a communications system or a high-end compute server. The dc-dc converters supply the voltage rails needed for the card and maintain the required isolation between the 48-V input and the logic outputs. In this example, a single isolated dc-dc converter (usually called a brick) generates an intermediate bus voltage of 5 V that feeds a number of non-isolated point-of-load (POL) power converters.

A wide range of manufacturers offer bricks and POL converters as standard products in many output voltage and current combinations. They can conveniently function as building blocks in a card power system. A high degree of commonality exists among manufacturers, both for the electrical performance and the physical details (e.g., dimensions and pinouts). Though the figure shows a single brick, two or more bricks are often used to generate the rails that require the highest power, with POLs for the lower-power rails. Many combinations of power converters will meet the specific needs of any particular card.

To coordinate the operation of the dc-dc converters, the card power system requires an overall management function. Some degree of management also is necessary on both the primary and secondary sides of the isolation, as shown. While details vary, power-management functions typically include some or all of the following:

  • Startup and shutdown of the power system at a specified input voltage
  • Controlled startup and shutdown of all outputs in the required sequence
  • Monitoring of all outputs for OV and UV faults
  • Controlled shutdown if a fault occurs
  • Adjustment (trim) of output voltages if required
  • Margining of rail voltages during system testin
  • g
  • Reporting power status to the system controller
  • Thus, a high-reliability power system requires careful attention to the power-management design, which is equally important as the choice of dc-dc converters.

    Power reliability can be seen in two very different ways:

  • Component level, using a bottom-up approach based on component failure rate. This aspect of reliability is typically expressed as predicted mean time between failure (MTBF), or failures in time (FITs). Since 1 FIT = 1 failure in 109 device hours, 1000 FITs = 1 million hours MTBF. The two most commonly used prediction methods are MIL-HDBK 217 and Telcordia TR-332. This type of prediction only considers component failures, and it doesn't take into account such aspects as design errors or inadequate specifications.
  • System level, using a top-down approach based on ability to perform the required functions. This can be addressed by worst-case design, simulation, and testing of the complete system. The testing must be sufficient to ensure that the design meets all required functions under all operating conditions—a process called qualification. As always, good design practices must be followed. Testing alone can't guarantee proper performance under all conditions.
  • It's important to consider both of these aspects in your design. A good predicted MTBF is necessary, but not sufficient. MTBF itself offers little value to the customer if the power system shuts down for every thunderstorm in the area.

    Most anecdotal power reliability problems customers see can be traced back to weaknesses in system-level reliability—the component application and system qualification—rather than the fundamental MTBF of the components themselves. For example:

  • The production version of the card draws more peak current than expected, causing the voltage to drop under extreme conditions.
  • The power system shuts down unexpectedly in the field (nuisance trips).
  • The card fails at the customer site, but when it is returned for repair, no faults are found (NFF).
  • Sequencing between rails depends on component tolerances and doesn't always meet the needs of the ICs.
  • Sequencing during shutdown wasn't considered during the design.
  • The power system cannot deliver full load at extremes of input voltage and temperature.
  • Power modules overheat due to restricted airflow when the card is installed in the equipment.
  • Although these types of problems can occasionally occur even in a well-designed system, the likelihood can be reduced significantly through careful design and thorough qualification testing. The table takes a closer look at these specific problems and offers tips on how they can be avoided.

    Obviously, good power-system design is a complex, multifaceted subject that touches on the entire product and its environment. Don't underestimate the task's complexity. Furthermore, although the initial focus is on efficient power conversion, remember that the power-management functions share equal importance in achieving a good power-system performance.

    Following three fundamental methods can improve the MTBF of any system. Use fewer components, make the components more reliable, and make the system function even if components fail. Each can play a part in improving power-system reliability, together with comprehensive qualification testing.

    Often, component count can be reduced in the power-management system. A dedicated power-management IC can replace a large number of discrete components used for monitoring and control, such as comparators, op amps, optocouplers, and RC time delays. At the same time, a power-management IC can offer much better performance than a discrete solution, improving system reliability by accurately reporting marginal performance while avoiding nuisance trips.

    For example, the Potentia PS-2610 measures each output rail voltage every 40 µs using an 8-bit analog-to-digital converter. The PS-2610 employs digital filtering to allow for fast response to a real OV condition while preventing false OV or UV shutdown due to voltage spikes.

    A typical POL contains fewer internal components than an isolated brick, and the failure rate can be significantly lower. The manufacturer's quoted failure rate for a typical POL is about 200 FITs (equivalent to an MTBF of 5 million hours), whereas a typical brick is about 500 FITs (which is an MTBF of 2 million hours). On the other hand, a POL usually has lower output power than a brick, so you may need more of them to meet your total power requirement. Of course, reliability is only one of many factors when choosing power converters. But by considering reliability early in the design, you can make the best tradeoff for your application.

    Component reliability is influenced primarily by the qualification and quality-control processes used in manufacturing, as well as by the stresses applied in the application. Power-conversion reliability can be improved with a modular approach, using standard off-the-shelf dc-dc converters as components in your design. These units, which are built in high volume using an automated process with full quality control, offer excellent performance and reliability. You will avoid the need to calculate component stresses within the power converter, because the design is optimized during the manufacturer's in-house qualification.

    Similarly, plan your power-management design around a dedicated power-management IC rather than a general-purpose device, like a gate array or microcontroller (MCU). A power-management design using an MCU or gate array requires extensive testing under both normal operation and fault conditions. This is to ensure that logical errors in programming don't cause incorrect behavior. Conversely, the dedicated power-management device's behavior is already fully tested and qualified by the manufacturer. Only the operating parameters (voltage levels, time delays) require programming.

    To dramatically improve system reliability, design the system to be fault-tolerant. In the ideal case, an available backup instantly takes over for any component failure, leaving system performance unaffected. The term availability expresses the proportion of time for which the system performs as expected. The provision of backup components is called redundancy. In a practical system, there are limits to the degree of redundancy that can be achieved, and availability can never reach 100%. Through careful design, redundancy can provide almost complete protection against any single fault, and it can achieve 99.999% (five nines) availability or better.

    Most redundant systems achieve redundancy by duplicating entire cards. For example, two identical control-processor cards can be used in a shelf, either of which can take control if the other fails. The 48-V distribution system also is duplicated, with dual 48-V feeds to each card from independent circuit breakers. If any individual circuit breaker trips, the cards still receive uninterrupted power through the second feed. In most cases, it's not considered beneficial to duplicate the on-card power system itself, since any card failure (power or otherwise) means simply replacing the card.

    For effective redundancy, it's vital to report all component failures immediately to the operator for maintenance before the backup fails. In the power system, this implies not only comprehensive monitoring of all output-voltage rails, but also monitoring of fuses and power feeds to detect any loss of redundancy. Additional monitoring such as input-current measurement and thermal sensing can provide advanced warning of overload conditions and further improve reliability.

    While today's power systems are more complex, high reliability is achievable. Minimizing component count can improve the failure rate and yield a high calculated MTBF. Also, with effective power management, you can implement features that improve overall equipment reliability. Remember that reliability is much more than just MTBF. Carry out thorough qualification testing of your power system to ensure it meets equipment requirements under all conditions.

    Power problem Suggested solutions
    Card draws more current than expected
  • Maintain detailed system power estimates and update frequently during development. Include the effect of software updates.
  • Build in enough margin to allow for power increases during development.
  • Nuisance trips
  • Do not use unnecessarily short time delays for fault detection. In typical systems, about 1 ms is suitable for OV and 50 ms for UV.
  • Use adequate decoupling in fault-detection circuits.
  • Carry out thorough system transient testing on the product, including ESD, EFT, and lightning tests as applicable.
  • Cards are returned NFF
  • Consider including a fault log as part of the power-management system to improve diagnosis.
  • Sequencing depends on component tolerances
  • Do not use time-based sequencing. Instead, voltage interlocking between rails helps to guarantee correct behavior.
  • Incorrect shutdown sequencing Review IC specs to determine whether shutdown sequencing is required. If so, include it as part of the power-management system.
    Cannot deliver full power at extremes
  • Design for worst-case combination of voltage and temperature.
  • Remember that input current is highest at minimum input voltage, particularly for battery systems.
  • Test at extremes, and include margin testing of all rails.
  • Overheating in system
  • Carefully characterize the airflow in your system, including variation at extreme conditions.
  • Design for worst-case load, with adequate derating.
  • Follow the power-module supplier's guidelines.
  • Provide alarms for overtemperature and fan failure.
  • Make sure system testing represents the real environment and covers extremes of temperature, load, and airflow.
  • Hide comments


    • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

    Plain text

    • No HTML tags allowed.
    • Web page addresses and e-mail addresses turn into links automatically.
    • Lines and paragraphs break automatically.