Advanced Thermal Management Augments System Reliability

$jq().ready( function() \{ setupSidebarImageList(); \} );

Sept. 24, 2010

20 min read

1 of Enlarge image

CPU board heatsink

Fanless system

Heat dissipation

Thermal management remains a crucial design consideration for virtually all embedded systems applications. Keeping a system within its specified operating temperature range and protecting valuable computing assets yields higher reliability and longer deployment lifespan, ultimately lowering cost.

Failure to manage the heat within a system’s specified limits can detrimentally affect system performance, create system errors, or potentially damage components or the system. In addition, exceeding maximum operating temperatures may cause irreversible changes in the operating characteristics of a given component.

Therefore, it’s important to make decisions early in the design cycle to determine how heat generates within the system by looking at power dissipations and component locations, airflow paths, and general thermal performance. Thorough thermal analysis and testing to select the right thermal-management solution are major factors in the design’s overall success. Miscalculations and layout errors can result in serious consequences for system reliability, field failures, increased operational costs, and overall performance.

But as system complexity increases with higher speed and density components, smaller form-factor boards, and reduced system footprints coupled with requirements to operate in more rugged environments, designing for effective thermal management has obviously become more challenging. New thermal-management options continue to evolve and next-generation applications may be required to provide better cooling solutions to match new standards specifications (e.g., MicroTCA, CompactPCI, Pico-ITX, and other board form factors), depending on the type of system and specific operating environment.

Furthermore, managing complex cooling issues associated with unique or extreme embedded computing environments requires extensive thermal knowledge, along with an evaluation of overall system cost and options for commercial-off-the-shelf (COTS) standards versus custom design. Designers today face a greater array of thermal design options.

For demanding, highly specialized embedded markets such as medical, transportation, and military, a full COTS cooling solution may not meet thermal requirements. In these instances, a semi- or full-custom solution may offer the best solution regarding use of resources, thermal expertise, and reducing overall system cost.

The basic premise of thermodynamics is that heat will always transmit from an area of higher temperature to an area of lower temperature, and it will act to equalize variances in temperature. But how can a system designer determine the most optimal thermal-management cooling solution for a particular application?

The first step is to understand the major cooling methods used in embedded systems: active cooling with onboard fans and system fans, fanless passive convection cooling, and active or passive conduction cooling. Industry trends drive the thermal-management method in some applications, while others are best served with a semi- or full-custom cooling solution to achieve a cost-effective resolution of thermal-management issues without adding size or weight or adversely affecting the computing environment.

Continue on next page

The objective of thermal management is to ensure that all system components operate within specified functional temperature limits for optimal performance. The thermal-management method used depends entirely upon the application and the individual devices designed into the system. By applying knowledge of the application devices along with a thorough analysis of the thermal makeup of a particular application, designers can more easily match the optimal cooling option.

Active Cooling: Onboard And System Fans
The most widely used thermal-management method is active convection cooling with onboard or system fans to force air across the system boards, power supply, and other system components. It’s accomplished in one of two ways: a convection cooling method that exhausts the heat out of the chassis, or a cooling method that forces the heat through hollow side walls and then exhausts to the ambient environment.

Fans also represent the most cost-efficient option for system cooling. Designers have found that one of the most effective fan-based options for heat dissipation is to place a fan directly onto a CPU board heatsink (Fig. 1). Success is based on choosing the right fan for the application and having a sound chassis design for the airflow.

Considerations for how internal components affect the air pathways and how that air effectively moves through a unit are keys to system design. It’s also important that no “dead spots,” where air more or less remains stagnant or congregates into eddies, exist within a chassis. Once those considerations are satisfactory, focus on selecting the best fan for the project by examining the fan’s design and construction.

Recent blade and bearing design advances in fans provide significant airflow as well as reduce noise and vibration for quieter system operation. Every onboard and system fan has different ambient air temperature thermal limits, so it’s advised to check the specifications of each from the supplier.

Though fans are popular due to their efficient cooling, designers should be aware of some reliability issues in certain applications. Onboard system fans are primarily small, high-RPM units, which can cause more failures in embedded applications operating round-the-clock. Also beware when the application incorporates more than one board that requires cooling. Such a design requires multiple fans, creating more vibration that further reduces overall system reliability. Onboard fans may be the only viable cooling option for space-constrained embedded systems that dissipate a lot of power and can’t accommodate a larger fan.

For mission-critical systems using high-performance processors that demand high reliability and strict mean time between failures (MTBF), it’s best to incorporate one or more larger fans that blow air across all boards in the system. Of course, heatsinks will still be used for each board to dissipate heat from key components.

The effectiveness of active cooling depends on how much air flows over the board. Therefore, the amount of heat dissipation is directly related to getting as many air molecules as possible to come into contact with the heatsink. This is particularly important during board layout so the system’s configuration considers the amount of airflow a given board will realize in an actively cooled system using a central fan. Another challenge is making sure that the airflow is balanced throughout the system, giving each slot a similar amount of airflow.

Continue on next page

To further assist with the thermal analysis process, board suppliers may provide graphs that indicate the amount of airflow required to sufficiently cool the board at a given air temperature. In fact, the specifications for certain platforms, such as MicroTCA, explicitly require that the documentation for compliant boards include the temperature versus airflow curves. Using the temperature and airflow information, designers can determine the thermal limits of the system, assuming the amount of airflow provided to each board can be measured or calculated.

As demand increases for smaller, more rugged systems, cooling fans become an issue because of limited power budgets and high MTBF requirements. Despite all of the recent advances in fan design and construction, this thermal-management mainstay methodology has been one of the main causes of mechanical failure in any system. Workarounds, such as providing easy-to-replace, hot-swap fan options in systems and monitoring fan speeds in software, help mitigate failure risks.

As mechanical devices, fans are subject to mechanical wear and contribute to system vibration. Over the life of a system, fans can slowly degrade or fail completely, significantly impacting the system’s thermal-management effectiveness. In addition to mechanical failures, fan usage increases power consumption and, if poorly implemented, may add much more noise to the system.

Moreover, space-constrained systems, such as box PCs and systems that must be fully sealed (e.g., transportation applications), can’t use fans for thermal management. For this reason, most onboard and active cooling fans find homes in rack-mounted industrial PCs, server systems, and some conduction-cooled systems.

Passive Convection Cooling: A Fanless Solution
Fanless convection cooling is an alternative thermal-management solution for systems unable to use an active cooling method and require medium-range computing performance. Though called “fanless convection” or “natural convection,” it’s somewhat of a misnomer. Convection cooling is accomplished in a fanless application using natural airflow, allowing the hot air to rise. However, it also employs radiation for additional thermal dissipation.

Similar to passive conduction cooling, this cooling mechanism allows for a fanless system (Fig. 2). “Natural convection” and radiation are far less efficient cooling mechanisms, though, displaying much-reduced power dissipation. Designers considering convection cooling will typically have a practical limit of about 15 W of heat dissipation for a 6U form factor (e.g., VME), or about 12 W for a 3U form factor (e.g., CompactPCI). When compared to conduction-cooled boards that can consume 70 W or more, the heat dissipation requirements must be considerably lower.

Software applications are available to model all cooling methods (discussed later in the article). These may be particularly useful for a natural convection-cooled system to help determine a more exact power dissipation threshold. The rule of thumb for applications with less than 60°C ambient air temperature and embedded CPU junction temperatures of approximately 105°C is typically in the 12- to 15-W range.

Continue on next page

In the past, the 12- to 15-W threshold severely limited the applicability of natural convection for microcontroller applications and some low-power PowerPC applications. But with the advent of Intel’s family of Atom CPUs, the capability now exists for a full single-board computer (SBC) with acceptable performance suitable for general-purpose computing applications. This architecture allows all high-speed peripherals to run at full bandwidth, including Gigabit Ethernet, SATA, and PCI Express.

System-Level Conduction Cooling
Conduction cooling is primarily used for rugged environments that contend with moisture, corrosive atmospheres, sand and dust, and other degrading factors. System boards, power supplies, and other system components are sealed in an airtight enclosure, with the edges of the system components mechanically clamped to the sides of the enclosure. Heat from system components is conducted through the chassis structure and dissipated through one of four methods:
• Forced air through hollow side walls
• Forced liquid through channels in the side walls
• Passive convection via external fins
• Passive convection via a cold plate that’s typically mounted on the bottom of the unit; the cold plate itself can be either passive or actively cooled.

Numerous rugged systems now using computer-on-modules or stackable PC-104 single-board computers (SBCs) require different cooling mechanisms instead of traditional finned heatsinks mounted directly on the CPU. Many computer-on-module or PC-104 systems employ thermal gap pads and heat pipes to connect components directly to the chassis for heat dissipation (Fig. 3).

A mitigating factor for conduction cooling is cost—its special chassis design makes it more expensive than an equivalent convection-cooled system. On top of that, despite their attractiveness in achieving higher power densities, liquid cooling solutions also tend to be heavy, large, and complex.

Semi-And Full-Custom Solutions
Board form factors with available cooling solutions are often critical to the system design. They minimize design risk with faster time-to-market and offer availability, long lifecycle, upgradability, and interoperability. However, the arena of embedded design often requires customized performance.

Applications of this ilk include wearable computers used by soldiers on the battlefield or emergency personnel and rescue teams, systems integrated into sea- or air-based applications, sealed or drip-proof medical imaging systems, or industrial systems that require attention to airborne contaminants (heat, dust, etc.). The added requirements further constrict airflow in and out of the system.

Designers familiar or experienced with the complex thermal issues unleashed by these harsh environments may be able to address the modeling, calculations, design layout concepts, and cooling options demanded by the application at hand. Those who are not may prefer to turn to a vendor that can minimize design risk and time-to-market. An off-the-shelf embedded PC comes with the knowledge that an experienced system supplier addressed the thermal layers as well as specified and validated performance thresholds.

Continue on next page

Enabling Technologies
Understanding thermal-design issues can help a system designer choose among board level, module, and chassis system products early in the design process. For instance, in-depth thermal modeling of a proposed system can be done before extensive work is spent on a mechanical design, which prevents the need for costly/timely redesigns. Active and passive cooling solutions, such as heatsinks or heat spreaders, can take up a lot of space in the system design. By utilizing COTS solutions that have some level of flexibility, a number of options may need little or no modification in terms of thermal requirements.

Starting with SpeedStep, Intel has propagated power-conservation strategies within the processor lines. One example is Kontron’s 20- to 40-W computer-on-module, which integrates Intel’s advances with the new 32-nm architecture. The computer-on-module contains up to 2.53 GHz of processing power from Intel’s Core i7/i5 mobile processors and the new integrated QM57 platform controller hub, which includes an integrated memory and graphics controller. Active and passive heatsink options for the new ETXexpress-AI module help system designers select the heat-dissipation option needed to meet system requirements.

The 45-nm Intel Atom has created a number of new devices and performance levels previously unattainable because of power consumption requirements. Power dissipation of the compact processor (13 mm by 14 mm), together with the single-chip Intel system controller hub US15W (22 mm by 22 mm), is less than 5 W. Other processors, such as the Intel Atom Z520PT used in Kontron’s microETXexpress-XL sub-8-W module, are designed for –40°C to 85°C industrial temperature operation, suiting them for extreme environment applications.

Design Considerations For Thermal Management
Many design requirements come into play when determining an appropriate, cost-effective, thermal-management method. For instance, in a passively cooled system, acoustic noise constraints may limit the size and types of fans, vents, and ducts that can be used in a particular design. The evaluation typically starts at the individual component level, where designers use software and simulations to accurately model an application’s system-level thermal characteristics.

Industry testing standards like MIL-STD-810, which are used to define environmental methods and test protocol, also help in providing designers the details of thermal management for a particular computing platform. While some applications depend on industry standards that address cooling concerns in their basic specification, others benefit from a variety of component building blocks to achieve cost-effective resolution of thermal-management issues.

To determine component temperature in a system environment, designers must consider not only the component, but also the board and system thermal characteristics. Modeling a system’s thermal characteristics consists of the following elements:
• Ambient temperature around components
• Any component-level thermal cooling solution
• Solar loading (for unprotected field applications)
• Altitude (for military/aeronautics applications)
• Airflow over and surrounding the component and board
• Board-level size constraints that may limit the size of a thermal method
• Overall power dissipation

Continue on next page

While a great deal of technical science goes into developing an effective thermal-management solution, an element of conjecture and past experience also makes up part of that process. To analyze and properly test heat management, designers must understand and carefully model how it’s generated in a given design. Ideally this step is taken early in the design process, allowing for sound decisions about airflow paths and location of power dissipations. It’s crucial toward the overall success of the design, and it avoids errors that impact system reliability, performance, cost, or failures in critical field applications.

For example, a surgical theater where drip-proof systems keep components safe from contaminants requires accurately specified environmental concerns as part of the design process, including definitions of heat, shock, vibration, airborne particles, or moisture. Designers can then evaluate thermal performance using simulation tools and modeled scenarios demonstrating thermal impact on preliminary hardware options.

Thermal-Modeling Tools And Tests
Thermal-modeling tools like FloTHERM from Mentor Graphics use advanced computational-fluid-dynamic (CFD) evaluation techniques to accurately predict airflow, temperature distribution, and heat transfer in components, boards, and even complete systems. This type of advanced mathematical modeling enables engineers to create virtual models of their design, analyze thermal performance, and create and test modifications easily, before any equipment prototypes go into production.

Viewing and understanding airflow and temperature distribution through a three-dimensional model avoids costly design issues such as validating temperature thresholds and guarding against over design. For example, solving heat conduction, convection, and radiation in three dimensions shows engineers whether or not heat moves downward toward the printed circuit board, verifying how much is alleviated by a heatsink placed on top of a specific component.

After the computer simulations are performed and the designers feel confident that the design is stable and meets the specification’s goals, one or more prototypes are built using materials/parts/assembly processes that are very close to the final product. Once these prototypes are completed, they’re subjected to a series of real-world tests, including thermal cycling, shock and vibration, electromagnetic interference (EMI), and electrostatic discharge (ESD), corresponding to customer requirements. Also, devices are usually tested against previously referenced standards such as MIL-STD-810 or SAE during operation to ensure that they will not fail under distress.

Cooling Trends
Passive-cooled designs are growing in popularity, simply due to the system components being so low in power that sufficient cooling occurs through natural heat exchange between the enclosure and surrounding air. Heat is dissipated by transferring it from various heat-producing board elements to the external walls of the system.

Processor and chipset power consumption is trending toward more reasonable values. For example, typical thermal design power for a desktop CPU is around 60 to 70 W. So, designers have greater access to alternative cooling methods, such as passive convection and conduction techniques. In many cases, Intel embedded mobile components, similar to those used in laptops or netbooks but with long lifespans for embedded platforms, are suitable for passive cooling solutions because they perform in the 30-W or less range.

Continue on next page

Transportation Application Design
COTS-based customization lies at the heart of transportation design, with modern rail, road, air, and sea industries requiring many diverse applications. Thermal failure isn’t an option when it involves passenger safety or critical systems that keep the world moving. Fans may be useful in theory, but they’re impractical in many of these applications due to insufficient airflow to guarantee performance and risk of failure.

Using a fleet management system as an example, layout included an extended temperature Atom component in a cast aluminum housing. In this case, the Atom processor and chipset are placed on top of the primary board and then thermally interfaced to the top of the aluminum housing for cooling.

Instead of relying on natural air convection or airflow around the system, the transfer of heat to the housing provides enough cooling to keep the components within their rated temperature ranges. The enclosure itself radiates heat, but it’s minimal and within accepted performance levels for this application. Thermal modeling of this design validated the extended temperature components and the survival of the system under extreme temperatures.

Medical Systems Application Design
Here, integrated components conduct heat to the enclosure, and natural air convection across heatsink fins removes heat from the system. For example, an integrated CPU and display system designed for sterile surgical environments demonstrates a number of mechanical and thermal challenges for convection designs. Many of these systems are deployed in areas where additional operating noise from cooling fans is undesirable, even when overall system reliability is critical.

Thus, passive cooling becomes a viable option. However, designers must balance thermal and mechanical requirements that are robust enough to allow proper cooling, but not too unwieldy from added weight or reduced aesthetics of the system. This system requires a passive cooling solution that’s not too heavy and is IPx1-compliant, ensuring safety from vertically dripping water and allowing the device to be sterilized with hospital-grade disinfectants.

Modeling this surgical environment involved a number of FloTHERM variations. Worst-case thermal and power-consumption boundaries were assumed to create the most robust design. Thermal modeling should always consider the absolute worst-case operating conditions, including variables such as temperature and altitude. To ensure thermal models are sufficient, it’s good to perform a sanity check to stress components to their maximum usage.

In this modeling example, the maximum ambient operating temperature was noted at 40°C, drawing a consistent 63 W from the various integrated components. Processor cores were set to 100% workload in mock configurations, yielding power consumption comfortably lower than the 63 W maximum.

By incorporating an outer plastic shell containing the inner steel system case, designers could create a thermal buffer zone, including internal vents for effective convection airflow and cool air intake designed in accordance with IPx1. After internal components were laid out and characteristics assigned to them, thermal monitoring points were set within the model, tracking temperatures throughout the modeling process.

Continue on next page

Thousands of iterations later, critical measurements (including the CPU and hard drive’s individual ambient conditions) were validated within maximum operation conditions. The CPU showed a 15°C margin from its 105°C TJUNCTION rating. This is more or less the component’s maximum rated operating temperature, or how hot the component can get. Key components like a CPU may be throttled once they reach their maximum temperature to prevent damage, which is better than the component exceeding its rated maximum temperature and potentially causing system damage.

Military Application Design
Kontron was able to expand its ability to handle complex military systems with the acquisition of AP Labs. When it comes to unmanned vehicle programs, a number of key factors are critical to success for their electronic systems. Survivability must be the core design objective for any mobile electronics system used in military applications. If the system can’t operate continuously and reliably within the target environment, no amount of sophisticated features is of any practical benefit for the mission.

What’s proven to be the most effective design approach is to implement all of the required system functionality in a chassis already certified for ruggedized operation—not one that’s simply listed as “designed to meet.” For example, selecting a chassis manufactured to meet the requirements of MIL-E-5400 Class 1 thermal performance, MIL-901D shock, MIL-167-1 vibration, etc., assures the designer that it can withstand specified extremes of temperature, vibration, shock, salt spray, sand, and chemical exposure. Furthermore, it maintains a sealed and temperature-controlled environment for the computing elements and electronics inside.

Depending on the mission requirements, the chassis may need to be cooled and mounted in a number of different ways. In addition to mounting systems within standard racks or into ARINC-style equipment trays, it’s useful to have options for custom hard mounting or shock mounting within the mobile platform.

From a cooling standpoint, some applications can use forced air, sometimes called forced convection (using internal or external fans). Yet because of space, weight, and environmental constraints, many unmanned aerial vehicle (UAV) applications need to use conduction-cooling methodologies (with or without fan assist). Therefore, it’s helpful for designers to be able to choose among various cooling and mounting options at the outset and then build the optimized system upon a proven foundation.