Mind Your Thermal Management To Improve Reliability
Historically, considerable attention has been given to the physical design of processor-based systems, driven by the inverse relationship between absolute temperature and processor performance. This has propelled the development of many novel cooling concepts and has led to the high reliability now associated with computing, networking, and telecom equipment. For many other products, reliability rather than performance is the key issue, yet physical design is all too often an afterthought.
It is also worth highlighting the role of experts in physical design. Traditional design-build-test-fix product development practices inevitably require “experts” to get involved between the test and fix stages. This can get your product to market, though this methodology will often cause your project schedule to slip. And, such late-stage fixes can be an “Achilles heel” from a reliability perspective.
Having physical design experts in-house is a real advantage during your product creation process. However, this benefit can only be exploited to maximum effect when it’s utilized from the outset of the project. If this expertise is only employed at the point of test and fix, then optimum reliability will not be achieved, and any solution often will come with a cost penalty.
Cooler Is More Reliable
This is generally true, but not because many chip-level failure mechanisms are accelerated by absolute temperature. Within the manufacturer’s specified operating temperature range, most of the reported failure mechanisms aren’t due to high steady-state temperatures. Rather, they depend on temperature gradients within the electronics assembly, the temperature cycle magnitude, and how fast the equipment heats up and cools down.
Generally, reducing the operating temperature also reduces these accelerants, so in a general sense it is true to state that cooler is more reliable. This is good news, and it means that many of the design practices developed in the telecom, networking, and computing markets over the years can be more widely applied to enhance your product reliability.
Component Failures
When components fail, the failure isn’t often attributed to a mechanical failure intrinsic to the component. Instead, it’s often due to some form of physical overstress. As determining the root cause of the failure is both difficult and costly, replacement of faulty parts (e.g., printed-circuit boards, or PCBs) in the field or when products are returned is often the norm. This can be an expensive solution to a problem that could have been avoided during the design phase of the product in question.
When talking about reliability, it’s important to realize that product reliability is a function of the assembled product, not the sum of the reliability of its individual components. Current chip packaging technology has developed to the point that chip packages are intrinsically very reliable, so very often the component itself isn’t what fails.
When Components Don’t Fail
Perhaps not surprisingly then, many “field failures” aren’t related to components. Rather, they’re assembly-related. How the various components are assembled to produce the final product is what’s of concern. For modern electronics products, the dominant reliability problems are related to interconnects, especially solder joints.
This is generally due to mechanical issues associated with powering the equipment on and off. Both the steady state operation of the equipment and how temperatures change over time affect the reliability of your product. With this in mind, it’s worth reviewing some factors that influence the field reliability of a product.
Poor Board Layout
Poor distribution of dissipated power can lead to components overheating. The main cause of this is a lack of awareness of the physical challenges by those designing the PCB. At best this is not their primary job function, and at worst it is seen as somebody else’s problem. Board designers generally don’t have a mechanical engineering background, so they require tool support to consider thermal issues.
In digital design it is common practice to group high-speed digital circuitry together on the board, which generally has the side effect of grouping the highest heat sources together. This is somewhat inevitable, but considering the physical design upfront can significantly improve the arrangement layout without impacting the electronic performance.
As well as power distribution on the board, several factors influence reliability: the choice of package used for a given chip, the orientation of the components relative to the airflow, and blockage effects caused by components, connectors, and heatsinks when they’re included in the upfront design.
If these issues aren’t considered from the outset, very often something will need to be done to fix the resulting problems later on in the design cycle. One solution is to add more power and ground layers to the board, particularly if a board re-spin is required. Board re-spins are costly and time consuming, and they are becoming an ever-more frequent symptom when upfront physical design is lacking.
However, this adds weight and stiffens the board. As a result, the board won’t flex as much when the equipment is powered on and off, which increases the strain on the connections between packages and the board. Hence, utilizing the board as a heatsink in this way can cause reliability problems as well as add cost to the board and weight to the final product.
Heatsinks
Along with adding weight and cost to a product, attaching a heatsink directly to a component additionally strains the connections between the package and the board, where the majority of failures occur. This is particularly true when the heatsink is added in late design, as fixing the heatsink to the component is often the only mounting option. Mechanical attachment to the board is preferable, but requires space to be allocated during board design.
The heatsink attachment itself is a reliability concern. In the worst case, heatsinks can become detached. For example, attachment clips can come off if the equipment is dropped during shipment or operation, or the heatsink may simply fall off during operation if it’s attached with double-sided adhesive tape or glue. A mechanical attachment should always be used.
The thermal performance may also degrade over time. Thermal paste used to improve the thermal contact between the package and the heatsink can also creep and dry out over time, and dust can accumulate, blocking the flow channels between the fins of the heatsink, particularly if these are very narrow..
If added late in the design, it is often difficult to adequately ground the heatsink once the board is tracked. In processor-based systems, ungrounded heatsinks broadcast clock harmonic electromagnetic radiation, causing problems during final compliance testing.
If a heatsink is required, space constraints will often limit the available solution. The best heatsink solution will provide adequate cooling yet have low weight and present the minimum blockage to flow. Even so, problems may arise depending on what components are in the wake behind the heatsink where cooling is impaired due to the reduced air flow.
Fans
When designing a system, it is important to ensure that any fans will operate within the recommended range on the fan curve specified by the manufacturer. Adding heatsinks late in the design increases the system pressure drop, reducing the flow through the system and increasing the load on the fans.
Assuming that the axial fans were originally specified to operate within their recommended flow rate range, adding heatsinks can move the fans’ operating point away from optimum, causing increased fan noise and reduced operating lifetime.
If heatsinks are added late in the design, the fan selection should also be reconsidered. The options, all of which are limited by constraints on the enclosure, include increasing the number of fans, using larger fans, or using higher-performance fans. In general, all of these techniques solutions will increase cost, noise, and system power consumption.
Increasing the number of fans will probably provide the best solution from a noise perspective. However, the system reliability will be impaired due to the increased likelihood of fan failure. Depending on the application, redundancy may need to be built in to improve the reliability. If so, measures will need to be taken to ensure that the equipment can continue to operate in the event of a fan failure to prevent cooling air leaking through the failed fan.
To overcome higher system pressure drop, fans need to be added in series. In theory, connecting two fans together in series doubles of the delivery pressure., Bbut in practice, this doesn’t work well as the swirl introduced by the upstream fan reduces the suction effectiveness of the downstream fan and increases aeroacoustic noise. A better approach is to design a push-pull system, with fans at both the inlet and outlet.
Larger fans may increase the system flow rate back up to an acceptable level and bring the fans back within their recommended operating range, but the noise is likely to be more than that envisaged with the original design. Sticking with the same form factor and using higher-performance fans, such as diagonal or “semi-axial” fans, may also be the only provide a solution for a given enclosure.
Beyond these methodsapproaches, it may be necessary to make more radical changes to the cooling system design to accommodate a different type of fan, such as a radial centrifugal fan with better flow rate versus pressure drop characteristics. To accommodate these, the enclosure design will need to be changed as a centrifugal fan is not a drop-in replacement for lower-cost axial fan.
All of these solutions will increase your system’s power consumption. Typically, around 30% of the power consumed in fan-cooled systems is used in cooling.
If more active cooling is required than first thought, there will be an increased load on the power supply. If it has been quite tightly specified, then a larger power supply may be needed to cope with this additional duty. The key to optimizing product reliability is to gain physical insights into the way the design will perform before your product is prototyped.
Related Articles
Use Cell Balancing To Enable Large-Scale Li-ion Batteries
Cooling Techniques Attack MPU Processing Heat
\\[\\[move-your-thermal-strategy-for-air-cooled-electron|Move Your Thermal Strategy For Air-Cooled Electronics Up In The Design Flow\\]\\]