Electronic Design

We Have Seen The Enemy, And The Enemy Is Heat

Today’s complex SoCs are prone to thermal issues that can cause field failures. Here’s how thermal analysis can help you ferret out those hotspots.

As the semiconductor industry traverses through the deep-submicron process nodes, each plateau along the way carries its own signature bugaboo arising from physical effects. At 180 nm, timing-closure issues got everyone's attention. At 130 nm, signal integrity was the topic of the day. At 90 and 65 nm, though, power integrity and leakage are weighing on designers' minds. We now pack so many active elements onto such a small slab of silicon that power density has reached near-critical mass. For example, according to Srikanth Jadcherla, founder and CTO of ArchPro Design Automation, a die measuring 1 by 1 cm with power consumption of 1 W dissipates the equivalent of 10 GW per square kilometer, or 25 GW per square mile.

Along with enormous increases in power density comes the physics of the submicron realm. With narrower feature sizes come thinner gate insulators, and that translates into leakage power. Leakage across gates is a condition in which the gate never shuts entirely off. Rather, it continues to consume power even though it's in a nominally passive state. At the 65-nm node, leakage can constitute more than 40% of the overall power consumption of a system-on-a-chip (SoC) or ASIC (Fig. 1).

Unfortunately, leakage has a symbiotic, and positively reinforcing, relationship with temperature. Leakage begets heat, which begets more leakage, which begets even more heat. And, in worse-case scenarios, thermal runaway can ensue, leading to potential fires and/or explosions in enduser systems.

Thus, heat is indeed an enemy that must be faced head-on. Fortunately, designers can turn to a number of tools and methodologies for prediction and management of thermal effects. In this article, we'll explore some of the thermal-analysis methods that help unearth problem areas. We'll also discuss some best practices in the thermal-management arena.

In addition to its exponential relationship with temperature, leakage is at the root of more subtle, yet no less pernicious, effects. Chief among these are problems brought on by electromigration, which are exacerbated by the higher current densities.

Then there's the broader issue of thermal variation across a given die's planar dimensions—even in the Z dimension between metal layers. Not only do disparities exist in temperature at a great many points on and within the die, but those variations are far from constant. As major functional blocks turn on and off, switching activity will have an ongoing effect on the die's thermal characteristics.

There is, in fact, an interconnected maze of effects brought about by temperature variation that involves timing, signal integrity, and reliability (Fig. 2). As mentioned, temperature has a positive feedback loop with power and leakage. But it also affects timing by weakening the driving capability of devices. Higher temperatures mean an increase in the passive resistance of interconnects, which in turn increases delays.

The effect of temperature on IR drop and electromigration is accomplished primarily through Joule heating, or self-heating of the interconnects. This is another result of the increased resistance of the wires due to elevated temperatures. The circuit's electromigration lifetime degrades exponentially with rising temperatures. In IR-drop terms, that increased resistance on the power and ground grids leads to larger IR drops, meaning more power consumption.

Thermal effects, like most physical effects that plague deep-submicron processes, can be dealt with through guardbanding. But as savvy designers know, excessive guardbanding, i.e., designing for worst-case process, voltage, and temperature corners, will leave performance on the table. It also usually means decoupling the voltage, temperature, process variation, and timing analyses from each other.

But the room for margin continues to shrink anyway. According to Li-Pen Yuan, group R&D director for extraction and power-integrity tools at Synopsys, there are two major problems with the margin approach: "Even if we anticipate elevated temperatures and define a larger range to which we must design, there's no guarantee that the actual chip temperatures will be within that new margin. So we run the risk of violating it."

And, pushing too aggressively in terms of margin is what will cost you performance. "Thermal analysis should be used to understand the realistic distribution of temperatures on the chip in its various modes of operation. That way, we can potentially reduce the margin and not suffer from excessive guardbanding," says Yuan.

A typical thermal-analysis flow must examine three key elements. One involves the sources of heat and how to model them. Another is how that heat is distributed and/or dissipated. The third is determining when thermal equilibrium is reached.

Thermal models for the chip are created using the source model and the distribution network. The models draw upon technology parameters supplied by the silicon foundry. The distribution network is modeled using extraction in thermal tools such as Apache Design Solutions' Sahara-PTE thermal-analysis engine.

Those thermal models are supplemented by the boundary conditions describing their larger environment. According to Dian Yang, VP of product management and GM at Apache Design Solutions, the boundary conditions consist of models for the chip's package as well as for the board the package is attached to. Apache works with Ansys, whose IcePak and IceBoard tools generate those models and also interface with Cadence's Allegro Package Designer as well as Sigrity's Unified Package Designer.

Ansys' package-analysis tools are grid-based, finite-difference solvers that bring in geometric package shapes from either the electrical or mechanical CAD tools. It accounts for all of the physical elements, including the die, substrate, lead frame, die attach, and encapsulant. The tool automatically builds a meshed, finite-element model that breaks the package into tiny fragments. This facilitates predictions of temperatures at anywhere from 30,000 to 75,000 distinct locations within the package.

Once models for the chip, package, and board are in hand, thermal-analysis tools such as Apache's Sahara-PTE, Gradient Design Automation's FireBolt, or ArchPro Design Automation's MVSIM can take on the task of identifying thermal hotspots.

Designers can use these tools to determine whether they're meeting their maximum junction-temperature (TJMAX) specifications. Other tasks performed by thermal-analysis tools include verification of thermal gradients for various modes of IC operation and identification of the best locations for thermal sensors.

Interestingly, ArchPro considers thermal issues within the larger context of power management. Whether you use ArchPro's tools or not, it's worthwhile to ponder how hardware and software can conspire to mishandle systemic reactions to local thermal events.

For example, if a thermal diode in a cell phone trips and issues an interrupt, the CPU's local thermal throttling mechanisms may gate down the clock and phase-locked loop. However, the thermal interrupt goes unserviced. Consequently, the system continues to see an increase in leakage power and temperature. The resulting vicious cycle of leakage and heat can end in thermal runaway.

Traditional logic models often don't account for what each subsystem's logic is doing in the event of thermal problems. "When a thermal interrupt kicks in, many actions are set in motion," says Srikanth Jadcherla, ArchPro's founder and CTO. "Different subsystems are going to standby or shutting down, often in an uncoordinated manner." In traditional simulation, it may look as though the power-control system is responding to the interrupt, but, in fact, that may not be the case at all.

There are those who advocate an implementation flow that in some way, shape, or form considers power management and thermal concerns concurrently. All of the major GDSII-to-RTL flows on the market address this tack in various ways (see "Power And Thermal Analysis Are Best Done Together,").

For large systems houses, thermal analysis and management are taken extremely seriously. In the case of Texas Instruments, it's the subject of a company-wide initiative.

"We have what's called the Thermal Council here at TI," explains Darvin Edwards, manager of advanced package modeling and characterization and TI Fellow. "One of the intents of the Council is to educate each of the various business groups within TI as to the nature of thermal issues they'll face in their products." The Council meets to share lessons learned from various design projects.

Within TI, thermal analysis is a standard part of the design flow. Design teams run through analyses to determine whether there will be problems. "We have some rules of thumb," says Edwards. "For example, we check to see if there's going to be more than a 2X differential in temperature gradients across the die." If there are concerns, the product engineers are made aware of the hotspot issues and a power map may be generated for the die.

In the event of such issues, TI follows some best practices in efforts to ameliorate them. For one thing, the engineers will consider reducing the impact of hotspots by attaching the die directly to a high thermal-conductivity heat spreader, such as a copper plate. Then, if a die with hotspots happens to be a thinner die (say, 50 µm in thickness versus 400 µm), that would imply the need for chip/package co-design. A special case concerns packages with stacked die, in which hotspots on one die within the package can create hotspots on another.

Engineers at TI try not to cluster hotspots, if at all possible. Spreading them apart keeps each hotspot away from the "thermal footprint" of neighboring hotspots, keeping each of them cooler. This practice applies to pcboard design as well as to IC design.

If a given die has only one hotspot, the best place for that hotspot is in the center of the die. Conversely, the worst place is in a corner. Silicon itself is one of the best thermal conductors, so centering a hotspot in the die gives it the best possible position for heat spreading.

But when multiple hotspots exist, it's poor practice to cluster them in the center, which effectively creates one large hotspot. In such cases, it's best to distribute them relatively evenly over the die while still avoiding the corners and/or edges. So, each hotspot has a chance to dissipate its heat evenly through the medium of the substrate.

Apache Design Solutions
ArchPro Design Automation
Gradient Design Automation
Magma Design Automation
Texas Instruments
Hide comments


  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.