Beyond the Data Sheet: Demystifying Thermal Runaway

Nov. 1, 2007

Mathematical modeling of thermal runaway and a proper thermal analysis of all interacting heat sources can clarify how a device can be designed within the specified operating margins outlined on the manufacturer's data sheet.

Roger Stout, Senior Research Scientist, ON Semiconductor, Technology Development, Advanced Packaging, Phoenix

Thermal runaway in a semiconductor device is a mysterious and scary phenomenon. A device can cross a magic threshold and burn up in an instant — something you'll never see coming. What makes thermal runaway so interesting and important to understand is that it can happen at a much lower temperature than the normal maximum junction temperature rating of a device.

In other words, if you inadvertently designed your system too close to the runaway temperature, even though it may seem to be operating safely away from the maximum-rated junction temperature, there is no safe operating margin. Once runaway begins, no temperature is out of reach.

By exploring a specific mathematical model of a semiconductor device and applying it, we can make some quantitative statements about thermal runaway and attain a better understanding of this condition. This will allow designers to more accurately predict the types of operating margins in their designs with respect to thermal runaway, while using data supplied from manufacturers' data sheets.

Briefly, thermal runaway is what can happen in an electronics system when a particular device starts to generate too much heat. Over time, the excess heat increases, raising the device temperature, which, in a circular fashion, increases the device power dissipation, and so on, until destruction of the device (if not more) results. Two conditions must converge to cause thermal runaway, or at least to make its possibility a design issue. The first condition is that the thermal system surrounding the device be unable to dissipate as much heat as the device produces. The second is that the device power versus temperature characteristic be significantly nonlinear.

The first condition is fairly obvious: If the device produces 10 W and the system can only dissipate 9 W, bad things should be expected. Clearly, your initial design goal is a thermal system capable of removing at least as much power as you expect the device to dissipate. If you can't match device and system characteristics at this point, you don't have a viable design.

So, given that nominal system cooling capacity equals nominal device power dissipation, a somewhat more accurate statement of the thermal runaway problem is that a small perturbation in the power output of the device is more than can be dissipated by the system. For example, maybe the device and system are balanced at 9 W, but suddenly the device surges to 10 W. This will cause the device to heat up, or in semiconductor jargon, the internal junction temperature (T_J) of the device increases. The question is, can the system handle it?

A Stable Operating Point

To figure out whether a stable operating point is possible, we must discuss the relative change in the device versus the system. In Fig. 1, the straight, green device line represents the power output of a hypothetical device as a function of its temperature. (We'll talk about real device lines later, but this simple example helps illustrate the basic concept.) Fig. 1 also illustrates two possible linear system lines that describe how much power the system might be able to dissipate as a function of junction temperature.

As depicted in Fig. 1, the reality is that most systems are able to dissipate more heat as the driving temperature increases (and a straight-line model is often quite reasonable), but the difference between the two systems shown is in their theta (θ) values. (Broadly speaking, theta is a measure of the cooling capacity of the system, usually expressed in degrees Celsius per watt. This means that, with the chosen axes, theta is actually the inverse of the slope.) The red system line is a high-theta system with a relatively large temperature rise per watt; the blue one is a low-theta system with a relatively small temperature rise per watt.

If we define the operating point of the system as the place where device power dissipation equals system cooling capacity, both system lines intersect the device line at the same point. The critical distinction is that, as you move away from the operating point, one system illustrates stable behavior and the other unstable behavior.

Consider the separation between the device line and a given system line: It is evident that when the system line is steeper than the device line (for example, the blue low-theta), a perturbation in device power (or temperature) to the right results in a temporary situation in which the system can successfully remove the excess power, cooling the device back down to the original operating point.

In contrast, moving to the right with the red high-theta system yields trouble. The system can't dissipate as much additional power as the junction produces, so the system will heat up some more, exacerbating the problem, and thermal runaway will be the result. Indeed, a simple-minded description of thermal runaway says that if the system slope is less than the slope of the device line, thermal runaway will occur.

If managing thermal runaway were always this simple, life would be easy. What could complicate matters is a device whose power/temperature behavior isn't a straight line. As shown in Fig. 2, you can have a perfectly stable operating point (where you're going to try to operate), yet at a temperature somewhat higher than your intended operation, the curves cross again, and there the slopes will necessarily have the opposite, unfavorable relationship. That is, if a perturbation is large enough, the system can move from the lower, stable point up to the higher, unstable point; once there, it keeps moving to the right and the system experiences complete thermal runaway.

Note that in Fig. 2, the device line has an increasing slope as the temperature goes up. Curves in the mathematical class known as power-law functions have this property, which turns out to describe many semiconductor devices over a useful range of current (hence, power) and temperature.

In this more interesting situation, is thermal runaway as simple as comparing the slopes? Not always. If the nonlinearity is modest, such as for the power MOSFET shown in Fig. 3, individual devices' lines are effectively straight over any realistic operating temperature range. You certainly have to pick a combination of system slope and T-intercept to give you an acceptable operating point. But all such lines will have a slope steeper than the device lines, so thermal runaway isn't going to happen.

On the other hand, the power Schottky device in Fig. 4 has strong temperature nonlinearity, and it is quite possible to build a cooling system that crosses any particular device line in two places, thus opening the door for thermal runaway.

When thermal runaway is a possibility, what you want to quantify during the design phase is the margin by which runaway can be avoided. Fig. 2 illustrates this concept: If the system line shifts to the right, there comes a point at which the system and device lines are exactly tangent (where their slopes are equal, in fact). Note that if the system line stays properly located, even a local device temperature perturbation this far to the right will not actually experience runaway, because there is still considerable cooling margin at that point. But, if the system line actually shifts that far to the right, thermal runaway will indeed occur.

The Power-Law Device Model

Before proceeding, we should head off some unfortunate possible confusion between the power a device dissipates (that is, the terminal voltage multiplied by current passing through the device), and the power in the mathematical term power law, which refers to a quantity being raised to a power. If the base quantity happens to be the base of the natural logarithms (e), the power law becomes more specifically the exponential law.

A classic example of a power-law device is a reverse-biased diode, for which there's a rule of thumb that says the leakage current goes up by a factor of 2 for every 10°C increase in temperature. The following equation expresses this directly:

With a little algebraic slight of hand, any power law can be turned into an exponential law. Thus, the next equations say the same thing as Eq. 1:

where I_O is the device current at a temperature of 0°C and is the power-law strength that can be derived from the leakage current at any two temperatures at which leakage current happens to be known. (Although you probably won't find I_O on a data sheet, you can deduce it from any other current and temperature, once λ is known.)

Then, as if the terms weren't confusing enough, note that if the reverse voltage (V^R) on the device is constant with temperature, the device power as a function of temperature (being the product of the power-law current and a constant voltage) also follows a power law with the same power-law strength, as did the current.

Finally, even when the device terminal voltage is not constant with temperature (for instance, in power MOSFETs where on-resistance is a function of temperature — so at constant current, the voltage will change significantly with temperature), it is fairly likely that, at least over some reasonable range of temperatures and operating conditions, device power in a real application could be approximated by some sort of power law.

Therefore, in general, we'll henceforth be referring only to device power in the power law, and will be the power-law strength of the device power, as in:

Indeed, to obtain Fig. 3 and Fig. 4, the various power-law strengths and functions as described here were computed from data obtained on the device data sheets. In the case of Fig. 3, a MOSFET, the power-law strength came out to about 200°C (which is why the device lines were effectively straight over the plotted 125°C range). By contrast, for the power rectifier of Fig. 4, the power-law strength came out to only about 15°C, explaining the rapid curvature of the device lines over the same temperature range.

If it turns out there is strong nonlinearity (say, if λ is 30°C or lower), then you may find the following additional relationships useful in quantifying your runaway margin. Given θ_JX is the theta of your cooling system as experienced by the device of interest, the runaway junction temperature (point of tangency as shown in Fig. 2) is given by:

And as a result of the amazing mathematical properties of the exponential function, the T-intercept that goes with it, is a simple λ offset from the runaway temperature:

Obviously, the runaway temperature margin is the difference between your designed T-intercept (T_X) and the intercept given by Eq. 9 (T_Y). (One of your jobs as the designer is to decide how small a margin you're comfortable with.) In addition, the nondimensional quantity,

and the nondimentional temperature,

are useful for computing the two intersections between the device line and the nominal system line, which satisfy the nondimensionalized equation: kz = e^z.

If k > e, you'll have both the stable design point of the original system as well as the theoretical (but unstable) upper intersection. If k = e, you're already at perfect runaway. If k < e, there are no solutions, meaning your system design was bad at the outset and so you need a lower theta or a lower ambient just to get started.

The True Meaning of Theta and Ambient

By definition, the T-intercept of our system line is the zero-power device junction temperature. If our device of interest were the only heat source in the system, then and only then would this zero-power temperature be ambient. However, in most systems of interest, there are many interacting heat sources, each contributing to each other's background temperature. In other words, the T-intercept of our device of interest is neither more nor less than the temperature it reaches when it is turned off and the rest of the system is otherwise normally powered.

Similarly, as shown in Fig. 5, for the system line to mean anything, its slope must correspond to incremental changes in junction temperature for incremental changes in power dissipation. If the system is thermally linear (hence, the principle of linear superposition is applicable), then the slope of a device's system line will not change just because its background temperature (T-intercept) shifts right or left. What one must not do is compute θ_JA as the difference between actual operating temperature and the ambient temperature, divided by device power, while additional heat sources are active.

As other heat sources are turned on, the background temperature of each device rises. Each increase in this background temperature effectively moves the T-intercept of that device's system line to the right. If you have a device subject to thermal runaway and have computed the temperature margin relative to the ambient temperature, then each background temperature increase from every other heat source eats into this margin. In other words, a proper thermal runaway analysis also must comprehend all the thermal interactions between all your heat sources. Then your runaway margin is a real margin.