There have been numerous incidents of explosions in mobile systems recently. Batteries seem to be the focus whenever one of these unfortunate events happens. The world seems happy and satisfied that the offending battery has been recalled. That may work for CNN, but it's not good enough for the engineering community.
Mobile systems have complex power-management schemes that have two conflicting requirements. One is saving battery life by quickly switching to "standby" states that are inoperational; the other is the need to apply various degrees of heat control as temperature rises, ranging from clock gating to fan activation to full shutdown. This is the cause of trouble, for you need the hardware (and software) to be working to activate these heat control measures, irrespective of what state the system is in. Unfortunately, the system is often dysfunctional enough for deadlock/runaway conditions to occur.
The truth is that mobile systems have always been volatile with a mass of explosive material (the battery) sitting quite close to significant heat sources (the very ICs the batteries power). Consider this: A typical 1-W chip on a 1-cm x 1-cm die represents roughly 25 GW per square mile. That is a lot of nuclear reactors to carry around with you!
Coupled with such high current density is a tremendous increase in leakage currents, as we move to 90 nm to 65-nm process geometries. In many cases, even if the chip is idle and clock-gated, the leakage current is sufficient to cause a steady rise in temperature. Temperature in turn causes an increase in leakage. Thus, these two phenomena feed each other's frenzy. In fact, leakage has caused many designers to re-architect their thermal-protection schemes. In days past, it was sufficient to gate the clock for some time period if the chip was overheating (thermal throttling). This technique is no longer effective because of the increase in leakage. The only way to control leakage is to exercise some kind of voltage control on the chip, especially when the chip is idle.
With CMOS technology being a voltage-controlled current source, the most effective handle we have on the current consumption is in the voltage domain. There are various techniques to control voltage-to-CMOS techniques, including multi-VDD, power gating, and back biasing (among others). In general, this aspect of voltage control can be used to make sure the chip operates at the most efficient point possible from a power perspective in both the operational state as well as in standby/sleep states.
In many cases, it is a functional failure of the power-management system that leads to runaway situations. This is a complex hardware and software problem that needs to be well thought through and debugged before systems are shipped. Voltage-control techniques are very difficult to specify and verify. Often, the biggest battle is getting them right! Nevertheless, to be viable and competitive in the market, designers are incorporating multiple voltage islands into their SoCs, involving both hardware- and software-based power-management techniques.
Why is it so difficult to verify voltage control? For some 20 years, the EDA industry has been missing the fundamental components of power management. These include a system-level perspective, modeling of system electronics, and multi-voltage power management of the ICs. Any RTL/ESL methodologies used to describe the functionality of logic live in a primitive world where voltages don't change between blocks; nor do they change over time. This has been a tremendous obstruction to any system verification effort. Hence, there have been no verification systems capable of avoiding dangerous situations like the ones we are facing now. As we march towards ESL adoption, we, as an engineering community, must think hard. Do the emerging languages describe the system hardware in its entirety and still have the ability to work with software? Can they capture the complex, yet fundamental, link between the functional, electrical, thermal, and mechanical aspects of power management? Perhaps we need a real hardware description language! In fact, it may be more of a "hardware description system."
Design managers also must factor in software before they tape out. More than ever, power management is being implemented in software with hooks provided to the hardware. This makes software development an inherent part of the tape-out process, at least for power management. This is an even more compelling issue if the main competitive advantage of a chip lies in its power-management scheme, which is an increasingly likely case for system designers in the mobile, consumer, and enterprise segments. Power is the differentiator, and software plays a key part in this differentiation.
To make matters worse, software often needs a model of the entire system to effectively verify power-management schemes. As it is, SoC design teams are constrained in their resources, but this could be one costly corner to cut. Unfortunately, this is one area in which software emulation with FPGAs or other emulators does not help?they simply cannot model thermal or electrical effects realistically.
Engineers will have to do a lot more system validation before they ship parts, especially if they are designing complex SoCs. This validation needs to be as electrically savvy and thermally aware as it is functionally accurate. It's an issue that presents an enormous challenge indeed. But for starters, we must stop blaming bad batteries for every over-heating problem in handheld devices.