The Power of Memory

1 of Enlarge image

Power consumption in a typical server system. Source: Intel

Server energy efficiency vs. utilization. Source: IEEE Computer

With the global trend to go green and the relentless quest to reduce total cost of ownership, it’s not just mobile platforms that are focused on reducing power. Often consuming the equivalent electricity of a not-so-small country, data centers are prime candidates for reducing energy consumption. Where that power can be saved may come as a surprise. The biggest power hog in a typical computing system is the microprocessor. Fueled by Moore’s Law, microprocessor transistor counts have risen exponentially over the past two decades. Now numbering in the billions, high-performance transistors have increased processor computational performance by nearly four orders of magnitude since the late ’80s (Fig. 1).

But that dramatic rise in performance hasn’t come free. Today’s microprocessors are the biggest consumers of power in a typical server, but what’s No. 2? As processor performance increases, so too does the demands on memory. As a result, the memory subsystem finds itself a close second in the server power struggle. Illustrated in the chart above, DRAM accounts for about 20 percent of a server’s power budget. In itself, this isn’t that shocking since fast processors need to access large amounts of data from memory. What’s surprising is how much power memory uses when it isn’t even being accessed.

In general terms, DRAMs have two principal power modes: active read/write, and standby idle. As the name implies, in active mode, read and write operations are transacted between the processor and the DRAM, and power consumption is highest. In standby mode, when a DRAM is sitting idle, power consumption decreases from its peak but does not go to zero. Why is this important? In today’s multi-rank, high-capacity servers and workstations, standby power really adds up. In fact, even under peak workloads, the majority of memory resources are in standby burning up to half of the total memory power.

Please Stand By

In a study published in IEEE Computer, researchers Luiz Barroso and Urs Hölzle found that servers typically are utilized between 15 percent and 55 percent of the time. Memory accesses are generally proportional to server utilization. The rest of the time (i.e., the other 45 percent to 85 percent) the DRAM is in standby waiting for the next request. Figure 2 illustrates the typical operating region as the shaded area on the left.

Despite a server’s typical utilization residing predominantly in the sub-50-percent range, computing systems are normally optimized for high performance under peak workloads. Great emphasis is placed on the active power profile and increased server energy efficiency as utilization increases. For systems in the typical operating region, however, there is a substantial fall-off in energy efficiency. Lower utilization decreases the percentage that active power contributes to total memory power, leaving standby power as the dominant factor. The problem is magnified in high-capacity systems. For example, every memory channel on a typical server can connect to two DRAM modules, each with four ranks for a total of eight ranks of memory. During a memory access, only one rank within a single module is active at any given time. This leaves the remaining modules and ranks in standby. In active mode, a single rank of memory dissipates 7.2W of power, while the cumulative power dissipated by the remaining ranks in standby is 13.9W. Timing registers and DRAM core refresh add another 1.4-2.3W per module. Consequently, standby power is 58 percent of the total memory power and a significant contributor to overall power consumption at all levels of utilization.

It’s Not Easy Being Green

Looking again at the server energy efficiency vs. utilization chart (Fig. 2), we see that while power consumption scales linearly with usage, power efficiency drops off exponentially as utilization decreases. For the memory subsystem, the more time the DRAM spends in standby, the less efficient it is. Combine this with typical utilization results from Barroso and Hölzle, and the resulting outlook for future data centers appears to be anything but green. There are several factors that contribute to standby power consumption and the resulting inefficiency. The fundamental design of DRAM requires that the core be periodically refreshed in order to maintain data. A DRAM memory cell stores data as a charge on a capacitor, and unless it is refreshed periodically, that charge and the associated data are lost. DRAM core refresh cycles consume power, but it is small relative to total power consumption and is consumed only a fraction of the time a DRAM is in standby.

Another component of standby power is that used for on-chip delay-locked loop (DLL) and clock buffers. The principal purpose of the DLL it to maintain the precise timing between the input clock and output signals at higher data rates. This is achieved by optimizing the clock access and output hold times to compensate for the timing shifts from process, voltage and timing variations. In addition, the delay caused by components such as on-chip clock buffers, input receivers and output drivers is monitored through a feedback mechanism and periodically adjusted to ensure stable memory signaling.

DLLs and clock buffers on both sides of the interface must remain powered on between transactions in order to keep the DRAM and the processor bus synchronized. Though core refresh power often gets a bad rap, it is the DLL clocking and clock drivers that make up the majority of memory system standby power. This is because they are always on, even in standby, drawing power 100 percent of the time. In contrast, depending on device density, the DRAM core spends less than 10 percent of the time in refresh.

For DDR3-based systems, the lowest power alternative is to put the entire DRAM in PowerDown mode, or IDD2PD, shutting off the DLL and drawing only 4 percent of the power consumed in active mode. This is the most power-efficient mode, but it requires 512 clock cycles to achieve DLL lock and exit PowerDown, a completely unacceptable latency between server transactions.

Active standby mode or IDD3N reduces the latency between back-to-back accesses to different ranks. In this mode, the DRAM is not being accessed, but DLL and clock buffers remain on. While power is reduced, the memory still draws 22 percent of the power consumed in active mode. Clock enable can also be toggled in active standby to marginally reduce power, but only at the cost of increased latency which solves neither the power nor the latency problem. An alternative method is needed in order to significantly reduce memory standby power while still achieving the performance and latency requirements of next-generation servers and workstations.

The Challenge to Lower Clocking Power

As described above, the DLL and clock buffers account for a large portion of the standby power and are prime candidates for power reduction. There are two potential strategies for reducing the power: redesign the DLL, or eliminate it. However, since the role of the DLL is critical to the operation of the memory system, power reduction through the redesign or elimination of these circuits must not come at the cost of their functionality.

A variety of DLL or phase-locked loop (PLL) architectures can be implemented in memory subsystems to manage standby and active power, each with their own tradeoffs. The most straightforward method for implementing a clocking architecture is with a digitally controlled DLL. This method is commonly used in industry memory systems today but is not optimized for power consumption and is difficult to scale with increasing data rates.

A digital DLL clocking system is implemented using a control register to select different taps, or predetermined phase offsets, of a delay line to compensate for the circuit delay variation from a clock input buffer. The timing through such a delay line typically spans one to two clock cycles and is prone to drift from temperature and voltage variation. This tendency to drift requires frequent updates and timing adjustments to maintain lock. This makes it challenging to reduce standby power by pausing the clock. Additionally, the large granularity between taps limits a digital DLL’s ability to operate at higher frequencies.

For higher data rates, memory systems can use a hybrid analog-digital DLL or PLL with a voltage controlled delay line or oscillator. The hybrid approach can reduce both active power and high frequency timing jitter since fewer transistor stages are needed to lock timing. Hybrid DLLs are less sensitive to temperature and voltage variation, and thus require fewer updates to maintain delay lock. However, the voltage controlled circuits must be continuously active for optimal timing. Like the fully digital implementation, the hybrid approach is limited in its ability to pause or power down the DLL and clock buffers which means little improvement to standby power.

A third alternative is to use a fully analog DLL or PLL architecture. This method supports both high-frequency operation and reduced memory system standby and active power. By implementing the DLL or PLL with analog circuits, delay and phase compensation can be achieved with finer granularity, fewer transistor stages, and lower corresponding current draw and power consumption. Furthermore, power can also be reduced by operating a portion of the circuits at lower voltages.

However, analog DLLs or PLLs have significant drawbacks. Custom analog circuits are difficult to tune and have higher sensitivity to semiconductor process variation versus the other approaches. This translates into poor yield for high volume manufacturing. The fully analog approach also requires long power-up times and is challenging to rapidly restart from a standby state.

Innovative Approaches

Rather than redesign the DLL or PLL and clock buffers, alternative approaches have been developed which eliminate these circuits on the DRAM while maintaining their functionality. One such innovation is Rambus’ FlexClocking architecture that enables precise data alignment between the DRAM and processor without the use of a DLL on the DRAM device. The clock forwarding architecture combine2 with other circuit design techniques to achieve low standby power and fast turn-on times. The clocking circuits and complex phase-alignment functionality is implemented in the memory controller on the processor chip, taking advantage of more advanced and lower-power logic process technology. By removing the DLL, the constant power draw by the DRAM clocking circuitry is eliminated, significantly reducing standby power. In To quantify the savings, the technology can cut about 50mA per chip when applied to a typical main memory DRAM such as DDR3. When a server memory module can contain up to 72 DRAM devices, and mid-range servers are often equipped with six such modules, that’s over a 20A savings per server across thousands or tens of thousands of servers in a data center.

DRAMs are simple in concept but exceedingly complex to power-optimize. It takes clever engineering for an elegant solution.

Rambus