Take The LRDIMM Challenge To Boost Server Speed, Capacity

A rapidly rising star within the world of large-capacity servers and high-performance computing platforms is the load-reduced dual-inline memory module (LRDIMM). As indicated by its moniker, when modules use LRDIMM with DDR3 SDRAM main memory, they relieve much of the burden induced by earlier, less powerful DIMM solutions. LRDIMM supports higher system-memory capacities when enabled in the system BIOS. It’s also fully pin-compatible with existing JEDEC-standard DDR3 DIMM sockets.

LRDIMM’s unprecedented capacity at higher bandwidth gives system designers a viable option for creating up to 768-Gbyte capacities on volume server systems. The technology delivers substantially higher operating data rates in the highest-capacity configurations; improved testability, usability, and memory capacity; and minimal power required in high-end systems with multiple DIMMs per channel.

A Step Forward

To get the most out of LRDIMM in today’s high-performance computing, it’s wise to first compare the module’s design with its predecessor, the registered DIMM (RDIMM). After that, designers can explore how it boosts a design’s memory capacity and speed.

RDIMM memory technology lacks LRDIMM’s buffering capabilities. Thus, designers must make tradeoffs between memory capacity and operating speed, which kills any chance of meeting ever-increasing user demands for bothhigher capacity and speed. LRDIMM overcomes this performance-zapping obstacle, enabling high-capacity memory systems to run at optimal operating speeds.

A memory buffer lies at the heart of the LRDIMM technology (Fig. 1). As can be seen, there’s an LRDIMM-enabled module with one memory buffer on the front side of the memory module and multiple ranks of DRAM mounted on both the front and back.

1. This LRDIMM module incorporates one memory buffer on the front side and DRAM on both sides.

In this example, the memory buffer redrives all data, command, address, and clock signals from the host memory controller to the multiple ranks of DRAM. The buffer isolates the DRAM from the host, reducing the electrical load on all interfaces. Consequently, a system can operate at a higher speed for a given memory capacity or support a higher memory capacity at a given speed.

RDIMM technology, in contrast, connects the data bus directly to the multiple ranks of DRAM (Fig. 2). This increases the electrical load and limits system speed when the targeted capacity is boosted.

2. RDIMM connects the data bus directly to DRAM, while LRDIMM buffers this interface to reduce loading.

To maximize module and system-level memory capacity while overcoming limitations to the number of chip selects per channel, LRDIMM supports a feature called “rank multiplication.” In this scenario, multiple physical DRAM ranks appear to the host controller as a single logical rank of a larger size.

For this to happen, additional row address bits are used during the Activate command as sub-rank select bits. Since Read and Write information is stored in memory, their commands don’t require the additional sub-rank select bits after the Activate command. Rank multiplication can be either disabled or set for 2:1 or 4:1 multiplication, for up to eight physical ranks per LRDIMM.

Testing Becomes Easier

The LRDIMM architecture also enables designers to build in an array of additional features. For example, LRDIMM memory buffers can test DRAM and LRDIMM in transparent mode. The buffer acts as a simple signal redrive buffer and passes commands and data directly through to the DRAM devices.

In addition, the buffer’s memory built-in self-test (MemBIST) functionality allows individual bit-level testing for all 72 DQ bits on the data bus. Testing can be performed at full operational speed, using either in-band—command/address bus—or out-of-band serial management bus (SMBus) access.

Voltage-reference (VREF) margining enables the LRDIMM design to use externally supplied voltage references for data (VREFDQ) and command/address bits (VREFCA). Alternatively, these reference voltages can be supplied internally from the memory buffer.

Such margining capability is provided on both the host and DRAM sides of the buffer, allowing for system test and initialization. The host sets its level through buffer configuration registers when VREF is sourced from that buffer. When the buffer provides VREF, the host can control the voltage level though the buffer’s configuration registers.

The programmable voltage references enable an LRDIMM memory system to use independent voltage references for the host-to-DRAM interfaces and host-controller-to-memory-buffer interfaces. Therefore, VREF margining can be independently performed for the host-to-DRAM interfaces and the host-controller-to-memory-buffer interfaces. When equipped with the independent VREF margining capability, module and system suppliers can separately characterize the robustness of the LDRIMM memory system’s signaling interfaces, guaranteeing signal integrity.

Furthermore, an LRDIMM buffer makes possible parity checking, which detects corrupted commands on the command/address bus—it checks the incoming commands and asserts an ERROUT_n signal if a parity error is detected. The buffer’s SMBus interface supports an out-of-band SMBus to read and write configuration and status registers. In addition, an integrated temperature sensor is updated eight times per second and is accessible at any time through the SMBus. As a result, the buffer’s EVENT_n pin can be configured as an interrupt back to the host to indicate high-temperature events.

Scale LRDIMM To Higher Speeds

Designers often face the challenge of determining when to migrate from established RDIMM technology to the more powerful and enabling LRDIMM solution. RDIMM does provide some system memory capacity advantages over unbuffered DIMMs. However, since an RDIMM’s register component only buffers the command and address buses, the unbuffered data bus remains a weakness for an RDIMM-based memory system.

For example, a quad-rank DDR3 RDIMM presents four electrical loads on the data bus per RDIMM. These RDIMMs, then, can only operate at a maximum data rate of 1066 Mtransfers/s in a configuration with one DIMM per channel (DPC) and 800 Mtransfers/s in a two-DPC arrangement. In contrast, LRDIMM, which buffers the data bus along with the command and address buses, can operate at higher data rates and in higher memory density configurations.

A simulated eye diagram of the data bus with two quad-rank RDIMMs in a two-DPC configuration helps illustrate this scenario (Fig. 3). It demonstrates that the presence of eight electrical loads on the data bus severely degrades the memory channel’s signal integrity and limits the memory system’s signaling rate.

3. This data bus with two quad-rank RDIMMs at 1333 Mtransfers/s demonstrates severely degraded signaling rate.

Specifically, this diagram shows that with eight electrical loads at 1333 Mtransfers/s, the maximum data eye width on the data bus reduces to 212 ps at an idealized VREF point and less than 115 mV at the maximum voltage opening. The reduced data eye means that the two quad-rank RDIMMs in the two-DPC configuration aren’t suitable for operation at 1333 Mtransfers/s. This highlights the difficult tradeoffs that must be made between higher capacity and higher-data-rate operation in RDIMM memory systems.

To compare the difference with LRDIMM, a simulated eye diagram of the data bus was created using two quad-rank LRDIMM modules in a similar two-DPC configuration (Fig. 4). By replacing the electrical loads of the eight physical ranks of DRAM devices with two electrical loads of the memory buffer on the data bus, the data bus’ signal integrity improves dramatically.

4. Here, the data bus with two quad-rank LRDIMMs at 1333 Mtransfers/s improves the data eye width from 212 ps to 520 ps.

More specifically, under the same simulation conditions as those used for the two quad-rank RDIMMs in Figure 3, maximum data eye width on the data bus jumps from 212 ps to 520 ps, and maximum data eye height improves from 115 mV to 327 mV at the maximum voltage opening. The enhanced signal integrity means that LRDIMM can operate at 1333 Mtransfers/s and above, even when multiple LRDIMM modules populate the same channel.

Capacity Implications

As already stated, one key advantage of LRDIMM is its ability to dramatically increase total system memory capacity without sacrificing speed. Devices such as Inphi’s iMB memory buffer electrically isolate the DRAM from the data bus, allowing additional ranks of DRAM to be added to each DIMM.

At the same time, LRDIMM maintains signal integrity and enables the installation of DIMMs on each system memory channel. Consequently, up to 32-Gbyte LRDIMM capacities are possible (with 4Rx4 modules using a 4-Gbit dual-die package DRAM). Since each LRDIMM presents just a single electrical load to the host, more DIMMs can be installed per channel as well.

Therefore, a high-capacity server with two processors, three DIMM slots per channel, and four channels per processor can increase LRDIMM-based system memory two to three times more than RDIMM. The differences become stark when contrasting typical RDIMM and LRDIMM capacity limits for each operating speed and voltage (see the table).

For 1.5-V DDR3 operation at 800 Mtransfers/s, a system that employs three 16-Gbyte 2Rx4 RDIMMs per channel could reach 384 Gbytes. With support for 32-Gbyte 4Rx4 modules, system memory capacity using LRDIMM can double that limit, reaching 768 Gbytes. The LRDIMM’s rank multiplication feature can overcome system chip-select limits (typically eight total DRAM ranks per channel). In this case, it allows for the required 12 physical ranks per channel.

LRDIMM Minimizes Power Impact

LRDIMM fosters higher-memory-capacity designs, but not at the expense of high power consumption. In a one-DPC configuration, the memory buffer on a single LRDIMM module in a one-DPC configuration draws more power than the registering clock driver on a single RDIMM. However, the difference drops substantially for higher-density two- and three-DPC schemes.

Taking a closer look, normalized power per RDIMM is compared against LRDIMM for one- and two-DPC configurations at various speeds (Fig. 5a). Since actual power consumption depends primarily on the memory density and DRAM technology, it shows relative power for LRDIMMs and RDIMMs using the same-generation DRAM on equivalent 32-Gbyte 4Rx4 modules. RDIMM module power at 800 Mtransfers/s was normalized to 1.00. Other results also scaled to that reading. Memory was exercised using a standard benchmarking tool set to generate maximum bandwidth with 50% reads and 50% writes.

5. The one-DPC RDIMM is more power-efficient for split read/write test (a) and read-only (b), as indicated by the normalized power per DIMM for the same speed compared to one-DPC LRDIMM. However, the two-DPC LRDIMM is more efficient than both of them.

At 800 Mtransfers/s in a one-DPC configuration, LRDIMM power was 17% higher than that of RDIMM. However, at 800 Mtransfers/s in a two-DPC arrangement, LRDIMM power measured 3% lower than RDIMM power. At 1066 Mtransfers/s in a one-DPC configuration, LRDIMM power registered 15% higher than RDIMM power. Again, though, with a two-DPC arrangement, it dropped 15% below that of one-DPC RDIMM power. At 1333 Mtransfers/s, power per LRDIMM was 28% lower in a two-DPC configuration.

Similar results appeared for maximum bandwidth memory accesses with 100% reads (Fig. 5b). The power differential at one DIMM per channel runs slightly higher than in the 50/50 read/write case, but again drops off significantly in two-DPC configurations. Power per module in the two-DPC configuration is likely more compelling to designers, because LRDIMM is primarily intended for high-density memory applications. These results affirm that higher-density, two-DPC configurations deliver optimal speed and capacity benefits with no power penalty.

Multiple DIMMs Per Channel Hikes Efficiency

Hypothetically comparing the incremental dc power required by the second RDIMM or LRDIMM in a two-DPC configuration shows the magnitude of how much power can be saved. If an LRDIMM module’s non-target buffer DQ termination is 60 Ω, termination voltage measures 750 mV (V_DD/2), and the DQ signal level is 1250 mV (high) or 250 mV (low), LDRIMM power per bit is calculated as:

For the RDIMM, the two non-target DRAM terminations may need to be set to 40 Ω, a parallel combination of 20 Ω, which gives:

This would save 8.33 mW per bit for the non-target DIMM, or approximately 900 mW for the 108 bits of the data bus (64 DQ data bits, 8 ECC DQ bits, and 36 DQS strobe bits).

In addition, dynamic power savings are possible due to a lesser amount of capacitive loads charging or discharging in the two-DPC LRDIMM versus the two-DPC RDIMM. For each data bit, the DQ driver sees four additional loads going from one DPC to two in the RDIMM case, and only one additional load in the LRDIMM.

Assuming a 2-pF input capacitance for each of these three additional loads toggling at 1333 MHz (50% of the inputs charging or discharging for a random data pattern with a 1.0-V high-to-low voltage swing), the two-DPC LRDIMM could save:

or:

The 108 bits of the data bus would achieve a total dynamic power savings of approximately 1350 mW. Combined with the dc power savings calculated above, the two-DPC LRDIMM configuration could save over 2 W—a substantial improvement in power efficiency over RDIMM technology in high-density configurations.