Optimize Memory Subsystem For Top Performance

A Better Understanding Of Memory Accesses Allows DSP Memory Subsystems To Be Better Matched To The DSP Chips.

May 25, 1998

23 min read

Designers are increasingly using multiple DSP chips in applications that contain huge data sets--tens to hundreds of megabytes. Such applications can no longer be economically implemented with static RAMs, most of which typically have maximum capacities of 512 kbytes. Consequently, many system designers must consider the use of dynamic RAMs (DRAMs) to provide the larger memory space. Most DRAMs, however, are designed for PC workstations. To optimize DRAM use in DSP applications, designers must select the correct DRAM technology based on a different set of goals.

In addition, most DSP chips are optimized for I/O handing, and that typically means an interface optimized for use with SRAMs. As a result, overall memory subsystem performance in a DSP application depends on both the memory technology and the DSP chip's external interface.

Designers can pick from several DRAM architectures, each of which brings a number of pros and cons for various DSP system implementations. Thus, a better understanding of DRAM architectures and the DSP memory interface will allow designers to better optimize the memory subsystem for multiprocessor DSP applications.

On PCs, short read bursts for instruction cache-line fills have dominated accesses to main memory. But the increasing use of object-oriented languages and multitasking operating systems on PCs has lead to a significant number of accesses that are dispersed throughout main memory. This, in turn, has lead to an increasing emphasis on random-access latency instead of solely on burst-access time for subsequent reads to an open DRAM page.

Due to the emphasis on random-access latency, many PC manufacturers were slow to replace EDO (extended data out) DRAMs with synchronous DRAM (SDRAM) technology, which emphasizes burst accesses. In a typical 66-MHz memory implementation, SDRAM adds a cycle of latency on the initial access in exchange for one less cycle on each of the subsequent accesses. For a four-clock burst the net result is a two-cycle savings, but that is only relevant if more than just the first fetch was needed.

Differing Emphasis In a DSP system, the speed of instruction loads is generally not the main concern. Signals are typically processed as vectors, which are many times the length of the data cache line. The code for the tight inner loops of signal processing is typically loaded once for a long vector of data. The emphasis, therefore, is on the speed of both the subsequent reads to the same cache line and for immediate access to sequential memory locations.

The workhorse dynamic memories like standard fast-page mode (FPM) DRAMs, EDO DRAMs, and burst-EDO DRAMs are basically the same, save for some differences in the interface for reading data out at the time of the column access strobe (CAS) signal. With FPM DRAMs, the CAS signal causes data to be read directly from the sense amplifiers. EDO DRAMs add a latch to the output of those sense amplifiers, which allows the data-output buffers to stay on even after the rising edge of CAS. The result is a faster cycle time from column address to column address--up to a third faster than standard FPM DRAMs.

Burst-EDO DRAMs replace the output latch on the EDO DRAM with a register. That adds an internal pipeline stage, which allows data within a burst to come out quicker after the CAS signal for the second and subsequent accesses in the burst. The trade-off is an extra pipeline stage for the CAS signal on the first access, but this does not lower performance because the first data access is limited by the row access strobe (RAS) time, not the CAS time.

SDRAMs present more of an architectural change from FPM DRAMs than do the EDO DRAM variations. From the DSP system designer's standpoint, the important differences are that SDRAMs are synchronous and use a clock input. An internal SDRAM divides the memory into multiple banks, each with its own row decoder and sense amps. Current high-performance SDRAMs use four internal memory banks, although earlier versions typically used two banks (Fig. 1).

The multibank architecture eliminates gaps between data accesses because data can be accessed from one bank while the others are precharging. The SDRAMs buffer both inputs and outputs, and that does affect the latency for the first access in a burst. The increased pipelining, though, enables both quicker access to a full burst and operation at higher frequencies, compared to EDO DRAMs.

As a result, one of the key performance issues becomes how the system can deal with pipelined memory operations. The highest memory-to-processor throughput is achieved by using the multiple accesses inherent in the bursts of a cache line load. If that approach isn't used, the access rate is limited by the speed of the address bus, which usually has a duty cycle of only a percentage of the data bus. To reach the full potential of pipelined memory systems, the pipeline should be full as long as possible. Like a pump that needs priming, the data through a pipelined memory system will incur startup latency after any time the pipeline stalls. Accessing long vectors typically used in signal processing data arrays helps keep the pipeline full.

Match Latency To Pipeline When evaluating the various memory technologies for use in DSP systems, the designer should match each technology to the processor's capabilities. That is, the latency of the memory subsystem should be matched to the pipeline capabilities of the processor. The more pipelining in the processor, the higher the latency it can tolerate in the memory and the memory controller without affecting throughput.

However, many DSP chips do not support pipelined memory accesses, while other architectures support pipelining by using a separate DMA engine. In some applications, RISC processors can be used to execute DSP algorithms. Some RISC CPUs support pipelining, but even those are optimized to deal with alternating instruction and data accesses. Such a simplified design can work well on general-purpose applications that have an unstructured mix of instruction and data transfers, but it is not of much benefit for DSP applications. Unrestricted data pipelining delivers the most benefit in DSP applications, and this capability is available in some high-end RISC chips.

Additional considerations affect memory-access latency in RISC processors. For example, some high-end processors support a feature called "late cancellation," which allows a memory access to be canceled the cycle after the acknowledgment is sent by the memory controller. Late cancellation can be useful to support either an error-correcting (ECC) memory implementation or a cache coherency protocol. To implement this feature, however, the processor must have an additional one-cycle internal latency. If ECC and cache coherency are not needed, overall latency can be reduced by turning this feature off.

In addition to the processor's external memory interface and the DRAM itself, the memory controller is the third component that greatly affects the latency and throughput in the memory subsystem. The latency through a memory controller is mostly affected by the technology of the components. For absolute speed, an ASIC is the best choice; however, FPGAs are good alternatives to meet demands of flexibility and shortened time to market. FPGAs also are amenable to a pipelined implementation due to the abundance of flip-flops in their architecture.

FPGAs provide the flexibility to design the exact controller features and behavior desired, such as the page handling algorithm, and they are relatively fast, with the latest crop claiming to support 90-MHz and faster pipelines. Such pipeline speeds should be sufficient to keep pace with most of the highest-speed RISC processors, which operate with bus interface speeds of 83.3 MHz. Once the design has been debugged, it is relatively easy to turn an FPGA into an ASIC. And, although converting the FPGA to an ASIC will not remove any of the stages of the latency without a redesign, it will permit the circuits to operate at higher clock frequencies while reducing system cost.

When they use SDRAMs, designers must decide how the memory controller will handle the open pages of the multiple internal banks. One technique is to treat data from each of the memory banks as four independent open pages. This architecture can present four separate memory buffers to the application. The results from operating on two vectors in two different buffers could be placed in a third vector in the third buffer.

One disadvantage to this approach is that it places a greater burden on the software. Each vector must be placed in a different buffer to get maximum throughput. The manipulation required to do this for a long chain of vector operations is not an easy task and, in fact, may be impossible.

An alternative option is to present one big open page that covers all the internal banks. Accesses, though, will still be interleaved at the cache-line level. Not only will this arrangement be simpler to program, it also will provide better performance for strided accesses. Strided accesses would stay within a larger page for a larger number of accesses. The final benefit is a simplified design for the memory controller compared to managing four open pages, and that would have a direct impact on improving the time to market.

Finally, most DSP designs are size limited. In addition to reserving space for the processor, DRAMs, and memory controller, designers must allow for line terminations as well. For example, a multiprocessor board with four processors, each with a 64-bit external interface, has 256 data lines that would need to be terminated on each end. For many implementations, the space required to place 256 resistors on a board is prohibitive.

The best option could be to choose bus interface devices that include output resistors. If such parts are not available, transceivers are another option. In many memory bus implementations, however, the physical placement constraint would cause the drivers to be spread too far apart, and ringing on the signal lines would result. Ringing, though, can be controlled without termination by minimizing line lengths and choosing drivers that match the particular line characteristics. Such a design is possible, even at speeds over 80 MHz, but only after performing a thorough signal-integrity analysis.

Engineers can create multiprocessor DSP systems using optimized building blocks containing the processor, memory, and memory controller. Designing such a system becomes an exercise in optimizing data transfers between processors and memory, and between two memories. Ideally, the interface to remote memory should have the same features and performance as for local memory.

Available interconnect technology is typically a factor of two to four times slower than current memory buses, so the design must be optimized for the difference in performance. One popular approach is to use multiple levels of interconnection that vary by distance. This approach provides very high bandwidth where possible for nearby processors and memory, while using standard interconnects for more distant connections.

For neighboring processors and memories on the same printed-circuit board, the full local memory bandwidth should be achievable. For example, a 64-bit, 83.3-MHz connection can be used to connect a set of four processors and memories (Fig. 2). At those frequencies, using a bus to make the connection can be problematic. A better approach would be to make point-to-point connections using a multi-ported switch. One port on the switch should be connected to the interconnect used between boards.

The interconnect should minimize the funneling effect of going from the neighborhood connections to the inter-board connections. This implies choosing a high bandwidth per connection, and using very high-speed FIFOs to decouple the two bus speeds. A DMA engine at the interface to the interconnect will prevent a fast processor from waiting for the interconnect transfer to complete.

Real-World Example
How does all this come together in the real world? Let's examine the high-performance memory architecture for DSP applications employed in the Excalibur PowerPC daughtercard from SKY Computers. On the daughtercard, each of four PowerPC processors is connected to its own local SDRAM by an 83.3-MHz interconnect (Fig. 2, again). The memory controllers also are connected to each other at the same 83.3-MHz rate, so any processor can access any memory on the daughtercard at the same raw bandwidth.

PowerPC processors were chosen primarily for the sustained throughput of their external interface, including the raw bandwidth and the pipelining capabilities. Different PowerPC processors have different pipelining capabilities, as well as being available with different maximum CPU core frequencies. The Excalibur card can be implemented with different processors to take advantage of those various combinations as they change over time.

The current top performer is the 333-MHz PowerPC 604e. It allows three memory accesses to be pipelined, enabling single-cycle throughput for subsequent loads from the same cache line. The 604e also has a feature called "streaming," which allows subsequent cache line loads to occur without a gap. The result: a sustained memory access pattern for data reads of "... 1111 1111 1111 ..." Therefore, the sustained performance in this case is the same as the theoretical peak performance.

The memory controllers on Excalibur also can access the interface to the SKYchannel interconnect. The ANSI-standard SKYchannel Packet Bus (ANSI/VITA 10-1995) provides a 320-Mbyte/s connection to all the other daughtercards in the system. Multichassis systems as large as 4096 units are possible using SKYchannel (Fig. 3). Achieving the highest performance for multiprocessor DSP applications thus requires a global optimization of processors, memory technology, controller design, and multiprocessor interconnect.

When SDRAMs are combined with processors that support pipelining, they can provide sustained memory accesses at 667 Mbytes/s, while providing hundreds of megabytes of storage capacity. This combination of speed and capacity is critical for DSP applications with large data sets.