Multiport Memories Soup Up Wireless Baseband Processing

DESIGN VIEW is the summary of the complete DESIGN SOLUTION contributed article, which begins on Page 2.

With wireless transmission standards evolving from 2G to 2.5G to 3G and beyond, each wireless infrastructure network subsystem is under pressure to handle increasing performance and bandwidth requirements. At the same time, subsystem chip vendors trying to supply the necessary functionality and performance are facing technological limitations. As a result, system architectures have to be rede-signed with nontraditional components. Multiport memories, also known as specialty memories, fall into this component category.

As wireless networks transform themselves to carry multimedia traffic at 3G rates, the complexity of processing requirements inside a baseband card has increased tremendously. A large number of DSPs, FPGAs, and ASICs are used to partition the tasks, process the data in parallel, and share it in real time. Employing multiport memories with large buffering capacity enables such interprocessor communication.

Multiport memories improve the system's overall performance by increasing overall throughput, adding design flexibility, and enabling faster time-to-market. Board design also improves, since these memories shorten the distance that signals must travel between DSPs and FPGAs. Moreover, point-to-point connections reduce the load on interfacing DSPs, FPGAs, and ASICs.

This article focuses on one of the main components of a 3G basestation—the baseband-processing card. This is where the most intensive computation and signal processing takes place. In particular, the article concentrates on the card's receive flow. See the figure below, which highlights the card's receive section.

Also discussed are techniques for achieving efficient chip-rate and symbol-rate processing.

HIGHLIGHTS:
Baseband Processing Cards	In 3G basestations, most of the heavy-duty computation and signal processing occur on this card. The receive section of the card is complex because it simultaneously receives multiple signals from users, along with the interference.
Efficient Chip-Rate Processing	Chip-rate processing on the uplink (from the end user to the basestation) can be optimized via a combination of FPGAs and DSPs.
Rake Receiver	A Rake receiver solves the problem of multipath signals. ("Rake" comes from the image of varying-length finger paths used in the receiver, which looks like a common garden rake.) The Rake takes copies of the received signal, sends them down separate finger paths, and sums together the outputs of each finger at the end.
Efficient Symbol-Rate Processing	Data rates involved with symbol-rate processing are much lower than those associated with chip-rate processing. Most symbol-rate processing is carried out using just DSPs. Tasks performed here include CRC encoding and decoding as well as convolutional encoding.

Full article begins on Page 2

With wireless transmission standards evolving from 2G to 2.5G to 3G and beyond, each wireless infrastructure network subsystem is under pressure to handle increasing performance and bandwidth requirements. At the same time, subsystem chip vendors trying to supply the necessary functionality and performance are facing technological limitations. To meet these mounting demands, system architectures have to be redesigned with non-traditional components. Multiport memories, also known as specialty memories, represent one such component enabling today’s network-equipment subsystems.

As wireless networks transform to carry multimedia traffic (voice, data, and video) at 3G rates, the complexity of processing requirements inside a baseband card has increased tremendously. A large number of DSPs, FPGAs, and ASICs are used to partition the tasks, process the data in parallel, and share it in real time. Employing multiport memories with large buffering capacity enables such interprocessor communication.

Compared to the existing GSM networks, 3G networks require at least an order of magnitude greater processing. On top of that, the demand for signal processing in these networks is outpacing the process technologies’ ability to deliver the required processor speeds. A number of techniques that use multiport memory in the basestation architecture can fill that gap.

Multiport memories help wireless baseband systems in a variety of ways. They improve the system’s overall performance by increasing overall throughput, adding design flexibility, and enabling quick time-to-market. In addition, board design is improved as multiport memories shorten the distance that a signal must travel between DSPs and FPGAs. Moreover, point-to-point connections reduce the load on interfacing DSPs, FPGAs, and ASICs.

Figure 1 shows the main components of a 3G basestation (also known as a node B). The main components can be broken down into the following constituent parts: antenna, amplifiers, filters, baseband processing card, power, control and clock distribution, plus the network interface. The focus here is on the baseband-processing card because that’s where the most intensive computation and signal processing takes place.

Baseband Processing Card The receive section of the card is much more complex than the transmit section of the baseband flow, in accordance with the old idiom "it’s much easier to talk than to listen." This is due to the fact that the basestation can simultaneously receive multiple users’ signals, multiple copies of the same user’s signals, and interference from the many noise sources that exist in the environment between the mobile user and the basestation. Separating these multiple signal sources demands great computational resources.

On the transmit side, the basestation is only concerned with converting raw data from users into the format used by the 3G air interface protocol, and transmitting the data. The design suggestions and techniques discussed here focus mainly on the receive flow within the baseband card. The TX and RX baseband processing sections can be implemented on two separate boards to allow for upgrades and replacements in a basestation chassis.

Figure 1 highlights the receive section, which consists of a chip-rate processing block and a symbol-rate processing block. Chip-rate processing generates the strongest signal from multiple signals received from a user’s mobile device. Symbol-rate processing decodes user data out from this signal.

Efficient Chip-Rate Processing Chip-rate processing on the uplink (from the end user to the basestation) can be optimized via a combination of FPGAs and DSPs. An FPGA is primarily used to implement a Rake receiver because the rate of incoming data is high, and it takes a large amount of parallel processing to deal with multiple users. On the other hand, DSPs are better suited to implement computationally intensive functions, such as path estimation, channel estimation, and maximum ratio combining (MRC). This implementation requires the passing of large amounts of data between the FPGA and the DSP.Rake Receiver A Rake receiver solves the problem of multipath signals, where the transmitted signal from an end user’s mobile device travels over several different paths, including reflections from buildings and other obstacles. (The term "Rake" comes from the multiple finger paths used in the receiver. These varying length fingers give a mental image of the common garden rake.) A Rake receiver takes copies of the received signal, sends them down separate finger paths, and sums together the outputs of each finger at the end. The process of path estimation is used to calculate the timing of these different paths, and set up appropriate delays in each of the Rake receiver’s fingers.

Each call from a user may require a different number of fingers (usually between three to six) in a Rake receiver to recover the best signal. This is determined by the channel-estimation block. Channel estimation and MRC are also used to figure out the relative weight that should be given—based on estimations of the noise and corruption in each signal—to each Rake finger when combining the multipaths.

The FPGA will pass the tracking data (comprising many large correlations) to the DSP for finger allocation and weighting, and could also pass the actual despread data through the DSP for symbol-rate processing. The amount of tracking data to be passed between the FPGA and DSP depends on the number of channels being processed on the baseband card, the sort of code-sampling rate, and the number of antennas being polled. Reference 1 gives an example of 13 Mbits of tracking data being passed from the FPGA to the DSP every radio frame (10 ms) in a WCDMA system for processing 32 channels:

FPGA to DSP = 400 kbits (tracking data per frame) × 32 (number of channels) × 100 (10-ms frame) = 1.3 Gbits/s.

Low-latency, high-speed DSP serial ports can be used to transfer the small amount of coefficient update and finger allocation data back to the chip-rate FPGA or ASIC. If the chip-rate processing system contains an FPGA connected to one DSP, placing a multiport memory, such as a dual-port memory, in the path between them allows for easy passing and buffering of the tracking data.

Studies have demonstrated that splitting the computational tasks between DSPs can accelerate the algorithms by up to five times and help reduce system bottlenecks.² For example, separate DSPs can be employed for channel/path estimation and multi-user detection. In an implementation such as this, both DSPs may be accessing the same data at different times or different clock rates. Dual-port memories are ideally suited for such an application (Fig. 2).

Dual-port memories offer very high-density buffering capacity with random access to the buffered data at very high throughput. The data can be accessed simultaneously from both the interfacing devices—an FPGA and a DSP in this case—operating in two independent clock domains. The bidirectional nature of each port allows true data sharing between the FPGA and DSP. Recent products in the market include densities of up to 18 Mbits with 72-bit-wide ports, which can be cascaded to create even denser and wider memories. Throughput (bandwidth) in a dual-port memory is calculated by:

f_MAX× 2 ports × width of each port and recent products offer bandwidth in excess of 19 Gbits/s.

Dual-port memories can be connected seamlessly to the external memory interface (EMIF) of DSPs. Using the direct-memory-access (DMA) engines within the DSP permits access of data in the dual port with minimal intervention from the CPU.

By using the chip enables, this connection scheme allows the same data to be easily shared between a bank of DSPs. Thus, multiple processing can occur on the same data, or data can be buffered for different users in separate dual ports. This scheme makes it possible to buffer data while the DSP acts on previous data—rather than having a direct connection between the FPGA and the DSP. The advantage of this scheme over regular single-port SRAMs or DRAMs is that there’s no bus-turnaround latency associated with the FPGA writing data into the memory buffer and the DSP reading out the data. In turn, the system’s bandwidth and efficiency are effectively doubled.

Furthermore, multiport memories allow processors running at different clock speeds, or within separate clock domains, to be easily connected. That wouldn’t be possible with single-port memories. Multiport devices used in this manner also offer a point-to-point connection between processor and memory, whereas single-port devices would require a shared bus. The point-to-point connection simplifies signal integrity and allows for higher clock speeds than are possible on shared-bus schemes.

Efficient Symbol-Rate Processing The data rates involved with symbol-rate processing are much less than those associated with chip-rate processing. Most of the symbol-rate processing in baseband cards is carried out via just DSPs. Several tasks are performed in the symbol-rate processing section: CRC encoding and decoding adds a final stage of error checking to the received data. Additionally, convolutional encoding serves as a forward-error-correction (FEC) technique that improves the integrity of data transmission by encoding each bit into a 3-bit symbol.

Using corresponding decoding techniques in the receiver aids in recovering data that may have been corrupted by noise during transmission. The decoder can recover the original data bit even if some of the bits that make up the symbol were corrupted by transmission. The symbol rates are three times the original data rate.

Two main types of encoding/decoding are employed in 3G systems. One, Viterbi, is used primarily for voice channels, and provides backward-compatibility to 2G systems. The second, turbo decoding, is more efficient for encoding and decoding data transmissions, yet it takes more computational power than Viterbi.

Interleaving involves writing the data into rows of a matrix X rows wide, and N columns deep, then reading the data out in columns. The deinterleaver in the receiver writes the data into a similar matrix in columns and reads it out by rows to restore the original transmission. This process spreads the symbols out during transmission, which protects against corruption from short noise spikes in the transmission environment.

The introduction of DSPs like the C64x series from Texas Instruments with on-chip turbo and Viterbi coprocessors helped to increase the performance of symbol-rate processing. Once again, multiported memories help optimize the processing in this section of the baseband card.

Figure 3 shows a four-ported memory device from Cypress. The QuadPort memory is a four-port switching element that allows simultaneous access to an integrated memory array from each of its completely independent ports. Each port can operate in different frequency domains. In this implementation, one port is connected to the chip-rate FPGA, and the other three ports are connected to three different DSPs—making it possible to access the same data at the same time.

The despread data from the chip-rate processing FPGA is buffered in the QuadPort memory, which is subsequently read by the deinterleaving/demultiplexxing DSP. Then the data is written back into the memory to be accessed by either a DSP tasked with Viterbi decoding (for voice channels, or data from a 2G legacy device), or a DSP that performs turbo decoding (for 3G data channels). Again, using the EMIFs on the DSPs controlled by the DMA machine lets the CPU continue computing while the data is being transferred from the external multiport memory to the internal memory cache for processing. The memory space in the QuadPort memory can be partitioned to hold the original interleaved data in one space of the array, and the deinterleaved (processed) data in a separate space to be accessed by the decoding DSPs.

In addition, multiple DSPs can be used to boost the performance of the turbo-decoding process. This allows for parallel processing of the data, which can result in more reliable data decoding. The scheme shown in Figure 4 facilitates this process. A dual-port memory buffers the despread data coming from the chip-rate processing section, to be accessed by the DSP performing the deinterleaving/demultiplexing tasks. The other port of the dual port drives a bus that lets either the Viterbi or turbo-decoding DSPs access the data. Another dual-port device can then be used to share with an additional DSP to perform parallel turbo decoding.

The authors would like to acknowledge the assistance of Staci Plopan, whose work helped in preparing this article.

References:

Wale, Karl, "Rolling out 3G Networks: How can Demonstrator Systems Evolve for Commercial Deployment?" Motorola application article.
Rajagopal, S., Jones, B. A., and Cavallaro, J. R, "Task Partitioning Gíreles Base-station Receiver Algorithms on Multiple DSPs and FPGAs." International Conference on Signal Processing and Technology (ICSPAT), Oct. 2000.