The path to real-time wideband channelization is camouflaged by different techniques. Among the competing approaches are Pipelined FFT, polyphase DFTs, multiple digital downconverters (DDCs), and both the Pipelined Frequency Transform (PFT) and its derivative—the tunable PFT (TPFT). When selecting a technique, remember that the main objective is to establish the optimum solution for different application types. Currently, high-speed analog-to-digital converters (ADCs) are available off-the-shelf with conversion rates of up to 1.5 GSamples/s (e.g., the Maxim MAX108). Their dynamic range is constantly improving as well. However, the problems really start with the area of signal processing that resides immediately after the ADC.

Typically, processing at this stage involves frequency conversion and channelization. Standard digital-downconverter technologies do exist for the selection of narrowband channels from a medium-bandwidth spectrum (e.g., Conexant, which was formerly Globespan Virata; TI-Graychip; and Analog Devices). But they're limited to only a few simultaneous channels for an economical amount of silicon.

Alternative technologies do exist that require the spectrum to be channelized into equally spaced, equal-bandwidth channels. Among them is the FFT, where channel filter performance isn't too critical. Another example, polyphase DFT, provides higher-performance filters.

By using pipelining architectures, real-time multichannel performance can be achieved in a practical amount of silicon. Yet real-world situations often require channels of non-equal width and spacing along with time-varying channel plans. This statement applies to multistandard mobile base stations, software-defined radios (SDRs), and satellite communications. Additional demands are spawned by monitoring, instrumentation, and surveillance activities. Often, such activities create a need to observe signals in different resolution bandwidths—sometimes simultaneously.

This article compares the main competing techniques for real-time, wideband channelization. It focuses on the basic techniques that provide multiple channels from a broad band for further processing, such as demodulation or signal detection. The architectures discussed here are generally biased toward being implemented in hardware, such as field-programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs). In terms of multiply/accumulate operations (MACs), the required processing power is very high. In most cases, it's very much in excess of the peak MAC performance of today's programmable DSPs. The most difficult aspect to overcome is the memory bandwidth requirements of a wideband, real-time system. For high-end specifications, it's not clear how this bandwidth could be achieved without using a totally impractical number of DSPs.

**Digital downconverters (DDCs)**

Digital downconverters are well established as a technique. Using custom or standard cores, this approach is relatively straightforward to implement in FPGAs. In cases that require only a few channels (typically 4 to 8) to be selected from the broad band, such a solution is quite efficient. It also proves to be very flexible. Each channel can be independently configured for center frequency, bandwidth, and filter response. For larger numbers of channels, however, the logic and more particularly the memory requirements become excessive.

**Fast Fourier Transform (FFT)**

The Fast Fourier Transform and its real-time pipelined implementation also are well-known techniques. The FFT provides a very economical solution to the channelization problem. It's especially well suited to scenarios in which a large number of channels are required, but the channel filter performance isn't too critical. Generally, the FFT is restricted to cases that require channels with even frequency spacing and equal filtering.

**WOLA and polyphase-DFT filter banks**

To achieve an improvement in filtering performance, don't just use the "windowing" of time data. Utilize polyphase filter banks ahead of the FFT. This technique, which is generally called Weight Overlap and Add (WOLA), has a subset: the polyphase DFT. This approach is gaining recognition. It's very efficient when large, high-quality filter banks are required. Like the FFT, however, it's generally restricted to cases that require evenly spaced channels with equal filtering.

**Pipelined Frequency Transform (PFT)**

The PFT processing form takes a different approach. Based on a "tree" structure, it successively splits and filters the frequency band to achieve a finer and finer resolution of the broad band. The time interleaving of common processes can lead to a very efficient structure. One of its advantages is that it makes simultaneous outputs available from successive stages. These stages are at different frequency resolutions. PFT also offers the ability to independently tailor the filters for different frequency bins. If certain frequency bins or blocks of spectrum aren't required, it's easy to exclude them from the processing. The result is greater efficiency.

**Tunable PFT (TPFT)**

In its simplest form, the PFT mentioned above still produces equally spaced frequency bins. To overcome this limitation, a derived form may be used. Known as the TPFT, it allows the independent tuning of the center frequency of all bins. It also permits independent filters for each bin. Because of the availability of different stage outputs with varied frequency resolutions, the end result is like having the flexibility of the DDC approach. At the same time, the designer gains the efficiency of the PFT. This efficiency is vital for a larger number of channels.

To really understand the implications of choosing between these approaches, it's necessary to examine each technique in more detail. For example, look at the general architecture of a typical DDC (FIG. 1). Although there are many variants of this design, the principle is broadly the same in each case. The input can be complex or real. For this discussion, assume a complex input. The broad-band input signal is frequency shifted up or down to center the required narrowband channel on zero frequency. This step is achieved using a complex local oscillator. That oscillator is some form of numerically controlled oscillator (NCO). It also is a complex mixer which, in its basic form, comprises four complex multipliers and two adders. The complexity of the NCO will depend on the final frequency-setting accuracy that's required as well as the system's spurious-free dynamic range (SFDR).

Next, low-pass filtering extracts the required channel. That channel may consist of any combination of Finite Impulse Response (FIR) or Infinite Impulse Response (IIR) filters. Typical examples of these filters include the cascaded integrator comb (CIC), half-band, and decimating FIR. For the simple analysis presented here, assume a multi-stage CIC followed by a decimating FIR.

It's normal to perform some of the decimation in the FIR filters. This is partly because very high decimation factors in the CIC require significant bit growth in the CIC components. It's also partially done to avoid having alias sidelobes spoil the stop-band response. To achieve adequate passband flatness in the DDC, the FIR filter also may need to compensate for rolloff in the CIC passband. The typical performance of a single DDC is shown in Figure 2.

CIC filters permit high-rate signal decimation or interpolation. And by eliminating the need for multipliers, they use a very compact architecture.^{1} The two basic building blocks of a CIC filter are an integrator and a comb. The integrator or accumulator is simply a single-pole IIR filter with unity feedback coefficient. A typical comb filter behaves as a high-pass filter with a 20-dB-per-decade gain (FIG. 3).

Building a CIC filter involves cascading N integrators with N combs. They are followed by a decimate-by-R block. Although such a scheme works, it can be greatly simplified by placing the comb after the decimator. In general, a CIC filter would have N integrator/comb pairs with N typically ranging from 3 to 6. A three-stage CIC is schematically illustrated in Figure 3.

The magnitude response of a CIC filter can be shown, for large R, to approximate to a (Sinc)^{N} function. Spectral nulls will appear as multiples of 1Fs/MR Hz, where Fs is the input sample rate and M is the integer number of delays in the comb section (typically 1). The stopband relative attenuation is a function of the number of stages that are used and equals approximately N × 13 dB at the first sidelobe. In contrast, the DC gain is a function of the decimation rate R and equals (RM)^{N}. When designing a CIC filter, it's critical to account for bit growth. Insufficient bit width would lead to an unstable filter.

To provide multiple channels, it's possible to have a number of DDCs in a parallel stack. The number of output channels is equal to the number of DDCs that are employed in such an architecture. Clearly, a linear relationship exists between the amount of silicon and the number of channels required.

It's possible, however, to optimize the DDC stack architecture by taking advantage of the changes in sample rate across the system. From each CIC decimator onward, the comb half of the CICs and the FIRs are clocked at a fraction of the input rate. As a result, there's potential for recycling all of the combs and FIRs into a pipelined version of each.^{2}

The next focus is the vast area of FFT techniques. These methods boast a multitude of algorithms for programmable DSP implementations and a number of COTS ASIC implementations that are readily available. Here, coverage is restricted to wideband, pipelined hardware solutions—particularly those that are suitable for FPGA realization.^{3 }

Figure 4 shows an implementation of the Pipelined FFT (PFFT).^{4} This implementation is based on successive n stages, where 2n is the size of the FFT. Each stage has switched delay elements and butterflies. The switches and delays re-order the data for processing at the next butterfly.

There are n butterflies that implement the complex arithmetic. Each one performs a two-point DFT and complex phase rotations (twiddles). The input to the first butterfly has a FIFO n/2 buffer stage, which ensures efficient utilization of the butterfly arithmetic. The normal output of the final stage is bit-reversed complex (I/Q) data. To achieve normally ordered frequency data, it requires a bit reverser.

The filter bank's performance also must be considered when weighing the benefits of the FFT. In Figure 5, the effective frequency response is shown for the unweighted FFT. The figure displays the Sinx/x nature of the sidelobe structure. Compare it to the filter response of a typical DDC filter (−85 dBc stop-band and low-passband ripple). A clear advantage exists in the filter frequency response when using the DDC.

The standard approach to improving stop-band performance is to weight or "window" the time-domain data. For example, Figure 5 also shows the effect of Kaiser weighting. Here, the stop-band level is improved. Yet a significant price is paid: a significant widening of the passband. Figure 6 demonstrates this tradeoff a little more clearly. It shows the equivalent set of overlapping filters for a 32-point complex FFT with a Kaiser window. Obviously, a signal that's occupying a narrow frequency band (e.g., CW) will actually appear in a number of adjacent bins in decreasing—but still significant—levels.

To achieve performance approaching that of a typical DDC, a window is needed in the time domain that matches the overall DDC filter impulse response. For instance, a 1024-bin filter bank might require a window some 4000 to 5000 samples long. Such a window could be achieved by using an FFT of this length and decimating the output by 4 or 5. But it would be very inefficient. It would particularly impact real time, which needs the parallel processing of several large FFTs.

Fortunately, a much more elegant and efficient solution exists.^{5} In its most general form, the Weight Overlap and Add (WOLA) method is shown in Figure 7. The required filter shape is determined by the weighting function, which is L samples long. To match the DFT length, the weighted data is divided into blocks of KSamples. The blocks are then added together before processing by the DFT.

Next, the input data is shifted along by MSamples and the process is repeated. In the simple case where M = K, a fresh result is attained every KSample. The system is then known as critically sampled (i.e., the sample rate just satisfies the Nyquist criterion). This method may be adequate for some processes, such as spectral analysis or analysis/synthesis filter-bank pairs. But the alias problems, which are caused by the critical sampling of a filter bank with finite cutoff rates, often require some degree of oversampling. This oversampling is achieved by making M < K so that the oversampling factor, I = K/M, is greater than unity. An advantage of the WOLA is that I need not be an integer. For those cases in which it can be an integer, a different structure—usually known as the polyphase DFT—may be used (FIG. 8).

The PDFT may sometimes be a more efficient structure, provided that the limitation of integer oversampling is accepted.^{6} It may be shown that the stacked DDC, WOLA, and PDFT give identical results for a given filter response and integer oversampling. The choice between them is mainly based on silicon efficiency. The performance of a PDFT filter bank is illustrated in Figure 9. When compared with the standard weighted FFT response in Figure 6, it's easy to see the improvement that it offers.

Unlike the previously mentioned techniques, the Pipelined Frequency Transform's underlying concept is one of frequency-band splitting. In the simplest radix 2 form, each successive stage of the PFT increases the number of bands by a factor of two (FIG. 10). This increase could be achieved, for example, by a simple tree structure (FIG. 11). The input is complex to preserve positive and negative frequencies. First, it is split into two equal bands using a complex downconverter (CDC) and a complex upconverter (CUC). Because the bandwidth of each one has been halved, it's possible to halve the sample rate for each of the sub-bands.

In practice, a degree of oversampling is required to avoid the image response problems caused by finite filter cutoff rates. At the output of the first stage, 2X oversampling is used. For all successive stages, the output is decimated by two. The overall 2X oversampling is thus preserved throughout the system.

Yet this approach also has at least one obvious disadvantage. For large numbers of channels, the tree gets impossibly large. For instance, 1024 channels would require 2046 complex CDC or CUC modules. Each of these modules would take the form of Figure 1, which shows the conventional form of the CDC(A) module. It consists of four multiplies, two adds/subtracts, a sine/cosine lookup table, and a pair of low-pass filters. The CUC(A) module would be very similar. It would differ only in the signs of the adder/subtracter elements. Successive stages also would be quite similar except that the local oscillators would then be at Fx/8 (where Fx is the input sampling rate for the stage). In addition, the output would be decimated by two.

Fortunately, this architecture can be greatly simplified in several ways. With the tree system, the sampling rate drops by a factor of two at each stage. The result is inefficient use of the hardware, which is capable of running at the full rate, Fs. The most processing-intensive part of each stage lies in the low-pass filters. Because those filters take an identical form within any given stage, interleaving techniques may be used to regain full efficiency. This step involves interleaving the samples for each of the branches within a given stage. It also means modifying the filters (which are normally FIR filters) by adding extra delays between the coefficient multipliers (FIG. 12). Several other simplifications save silicon, including the avoidance of lookup tables and multipliers.^{2}

The PFT passband performance is very similar to that of the polyphase DFT that was illustrated in Figure 9. But differences exist in the stop band, because the PFT is a cascade of filters. Its form potentially allows higher stop-band attenuation over a large percentage of the broad band—an effect that increases with the number of stages. Simultaneous outputs also are available at each stage of the PFT. Each one gives a different resolution. Finally, IIR filters can be used at any stage of the PFT. Silicon may therefore be saved for applications in which linear phase isn't critical and/or low latency is required.

The tunable PFT actually makes use of the PFT cascaded structure where intermediate outputs are readily available.^{7} By modifying the PFT architecture, it's possible to extract frequency bands of the desired size while ensuring that those bands are centered at any given frequency. This level of tunability is achieved in two stages. First, the signals are coarsely tuned within the PFT stages. Then, they're fine tuned by a complex converter. That converter's local oscillator is a numerically controlled oscillator driven by the routing engine (FIG. 13).

By performing the tuning operation in two steps, the designer gains a reduction of size—for a given frequency resolution—of the LUT used for fine tuning. The tuning range that's required at each successive stage is reduced by a factor of two. In contrast, a DDC would need fine tuning over the whole input bandwidth. Overall, this structure is an ideal replacement for multiple DDCs in applications like multi-standard base stations, satellite communications, and intelligent antenna systems (FIG. 14).

A brief comparison of silicon usage for the different filter banks also must be made. Within the limited scope of this article, only a few examples can be considered. They are based on designs that have been placed and routed in Xilinx FPGAs. A comparison is done for filter banks with the following parameters (see table):

Number of bins = 256, 512, or 1024

Filter stop band = 100 dB

Passband ripple = 0.1 dB

Filter overlap = 75%

Input bit width = 14

Sample rate = 102.4e6 complex, 2x oversampled

Device: Virtex 2-6000

LUTs = 67584

RAM = 18432 bits

18-b multipliers = 144

Compared to the other two techniques, the most obvious conclusion is that the stacked DDC approach is very inefficient. To be fair, however, the particular design that was utilized didn't make use of the dedicated multipliers that are available in Xilinx's Virtex 2 devices. Even so, the use of stacked DDCs for more than about eight bins just isn't economical.

It's not easy to directly compare the polyphase DFT and PFT approaches. The PFT has been configured as a "multiplier-less" design. It doesn't make use of the dedicated multipliers even though it could. Plus, the PFT has outputs available at each stage. Those outputs make it very useful in certain applications. Furthermore, silicon efficiency is much improved if it's only necessary to output bins over selected portions of the broad band. The general conclusion is that for smaller numbers of bins (up to around 256), the silicon requirements are similar. For larger numbers of bins, the polyphase DFT gains rapidly—particularly in terms of memory. It becomes the preferred choice for single, fixed filter banks.

For tunable filter banks, the best comparison is between stacked DDCs and the tunable PFT. Figure 15 compares the logic requirements of the two approaches for up to 256 bins. Above about 16 bins, the TPFT wins rapidly. A similar comparison exists for the memory requirements.

Obviously, the most suitable design technique for any given application cannot be covered within a short paper. At the higher subsystem level, too many factors need to be considered: form factor, power consumption, weight, legacy systems, etc. At the board and chip level, one must factor in the major considerations of speed/sample rate, number of channels, dynamic range/filter performance, target device, etc. Only then can the engineer decide which architecture is the most appropriate to adopt.

At the device level, however, the situation is a bit clearer. If more than approximately eight fixed channels are required, the polyphase DFT approach provides a more efficient solution than digital downconverters. Then, the crossover point for tunable filter banks (channels) is at around the 16 channel point.

REFERENCES:

- Hogenauer, E.B., "An economical class of digital filters for decimation and interpolation," IEEE Transactions on Acoustic, Speech and Signal Processing, ASSP-29(2):155-162, 1981.
- PFT Architecture and Comparisons with FFT/Digital Down-Converter Techniques, www.rfel.com/download/W02001-PFT White Paper.pdf.
- Rabiner, L.R., and Gold, B., "Theory and Application of Digital Signal Processing," Prentice-Hall, 1975.
- Pipelined FFT, www.rfel.com/download/W02004-Pipelined FFT White Paper.pdf.
- Crochiere, R.E., and Rabiner, L.R., Multirate Digital Signal Processing, Prentice-Hall, 1983.
- Gumas, C.C., "Window-presum FFT achieves high dynamic range, resolution," Personal Engineering & Instrumentation News, July 1997, pg. 58-64, www.chipcenter.com/dsp/DSP000315F1.html.
- TPFT-Tuneable Pipelined Frequency Transform, www.rfel.com/download/W02003-Tuneable PFT White Paper.pdf.