FPGA Clusters Combine with Mesh Architectures

Embedded signal processors traditionally have been built with clusters of general-purpose floating-point processors. FPGAs usually have been used on the edges of the cluster to perform signal conditioning, while the hard processing work was reserved for the PowerPCs. Current technology makes it possible to do significant, useful signal processing work in clusters of tightly connected FPGAs instead, and this approach has significant advantages over general-purpose processors.

A general-purpose processor’s biggest limitation is the imbalance between I/O and processing. There is only one path to and from main memory. The processor can perform complex calculations faster than it can fetch operands or store results. The result is that many algorithms wind up I/O bound, limited by the access speed rather than the calculation times.

Advanced processors with on-chip caches can do better, but only if the next operand is already in the cache. It often isn’t, so the operation stalls while it is fetched. These processors work best on problems that have a high ratio of processing steps to I/O points, such as long fast Fourier transforms (FFTs). General-purpose processors aren’t a good solution for short FFTs, or finite impulse response (FIR) filters, since each input point is used in just a few calculations. The processor spends most of its time waiting for the memory interface.

In an FPGA, in contrast, an application can have many simple signal processing streams running in parallel. An FPGA can process many signals in the time a general-purpose processor would do one. A typical FPGA can have hundreds of I/O pins, as well as thousands of logic blocks that can be used to implement FIR filters. The ratio of processing power to input-output is much better balanced for signal processing.

Most general-purpose processors do arithmetic operations fastest on 32-bit floating-point values, even if the application doesn’t require that. FPGAs can be more closely tailored to the application, which usually saves gates and memory cells and ultimately power, weight, and system size.

FPGAs perform best in applications that take large amounts of input data and process it in a chain of operations that are the same for all the input points. They don’t suit “if-then-else” processing where the operations change depending on input values or intermediate results. But most signal processing algorithms don’t change with input values, making them ideal for implementation in an FPGA. FPGAs are very good at functions such as FIR filters or FFTs, which involve large numbers of relatively simple multiply and add operations.

In the past, it was cumbersome to connect FPGAs together in multichip systems or in stars, meshes, or rings. Current technologies make this much easier. Several large FPGAs will now fit on 6U VME boards, and new interconnect technologies provide high-speed, low-overhead connections chips and between boards. A cluster of FPGAs could be used to implement multichannel digital downconverters (DDCs), multichannel demodulators, or one- or two-dimensional correlators. Given the high-speed interconnect technologies now available, it’s possible to build multichip systems to do two-dimensional FFTs (2dFFTs) and SAR processing.

Cluster Applications:
The simplest application for FPGA processing would be a set of linear operations like FIR filters running in parallel. A FIR filter is simply a sum of products. The current output is the weighted sum of some number of past inputs. It would be implemented with a chain of multipliers and adders, along with some temporary storage for the previous inputs.

Data would come to the system over a high-speed channel (or several channels) and be distributed across perhaps several FPGA modules. Each FPGA would be divided up to handle as many channels as possible. This would allow a large number of operations to process in parallel, so the overall throughput of the system would be even higher than if one FPGA did all the work.

More challenging applications might involve correlation, digital downconversion, or demodulation cores. In cases where an entire signal processing chain doesn’t fit on one FGPA, it’s a simple matter to build a chain or pipeline comprising several chips in a string. Output data from one chip would be carried over a parallel link to a chip on the same board or over a high-speed serial link to a chip on another board in the chassis board. Cores that implement all these functions are available for purchase from the FPGA manufacturers as well as third parties.

More Complex Problems
Some more complex problems might involve two-dimensional FFTs, which involve one-dimensional FFTs on the rows and columns of the image, with a “corner turn” in between. A classic mesh architecture such as the VXS Processor Mesh fits this problem very well. A 2dFFT is the basis for a large number of useful operations, like image correlators, SAR processors, and target recognition algorithms. FPGA implementations of these functions would be very fast and probably use less power than a solution using general-purpose processors.

Another interesting utilization for a complex application would be a real-time adaptive beamformer. Beamforming involves calculating a set of weights from an input data set and then applying these weights, which define the gain and phase for each array element, to the input data. The weights are calculated using a process called QR decomposition. QR decomposition is a method for solving a set of simultaneous equations for the unknown weights, which then define the required beam pointing direction. The QR core is fairly resource intensive, so a complex system might require several FPGAs to implement it fully. Again, a network of FPGAs configured as a mesh fits this problem well.