By Mike Jadon and Richard M. Mathews, Micro Memory, LLC
Utilizing FPGAs for digital signal processing is gaining larger market acceptance and popularity at a rate few would have dared to predict just two or three short years ago. This can primarily be attributed to three factors:
· The priority being placed on increasing processing power while reducing total system volume and heat dissipation.
· The pace of advancement in FPGA technology, specifically as it pertains to size in the number of logic slices and speed in terms of the FPGA’s running clock frequency.
· Ease-of-Use provided by FPGA development tools, algorithmic modeling tools, and compilers.
These programmable devices, combined with advancements in reconfigurable processing, are changing the DSP paradigm at the system level for many applications. Today, FPGA processing is generally seen as a front end solution for fixed point operations and part of a larger heterogeneous combination of processing resources that includes conventional DSPs, such as Altivec PowerPCs or the TI C6x family. But this ratio is changing to include a higher and higher percentage of FPGAs. Performing floating point operations in FPGAs is becoming much more feasible through the use of modeling tools with high level languages and VHDL compilers, the most common being MATLAB’s Simulink with Xilinx’s System Generator.
Nonetheless, getting a complicated system based on FPGA processing to a production level requires a significant investment in terms of time and resources. While larger signal processing OEMs and Defense Primes are concerned with increasing processing power and reducing system size and heat dissipation, they are also heavily focused on consolidating on common processing platforms, where a core processing unit is used to address the requirements of several programs. This type of reuse dramatically reduces initial development costs for follow-on systems and recurring sustaining costs for all systems. But because different programs will inevitably have different requirements, it is likely that different interfaces will be needed from one program to another.
As FPGA processing is most often used at the front end of the data flow, these input interfaces are often on the same circuit board as at least some of the FPGA processing resources in order to meet the requirement of reducing overall system size.
Reuse Requires Flexibility
On the front end of a system, A/D converter requirements can vary in the number of channels, channel width, frequency, and decimation rate, and these requirements can easily change from program to program. Even if sensors are digitized separately from the core processing unit, in which case Serial FPDP or some type of high speed fiber links might be utilized, the number of channels and link rate can be different from one system requirement to another.
Also important is the need for true flexibility in terms of accommodating completely different input interconnect technologies. An example is a core processing unit that originally targeted a closed system and utilized Serial FPDP as the input interface. Suppose a requirement then arises to deploy this same core processing unit into a larger net-centric system based on 10 Gb Ethernet. This scenario is equally applicable to technology refresh, where interconnect bandwidth, features, and quality are evolving and improving each year.
Flexibility = Mezzanines
All of this leads to the following points: FPGA processing is a significant investment; consolidating slot count is necessary to reduce system size; and having a common core processing platform is essential to the business objectives of OEMs and Defense Primes. Thus, some type of mezzanine should be utilized for the system input interfaces in conjunction with carrier-based FPGA processing to provide the flexibility that is critical to re-use.
This conclusion is quite easy to ascertain, but often gated by practical implementation. When one looks deeper at the idea of combining mezzanines and FPGA processing, some important factors are evident:
· Mezzanines are often based on chipsets from leading I/O silicon providers (AMCC, Broadcomm, Intel, Mellanox, QLogic, Vitesse, etc.).
· Part of the appeal of utilizing mezzanines is the ability to readily avail of these devices and their cutting edge technology with minimal cost and development time.
· The vendors of these I/O silicon devices often utilize PCI/PCI-X/PCI-Express because they target mass-market servers.
ASIC-based Bus Translation Bridges
These factors often lead to the use of off-the-shelf ASSP (Application Specific Standard Part) bridges. These ASIC based bridges are widely utilized in a variety of embedded platforms and provide essential functionality, including connectivity between different devices on different bus segments or even different bus topologies.
If we go a step further and assume the system might utilize a truly distributed fabric, such as RapidIO, that has distinct advantages over alternative architectures for distributed multiprocessor systems, then it is very likely given the previously mentioned factors that the system will require a bus-translation bridge.
In this case, the system would require an ASIC-based bus translation bridge with a PCI endpoint and a Serial RapidIO endpoint, with the bridge translating the PCI protocol (PCI, PCI-X, or PCI-Express) to RapidIO. While providing essential functionality, these ASIC-based bridges can have several performance drawbacks. Regardless of fine tuning adjustable parameters, bus-translation bridges will inevitably force the use of flow control mechanisms such as retries, callbacks and disconnects. Combined with limited FIFOs and inefficient pre-fetching, this will result in performance penalties in terms of latency and throughput that can negatively impact the greater system.
A general-purpose bridge, especially one that must translate between bus protocols, often must make guesses about the future behavior of devices on each side of the bridge. When reading, it may be efficient to pre-fetch data, or that may waste bandwidth by transferring data that will not be needed. The bridge has to choose between transmitting writes immediately or posting them in the hopes of combining or collapsing them with other writes. A translating bridge must also make a guess at values to use for fields not defined by both protocols, such as guessing RapidIO priority for a request that originates on PCI-X. If the bus speeds do not perfectly match, the bridge must make guesses about flow control. If data is transmitted to the bridge faster than it can be transmitted, the bridge must act to slow down the sender; yet this may result in in-deterministic behavior that makes it impossible to schedule I/O such that different devices on the bus do not conflict. If data is transmitted to the bridge slower than it can be transmitted, the bridge has to make a guess about whether to transmit small bursts with low latency or whether to accumulate larger bursts to get higher bandwidth utilization.
With conventional PCI, a block read request provides no information as to the size of the block. As the bridge goes to prefetch the block of data, while throttling the requesting side, a huge amount of bus inefficiency is created when the prefetch block size does not match the block size required by the original requester. If the prefetch block is too large, then the bridge prefetches excessive data, which is subsequently discarded, wasting bus bandwidth. If the prefetch size is too small, then the bridge keeps disconnecting the requester while it goes and gets another small block. Without perfect tuning, it is difficult to achieve a “flow-through” mode in which prefetched data is being fed to the requester in a continuous flow, thereby coming close to the theoretical maximum bandwidth. The retries and disconnects drive the bandwidth well below the theoretical maximum.
PCI-X, PCI-Express, and RapidIO reduce this problem but do not completely eliminate it. If the optimum strategy involves pre-fetching more than the amount of data that can be requested on the bus at one time, the bridge may still not really know the correct amount of pre-fetching it should perform. For example, under many RapidIO implementations, there is a limit of eight outstanding requests. Since each can be only for 256 bytes, it is not possible for the reader to tell the bridge to prefetch more than 2 KB. A bridge may know it needs to pre-fetch at least 2 KB, but it can only guess whether it needs more.
Optimized Bridging through Multi-Ported Memory Controllers
Alternatively, bridges based around multi-ported memory controllers can overcome these performance penalties. Providing seamless, transparent access between endpoints with different topologies, such as PCI and Serial RapidIO, translation bridges based on multi-ported memory controllers do not suffer the disconnects and retries experienced with the conventional bridges previously mentioned. These solutions require more total devices, including external memory, that result in occupying more board real estate and at a higher cost than alternative implementations. But for many applications, particularly those involving real-time streaming data, this cost is easily offset by the overall system benefits.
If the multi-ported memory controller based bridge is also a smart device, it can make intelligent decisions about when to send requests and data. The multi-ported memory controller can carry on one-to-one communications with devices on each bus that are fine tuned to fit the application on each side. I/O can then be precisely scheduled to fit hard-real-time requirements.
FPGA Processing in the Bridge
In addition to providing optimized bridging through a multi-ported memory controller and intelligence embedded in the device, a true system-on-chip can be created that also includes FPGA processing resources for user programmable logic in the heart of the bridge. This type of smart, multi-ported memory controller combined with application specific state machines can process the data, so what comes out is not the same bits that went in. High bandwidth utilization can thus be achieved on both busses.
Assuming the application requires FPGA processing and a distributed switch fabric, you can then see how a smart, optimized bridge can provide efficient connectivity to mezzanines, alleviate bottlenecks, and perform pre-processing (FFTs, FIR filters, etc.) at a critical point in the data flow. A circuit board based on such a device is the MM-1400D (see Figure). The MM-1400D is a 6U VME VITA 46 PMC/XMC carrier board based on two of MicroMemory's CoSine Virtex-4 FX140 FPGA SoCs. The two mezzanine sites each support 64-bit/133-MHz PCI-X PMCs or Serial RapidIO x4 XMCs. Four independent Serial RapidIO x4 connections are provided to the VITA 46 MGT LVDS backplane connector in compliance with the VITA standard. Each CoSine SoC FPGA has 4 GB of multi-ported DDR (8 GB total on the MM-1400D) with ECC for rate buffering streaming sensor signals. Tightly integrated with the CoSine User Programmable Logic block, this provides an effective strategy for performing fixed point FPGA processing (FFTs, FIR filters, etc.) functions on incoming data streams before DMA’ing results to downstream DSPs.
Mike Jadon is the Director of Product Marketing at Micro Memory, LLC. He can be reached by e-mail at [email protected]
Richard M. Mathews is the Managing Developer of Software Engineering at Micro Memory, LLC. He can be reached by e-mail at [email protected]
Product URL: Click here for more information