Design Great Interconnects By Treating FPGAs Like Software

The potent combination of FPGA processing and switch-fabric interconnect offers systems designers dramatic increases in both processing throughput and scalable bandwidth¾within existing volume and power constraints. However, FPGA development is still a resource-intensive process, requiring a considerable investment to achieve efficient processing of application-specific algorithms. Given the high rate of product evolution, particularly with new switched-fabric interconnects coming on the market, software vendors are working to solve this problem through better tools and higher-level methods of algorithm definition. At the same time, FPGA manufacturers continue to create next-generation FPGAs with greater built-in functionality, making logic synthesis simpler and therefore more efficient to automate.

Until these benefits are realized, effective use of FPGAs will be limited by how quickly designs are adapted to new applications and platforms. One approach to this problem is to apply the modularity and reuse lessons from the software community. This leverages FPGA development investments across more than one product generation to achieve time-to-market goals, and hedge against the inevitable technology changes to come.

Fabric Winner? None Of The Above
Based on the early returns, the only clear fact about the "fabric wars" is that there will be no clear winner. Most high-performance, switched-fabric solutions being shipped today are based on proprietary protocols. Some applications will continue to need solutions tailored to their specific markets. Others will migrate toward off-the-shelf interconnects such as PCI Express, RapidIO, and 10-Gbit Ethernet as they mature. But, none of the available interconnects will meet everyone's needs, so different interconnects will coexist for some time to come, even within the same market segments.

Next-generation board-level standards reflect the need for choices by providing layered architectures with multiple protocols mapped to a common physical interface. In the telecom space, AdvancedTCA offers GigE, PCI Express, StarFabric, and RapidIO fabrics mapped to the same set of pins. In the defense market, VITA 41 and 46 present a similar look and feel in a 6U form factor. New mezzanine standards such as AMC and XMC also share the same goals, providing one core standard with the usual list of fabric suspects as options.

While these standards support developing different cards that implement different fabrics, the inherent flexibility of FPGAs enables a new approach to the problem. Rather than create choices at the board level¾for example, by allowing for a PCI Express analog-to-digital converter (ADC) module or a RapidIO ADC module¾using FPGAs with integrated fabric interfaces and appropriate IP cores allows one hardware solution to support multiple fabric choices (Fig. 1). For many applications, the FPGA-based hardware platform becomes a blank canvas upon which designers can paint their solutions.

A work of art should be unique, but today's FPGA-based IP cores need to be reusable, amortizing the development investment across multiple products, multiple fabrics, and multiple-technology refresh cycles. Fully exploiting the potential of today's technology while hedging the fabric choice requires a modular approach to interconnect-not only at the board level, but within the FPGA device itself.

Interconnect Strategy
An example of such an approach is the one used in our FPGA-based carrier cards and I/O modules. That interconnect architecture is called tekConnect (Fig. 2). This architecture is tailored to the needs of high-performance embedded computing, supplying a very simple protocol for streaming data transfer between external interfaces and IP cores, from IP core to IP core, and from IP core to fabric. Using a common protocol and set of signals allows different IP cores to be inserted, rearranged, and removed without significant integration effort. It also encapsulates both bus- and fabric-based endpoints using a common interface.

By conforming to a common interconnect architecture, IP core developers can focus on the value added by the core itself, without coupling that value to specific fabrics or hardware platforms. This lets both in-house and third-party developers effectively develop application-specific IP functions that are usable on a wide array of board-level products-from modular I/O solutions to high-performance payload and switch cards-across a range of bus and fabric choices.

Encapsulation of internal details, high-level interfaces, and modular design all sound like software. And, in fact, many of the lessons learned by software developers can be applied to FPGA IP cores. In the software domain, library functions are routinely developed and reused without significant effort, insulating the application from changes to the underlying environment. This same approach can be applied to FPGA-based systems, where the underlying environment may be a hardware platform, fabric interconnect, or operating system.

For this to work, the interconnect must combine efficiency with flexibility. If the interconnect isn't flexible enough, it won't meet the needs of a broad range of applications. If the interconnect is too flexible, it will require too many resources to implement, and will add too much overhead to the resulting FPGA design.

The approach we've chosen balances these issues and is tailored for streaming high-speed data applications. The specific design choices made were:

Unidirectional: Simple source-to-sink data transfer with flow control. Bidirectional applications can be implemented using one port in each direction.
Point to point: Eliminates arbitration logic; minimizes data-bus complexity.
Optimized for data: While route or address information can be embedded in the data stream, the interconnect doesn't require address information.
Extensible: Tag bits associated with each data word identify sync boundaries in the data stream, but may also be used for application-specific control extensions.

These tradeoffs are optimized for applications with pipelined data transfer from one or more sources through a chain of processing elements. For these types of applications, the data routing is largely determined during the design phase. While the specifics of each processing element can change dynamically at run time, the overall data flow is static once implemented.

Image-Processing Design Example
Let's assume that you're designing a system to analyze and process high-resolution images in real time. The images could come from a reconnaissance camera, a digital x-ray, or a wafer-inspection system. However, the semantics of the data are identical in all cases: high-speed streaming data consisting of pixels with line and frame boundaries embedded into the data.

Let's further assume that you want to take advantage of FPGA-based processing. You've selected an appropriate combination of off-the-shelf cores and a development-tool chain for your development task. The desired processing is divided into five stages: non-uniformity correction, 2D FFT, convolution, inverse 2D FFT, and detection. Each of these stages will be implemented using an IP core with a common input and output interface.

For any given application, some IP cores will be "generic" (FFTs, digital filters), while others may be domain-specific, or even tailored specifically to the exact needs of the application. In this example, the non-uniformity correction algorithm will need to match the characteristics of the input device. The FFT, convolution, and IFFT cores can be generic, and the detection function is likely to be application-specific.

Many generic cores may be purchased directly for our products with a common interconnect. Or, an off-the-shelf third-party core can be "wrapped" with the necessary logic to package the signals used by the off-the-shelf core into the common interconnect (Fig. 3). Typically, the wrapper logic is straightforward to implement, and once the wrapper is completed, the third-party core can then be reused in multiple applications.

Application-specific functions can be implemented in any desired tool chain, again using wrappers to encapsulate the application processing inside a standard framework. As with third-party cores, the application-specific functions can easily be reused via this approach.

Once the cores are implemented, the application must be mapped into actual hardware. With today's products, the input may arrive using PMC-based I/O modules and traditional PCI bus implementations (Fig. 4a). Such an implementation employs three FPGAs, one on the I/O module and two on the carrier card, to perform the required functions. The data transport between FPGAs uses PCI (from A to B) and RACE++ (from B to C and C to D) endpoint IP cores with common interconnect.

Future platforms will support higher-speed interconnects, replacing PCI bus with PCI Express, and RACE++ with PCI Express, RapidIO, or other switched-fabric interconnects. A PCI Express implementation with two FPGAs is shown in Figure 4b, and Figure 4c depicts a higher-density solution using a single FPGA and RapidIO. In all cases, the IP cores developed for the application are reused without modifications. Only the top-level design must be changed.

Will On-Chip Standards Emerge?
The interconnect architecture described above is our implementation of a cross-platform, cross-fabric, cross-technology strategy for protecting IP investment. Although designed to be general purpose, it's implemented on one company's products and is therefore not an open standard.

There are standards available on the open market to meet a wide range of requirements for IP-to-IP interconnect. Many of the standards are designed for processor bus implementations, making them too resource-intensive for high-performance embedded applications. Other standards have been defined by specific FPGA manufacturers, resulting in solutions that aren't portable across different FPGA solutions. To date, though, an open portable solution that meets the needs of the marketplace has yet to emerge. It remains to be seen if any current standards will achieve the critical mass necessary to become a de facto open standard. Unlike board-to-backplane hardware interfaces, FPGAs are inherently adaptable, and IP cores written to different standards can be adapted to work together. This reduces the pressure on vendors to invest in the type of open, industry-wide standards that are required for board-level products.

Fortunately, most benefits of modularity and reuse can be achieved whether or not the selected interconnect becomes widely available. By using a modular architecture and building IP cores based on common interconnects, system designers can focus development investment on the value-added parts of the solution, maximize design reuse, and reduce the schedule and development costs of technology changes in the future.