Programming FPGA Systems Doesn't Have To Be Difficult

DESIGN VIEW is the summary of the complete DESIGN SOLUTION contributed article, which begins on Page 2.

Hardware designers have taken to FPGA computing for high-performance DSP solutions because it offers throughput gains on the order of 10 to 100 times faster than PC- or single-board-computer (SBC) systems. Previously, the advantages of powerful FPGA solutions weren't available to software design teams not skilled on the hardware side. Today, however, C-based solutions bring the power of FPGAs to software designers without a steep learning curve. These C-based tools significantly reduce design time versus HDL-based hardware design.

Because of these advantages, FPGA technology has evolved to the point where these chips can do much more than serve as front ends to I/O devices. FPGAs now handle the bulk of the actual processing in high-bandwidth and compute-intensive applications.

In addition, FPGAs are closely coupled with on-board memory, so multiple devices can reside on one board. And, FPGA boards can communicate via emerging serial communications standards, such as RapidIO or PCIX. With these advances, it's possible to deliver FPGA-based systems with an order-of-magnitude higher price/performance ratio over existing multi-CPU or DSP systems. As a result, FPGAs are now being used in lieu of CPUs or DSPs for algorithm-intensive, high-bandwidth applications.

This article shows designers how to move their signal-processing application to an FPGA-based system by using software tools that implement the C programming language. It outlines the process in a step-by-step fashion, taking the developer through the process of moving an algorithm-intensive, signal-processing application to a multi-FPGA system. Using C to program an example FPGA computing solution, the software application's execution dropped from about 12 minutes to approximately 2 seconds.

HIGHLIGHTS:
Porting to Hardware Via C	When designing an algorithm-intensive signal-processing application, first determine which algorithms to accelerate from your C-based code. Then create an FPGA design on paper to accelerate the code's functions. Each algorithm must be analyzed.
Determining Resources	The amount of FPGA resources to accommodate the code must be ironed out. You can partition or divide the code to run across multiple FPGAs for greater throughput.
Simulating the Environment	The simulation environment provides full bit-true/cycle-true simulations and faithfully simulates the FPGA implementation. Output from the design can be compared with the original C software version to test for accuracy. The simulation also reports actual operating speed.

Full article begins on Page 2

Hardware designers have taken to FPGA computing for high-performance DSP solutions, because it offers throughput gains on the order of 10 to 100 times faster than PC- or single-board-computer (SBC)-based solutions. Previously, the advantages of powerful FPGA solutions weren’t available to software design teams not skilled on the hardware side. Today, however, C-based solutions bring the power of FPGAs to software designers without a huge learning curve. These C-based tools significantly reduce design time versus HDL-based hardware design, allowing for quick design improvements without requiring hardware knowledge.

Because of these advantages, FPGA technology has evolved to the point where these chips can do much more than serve as front ends to I/O devices. FPGAs are now able to handle the bulk of the actual processing in high-bandwidth and compute-intensive applications. Plus, FPGAs are closely coupled with on-board memory, and multiple devices can reside on one board. Better yet, it’s possible for FPGA boards to communicate via emerging serial communications standards, such as Rapid I/O or PCIX. These recent advances allow embedded developers to deliver FPGA-based systems with an order-of-magnitude higher price/performance ratio over existing multi-CPU or DSP systems.

As a result, FPGAs are now being used in lieu of CPUs or DSPs for applications that are algorithm-intensive and require high bandwidth—such as medical imaging, industrial applications, and sonar and radar for the military. Designers using these new C-based tools to deploy DSP applications (with one or more FPGA processors on a PCI board) will achieve the aforementioned improved performance, and faster time-to-market.

This article shows designers how to move their signal-processing application to an FPGA-based system by using software tools that implement the C programming language. It outlines the process in a step-by-step fashion (with examples), taking the developer through the process of moving an algorithm-intensive, signal-processing application to a multi-FPGA-based system. Using C to program an example FPGA computing solution, the software application’s execution was reduced from approximately 12 minutes to only about 2 seconds.

Porting To Hardware Via C Let’s assume that you’re designing an algorithm-intensive signal-processing application, like analyzing the cracks on thousands of kilometers of a road surface. Such an application uses the Hough/Inverse-Hough algorithm, which also locates rivers or streets on aerial images, or surface defects on semiconductors.

Suppose that you’re using a Pentium 4 Windows XP-based PC, a PCI board with multiple FGPAs on it (such as our Tsunami board), a C development environment, and Handel-C (by Celoxica). You have little knowledge of HDL hardware programming, but you’re familiar with the basic elements of FPGA-based design. The process starts with code that you will write in C. Then the C code is translated to Handel-C, simulated on a PC, and eventually run on multiple FPGA processors(Fig. 1).

To begin, first determine which algorithms to accelerate from your C-based code. A good profiler, such as the Intel VTune Performance Analyzer, can determine the parts of your code that use up too many clock cycles. In the case of the above-mentioned signal-processing application, the algorithm stream was totally CPU bound, taking 12 minutes to execute. The profiler showed that this time was dominated by loops within loops within loops, clearly highlighting which code to port to the FPGA accelerator. Accelerated code requires streaming input and output over the PC’s PCI bus. So you should check that the I/O data rates are within the range of the PCI bus—typically from 70 to 200 Mbytes/s.

The next challenge is creating an FPGA design on paper to accelerate the functions of the code. An FPGA can execute thousands of instructions simultaneously, accessing hundreds of memory banks so that "pipelining" and "parallel-processing" techniques can be used to accelerate the functions. With pipelining, the instruction paths are sequenced. So while some algorithms are being computed on one part of a data "pipeline," others are computed on a later portion of the same pipeline, much like an automotive assembly line. The routines with longer clock times can also be "paralleled up" to dramatically reduce time (Fig. 2).

To that end, you must analyze each algorithm. Take each algorithm step and break it down into operations that comprise mathematical functions (add, subtract, multiply, multiply/accumulate, divide), delays, store to memory, and table lookup. No matter how complex an algorithm, it breaks down into one of these categories. Operations can occur in parallel as long as they don't depend on each other.

Our example application was accelerated like this: The nine processing cycles were fully pipelined with one result per clock after initial latency. The cycles were then embedded in a three-dimensional loop of X, Y, and Θ. The total number of cycles became 9+(9*X*Y*Θ), which included only 9 cycles of latency + (9 cycles*64 pixels*64 pixels*64 steps) cycles per processed tile (Fig. 3).

Although FPGAs can implement floating-point units, these units gobble up FPGA resources rapidly and are best used sparingly, if at all. Algorithms that rely heavily on floating-point math must be converted to fixed-point. You can either use a "block floating-point" approach, or design the whole system via a fixed-point method. Then confirm the accuracy of a conversion by comparing the design’s output with that of the original full floating-point software implementation. In the Hough algorithm example, a fixed-point resolution of 14 bits + 7 bits gave exactly the same results as the full floating-point version.

Determining Resources In the next step of the application design, you count the clock cycles for each part of the process. Generally, two or three operations can occur in each clock cycle. Next determine the amount of FPGA resources needed to accommodate the code. You can partition or divide the code to run across multiple FPGAs for greater throughput. Solutions scale very easily. Simply plug as many FPGAs as necessary (up to five) and the system will detect them automatically.

In the example project, the design was tile-based. The tiles were dispatched to, and then collected from, each FPGA in order (the logic for this was part of the code). One FPGA achieves an acceleration of 37:1, while 10 FPGAs (five FPGAs on each of two boards) bring a 370:1 acceleration.

Coding the design is relatively straightforward, because it’s handled primarily in C (except for a few new functions that require specific Handel-C instructions). These new instructions include enhanced bit manipulation, parallel processing, macro procedures and expressions, arbitrary width variables, FPGA memory interfacing, RAM and ROM types, signals (to represent wires in hardware), and channels (to communicate between parallel branches of code or across clock domains)(Fig. 4). A sample conversion between C and Handel-C is shown in the sidebar "Code Conversion."

Simulating The Environment The next step sets up and operates the simulation environment to test and optimize the hardware code. The simulation environment provides full bit-true/cycle-true simulations and faithfully simulates the FPGA implementation. Output from the design can be compared with the original C software version to test for accuracy. The simulation also reports the actual operating speed when run on the FPGA processor. Often, it’s best to simulate a design piece by piece to help find problems, because the pieces can be integrated later to confirm the overall operation.

Further tuning can be accomplished during the simulation process. You can pipeline the algorithm to accept one input value and produce one output value per clock cycle. Or instead divide the processing into more parallel streams and continue until FPGA resource usage approaches 100%. Furthermore, it’s possible to identify the slowest spots in an algorithm during hardware compiles and then optimize them. For additional speed, you can partition the algorithm across multiple FPGAs, and even across multiple boards.

As with software, further tuning will provide more performance. However, fine-tuning could lead to a point of diminishing returns. It may be more cost-effective to simply add extra FPGAs. It’s not necessary to get the design fully optimal out of the gate, because you can perform rapid simulations and tune the design based on those results at any time. Once the simulation is good, you compile the design for hardware and activate the Data Streaming Manager (DSM) so that the data stream is routed to the FPGA processor board rather than to the simulator.