Compiler Leverages Automation Power Of CPU Core

If you're already on the configurable processor bandwagon, you know all about the beauty of having programmability after silicon implementation, the benefits of fixing bugs in software, and the freedom of implementing changing standards on-the-fly. With the recent introduction of its Xtensa LX CPU core, Tensilica raised the bar for configurable processors by lowering power and adding flexible-length instruction extensions (FLIX) to the architecture mix (see "FLIX Helps Low-Power CPU Flex Its Performance," Electronic Design, May 24, 2004, p. 42).

The Xtensa LX core, with its ability to be rapidly configured in a rainbow of flavors for applications from audio/video processing to packet processing, image processing, encryption, and many others, enables designers to start from an algorithmic description of functionality at a behavioral level. But the challenge has been, and remains, how to get from algorithm to hardware. That hurdle is made especially difficult when one considers the vast numbers of tradeoffs and evaluations that must be made along the way.

A few of the potential decisions: Do I want or need parallel instruction execution in my processor? If I do, should it be two-way, three-way, or four-way? How many pipelines? What will my decisions mean to my gate budget? Even more importantly, how many architectures will I have time to evaluate if I must hand-code Tensilica Instruction Extension (TIE) code from the original C/C++ source code I used for modeling? What will happen if, along the way, I have to modify that C/C++ code?

To answer all of these questions elegantly, Tensilica is dropping the other shoe related to the Xtensa LX core in the form of its XPRES compiler. XPRES gives designers rapid insight into all of the tradeoffs and an ability to go from fully automatic generation of TIE code to fully manual to anywhere in between. Starting from the original C/C++ algorithmic description, the compiler automatically creates a large number of potential sets of configurations, providing analysis along the way. Most critically, it does so without modifying the C/C++ code in any way. Next, the XPRES compiler's output is fed to Tensilica's Xtensa Processor Generator as is typically done in the Xtensa flow. The resulting optimized processor then runs the compiled C/C++ code (see the figure).

Avoiding changes to the C/C++ code has two key benefits. One, it sidesteps the pitfalls of some coprocessor synthesis methodologies in which post-silicon changes to the C code can result in an inability to exploit the hard-wired coprocessor. Two, semiconductor houses may have scores of software developers writing algorithms or creating modifications of the original algorithm. As long as the application is similar in nature, it will continue to gain the benefits of XPRES acceleration.

The compiler evaluates as many as millions of possible instruction combinations for the function at hand, using various high-performance techniques. The designer chooses the best available speedup for a given target gate count. In addition, designers can opt to emphasize generic, non-specialized instructions if it's likely that software for the processor will stray somewhat from the original C/C++ source. Or, they can add highly specialized instructions for even greater performance enhancement if they've got some say over the end application code.

The process of going from algorithm to Xtensa LX hardware begins with compiling the C/C++ source code using Tensilica's Xtensa C compiler with instrumentation for profiling. Tensilica's profiling tools generate a database that tells the XPRES compiler which portions of the code are performance-critical and which aren't, as well as which are inner loops that will be executed millions of times and those that will not. The C code and profiling data are fed to the XPRES compiler.

With guidance from the designer, the XPRES compiler evaluates all possible combinations of custom instructions and presents a range of possible sets of TIE descriptions, given a gate budget. At this point, the designer can decide whether to invoke specialized techniques and manually refine the configuration, adding custom I/O ports or other bells and whistles. From there, the resulting TIE file drops into the normal Tensilica flow.

Users have a great deal of manual control over the optimizations. It's easy to click and explore regions of the C code, and a control panel lists optimization options.

The specialized techniques used by XPRES fall into three main categories: fusion, SIMD/vector, and FLIX. Fusion takes two operations, such as a multiply and an add or an add feeding a shift, and merges them together. XPRES looks for dependencies in operands and will propose fusions where it deems them appropriate. If fused operations prove too slow, they can be rejected or manually scheduled.

An SIMD operation is a vectorized operation. By its nature, an SIMD instruction replicates the execution hardware within the execution pipeline. For example, a 16-bit multiply can be implemented as an eight-way operation to give you a 128-bit register feeding eight parallel 16-bit multipliers. XPRES analyzes all possible permutations (one-wide, two-wide, four-wide, etc.) and reports how much acceleration would be had for each level of hardware cost. It would be prohibitive in terms of time and cost for the designer to explore all of these options manually.

The third option, FLIX, is completely independent of parallel operations. Any functional operation, whether scalar or an SIMD instruction, can be one of the operations in a FLIX instruction packet. For instance, one could have a four-way SIMD vector adder and a four-way SIMD vector multiplier in two separate execution units with two separate sets of register files and issue them both in the same FLIX instruction. The Xtensa LX processor supports both 32- and 64-bit wide FLIX instructions.

It's important to understand what the XPRES compiler doesn't do. It won't replace an experienced system architect and make decisions to automatically add designer-defined ports or queues. If the designer adds ports, the compiler will account for them and create instructions to use them. It won't try to tackle system partitioning, nor will it find all of the algorithmic tricks that a TIE expert might come up with. However, it will quickly perform all of the obvious optimizations, leaving the designer with more time to apply his expertise to the few remaining hot spots.

What kind of acceleration can a designer expect from the XPRES compiler? Depending on the function, designers will see a range of speedups (see the table). With a simple piece of code, such as a DSP-type kernel, which has less code and lots of latent parallelism, it achieves a 10.53 speedup. A larger function, such as an MPEG-4 encoder, is less amenable to acceleration but still gains a respectable threefold boost in speed. Basically, the smaller and "loopier" your C code is, the better XPRES will do in terms of acceleration.

Tensilica disclosed that the impressive scores (171.6 at 300 MHz) achieved by the Xtensa LX processor in "out-of-the-box" benchmarks run by the Embedded Microprocessor Benchmark Consortium came via use of XPRES compiler acceleration. In those benchmarks, Xtensa LX was shown to be 600% faster than the Xtensa V core and almost nine times faster than the ARM1020E processor.

The XPRES compiler is an add-on option for Xtensa LX processor customers costing $100,000 per year. It will be available in September.

Tensilica Inc.www.tensilica.com (408) 986-8000