Boost Performance By Vectorizing Your DSP Software

The high-performance computing arena still has an insatiable appetite for performance. The more processing power semiconductor vendors produce, the more developers try to squeeze out every drop of that performance. This demand is an upward spiral that keeps new products launching into the market at unprecedented rates. It's hard to get that last drop of performance before just giving in and getting the next big thing. But if you want to try, the long-standing but misunderstood tool called vectorization can increase DSP application performance by as much as 200%.

Vector libraries have been around since computers were first used for scientific computations. Initially, they were produced as a convenience. Programmers didn't have to rewrite and debug simple functions every time they needed to do a vector-oriented computation. As processors became powerful and complicated, effort was made to optimize these libraries by writing them in assembly language.

Today, processors aren't just complicated. With multiple execution units, superscalar pipelined architectures, VLIW, vector-processing units, and other tricks, it's nearly impossible for mortals to take full advantage of them. As a result, an increasing number of companies are creating optimized libraries. General-purpose processors are now so fast that the performance of many applications is limited by the memory bandwidth, not the processor speed. Advanced cache architectures are attempting to help this bottleneck. But proper vectorization can be the key to a cost-effective solution.

What Is Vectorization? Many common algorithms are optimized by the process of vectorization. It maximizes processor utilization and memory bandwidth. There are two basic ways to vectorize an application. The most commonly used approach is for an engineer to write a vector library. The functions may be written in a high-level language, such as C or Fortran, or written in assembly language for a particular processor. The application programmer then hand vectorizes the application by calling these vector-library routines.

The other approach is for the compiler to perform automatic vectorization. A few compilers are able to recognize loops in the program as vector functions. They can automatically convert these loops to calls to a vector library. Or, they can directly use hardware vector functionality.

A wide range of applications, but especially DSP ones, can be optimized through vectorization to enable additional performance gains. This range includes basically all signal- and image-processing applications. They're characterized by having their data stored in contiguous memory locations as vectors or arrays. Storing it in this form makes it possible to optimize common algorithms, maximizing processor utilization and memory bandwidth.

Several reasons exist to hand vectorize an application. The possible performance gains certainly count. Accuracy and reliability gains also may be achieved. Some algorithms require careful analysis by a mathematician in order to obtain satisfactory results.

Another reason to hand vectorize is that the code becomes more self-documenting. A subroutine call, in which the name of the subroutine describes the function with the arguments clearly specified, is easier to understand than trying to decode what a loop is doing.

Memory-Performance Problem Modern compilers do a very good job of scalar optimization. Some even do some advanced loop optimizations, like loop unrolling. Unfortunately, though, they don't have any real understanding of the memory architecture and what's involved in optimizing for memory bandwidth.

Take a simple function like a vector multiply as an example. In C, a vector multiply would look like:

for (i=0; iIf the source and destination vectors are in cache, a good C compiler will generate code that's pretty efficient for this function. But if the vectors are in memory, the scalar compiler will actually generate code with the worst possible utilization of memory bandwidth. Understanding this requires knowledge of DRAM behavior.
DRAMs have two types of sequences to access their contents. If the current access is in the same page as the previous one, the memory location can be selected very rapidly. The data can potentially be accessed without wait states. But if a page change is required, the page must be selected before the location with the page is accessed. Changing pages takes several clocks, and induces wait states.
Looking at the vector-multiply example again, it's seldom the case that the three vectors are in the same page. As Figure 1 shows, the typical sequence of functions performed is to open the page containing b(0). Next, load the cache line containing b(0) into the L1 cache. Then, open the page containing c(0) and load its cache line into the L1 cache. A multiply is performed. But before the result can be stored, the page where a(0) will be kept must be opened. That cache line must be loaded into the L1 cache. Finally, the location in L1 cache representing a(0) is modified, with the remainder of that cache line being left unmodified. In processor terms, this has all taken a long time because a lot of wait states were required to get the data out of memory.
The processor can now proceed to compute a(1). This should happen very quickly, because b(1) and c(1) were probably fetched into L1 cache as a byproduct of the previous computation. The processor will rapidly finish with these cache lines, however, and go through the process of loading additional cache lines, again with lots of wait states. The time consumed by the wait states can be twice the time used to actually transfer data. This process continues until the entire computation is performed.
If the performance of this calculation is to be improved, the wait states accessing the data from memory must be eliminated. Achieving this demands changing the order in which the operations are performed (Fig. 2). If the processor were to prefetch all of the data in the b vector at once, loading it into the L1 cache, the memory accesses would tend to be in the same page. With a good memory controller and processor, these accesses can then be performed with few, and perhaps even no, wait states. The data will stream into the processor at the maximum bandwidth possible from the memory.
The next step in processing this function is to access all of the c vector, loading it into the L1 cache as well. Room must be made for the result vector. Some processors provide a mechanism that performs this function without using any memory bandwidth at all. The final step is to actually do the calculation.
The overall result is a function that performs faster than the original scalar code, as long as the source and destination vectors all fit into L1 cache simultaneously. If they don't, the cache will start thrashing. For example, if each individual vector is larger than the cache, loading the c vector will discard all of the b vector from cache. Making room for the result vector will discard the b vector from cache. By the time the computation is being performed, the processor's behavior will revert back to the scalar case. All of the work involved in attempting to prefetch the data ends up being overhead without any benefit. But an additional optimization can deal with this issue.
If the source and destination vectors don't fit into cache, the problem must be broken down into pieces. To achieve maximum performance, the pieces have to fit into cache. This technique, called "strip mining," is illustrated in Figure 3.
The process involves prefetching a portion of the two source vectors and making room for a portion of the destination vector. The computation is performed and the result is flushed to memory. The next portions of the input and output vectors are then set up in cache. Again, the computation takes place and the result gets flushed. This continues until the entire computation has been performed.
Do Some Strip Mining
The vector portions are called "strips," hence the name "strip mining." Ideally, the size of the strips is computed so that they just fit into the L1 cache. Obviously, the bookkeeping must be done carefully. If the strip size is too large, the strips will overflow the cache, inducing memory-bandwidth overhead. If they're too small, additional overhead will come from managing the smaller strips. The pointers into the vectors also must be managed correctly to produce the right result with minimum overhead.Additional optimizations should be considered when performing a sequence or chain of vector computations. If intermediate results can stay in the cache, they don't take any memory bandwidth at all when used in the later computation. If a vector can somehow get marked as a temporary when it won't be used elsewhere in the program, it's also possible to avoid using memory bandwidth to store it back into memory by invalidating those cache lines.
It's difficult to impossible for a run-time library to know how to perform strip-mining and function-chaining optimizations in all cases. Some library vendors have provided additional arguments to their library functions to enable the application programmer to specify optimizations. For example, an argument can be provided to indicate whether a given vector should be in memory or can be assumed to be in L1 cache. If a vector is indicated to be in L1 cache, it doesn't need to be prefetched.
Very efficient programs can then be written, but the user must manage the strip mining manually. The application becomes architecture-dependent. If the cache size changes, or an additional cache is introduced (L2 or L3), the optimization techniques change as well. Even upgrading to a later product within the same product family can require modifying the application to get the best performance.
A few companies have produced compilers that can automatically vectorize applications and target their proprietary hardware architectures. Cray (now SGI) was the first. Digital Equipment (now COMPAQ) and SKY Computers also have produced vectorizing compilers. Pacific-Sierra Research has more generic vectorizing technology. These compilers can significantly ease the task. They're able to identify program loops and convert them to appropriate calls to vector libraries.
SKY's compiler goes further in that it's able to perform a number of memory-bandwidth optimizations, including strip mining. It also can perform more global optimizations, even for programs using vector libraries. The compiler can eliminate vector loads to cache when not needed, discard vector flushes to memory for the same reason, and do automatic strip mining. The catch is that the user needs SKY's hardware to take advantage of these capabilities.
Dramatic Improvements Possible
The bottom line is that to gain the best performance for vector-oriented applications, the final executable must be optimized not only for the processor architecture, but also for the memory architecture. The optimizations required are well understood. Strip mining and careful cache manipulation can make dramatic improvements in application performance.It is possible to write generally useful library routines in such a way that the user is able to take advantage of these optimizations. But an advanced vectorizing compiler will dramatically reduce the amount of effort required to achieve the desired performance level. It also will enable the resulting application to be easily ported to future generations of hardware. With the rapid introduction of hardware products today, minimizing and preserving the software investment is critical to program success.