Optimizing Code, The SHARC Vs. The Minnow (Part II): The SHARC's View

Switching to a DSP offers many advantages over developing signal processing code for legacy algorithms.

Oct. 16, 2000

20 min read

This article is the second of a two-part series. Part I appeared in the Sept. 18 issue.—ED

Techniques of adding DSP capability to an existing CISC system were discussed in the first part of this article. 68k code for part of a frequency analysis system was developed. A procedure that will generate both instantaneous and average power of a complex-valued array is shown in Listing 1.

It was revealed that with this DSP code developed on the Software Development Systems' 68k environment, the compiler could be persuaded to produce code almost as efficiently as hand optimization. A key statement is "the compiler could be persuaded." The developer had to rewrite the code in the format that's shown in Listing 2. Explicit pointer operations are used to force the compiler to generate the faster auto-incrementing addressing modes available on the 68k processor. Other speed improvements include the evaluation of constant expressions not recognized by the compiler optimizer, and the introduction of a faster "down-counting" form of the loop.

Changing code writing from the explicit indexing into an array (Listing 1) to pointer operations (Listing 2) to gain performance isn't particularly onerous. Depending on the application, however, it may not be sufficient. In this part of the article, we discuss advantages that could arise from using a processor customized for DSP operations. Examples are taken for code generated for the Analog Devices ADSP-21061 SHARC processor using the White Mountain VisualDSP development environment. Techniques for increasing the employment of parallel operations to produce faster performance are discussed too.

A primary consideration in adding new material to 68k legacy code isn't one of speed, but rather if the code will work. Typically, the decision to switch to a DSP processor is associated with speed. But, the developer should also ask whether other advantages will result from switching.

A comparison of the time that it takes to execute a program on any processor can be obtained from the formula:

Execution time
= (instructions/program)
* (average cycles/instruction)
* (time/cycle)

Updating from a 16-bit older CISC processor to a newer 32-bit DSP processor allows a switch to more recent technology. This leads to a decrease in execution time as the time/cycle gets smaller, especially with faster implementation of the multiplication logic.

Switching to a 32-bit DSP offers other advantages. The wider data bus means no penalties are associated with implementing 32-bit data operations. DSP algorithms involve repeated multiplications and additions which can quickly lead to the overflow of a 16-bit number representation and inaccurate results. Many 32-bit processors have pipelined floating-point (FP) operations. The pipelining means that there are no speed penalties when using FP operations instead of integer operations. FP algorithms are easier to design as the automatic renormalization of FP numbers removes the problems associated with number overflow. The designer, however, should realize that a 32-bit FP operation has roughly the same precision as a 24-bit integer operation. See "Are you damaging your data through a lack of bit cushions?" by M. Smith and L.E. Turner, which will be published in the December edition of Circuit Cellar.

A 32-bit DSP offers other advantages. The wider data bus allows wider instructions to be fetched, which improves performance by decreasing cycles/instruction. Plus, the wider data bus could provide enough bits to allow the description of parallel operations to decrease the necessary number of overall executed instructions.

Particularly useful characteristics of the DSP are the presence of a hardware loop, hardware circular buffers, and even zero-overhead bit-reverse addressing that's useful during FFT operations. Other features available with the more recent DSPs include alternate register banks (faster interrupt handling) and large on-board data and instruction caches.

The SHARC processor has a single-cycle instruction where three memory accesses (two data and one instruction), two memory address adjustments, and an FP multiplication can be performed at the same time as parallel addition and subtraction operations. That provides for a 4000% improvement over the 68k processor without even changing the clock speed! The problem then is how to code so potential speed improvement will be true speed improvement.

Customization Variables In Registers As with the 68k processor, the first step in customizing a SHARC routine for speed is to move frequently used variables into registers rather than leaving them in external memory. This is particularly important because the SHARC has a basic LOAD/STORE architecture, which doesn't support the direct memory-to-memory operations available on a CISC processor.

The stages of establishing a stack frame by the VisualDSP 68k compiler are shown in Listing 3. Even this simple task uncovers some of the basic differences architecturally between the 68k CISC and 21k DSP architecture. There's no 21k equivalent to the complex MOVE Multiple operation that describes the storage of many nonvolatile registers to memory in a single instruction. Instead, each register is individually moved to the 68k stack. This difference eats up more program space on the SHARC than the equivalent 68k ROM space. But, there's no speed penalty as the SHARC memory access is more efficient.

Note the two-stage operation necessary to store the SHARC index registers, I0, I1, I2, and the single-stage operation for storing the data registers, R3, R6, etc. This is a consequence of the data address generators (DAG) block on the SHARC. The DAGs are separate ALUs dedicated to address calculations that can occur in parallel with the standard data COMPUTE operations.

Two DAGs are needed because the SHARC can parallel three data operations—one along the "data" data bus and another along the "program" data bus. The third is from the "instruction-cache" data bus controlled by the program sequencer logic unit. The architecture to support the multiple data accesses means that there isn't a direct path for the registers of a DAG to be saved into the memory block controlled by that specific DAG. The lack of data path isn't critical, as such operations aren't frequently required.

The SHARC provides a hardware stack for high-speed subroutine entry and return. This stack is small and incapable of handling the extensive stack usage needed to support C, though. As with the 68k processor, one of the SHARC index registers (I6) is set aside as a frame pointer to allow efficient handling of stack frame operations. A second index register (I7) acts as a C-TOP-OF-STACK pointer to play the software equivalent of the 68k hardware stack pointer.

Special SHARC instructions that aren't listed in the index of the standard USER manual are available to support 68k stack operations. For more information about the logistics of coding 68k subroutines on the SHARC, see the online article, "SHARC in the 68k," at www.chipcenter.com/circuitcellar/april00/c0400ms1.htm.

The first three parameters are passed on the SHARC via registers, rather than through the memory-based 68k stack as commonly occurs with 68k code generation. This method isn't always as efficient as it sounds. The SHARC architecture doesn't support general-purpose registers. So, the pointer parameters of Power(), although passed in data registers (R4, R8, and R12), must then be moved into index registers before use.

The fourth parameter, Npts, was passed 68k fashion on the stack. With today's processors that have on-chip memory, there's really no significant time penalty using either register or memory stack parameter passing conventions. The only exception would be the time lost passing parameters during subroutine calls embedded in a frequently used loop. Then it's a compiler instead of an architectural problem, as such situations are best handled using inline subroutines to avoid subroutine overhead.

If you are feeling in an old-fashioned mood, it's possible to program the SHARC as if it were a 68k-style processor. This approach can be found in Listing 4. Code was developed for the Power() subroutine using the White Mountain VisualDSP compiler with all optimization deactivated. This SHARC code will run much faster than the equivalent 68k code. Because the code doesn't take differences between the two architectures into account, though, there's much hidden inefficiency.

The for-loop code is converted into the while-loop format rather than the more efficient do-while format. On a 68k processor, a small penalty exists for using one format of loop over another. Because of the high degree of pipelining present in the SHARC architecture, every jump instruction has a series of hidden NOPs associated with it as the three-stage instruction pipeline is flushed.

Both the 68k and 21k processors have instructions that support direct indexing through an array. The 68k has an indirect addressing mode where an index register (A0) can be used to point to the start of an array, and a data register (D0) is used to step through the array elements.

MOVE.L (A0, D0), D1

The equivalent 21k instruction is the premodify operation

R1 = dm(M4, I4)

where the SHARC index and modify registers (I4 and M4) play the same role as the 68k address and data index registers.

The 68k D0 register can be efficiently used as both an array index and as a loop counter. The modify register, M4, however, forms part of the SHARC DAG block. This block permits direct mathematical operations on the index registers but not on the modify registers. The loop counter must be transferred into the modify register before each memory access.

Other interesting SHARC architectural features are revealed by the code in Listing 4. The single-cycle SHARC multiplication operation must be specified with the signed-signed-integer (SSI) syntax as the SHARC supports both integer and fractional number representations. The SHARC supports three register operations instead of the 68k two-register instructions. Finally, all short integer operations are 32 bits wide (rather than 16) with no time penalty. Due to the 32-bit-wide data operations, the SHARC DSP algorithm is less likely to overflow the short integer number representations.

The SHARC processor doesn't have the inefficient, multicycle division operation that's present on the 68k processor. In fact the SHARC doesn't have a division instructor at all! Division is handled through a subroutine based around an 8-bit approximation generated from a RECIPR instruction. A peculiarity of the VisualDSP 68k compiler is that it generates specific subroutine calls when asked to perform integer division, but inline codes the equivalent FP operations. This is strange because the subroutine overhead (around 10 cycles) to call and return from the division code is as large as the division code itself.

Many of the coding inefficiencies disappear when the 68k code has been rewritten using pointer arithmetic (Listing 2, again) rather than array indexing (Listing 1, again). Part of the VisualDSP 68k-model is to use a dedicated specific modify register (dm_one in Listing 5) to store constants that are associated with basic array handling.

But with the more efficient code, the inefficiencies associated with the loop control become more important. We can define loop efficiency as

Useful cycles in loop
/total cycles in loop

The loop efficiency drops from 75% for the code in Listing 4 to around 50% for the faster loop in Listing 5 as the overhead for handling loop construct starts to overwhelm the smaller body of useful loop instructions. The key is to turn on the optimizer and activate the SHARC's specialized hardware loops for 100% efficiency.

Shown in Listing 6 are the changes in code efficiency on activating the optimizer when compiling SHARC 68k code involving explicit array indexing (Listing 1, again). First, the loop efficiency has jumped from 75% to 100% with the activation of the SHARC processor's zero-overhead hardware loop capability.

Even more important, the speed of the key operations in the loop has jumped by almost 100% as the number of instructions dropped from 15 to 8. The optimizer built into the VisualDSP compiler has recognized the simple mode of indexing through the arrays and switched to the more efficient post-modify addressing mode associated with incrementing a pointer through an array. Note that the optimizer didn't recognize the common expression

re_power + im_power

in the code, which is why the loop body in Listing 6 is eight cycles compared to seven cycles for the loop body of Listing 5 where the developer had incorporated the common expression optimization directly into the 68k code.

The 68k code of Listing 2 was developed using the incrementing pointer form of array access. Activating the optimizer in-creases the loop efficiency from 50% to 100% with the introduction of hardware loops. The number of instructions in the loop body itself, though, remains unchanged.

Some interesting characteristics of the SHARC ALU architecture are exposed at the beginning of Listing 6. Prior to entering the hardware loop, there's a test on the number of points, Npts, to be processed. There's a 21k COMP instruction equivalent to the 68k CMP instruction to set the ALU flags prior to a conditional jump. In this code, however, the specialized ALU COMPUTE operation PASS is used to set the flags. Unlike the 68k register-to-register MOVE instructions, SHARC operations of the form

I6 = R4

don't change the ALU flag, so the condition set by the PASS operation is unchanged until needed by the (later) conditional jump.

The PASS operation offers several advantages over the COMP instruction. In the first place, it's faster because the SHARC LOAD/STORE architecture only permits comparison between registers, and not between registers and constants.

R10 = PASS R10; // Set Flags

but

R11 = 0;  // Build Constant
COMP(R10, R11); // Set Flags

The second advantage is the use of the PASS instruction when invoking the SHARC parallel instruction capability. An insufficient number of bits are available in the SHARC's 48-bit op code to describe the generalized movement of any two of the many SHARC registers at the same time, as in the instruction

R4 = R5, R6 = R7;
     // ILLEGAL

There are, though, sufficient op-code bits to describe a combination of a more specific COMPUTE instruction operating in with a more general register operation or memory access operation

R4 = PASS R5, R6 = R7

R4 = PASS R5, R6 = dm(I4,M4);

Later in this article, it will be shown how the ability to parallel PASS instructions with other operations is a key feature to obtaining maximum code efficiency.

What about trying to persuade the compiler to take advantage of other SHARC speed features? The SHARC architecture has been built for efficient DSP algorithm production. An attempt to activate these features directly from code written in 68k is shown in Listing 7.

The code employs FP variables rather than integer variables. FP variables automatically renormalize instead of overflowing the number representation. This simplifies code development. No time penalties are associated with this approach because SHARC FP operations are as fast as SHARC integer operations. Still, the developer should remember the point made earlier that the 32-bit FP number representation has the same precision as a 24-bit integer representation.

The SHARC architecture supports many parallel operations involving the multiplier and adder. The loop has been "unrolled" to "encourage" the compiler to use these parallel operations. Optimization could include paralleling the additions in the first part of the loop (green) with multiplications from the second part of the loop (red). This particular optimization will only work if the number of points to be processed, Npts, is even. This isn't a real limitation for most DSP algorithms, but the implications will be discussed later.

Parallel multiple memory accesses along a "program" memory data bus (pm) and a "data" memory data bus (dm) are supported by the SHARC architecture. Clashes between data and instruction fetches along the pm data bus are avoided by storing instructions in a cache. Only instructions that would conflict with pm data fetches are stored—permitting a small instruction cache to be efficiently used.

In order to "encourage" the compiler to activate this feature during code generation, the complex-number array has its real components, real, placed in a data memory data location and its imaginary components, imag, placed in a program memory data location. This is achieved through an extension of the 68k language where parameters of the form

float dm *real;
float pm *imag

are passed. The arrays must be declared as global variables

float dm real\[Npts\];
float pm imag\[Npts\];

because the standard 68k auto variables would only be placed on the 68k stack, which is built entirely in data memory.

The optimized code generated by the VisualDSP 68k compiler from Listing 7 is shown in Listing 8. For clarity, the instructions associated with the first and second parts of the unrolled loop have been color coded.

Using the syntax float dm * and float pm * to describe the subroutine pointers has resulted in code where the FP data values are brought in along both of the SHARC data busses

R13 = dm(I0, dm_one);
R3  = pm(I8, pm_one);

Some interesting architectural features are revealed in these two instructions. Once again, integer values appear to be accessed from memory as the values are stored in integer registers (R3 and R13) rather than the expected FP registers (F3 and F13). On the SHARC, each data register can be used for storage of either FP or integer values. The FP characteristics reside in the ALU instead of in the bit pattern stored in memory or in a register.

The two memory banks can be accessed in parallel only when index and modify registers from DAG1 are used to access data memory and when index registers from DAG2 are used to access program memory. The compiler has prepared for this optimizing by generating code using the appropriate DAG1 registers, I0 and M6 (dm_one = +1) together with DAG2 registers I8 and M14 (pm_one = +1). Hand optimization is required to produce the parallel operations

R13 = dm(I0,dm_one),R3=pm(I8,pm_one);

in order to remove two cycles from the 14-cycle loop execution time. The total loop-cycle count can be further reduced to 10 by paralleling addition and memory store operations implementing the R13 (F13) register

     F10=F10+F13, dm(I1,dm_one)=R13

Further savings can be obtained as shown in Listing 9. The dual memory fetches to dm, and pm memory from the second part of the loop (red) can be overlapped with the initial multiplication in the first part of the loop (green). The paralleling of these instructions, though, introduces data dependencies between the use of F3 and F13 in the first and second parts of the loop. This dependency is broken by changing the compiler code to use registers F1 and F4. It will become apparent soon why these particular registers were chosen rather than F4 and F14.

In order to save the final two cycles, it's necessary to generate the two addition operations from the first part of the loop in parallel with the multiplications in the second part of the loop. These operations could be made to occur in conjunction with memory accesses to form parallel instructions involving half of the SHARC data registers in a single instruction

Fa = Fb * Fc, Fd = Fe + Ff,
Fg = dm(Iu,Mv), Fh = pm(Ix,My);

There wouldn't be enough room in the 48-bit-wide SHARC op code if this instruction were general enough to allow the use of any data registers in any position. Restrictions are placed on certain registers employed to reduce the number of bits necessary to describe the operation.

Multiplication registers
Fb is one of F0, F1, F2 or F3
Fc is one of F4, F5, F6 or F7
Addition registers
Fe is one of F8, F9, F10 or F11
Ff is one of F12, F13, F14 or F15

Allowing these parallel operations to be programmed requires some forethought. If all registers could be used during parallel operations, then the sequence:

R1=dm(I0,dm_one), R4=pm(I8,pm_pm_one);
F9 = F1 * F1, F13 = F8 + F12;
F14 = F4 * F4, F10 = F10+ F13;

could be performed. The op-code bit limitation means that the multiplication operation of the form F1 * F1 or F4 * F4 can't be described in parallel with addition operations. But, the operations F1 * F5 (with F5=F1) and F2 * F4 (with F2=F4) can be described. It takes only a little ingenuity to place equal values in two registers without adding additional cycles to the loop. Equality can be achieved by dual accesses to memory.

R1=dm(I0,dm_zero;)

combined with R5 = dm(I0, dm_one);

or register to register transfers

R1 = dm(I0, dm_one)

combined with F5 = F1; or F5 = PASS F1;

With the on-board SHARC memory, there isn't the penalty in doing a memory-to-register transfer rather than a register-to-register transfer that there would be on the 68k processor. Which one of the three approaches is taken depends on the parallel operations that are available for employment without penalty in other parts of the loop.

We have seen that it's a fairly straightforward procedure to reduce the 14-cycle loop that's produced by the optimizing compiler to seven cycles. But what's the theoretical maximum speed of this loop, and how can this speed be achieved in practice? The resource usage for the instructions required to calculate the instantaneous and average power can be seen in Figure 1. Registers have been assigned to allow parallelization of a number of these calculations.

Maximum speed is achieved when a resource is used to a maximum. There are two cycles, needed for additions, multiplications, and program memory operations. In theory, if the data memory accesses could be ignored, the loop cycle count could be reduced from seven cycles per power calculation to just two. This optimum coding sequence is shown in Figure 2, with a number of power calculations occurring in parallel—unrolling the loop.

The simplest way to avoid the dm data-memory-access conflict is to move two of these accesses to a separate line. This would produce code where it would take the average of three cycles to produce the instantaneous and average power levels of the complex-valued arrays. It's possible, however, to duplicate the register values by using an additional memory access combined with a COMPUTE operation. This technique is demonstrated in Figure 3. There, the final code "rerolled" back into a loop can be viewed.

Note the stages of the optimized algorithm. First, there's a series of instructions used to "prime" the computational pipeline prior to the loop. Then come the operations within the loop itself. Finally, there are instructions used to "empty" the computational pipeline. For this particular set of code, it also proved possible to account for situations when Npts was odd, something that wasn't straightforward to optimize with the original 68k code.

In this second part of a two-part series, we briefly looked at the SHARC—Analog Devices' ADSP-2106x series of DSP processors. We compared the characteristics of the SHARC with a simple CISC processor, the Motorola 68k, whose equivalent can be frequently found in embedded systems. We compared these two processors to show why, even with equal clock speeds, the SHARC had the capability to outperform the CISC processor by around 4000%. This was a theoretical speed improvement rather than an actual improvement, though.

It was shown that with the optimizer activated, the White Mountain Visual-DSP development environment generated 2106x assembly language code with a fair degree of speed. By hand optimization of "unrolled" code, it was fairly straightforward to activate the parallel operations available with the SHARC architecture to improve the code speed by another 200%. A detailed analysis of the code at the "resource" level revealed techniques that allowed a further 300% speed improvement. The final code has a speed that was close to the theoretical maximum processor speed.

Editor's note: The author has been in contact with Analog Devices' David Levine, who is developing an optimizing compiler that generates tight code for the TigerSHARC, which effectively has two 21k CPUs on one chip. The author was offered the chance to try out the beta release of the new compiler. He has also been "told" he needs to go back to school to learn about some alternative optimizing technologies that are not so compiler-specific. So if you found these articles useful, stay tuned for further updates on optimizing compilers.