Optimizing Code, The SHARC Versus The Minnow (Part I): The Minnow's View

This article is the first of a two-part series. Part II is scheduled to appear in the Oct. 16 issue.—ED

Not long after arriving at a new job, most developers begin hoping that their new employers will switch them from the initial bug-fixing mode. Developers anticipate the fun and pleasure involved in being asked to add a significant amount of new functionality to an existing embedded system. Then, the standard dilemma arises. Adding to code in the current system will mean that legacy code and existing hardware layouts can be reused. This reduces development cost, and the revised system will pass through the door in short order. But will the new system have the required performance?

The cost side of embedded-system development means that a processor with classic CISC-processor characteristics powers many systems. The processor is nothing fancy. It's just something with about the capabilities of a 68K microcontroller, which is something that can comfortably handle the current requirements with a minimum of fuss and cost. Soon, an upgrade is needed!

I consulted on an upgrade to a handheld system initially designed to sample pressure and volume characteristics of a reciprocating gas compressor. The customer wanted to add a fast Fourier transform algorithm to provide immediate frequency information about unwanted vibrations within the compressor without waiting to analyze the data back at the base. As a consultant, I was asked if the new functionality could be added to the existing system, or if a co-processor/new system were necessary.

This two-part article examines the design decisions required when answering questions about placing DSP algorithms on an existing CISC processor system. A typical DSP algorithm with its loops, various memory accesses, additions, and multiplications is shown in Listing 1. This algorithm calculates the instantaneous and average power associated with the complex-valued frequency components from some measurement.

This article addresses the kind of performance that can be obtained from a standard CISC processor with limited DSP capability. It also discusses whether or not there's a need to code directly in assembly code for maximum performance, or if an optimizing compiler will speed development. The second article will address what DSP CPU features offer the maximum performance advantage for different algorithms, how to code for speed—hand code or DSP compiler—and if techniques exist to really maximize the speed of today's highly pipelined processors.

The first consideration in adding new material to legacy code shouldn't be about speed, but rather, will the code work? Though not obvious, several design decisions have already been made which could mean that the DSP algorithm given in Listing 1 won't necessarily work. Even the design decision to use 16-bit precision for 8-bit data values may not be sound. One major characteristic of most DSP algorithms is a loop, along with multiple memory accesses and repeated multiplications and additions. It doesn't take a very large loop before operations on 8-bit values might overflow the 16-bit number representation. The two key DSP-related problems are in the words might overflow and the not very obvious fact that "C doesn't care."

Take, for example, a simple DSP algorithm designed to provide the average of a long sequence of 8-bit values. The addition of a very long series of alternating large positive (0x7F) and large negative (0x80) values from a high-frequency input would cause no problem. Consider, however, a situation where the data stream consists of the same values, but with a low-frequency signal so that the input comes in bursts of 512 similar values. The first 256 large, 8-bit negative values (0x80 in 8 bits stored as 0xFF80 in 16 bits) give the result of 0x8000. This result can be stored in 16 bits. Working with 512 negative values, though, requires the addition of the 16-bit values 0x8000 and 0x8000. The result can't be stored in 16 bits, but requires 17. The sum of the 512 input values is stored incorrectly as 0x0000 with an overflow flag set.

Some processors do have specific signed and and unsigned addition operations, although the 68K processor doesn't. The MIPS processor has ADDU (unsigned) and ADDS (signed), and the AMD 29K processors have ADD, ADDU, and ADDS. The ADDS instruction is designed to throw an exception when overflow occurs within a number representation. The exception handler must then correct the problem. For the averaging problem, the algorithm might be repeated with all of the input values scaled down by a factor of 2, and the final average scaled up to compensate.

Do you remember ever writing an arithmetic exception handler when developing "C" code? You probably didn't, as typical "C" compilers make use of the standard, nonexception-throwing ADD instruction. The typical MIPS compiler adds insult to injury by using the unsigned ADDU operation even when signed (short int) values are being handled. This means that the "C"-compiler-generated code simply ignores the overflow problem.

This hidden overflow defect is the reason why many standard "C" algorithms sometimes fail in an unpredictable fashion. It's particularly a problem with DSP algorithms. DSP processors have an architecture where there's a 50- to 80-bit accumulator register for use in conjunction with the multiplier to reduce overflow problems. But typically, there isn't a "C" syntax extension that lets the programmer specify access of this high-precision register.

To simplify this article, we shall assume that the input signals considered won't cause any overflows. Readers interested in a tutorial and techniques to minimize the overflow problem, and the related quantizing errors, while maximizing code speed can refer to "Are you damaging your data through a lack of bit cushions?" by M. Smith and L.E. Turner, to be published on the web at www.circuitcellar.com/.

Usually, DSP algorithms manipulate many variables. The first step for customization is placing the most commonly used values into registers rather than continually accessing the values that are stored in slow memory on a stack.

The establishment of a 68K stack frame for the DSP code in Listing 1 using the Software Development Systems (SDS) C/C++ compiler can be seen in Listing 2. The standard convention is that certain registers are set aside for use with local variables within a procedure. These volatile registers can be used freely within the routine without their contents being saved to external memory. Other registers are assumed to contain critical values necessary after this code returns to the calling routine. The contents of these nonvolatile registers must be saved to the stack before the registers can be used within a subroutine, and they are recovered when the routine exits.

Three address registers are needed as pointers for the arrays. Two pointers are placed in volatile registers A0 and A1, and the third is put in the nonvolatile register A2. Storage of data values requires two volatile registers (D0 and D1) and three nonvolatile registers (D2 to D4). Provided that the DSP loop isn't too small, the stack operations won't be a large percentage of the overall algorithm cycle count. In time-critical applications, however, every cycle count and efficient register usage must be coded in.

The code for the main loop of the DSP Power algorithm may be found in Listing 3. The "for loop" has been converted into a "while loop" as the 68K processor architecture doesn't have the zero overhead hardware loop found with many DSP processors. (But, see comments after Listing 5.)

The memory operation:

  re_power = real\[count\] * real\[count\];

requires double access to external memory. Yet considerable speed can be achieved if the developer recognizes that this double access involves a common expression—access to the same memory location. The value fetched from memory is registered, or temporarily stored, in the variable re_power (D3). More speed can be obtained if the code developer recognizes other common expressions within the algorithm and registers those as temporary variables.

Registering temporary and commonly used variables during a DSP algorithm raises two issues, especially when handled automatically by the compiler. The first issue is whether enough registers exist to handle all of the necessary temporary storage. If there aren't enough, which variables should be placed in the available registers for best code speed with the available resources? The answer to this question is both application- and processor-dependent, which is the reason that compilers have a wide range of available optimizations.

The second question is application-dependent. Should a value fetched from memory be placed in a register for later reuse? It makes sense to register the values obtained from memory for the DSP algorithm shown in Listing 1. Yet this isn't always the case. Consider an algorithm designed to provide an average of 32 values read from a memory-mapped analog-to-digital converter (ADC). An optimizing compiler might recognize that the 32 accesses are to the same memory location. The produced code will only access the ADC memory location once and reuse the value—not what was intended. This problem can be handled by employing the volatile "C" keyword to indicate memory locations that have to be continually accessed rather than registered.

The speed of an instruction depends on both processor and instruction type. For example, the operation setting a register to 0 on a 68K processor can be handled in one of two ways:

  MOVE.L #0, D0
  MOVEQ.L #0, D0

The first instruction requires memory accesses to fetch the instruction and further accesses to get the constant value 0 stored in memory. By contrast, the MOVEQ instruction just needs fetching from memory as the constant 0 forms part of the instruction itself. This sort of time-saving associated with the use of a specialized instruction is particularly important when the instruction is part of a loop.

The developer must be cognizant of the limitations of any specialized instruction to avoid wasting development time. For instance, take note of the following code sequence:

  MOVE.L #127, D0
  MOVE.L #-128, D1
    MOVE.L #128, D2

The first and second instructions can be optimized by implementing a MOVEQ instruction. Time is just wasted, though, if the developer has to wait to be informed by the assembler that:

  MOVEQ.L #128, D2

is invalid code. There just aren't enough bits available in the 68K MOVEQ op code to handle numbers outside of the −128 to +127 range. Some assemblers, such as the SDS 68K assembler, help the developer to develop fast but error-free code. The developer can employ a consistent process involving only the use of the MOVE.L instruction with the assembler automatically converting to the faster MOVEQ.L instructions when possible.

The majority of the DSP CPU memory addressing modes take a similar number of cycles. That isn't the case for the CISC processor. The relative efficiency of the wide range of available addressing modes on the 68K is very application-dependent.

Use of an indirect indexed addressing mode via an instruction of the form:

  MOVE.W (real, count), re_power

is made in Listing 3. This mode allows a direct translation of the array indexing employed in the "C" code of Listing 1. Indexed addressing makes sense when the indexing needed during the DSP algorithm is complex.

Many, if not most, DSP applications on CISC processors are better handled when memory is accessed by indirect autoindexing operations:

  MOVE.W (real)+, re_power

These instructions involve manipulating a pointer into an array instead of manipulating an offset from a fixed pointer to the start of an array. The auto-incrementing modes implement special architectural features for a faster adjustment of the address—a sort of ADDQ type of instruction for address registers. Through a separate data address generator (DAG), which is essentially an address-specific ALU that can function in parallel with the standard data ALU, many DSP processors are able to avoid this complication.

The appropriate choice of instructions also is important during loop control. The "C" for loop has been translated into a while loop with four components in Listing 3—one useful and three just overhead for the DSP application. These include initialization of the loop variable count prior to the start of the loop; checking the range of count each time the beginning of the loop is executed; the loop code itself that performs the DSP operations; and every time the end of the loop is encountered, the loop variable is incremented and the program flow is interrupted by a jump back to the start of the loop.

If the DSP loop code involves many operations, then the loop overhead costs are relatively small, and unimportant if time isn't critical in the application. When time is critical, several speed-improving techniques can be invoked, such as fast relative-branch instructions rather than slower absolute-jump instructions:

  BRA LOOPSTART
  JMP LOOPSTART

Branch instructions involve accessing the memory for the instruction and a 16-bit offset. Jump instructions are slower, as they involve a 32-bit address which must be fetched from memory.

Converting the while loop:

 MOVEQ.L #0, count
LOOP: CMP.W Npts, count
 BGE ENDLOOP

 LOOP BODY

 ADDQ.L #1, count
 JMP LOOP
ENDLOOP:

to the more efficient do-while loop:

 MOVEW.L #0, count
 BRA LOOPTEST
LOOP:

 LOOP BODY

         ADDQ.L #1, count
		 
LOOPTEST: CMP.W Npts, count
 BLT LOOP

The do-while loop is faster because each loop involves fetching only one branch instruction.

Another option is to use a downcounting loop variable rather than up-counting the loop variable. Further time savings are possible when the code can be arranged in a form so that indexing into the array can be handled in the order from N−1 to 0 instead of from 0 to N−1. This allows the decrementing of the counter at the end of the loop followed by a test for zero. This is faster than incrementing a counter and using a specific comparison instruction to set a flag for a conditional branch. The code incorporating the special instructions, indirect auto-incrementing addressing modes, and downcounting do-while loops is demonstrated in Listing 4.

For this power-related example, modifying the loop for downcounting was straightforward as memory indexing could be made independent of the loop counter using the auto-incrementing addressing mode. Still, things aren't necessarily that convenient in all DSP algorithms. Often, the developer must exhibit considerable ingenuity. The optimization techniques, however, are fairly straightforward, suggesting that an "intelligent" compiler could be developed to handle the job more effectively, and with fewer defects inserted than hand coding

Watts Humphrey's The Discipline Of Software Engineering, Second Edition, from Addision Wesley, suggests that all developers should have a Personal Software Process. This is basically a polite way of saying that developers should become aware of the stupid things that they do the most frequently. My personal worst habit is associated with downcounting. Is Listing 4 correct with a BGE performed with the initial loop variable set to Npts, or should the initial value be Npts-1?

My lack of confidence stems from many years of developing downcounting code on the AMD 29K processor, which allows use of a fast jump if false and decrement instruction. The JMPFDEC instruction decrements after testing, and tests for the Boolean Bit stored in the counter register's sign bit position. This instruction requires that the initial value for the loop be set at Npts−2.

The output of the SDS "C" compiler for source Listing 1 is shown in Listing 5. With the optimizer turned on, the code generated has many common features with our first hand-coding shown in Listing 3. There are, however, considerable differences.

Both our hand code and the compiler code use indirect indexed addressing for accessing the memory. But we directly used the loop counter:

  MOVE.W (real, count), re_power

whereas the compiler-generated code was equivalent to:

  MOVE.W count, temp
  EXT.L temp
    ADD.L temp, temp
	  MOVE.L temp, offset
  MOVE.W 0(real,offset.L), re_power

Oops! The compiler remembered that word values are stored two bytes apart in memory, but the developer didn't remember! The compiler also used offset.L rather than offset. Does the extension of the count variable to a long value mean that Listing 3 might only work because a MOVEQ.L was used to set all 32 bits of count to zero and count was only incremented?

We can be a little smug and say that the code should have been faster as:

  MOVE.W count, offset
  EXT.L offset
    ADD.L offset, offse
	  MOVE.W 0(real,offset.L), re_power

but basically this developer appears to be in the embarrassing situation of Compiler 3, Developer 0. It's 3 to 0 and not 2 to 0 because this developer would have used an LSR instruction to convert words to bytes (shift by 1 bit to multiply by 2), which is slower than the ADD instruction. There are some other things in the compiler-generated code that need further exploring. What are the implications of the compiler using the unsigned MULU operation rather than MULS? Another issue is why the compiler didn't automatically generate the more efficient do-while form of the for loop. The hand-generated code optimized the variable Npts into a register. The compiler, though, generated code to continually fetch this variable from the parameter stack, meaning an increased loop overhead.

The reason that the compiler left Npts on the stack is that all of the available eight data registers were in use by the compiler. The SDS optimizer wasn't sufficiently sophisticated to remove redundant code and free up the registers D0 and D5.

The optimizer recognized the common expression associated with multiple access to the same memory location. But it didn't recognize the other common expressions inside of the loop.

To be fair to this developer, the errors introduced into Listing 3 arose by a change in the development process associated with using unfamiliar index operations. The developer's standard process employs auto-incrementing addressing modes where offsets are automatically handled! Nevertheless, the experience still leads to the same object lesson. It makes sense to work with the compiler. Optimizing, or checking against, the output of the compiler offers a number of advantages to the tired, overworked developer.

It was revealed that writing "C" code using explicit array indexing, and turning on the SDS compiler optimizer, doesn't generate assembly language instructions using the faster indirect auto-incrementing addressing mode (Listing 5). The "C" code must instead be rewritten to explicitly use pointer incrementing operations, before the compiler output uses the faster instructions (Listing 6). The automatic generation of the faster do-while form of the for loop, use of the faster auto-incrementing indirect addressing mode activated by writing "C" code using a specific process, and use of registers to common expressions activated by format of "C" code is demonstrated by Listing 7 from the SDS compiler.

Over the last few years, SDS has joined with Diab-Data, which has in the last couple of months begun operating under the banner of White River. Recently, I was given the opportunity to compare the SDS compiler with the Diab-Data 4.3f compiler.

The assembly code generated by the Diab-Data compiler for Listing 1 is very different from that generated by the SDS compiler. With the option, no optimizations activated, the code generated for every memory access has the form:

Move starting address into register
Move loop counter into register
Change loop counter into array offset
Add offset to register to form address
Access memory

The nonoptimized code generated makes use of the simplest instructions, and to say that it's grossly inefficient is being polite. Still, these low-level instructions are far easier to optimize than the more complex instructions that are generated by the SDS compiler.

With the optimizing option activated, the Diab-Data compiler will automatically generate code from Listing 1 to use the fast auto-incrementing instructions without the necessity of "persuasion" by rewriting the "C" code in the form of Listing 6 with explicit indexed addressing. The optimizer also placed the loop variable into a register.

The "C"-code format given to the Diab-Data compiler, though, still affects the speed of the optimized code. The optimizer switched to the faster "downcounting" loop when the "C" code used explicit indexed addressing rather than direct array indexing.

Was It Worth The Effort? You, the reader, have probably spent 30 minutes reading this article, and perhaps it would take another couple of hours to become familiar enough to comfortably apply the hand-code optimizing techniques. If the new code is only a small part of the overall program code, is it worth spending the 200 minutes of your valuable development time? A comparison of the time it takes to execute a program on any processor can be obtained from the formula:

Execution time
= (instructions/program)

(average cycles/instruction)

(time/cycle)

Developing the optimized code involves minimizing the product of each of these ratios, not just minimizing one. The time/cycle ratio is a fixed fact for all listings executed on one processor. The total cycle time for a number of commonly used 68K instructions using four-cycle external memory is illustrated by Table 1. Faster access times are available when internal memory or caches are taken into account. But the overall DSP algorithm time may become undeterministic when cached values have to be written back to external memory. Choosing instructions involving fewer clock cycles per instruction might require additional instructions to be added, negating the speed gain.

A comparison of the execution times for the loops in the listing is given in Table 2. In conclusion, if you modify the process of writing your "C" code using pointers to access memory, then on a basic 8-MHz CISC processor with characteristics of a 68K, the run time of the main body of the code is essentially the same whether the code is hand- or compiler-generated. The optimizing would become more worthwhile if a higher-speed hardware multiplier were available. In that case, the time to handle the loop and memory operations becomes more critical.

Further hand-optimization of the SDS compiler-generated code of Listing 7 is possible with regard to loop overhead. The downcounting loop using registered variables scores over the compiler-generated code. If the loop body were larger, however, then the effect of the loop overhead would become insignificant. The difference between hand- and compiler-generated loops effectively disappears with the Diab-Data compiler.

If your original embedded system needs the occasional nontime-critical DSP-code section, just go with the optimized compiler output, written with an improved format to activate the time- saving instructions. On the other hand, if your DSP algorithm must process the last block of data before the new block overwrites it, then each cycle counts, and knowing every last optimization technique is important.

One technique to fine-tune your "68K optimization skills" is to take a look at the output from various optimizing compilers. For example, the Diab-Data compiler has three options: Standard Optimization, Aggressive Optimization, and Feedback Optimization.

One of the optimizing options will perform multiple passes through the code to recognize further speed improvements made possible by earlier optimizations. This makes the tradeoff of a little increased compile time against future run-time savings.

The Feedback Optimization option looks like it's worth determining whether it's just "hype" for your application, or if it produces something really useful. Paraphrasing the literature, the integrated run-time analysis tools make use of code profiling to "suggest" to the optimizer which of the various optimizing techniques for speeding the code is the most appropriate for "your" embedded application.

In the next article, "The SHARC's Byte," we will look at this same DSP algorithm implemented on Analog Devices' ADSP-21061 SHARC processor. The change from the CISC 68K to the 21K "super" Harvard architecture, with what are effectively superscalar RISC-like instructions, raises a whole new series of optimization issues.