Debugging a conditional block
Is speed or size is more important
Debugging a Single Build Model
Debugging a Dual Build Model
Getting the best out of your tools
Build Models
Lost In Translation
Scheduling Optimizations
Loop Unrolling Optimizations
Common Sub-expression Elimination
The Light at the End of the Tunnel
Additional Optimization Options
If you have ever tried to debug optimized code, you probably realized that it can be a frustrating experience. Without optimizations, your debugger is a reliable assistant, precisely following every command and accurately displaying every value you ask of it. However, the moment you turn on compiler optimizations,this assistant turns into a little prankster;most things appear to work, but on occasion, instructions are executed out of order, breakpoints are not hit consistently, and variables' values are unavailable or incorrect.
What is going on? After all, you just want to get the best performance and debug accuracy out of your tools. Unfortunately, many optimizations produce assembly that is impossible to map directly to your source code, making these two goals mutually exclusive. Instead, you must focus on finding a balance between them. This paper explains how optimizations impede debugging accuracy, provides strategies for setting up build configurations, and provides a procedure to help you figure out a balance between debug accuracy, size, and speed that works for your project.
Impossible Problem
To illustrate the difficulty of the problem, let's assume that you are concerned about program size, and you want the compiler to make it as small as possible. So, you turn onsize optimizations in the compiler.
Consider the debugger (Fig. 1). The Debugger is stopped in a conditional block where variable ashould be positive, but the value shown in the smaller window is negative. Is the debugger wrong?
- The program is stopped at line 14 of compute_nonzero()in a conditional block.
- The condition is a > 0. Therefore, at the start of the else ifblock, a should be positive.
- a is not modified anywhere inside of the conditional block, and its address is never taken. By line 14, it should still be positive.
- The Data Explorer window shows that the value of a is -3.
How did this happen? Surely, the debugger must be wrong. Let's look closely at the code the compiler generated:
The original C code is bold | ||||
if (a < 0) \{ | cmp | reg3, 0# compare a to 0 | ||
bge | LABEL1 | # if a >= 0, jump to LABEL1 | ||
compute_neg(a, &result); | ||||
add | reg4, sp, off <&result> | |||
# place &result into reg4 | ||||
call | compute_neg | # call 'compute_neg' | ||
cached = result; | ||||
b | LABEL_tail | # instead of setting | ||
# 'cached', jump to tail area | ||||
\} else if (a > 0) \{ | ||||
compute_pos(a, &result); | ||||
LABEL1: | cmp | reg3 , 0# compare a to 0 | ||
ble | LABEL2 | # if a <= 0, jump to LABEL2 | ||
add | reg4, sp, off <&result> | |||
# place place &result into reg4 | ||||
call | compute_pos | # call 'compute_pos' | ||
cached = result; | ||||
LABEL_tail: | ||||
load | tmp, \\[off\\] | # load 'result' into rtmp | ||
store | rtmp, cached | # store rtmp into 'cached' | ||
b | LABEL_end | # jump out | ||
\} else \{ | ||||
/* . . . */ |
1: | int i = 0; |
2: | while (i < len) \{ |
3: | int tmp = brr\\[i\\]; |
4: | arr\\[i\\] = tmp; |
5: | i++; |
6: | \} |
It's simple enough; a loop copies one array to another. The compiled,slightly optimized code in POWER-based "pseudo assembly" would be fairly easy to follow, because there is a direct mapping between the human and machine paradigms:
load | reg1, 0 | # load value 0 into index register | |
branch | end_loop | # branch to end of loop | |
start_loop: | |||
mul | reg2, reg1, 4 | # multiply reg1 by 4 to get offset | |
load | reg3, brr+reg2 | # load value from 'brr+reg2' | |
store | reg3, arr+reg2 | # store value into 'arr+reg2' | |
add | reg1, reg1, 1 | # increment index register | |
end_loop: | |||
cmp | reg1, len | # compare index to 'len' | |
blt | start_loop | # branch if less than to 'start_loop' |
The following sections explain how various optimizations translate this simple C code into the machine paradigm without maintaining this mapping.
Debugging Optimized Code
Build Models
Lost In Translation
Scheduling Optimizations -- \\[Next\\]
Loop Unrolling Optimizations
Common Sub-expression Elimination
The Light at the End of the Tunnel
Appendix A
Additional Optimization Options
Appendix B
Scheduling Optimizations
The code above is far from optimal. Take a closer look at the load and store instructions. A processor core can execute instructions many times faster than it can access memory. So, when it tries to load a value from memory into a register, the value will not be available until the memory access catches up. Even if the value is cached, most modern high-speed processors have several levels of caches, which cause delays.
This is where the compiler optimizer steps in. It notices that the loop has the same effect if you swap the store and add instructions. The end result is that the memory access now has one extra instruction to catch up, and hopefully will not stall the processor:
. . . | |||
load | reg3, brr+reg2 | # load value from 'brr+reg2' | |
add | reg1, reg1, 1 | # first increment index register | |
store | reg3, arr+reg2 | # then store value into 'arr+reg2' | |
. . . |
The optimizer scheduled the instructions better. If the loop used to take seven cycles for each iteration (two cycles for store and one for all other instructions), this change reduces it to six (with store taking up one cycle):a 14% speed improvement!
But, notice that in the original C code, the value of i was incremented after the assignment toarr\\[i\\]. If you stop the debugger at line 5, the value would have already changed.The scheduler changed the order of execution, and there is no longer a direct mapping between the human and machine paradigm.
Debugging Optimized Code
Build Models
Lost In Translation
Scheduling Optimizations
Loop Unrolling Optimizations -- \\[Next\\]
Common Sub-expression Elimination
The Light at the End of the Tunnel
Appendix A
Additional Optimization Options
Appendix B
Loop Unrolling Optimizations
An even more aggressive loop optimization is known as loop unrolling. In a typical pipelined processor, the execution of a conditional branch could cause the processor pipeline to get flushed. The goal of the loop unroller is to reduce the number of conditional branches by duplicating the body of the loop several times (typically four or eight):
loop prologue | # set index, jump to a 'start_loop' label | ||
start_loop1: | |||
loop body | # same body as before | ||
start_loop2: | |||
loop body | # same body as before | ||
start_loop3: | |||
loop body | # same body as before | ||
start_loop4: | |||
loop body | # same body as before | ||
end_loop | |||
cmp | reg1, len | # compare index to 'len' | |
blt | start_loop1 | # branch if less than to 'start_loop1' |
In this version of the code, "compare and branch" instructions occur only 75% as often as in the original loop, and straight-line execution is friendlier to your processor pipeline, providing a huge speed boost.
On the other hand, we have just lost the direct mapping from the human paradigm to the machine paradigm. Notice that a single source line is now tied to fourdistinct machine instructions.If you set a breakpoint on that source line, one could argue that the debugger should set breakpoints on all four instructions. However, in most cases loop unroll is a prelude to additional optimizations, like the instruction scheduler. A newly formed large block of straight-line code becomes fertile ground for more speed gains, at the cost of blurring the line between original loop bodies. The generated assemblycanget so convoluted that the debuggerstypically don't allow you to set breakpoints on the source lines inside the loop. If you wanted to step through it, you'd have to debug in assembly.
In addition to affecting your ability to debug, this optimization also increases your program size.By turning it on, you are prioritizing optimization over ease of debugging, and speed over size.
Debugging Optimized Code
Build Models
Lost In Translation
Scheduling Optimizations
Loop Unrolling Optimizations
Common Sub-expression Elimination -- \\[Next\\]
The Light at the End of the Tunnel
Appendix A
Additional Optimization Options
Appendix B
Common Sub-expression Elimination
Common sub-expression elimination (CSE) is a common optimization that increases speed and reduces size by finding expressions (or parts thereof) that are evaluated multiple times, and if possible, generating temporary variables to cache the results. For example:
int func(int a, int b) \{ | |||
int i; | |||
int j; | |||
i = x\\[a+b\\]; | /* 'a+b' used as index */ | ||
/* code not writing to 'a' or 'b' */ | |||
j = i + y\\[a+b\\]; | /* 'a+b' again used as index */ | ||
/* code not accessing 'i'*/ | |||
return j; | |||
\} |
Might be optimized into:
int func(int a, int b) \{ | ||
int i; | ||
int j; | ||
int _index; | /* new temp variable created */ | |
i = x\\[(_index=a+b)\\]; | /* cache 'a+b' into '_index' */ | |
/* code not writing to 'a' or 'b' */ | ||
i += y\\[_index\\]; | /* use same location for 'i' and 'j' */ | |
/* code not accessing 'i'*/ | ||
return i; | /* return 'i' instead of 'j' */ | |
\} |
The code above is faster and smaller because:
- a+b only needs to be evaluated one time.
- It is not necessary to allocate or copy any values into j.
But again, we have lost the direct mapping from the human paradigm to the machine paradigm. Because i and j share the same location, you canonly view one of them at the time. The debugger would display the other as "unavailable." Also, some debuggers let you change the values of variables "on the fly". If you stopped on j = i + y\\[a+b\\]and changed the value of a in the debugger, it would have no effect because the program is using the cached result in _index. This would be a problem if you suspected an off-by-one error and wanted to increment the offset in the debugger without having to recompile your program.
This optimization affects the accuracy of presented data and takes away the ability to reliably change the state of the program while debugging.
Debugging Optimized Code
Build Models
Lost In Translation
Scheduling Optimizations
Loop Unrolling Optimizations
Common Sub-expression Elimination
The Light at the End of the Tunnel -- \\[Next\\]
Appendix A
Additional Optimization Options
Appendix B
The Light at the End of the Tunnel
These examples have shown that some optimizations alter information when translating code from the human paradigm to the machine paradigm, making it impossible for the debugger to translate it back. Everything up to this point has been treating optimizations very much as black or white: an optimization is either turned on or off. But imagine if you could turn on an optimization only partially, specifically tuned to be as aggressive as possible while still maintaining the debugging accuracy. Sure, the generated code would not be the best available. But it would be better than no optimizations, and still debuggable.
Such an optimization setting is not merely a point on the debug accuracy vs. optimization axis. It is the "pivot point" before which you know that debugging will be reliable, and past which you know it won't be. Having such a setting is crucial in determining how to build and optimize your program.
Appendix A lists the full scope of optimization strategies available with Green Hills compilers, including the "pivot point" settings.
Appendix B lists the benchmark results for the size and speed of various optimizer settings with Green Hills compilers.
Debugging Optimized Code
Build Models
Lost In Translation
Scheduling Optimizations
Loop Unrolling Optimizations
Common Sub-expression Elimination
The Light at the End of the Tunnel
Appendix A -- \\[Next\\]
Additional Optimization Options
Appendix B
Appendix A
The Green Hills compiler provides the following optimization strategies to help you "find the balance". Along with each option, the equivalent GNU compiler option is listed:
-
-Omaxdebug
This setting has the highest debug ease and accuracy. It does not enable any optimizations, and it goes out of its way to make your program more debuggable. It does not inline any routines for any reason. Compiler processing time is moderately fast.
This level is recommended for the debug configuration of the Dual Build Model.
Closest GNU option(s): -O0 -fno-inline
-
-Omoredebug \\[default\\]
Start with this option when searching for the lightest optimization setting that meets your size and speed requirements. Debugging is guaranteed to be accurate, it is possible to set a breakpoint on any executable statement, local variables remain live throughout their scopes, and some optimizations are enabled. Compiler processing time is moderately fast.
This level is appropriate for the debug configuration of the Dual Build Model.
Closest GNU option(s): no equivalent option. -O0 likely has faster build time, but does no optimizations; -O has optimizations, but could disrupt debugging
-
-Odebug
Similar to -Omoredebug, but size and speed optimizations are both improved. You are not guaranteed breakpoints on all executable statements, and local variables may not be live throughout their scopes. The debugging is otherwise accurate. Compiler processing time is fast. This is the "pivot point" - more aggressive optimizations no longer guarantee that debugging will be accurate.
This level is appropriate for the Single Build Model and the debug configuration of the Dual Build Model.
Closest GNU option(s): no equivalent option. -O0 likely has faster build time, but does no optimizations; -O has optimizations, but could disrupt debugging
-
-Ogeneral The highest optimization setting before branching into size specific and speed specific optimizations.Debugging is possible but no longer accurate, and you won't reliably be able to change the state of the program in the debugger on the fly. This is a good setting if speed and size are equally important. Compiler processing time is moderate.
This level is appropriate for the Single Build Model (if you can live with less than perfect debugging)and the release configuration of the Dual Build Model.
Closest GNU option(s): -O
-
-Osize This level is similar to -Ogeneral, but it turns on aggressive size optimizations, sometimes at the expense of speed. Compiler processing time is moderate.
This level is appropriate for the release configuration of the Dual Build Model if you are prioritizing size, or for the Single Build Model if the debugging quality is acceptable. For the best results, enable it globally - size is global.
Closest GNU option(s): -Os
-
-Ospeed This level is similar to -Ogeneral, but it turns on aggressive speed optimizations, sometimes at the expense of size. Compiler processing time is moderate.
This level is appropriate for the release configuration of the Dual Build Model if you are prioritizing speed. For the best results, enable it locally - speed is local. Only enable it globally as a temporary measure to see if it's possible to meet your speed requirements using compiler optimizations.Used locally in combination with a global -Odebug level, this level may also be appropriate for the Single Build Model because most code remains easy to debug.
Closest GNU option(s): -O2
-
-Onone Use this option if your main concern is build time. No optimizations are enabled. Compiler processing time is fastest.
Closest GNU option(s): -O0
Debugging Optimized Code
Build Models
Lost In Translation
Scheduling Optimizations
Loop Unrolling Optimizations
Common Sub-expression Elimination
The Light at the End of the Tunnel
Appendix A
Additional Optimization Options -- \\[Next\\]
Appendix B
Additional Optimization Options
With the Green Hills compilers, beyond optimization strategies, there are aggressive optimizations such as -Olink, -OI, and -Owholeprogram that you can enable to get the most out of your compiler (for instance, turn on linker optimizations, or intermodular inlining). You might consider enabling them for the release configuration in a Dual Build Model. There are also fine-tuning options that selectively turn on and off specific optimizations to help you prioritize between size and speed. For more information, see the Green Hills toolchain documentation on www.ghs.com.
With the GNU compilers, there are also additional optimizations strategies. For instance -O3 will turn on the inliner for even faster performance. This type of optimization would be useful with a Dual Build Model, since the debugging would otherwise suffer. GNU also offers many fine-tuning optimization options that should be utilized if the main ones need to be tweaked. For more information refer to the GNU compiler documentation.
Additional Debugging Options
In the debug configuration of the Dual Build Model, you can turn on additional checks to catch bugs that would otherwise cause intermittent failures. These options make your code larger and slower, but that's not a problem because you do not ship builds from the debug configuration. Examples of options are:
- Run-time checking
Includes checks for NULL-dereferences, use of recently freed memory, freeing memory twice, array bounds, missing switch cases, etc. - Profiling
Instruments code to show executed code paths and how functions call each other
For more details, consult the Green Hills toolchain documentation.
Debugging Optimized Code
Build Models
Lost In Translation
Scheduling Optimizations
Loop Unrolling Optimizations
Common Sub-expression Elimination
The Light at the End of the Tunnel
Appendix A
Additional Optimization Options
Appendix B -- \\[Next\\]
Appendix B
This appendix provides benchmark results for Green Hills MULTI Compiler optimization strategies. A number of standard benchmarks were included, such as EEMBC, CoreMark, and Dhrystone, along with proprietary ones. All benchmarks were run on a PowerPC 440 based core and an ARM Cortex-A8 based core using Green Hills MULTI Compiler v5.2. Smaller percentage numbers are better.
Size Comparison
Optimization Setting | PowerPC Size vs. No Optimization Size | ARM Size vs. No Optimization Size |
-Omoredebug | 84.98% | 78.64% |
-Odebug | 82.12% | 73.99% |
-Ogeneral | 72.46% | 61.41% |
-Ospeed | 86.19% | 71.85% |
-Osize | 58.47% | 53.41% |
The results get better along the main debug vs. optimization axis, at which point they fork. -Ogeneral sits in the middle of the size vs. speed axis;-Ospeed takes a step back (as expected), but -Osize produces the best results.
Speed Comparison
Optimization Setting | PowerPC Exec. Time vs. No Optimizations | ARM Exec. Time vs. No Optimizations |
-Omoredebug | 90.13% | 84.31% |
-Odebug | 86.62% | 76.97% |
-Ogeneral | 65.88% | 64.43% |
-Ospeed | 57.77% | 56.60% |
-Osize | 72.63% | 67.43% |
The results get better as we slide down the debug vs. optimization axis until we get to -Ogeneral. At that point, we can get even more impressive speed results with -Ospeed. However, on the opposite spectrum of the size vs. speed axis, -Osize takes a step back by about almost 7% on PowerPC and 3% on ARM, which is expected.
References
- "Using the GNU Compiler Collection: For GCC version 4.6.1" - manual for GCC, the GNU Compiler Collection
- "MULTI: Building Applications for Embedded PowerPC" - product documentation for Green Hills Software MULTI
- Green Hills Software
Index
Debugging Optimized Code
Build Models
Lost In Translation
Scheduling Optimizations
Loop Unrolling Optimizations
Common Sub-expression Elimination
The Light at the End of the Tunnel
Appendix A
Additional Optimization Options
Appendix B