Electronic Design

# Debugging Optimized Code

Getting the best out of your tools

If you have ever tried to debug optimized code, you probably realized that it can be a frustrating experience. Without optimizations, your debugger is a reliable assistant, precisely following every command and accurately displaying every value you ask of it. However, the moment you turn on compiler optimizations,this assistant turns into a little prankster;most things appear to work, but on occasion, instructions are executed out of order, breakpoints are not hit consistently, and variables' values are unavailable or incorrect.

What is going on? After all, you just want to get the best performance and debug accuracy out of your tools. Unfortunately, many optimizations produce assembly that is impossible to map directly to your source code, making these two goals mutually exclusive. Instead, you must focus on finding a balance between them. This paper explains how optimizations impede debugging accuracy, provides strategies for setting up build configurations, and provides a procedure to help you figure out a balance between debug accuracy, size, and speed that works for your project.

## Impossible Problem

To illustrate the difficulty of the problem, let's assume that you are concerned about program size, and you want the compiler to make it as small as possible. So, you turn onsize optimizations in the compiler.

Consider the debugger (Fig. 1). The Debugger is stopped in a conditional block where variable ashould be positive, but the value shown in the smaller window is negative. Is the debugger wrong?

• The program is stopped at line 14 of compute_nonzero()in a conditional block.
• The condition is a > 0. Therefore, at the start of the else ifblock, a should be positive.
• a is not modified anywhere inside of the conditional block, and its address is never taken. By line 14, it should still be positive.
• The Data Explorer window shows that the value of a is -3.

How did this happen? Surely, the debugger must be wrong. Let's look closely at the code the compiler generated:

 The original C code is bold if (a < 0) \{ cmp reg3, 0 # compare a to 0 bge LABEL1 # if a >= 0, jump to LABEL1 compute_neg(a, &result); add reg4, sp, off <&result> # place &result into reg4 call compute_neg # call 'compute_neg' cached = result; b LABEL_tail # instead of setting # 'cached', jump to tail area \} else if (a > 0) \{ compute_pos(a, &result); LABEL1: cmp reg3 , 0 # compare a to 0 ble LABEL2 # if a <= 0, jump to LABEL2 add reg4, sp, off <&result> # place &# place &result into reg4 call compute_pos # call 'compute_pos' cached = result; LABEL_tail: load tmp, \\[off\\] # load 'result' into rtmp store rtmp, cached # store rtmp into 'cached' b LABEL_end # jump out \} else \{ /* . . . */

Remember how you enabled size optimizations? In this case, the compiler realized that you had two cached=result source lines that could be collapsed into one, and performed an optimization called tail merge (also known as cross jump). Instead of generating the code twice (6 instructions), the compiler generated the code once, and had the first instance jump to the second (4 instructions).

What does this mean for the debugger? By the time the execution gets to the LABEL_tail section, it cannot discern whether the execution came from the first or second conditional block. You could instruct the compiler not to combine the source lines, but that would increase program size. It is impossible to get the best optimizations along withperfect debug accuracy. How do you strike a balance?

## One Problem, Many Solutions

Most people have grown to accept the fact that it is not possible to get the fastest code with the smallest footprint and be able to debug it with perfect accuracy, but not everybody understands why. One reason for this disconnect is that these features are largely marketed as separate concepts:

• If you need a very fast or real-time system, vendors will show off benchmark results.
• If you are concerned with costs or huge feature-sets, they'll brag about small footprints.
• If you care about product reliability and time-to-market, they'll boast about impressive bug finding and code-analyzing features.

The problem is, most people care about more than one of these things. Sure, you can have one big problem that you lose sleep over, but you should expect decent solutions for the other two.

Consider buying a car:you might look at the speed, fuel efficiency, and comfort it offers. A ten-year-old would expect best of everything, but you are probably more realistic.

• If you want speed, you could buy a sports car, but it will suck up a lot of fuel.
• If you want space and comfort, you could buy an SUV, but it will be slow.
• If you want a fuel-efficient vehicle, you could buy an electric car, but it will be slow and small.

At any one point in time, you can't have the best of everything. Choosing the right car is about prioritization and tradeoffs. Each parameter is important to a certain extent, and ultimately you need to find the balance that works best for you.

The same concept applies to compilers; just change the parameters to speed, size, and debug accuracy. You must balance your need for debugging accuracy with your need for optimization. If you require a high level of optimization, you must also decide whether speed or size is more important (Fig. 2). The speed at which the compiler processes source files also varies depending on the level of optimizations. Generally, more aggressive optimizations take longer to execute.

The good news is that you don't need to buy three compilers to satisfy your different needs. If you have the right set of tools, you can customize them to meet your requirements for any project. The solution is to follow a process that uses those tools to find the right balance for you. Before we delve into that process, let's first establish some fundamental approaches to configuring your build system.

## Build Models

In the Single Build Model (Fig. 3), you create a single build configuration, and the binaries that you develop and test every day are the same binaries that you release. It is simple to setup, use, and maintain.

While a big advantage of this model is its simplicity, the limitation is that you will have only one configuration that needs to satisfy both your performance and debuggability needs. This configuration lies towards the middle of the debug accuracy vs. optimization axis.

In the Dual Build Model (Fig. 4), you create two build configurations: a debug configuration and a release configuration. Both configurations build from the same source files.

The debug configuration builds binaries that you debug on a daily basis during development. This configuration has few or no optimizations enabled, but includes full debugging support. You may also throw in additional run-time checks and assertions to help you uncover and fix bugs more quickly. This configuration is on the debug accuracy side of the debug accuracy vs. optimization axis.

The release configuration builds the binaries that you release and support. It uses the best possible optimizations, but you don't debug its binaries regularly, because debug accuracy is limited and results are unpredictable. This configuration is on the optimization side of the axis, and if you decide to use it, you should start considering whether speed or size is more important to you.

While this model gives you the most freedom, it comes with some overhead: you must maintain and validate two builds, you need more disk space, and there may be bugs that only show up in the release configuration which is harder to debug.

## Size Is Global, Speed Is Local

When balancing speed and size optimizations, remember the adage: size is global, speed is local.

If you want to reduce the size of your program through compiler optimizations, you should apply them to your entire source base. Generally speaking, more source code leads to more instructions and data, resulting in a larger program. If applying size optimizations to your entire source base buy you 40% improvement, applying them to half of your source basebuys you 20%. You won't be able to find one critical point where you can apply size optimizations to get the most bang for your buck . As a rule of thumb, apply size optimizations globally - size is global.

If you want to increase the speed of your program, the rule of 80-20 usually applies; 80% of the execution time is spent in 20% of the code. If speed were your only concern, you could turn on speed optimizations globally and get the best speed results. However, you could get almost as good results if you turn on optimizations for the critical 20% of the code that causes bottlenecks (a profiler should help you find which parts of your code to target).For the remaining 80%, you can focus on debugging accuracy or size optimizations. As a rule of thumb, apply speed optimizations locally - speed is local.

## Finding the Balance

Now that we have established some background concepts, it is time to find the balance between debug accuracy and optimization that works for your project. Start off by using the requirements for your shipping product to determine the level of optimization you need:

• Globally enable the least aggressive optimizations that meet your size requirements. Remember, size is global. Start with the default size optimization setting and increase it gradually until you are satisfied. If build time is a concern, measure the processing time it takes to build your project. This will likely be the fastest build time you will achieve, because you will add more optimizations later in the process.
• If necessary, locally enable the most aggressive speed optimizations on bottlenecks. Remember, speed is local. Run the program with your current settings and measure the time it takes your operation to run . If it falls within your requirement parameters, continue to the next step. If not:
• Run a profiler to identify bottlenecks where most of the execution time is spent. Create a list of these files and functions.
• Enable the most aggressive speed optimizations locally on the files or functions in your list. Go down your list in the order of the most commonly executed code, enabling the speed optimizations until you have reached your speed requirements. If you fall below your size requirements during this process, you may need to increase the global level of size optimizations.
If you have not found optimization settings that match your size and speed requirements at this point, you may need to consider other options, such as refactoring your code.
• Determine which build model to use.
If your compiler has a limited optimization setting that is as good as possible while ensuring debug accuracy, and you did not go beyond that setting in the previous steps, use the Single Build Model. If you went beyond it, use the following questions to determine if your current settings are easy enough to debug:
• Do live variables have values available?
• When the variable values are available, are they always correct?
• Is run control correct (hitting breakpoints, running, stepping in and out of functions)?
• Can you set breakpoints on or near executable lines of your choosing?
• Can you view the call stack correctly whenever you need to?
If the answer to each of these questionsis "yes,"use the Single Build Model. If the answer to any of these questions is "no," use the Dual Build Model.
• If using the Dual Build Model.
For the Release Build, use the settings you configured in the previous steps. For the Debug Build, enable few or no optimizations, and add any run-time checks that you would not want to turn on in your shipping product, such as:
• failed assert() checks should flag and terminate
• NULL dereference checks
• stack overflow
• calling free() on illegal memory
You may have to refrain from using some run-time checks if the resulting program does not fit on your target or meet the speed requirements you have for debugging.
• Make sure your product builds at a reasonable speed. By modifying the optimization settings, the compiler processing time has likely changed. Measure the time it takes to build your product again. If it is not acceptable, lower the optimization settings.

You do not need to come up with a single perfect setting for all times. You should repeat the balancing process whenever your shipping product gets sluggish or too big, or whenever debug accuracy gets too poor.For information about "Finding the Balance" using the Green Hills compiler, refer to Appendix A.

## Lost In Translation

While we have discussed how to strike a balance betweendebug accuracy and optimizations, we haven't gone into depth about what causes optimizations to get in the way of debugging.And, why is it that when it comes to aggressive optimizations, you need to make a choice between size and speed?

High-level programming languages were revolutionary because they created a programming model you could think in and reason about in ways that are natural to you. They allow you to work in the human paradigm. This paradigm has been so successful, and is so well entrenched, that most application programmers rarely have to program directly with low-level concepts such as registers, the function stack, machine instructions, or RAM. They rely on the compiler to take their high-level language and translate it into the machine paradigm, in a low-level language that makes sense to a computer.

The debugger's job is to act as a middleman, running a program in the machine paradigm and translating the information it finds back into the human paradigm, so that you can relate it to your high-level code without having to deal with low-level concepts such as registers and instructions. The debugger gets help from special debug information generated by the compiler. This information is stored in one of several formats, both standard and proprietary. Basically, the compiler leaves a trail of breadcrumbs for the debugger to follow; when the debugger needs to find something, it follows the breadcrumbs.

For instance, if the debugger needs to know where to find a variable, it looks in the debug information for the hardware register or the memory location it was allocated to. If you need to set a breakpoint, the debugger queries a table within the debug information that provides the mapping between source line locations and addresses of machine instructions. This all works great, so long as there is a direct (one-to-one) mapping from the human paradigm to the machine paradigm. As it turns out, almost every problem with debugging optimized code can be attributed to a case where no such direct mapping can be established.

## A Looping Example

Consider the following code in C:

 1: int i = 0; 2: while (i < len) \{ 3: int tmp = brr\\[i\\]; 4: arr\\[i\\] = tmp; 5: i++; 6: \}

It's simple enough; a loop copies one array to another. The compiled,slightly optimized code in POWER-based "pseudo assembly" would be fairly easy to follow, because there is a direct mapping between the human and machine paradigms:

 load reg1, 0 # load value 0 into index register branch end_loop # branch to end of loop start_loop: mul reg2, reg1, 4 # multiply reg1 by 4 to get offset load reg3, brr+reg2 # load value from 'brr+reg2' store reg3, arr+reg2 # store value into 'arr+reg2' add reg1, reg1, 1 # increment index register end_loop: cmp reg1, len # compare index to 'len' blt start_loop # branch if less than to 'start_loop'

The following sections explain how various optimizations translate this simple C code into the machine paradigm without maintaining this mapping.

## Scheduling Optimizations

The code above is far from optimal. Take a closer look at the load and store instructions. A processor core can execute instructions many times faster than it can access memory. So, when it tries to load a value from memory into a register, the value will not be available until the memory access catches up. Even if the value is cached, most modern high-speed processors have several levels of caches, which cause delays.

This is where the compiler optimizer steps in. It notices that the loop has the same effect if you swap the store and add instructions. The end result is that the memory access now has one extra instruction to catch up, and hopefully will not stall the processor:

 . . . load reg3, brr+reg2 # load value from 'brr+reg2' add reg1, reg1, 1 # first increment index register store reg3, arr+reg2 # then store value into 'arr+reg2' . . .

The optimizer scheduled the instructions better. If the loop used to take seven cycles for each iteration (two cycles for store and one for all other instructions), this change reduces it to six (with store taking up one cycle):a 14% speed improvement!

But, notice that in the original C code, the value of i was incremented after the assignment toarr\\[i\\]. If you stop the debugger at line 5, the value would have already changed.The scheduler changed the order of execution, and there is no longer a direct mapping between the human and machine paradigm.

## Loop Unrolling Optimizations

An even more aggressive loop optimization is known as loop unrolling. In a typical pipelined processor, the execution of a conditional branch could cause the processor pipeline to get flushed. The goal of the loop unroller is to reduce the number of conditional branches by duplicating the body of the loop several times (typically four or eight):

 loop prologue # set index, jump to a 'start_loop' label start_loop1: loop body # same body as before start_loop2: loop body # same body as before start_loop3: loop body # same body as before start_loop4: loop body # same body as before end_loop cmp reg1, len # compare index to 'len' blt start_loop1 # branch if less than to 'start_loop1'

In this version of the code, "compare and branch" instructions occur only 75% as often as in the original loop, and straight-line execution is friendlier to your processor pipeline, providing a huge speed boost.

On the other hand, we have just lost the direct mapping from the human paradigm to the machine paradigm. Notice that a single source line is now tied to fourdistinct machine instructions.If you set a breakpoint on that source line, one could argue that the debugger should set breakpoints on all four instructions. However, in most cases loop unroll is a prelude to additional optimizations, like the instruction scheduler. A newly formed large block of straight-line code becomes fertile ground for more speed gains, at the cost of blurring the line between original loop bodies. The generated assemblycanget so convoluted that the debuggerstypically don't allow you to set breakpoints on the source lines inside the loop. If you wanted to step through it, you'd have to debug in assembly.

In addition to affecting your ability to debug, this optimization also increases your program size.By turning it on, you are prioritizing optimization over ease of debugging, and speed over size.

## Common Sub-expression Elimination

Common sub-expression elimination (CSE) is a common optimization that increases speed and reduces size by finding expressions (or parts thereof) that are evaluated multiple times, and if possible, generating temporary variables to cache the results. For example:

 int func(int a, int b) \{ int i; int j; i = x\\[a+b\\]; /* 'a+b' used as index */ /* code not writing to 'a' or 'b' */ j = i + y\\[a+b\\]; /* 'a+b' again used as index */ /* code not accessing 'i'*/ return j; \}

Might be optimized into:

 int func(int a, int b) \{ int i; int j; int _index; /* new temp variable created */ i = x\\[(_index=a+b)\\]; /* cache 'a+b' into '_index' */ /* code not writing to 'a' or 'b' */ i += y\\[_index\\]; /* use same location for 'i' and 'j' */ /* code not accessing 'i'*/ return i; /* return 'i' instead of 'j' */ \}

The code above is faster and smaller because:

• a+b only needs to be evaluated one time.
• It is not necessary to allocate or copy any values into j.

But again, we have lost the direct mapping from the human paradigm to the machine paradigm. Because i and j share the same location, you canonly view one of them at the time. The debugger would display the other as "unavailable." Also, some debuggers let you change the values of variables "on the fly". If you stopped on j = i + y\\[a+b\\]and changed the value of a in the debugger, it would have no effect because the program is using the cached result in _index. This would be a problem if you suspected an off-by-one error and wanted to increment the offset in the debugger without having to recompile your program.

This optimization affects the accuracy of presented data and takes away the ability to reliably change the state of the program while debugging.

## The Light at the End of the Tunnel

These examples have shown that some optimizations alter information when translating code from the human paradigm to the machine paradigm, making it impossible for the debugger to translate it back. Everything up to this point has been treating optimizations very much as black or white: an optimization is either turned on or off. But imagine if you could turn on an optimization only partially, specifically tuned to be as aggressive as possible while still maintaining the debugging accuracy. Sure, the generated code would not be the best available. But it would be better than no optimizations, and still debuggable.

Such an optimization setting is not merely a point on the debug accuracy vs. optimization axis. It is the "pivot point" before which you know that debugging will be reliable, and past which you know it won't be. Having such a setting is crucial in determining how to build and optimize your program.

Appendix A lists the full scope of optimization strategies available with Green Hills compilers, including the "pivot point" settings.

Appendix B lists the benchmark results for the size and speed of various optimizer settings with Green Hills compilers.

## Appendix A

The Green Hills compiler provides the following optimization strategies to help you "find the balance". Along with each option, the equivalent GNU compiler option is listed:

• -Omaxdebug

This setting has the highest debug ease and accuracy. It does not enable any optimizations, and it goes out of its way to make your program more debuggable. It does not inline any routines for any reason. Compiler processing time is moderately fast.

This level is recommended for the debug configuration of the Dual Build Model.

Closest GNU option(s): -O0 -fno-inline

• -Omoredebug \\[default\\]

Start with this option when searching for the lightest optimization setting that meets your size and speed requirements. Debugging is guaranteed to be accurate, it is possible to set a breakpoint on any executable statement, local variables remain live throughout their scopes, and some optimizations are enabled. Compiler processing time is moderately fast.

This level is appropriate for the debug configuration of the Dual Build Model.

Closest GNU option(s): no equivalent option. -O0 likely has faster build time, but does no optimizations; -O has optimizations, but could disrupt debugging

• -Odebug

Similar to -Omoredebug, but size and speed optimizations are both improved. You are not guaranteed breakpoints on all executable statements, and local variables may not be live throughout their scopes. The debugging is otherwise accurate. Compiler processing time is fast. This is the "pivot point" - more aggressive optimizations no longer guarantee that debugging will be accurate.

This level is appropriate for the Single Build Model and the debug configuration of the Dual Build Model.

Closest GNU option(s): no equivalent option. -O0 likely has faster build time, but does no optimizations; -O has optimizations, but could disrupt debugging

• -Ogeneral The highest optimization setting before branching into size specific and speed specific optimizations.Debugging is possible but no longer accurate, and you won't reliably be able to change the state of the program in the debugger on the fly. This is a good setting if speed and size are equally important. Compiler processing time is moderate.

This level is appropriate for the Single Build Model (if you can live with less than perfect debugging)and the release configuration of the Dual Build Model.

Closest GNU option(s): -O

• -Osize This level is similar to -Ogeneral, but it turns on aggressive size optimizations, sometimes at the expense of speed. Compiler processing time is moderate.

This level is appropriate for the release configuration of the Dual Build Model if you are prioritizing size, or for the Single Build Model if the debugging quality is acceptable. For the best results, enable it globally - size is global.

Closest GNU option(s): -Os

• -Ospeed This level is similar to -Ogeneral, but it turns on aggressive speed optimizations, sometimes at the expense of size. Compiler processing time is moderate.

This level is appropriate for the release configuration of the Dual Build Model if you are prioritizing speed. For the best results, enable it locally - speed is local. Only enable it globally as a temporary measure to see if it's possible to meet your speed requirements using compiler optimizations.Used locally in combination with a global -Odebug level, this level may also be appropriate for the Single Build Model because most code remains easy to debug.

Closest GNU option(s): -O2

• -Onone Use this option if your main concern is build time. No optimizations are enabled. Compiler processing time is fastest.

Closest GNU option(s): -O0

With the Green Hills compilers, beyond optimization strategies, there are aggressive optimizations such as -Olink, -OI, and -Owholeprogram that you can enable to get the most out of your compiler (for instance, turn on linker optimizations, or intermodular inlining). You might consider enabling them for the release configuration in a Dual Build Model. There are also fine-tuning options that selectively turn on and off specific optimizations to help you prioritize between size and speed. For more information, see the Green Hills toolchain documentation on www.ghs.com.

With the GNU compilers, there are also additional optimizations strategies. For instance -O3 will turn on the inliner for even faster performance. This type of optimization would be useful with a Dual Build Model, since the debugging would otherwise suffer. GNU also offers many fine-tuning optimization options that should be utilized if the main ones need to be tweaked. For more information refer to the GNU compiler documentation.

In the debug configuration of the Dual Build Model, you can turn on additional checks to catch bugs that would otherwise cause intermittent failures. These options make your code larger and slower, but that's not a problem because you do not ship builds from the debug configuration. Examples of options are:

• Run-time checking
Includes checks for NULL-dereferences, use of recently freed memory, freeing memory twice, array bounds, missing switch cases, etc.
• Profiling

Instruments code to show executed code paths and how functions call each other

For more details, consult the Green Hills toolchain documentation.

## Appendix B

This appendix provides benchmark results for Green Hills MULTI Compiler optimization strategies. A number of standard benchmarks were included, such as EEMBC, CoreMark, and Dhrystone, along with proprietary ones. All benchmarks were run on a PowerPC 440 based core and an ARM Cortex-A8 based core using Green Hills MULTI Compiler v5.2. Smaller percentage numbers are better.

## Size Comparison

 Optimization Setting PowerPC Size vs. No Optimization Size ARM Size vs. No Optimization Size -Omoredebug 84.98% 78.64% -Odebug 82.12% 73.99% -Ogeneral 72.46% 61.41% -Ospeed 86.19% 71.85% -Osize 58.47% 53.41%

The results get better along the main debug vs. optimization axis, at which point they fork. -Ogeneral sits in the middle of the size vs. speed axis;-Ospeed takes a step back (as expected), but -Osize produces the best results.

## Speed Comparison

 Optimization Setting PowerPC Exec. Time vs. No Optimizations ARM Exec. Time vs. No Optimizations -Omoredebug 90.13% 84.31% -Odebug 86.62% 76.97% -Ogeneral 65.88% 64.43% -Ospeed 57.77% 56.60% -Osize 72.63% 67.43%

The results get better as we slide down the debug vs. optimization axis until we get to -Ogeneral. At that point, we can get even more impressive speed results with -Ospeed. However, on the opposite spectrum of the size vs. speed axis, -Osize takes a step back by about almost 7% on PowerPC and 3% on ARM, which is expected.

## References

• "Using the GNU Compiler Collection: For GCC version 4.6.1" - manual for GCC, the GNU Compiler Collection
• "MULTI: Building Applications for Embedded PowerPC" - product documentation for Green Hills Software MULTI
• Green Hills Software