Programming Efficiency: Part 2

Image courtesy of Thinkstock.

A number of years ago I did some work on a PABX. The device was electromechanical and far from state of the art, but the owner had been talked into spending a bunch of money expanding it and couldn’t justify scrapping it. I was also young and worked for a relatively low salary, so they could afford the luxury of my design efforts.

The system was given most of the features of a state of the art PABX, thanks to an 8-bit micro and about four rack panels of custom interface cards. The part of the project that is interesting here is that it provided Detail Call Recording output data that the IT mainframe could query and bill to departments.

As a result, I became intimately familiar with phone company tariffs and call zones. It was about the time FAX machines, modems, and cell phones started exploding the number plan, so the area codes and prefixes changed frequently. The mainframe program had to create tables for charging and look each call up by area code and prefix, as well has how it had been placed.

Probably today this would have been done using a database, and the gory details left to the OS, but it was before SQL became a household word. Any readers who are familiar with optimizing searches on databases with tens of thousands of records should instantly recognize this as an optimal fit for hashing.

Hashing is of no value when data needs to be sorted in some order and lookups done on the sorted data, but this project had no such requirement. It simply had to create a new database every month (because of the significant changes that occurred) and then look up each phone call in that database.

Hashing was well documented by Donald Knuth in his The Art of Computer Programming series of books, first published in the late 1960s. I am not aware of any significant new insights that have come to light since.

I was not (and still am not) a mainframe or IT programmer. The program needed to be written in RPG II, a “language” that wasn’t even intended to be a full programming language. In fact, it has often been referred to as “write only” because reading and understanding a program listing in RPG is very difficult.

So I enlisted the help of a very bright medical student who had done summer work for the IT department. I gave him the basics of Knuth’s algorithm and how it worked, along with some recommendations on creating a hash function for this particular application, based on what I knew about the data.

The basic idea behind Knuth’s approach is to create a “pseudo-random” number from the key field of the data (in this case the area code and prefix)—pseudo-random, in that it should result in as even and wide a distribution of records as possible. The next step is to divide that number by a prime number equal to the size allocated for the table, which should be about 20% larger than needed.

The remainder from that division is the index into the table to store (or retrieve) the data record. If storing, and the record is already taken, add the quotient from the first division and take a new remainder (in case it wraps off the end) and try again.

For retrieving, if the initial access doesn’t match the key, do the same thing. The remainder randomizes the accesses and guarantees against falling off the end. Adding the quotient moves to a new location an arbitrary distance down the table. Because the table size is a prime number, this movement will eventually access all records before it ever repeats, so the worst degenerate case is guaranteed to be a complete linear search.

The useful part about it, though, based on Knuth’s study, is that no matter how large the table, if its size is about 20% larger than needed, average access time is under six tries. That beats a binary search of anything larger than 64 items!

The finished program took about an hour on a mainframe to process around 10,000 records. It didn’t seem reasonable to me, and my department was paying for the CPU time, so I looked at the program. The programmer had done his own research from less enlightened sources and decided to forgo the part about adding the quotient; instead, he created an overflow table for cases where the first try was already taken.

The result was that a lot of the records ended up in the overflow table, requiring a linear search. I knew enough RPG to modify his code, so I implemented it as described above. The run time went down to a few minutes!

The moral of the story is that there are a lot of things in programming that can be subtle: a small misunderstanding, a wrong assumption about how something works, a high-level construct used without knowing the code it generates…any of these can make a huge difference in performance. That is why I maintain that any good programmer should be able to describe the assembly language code that the compiler will generate from any construct in his program.

I often look at disassembly listings of key pieces of the code I write. Sometimes I marvel at the tricks optimization can play. Other times I am shocked at what the compiler had to do to implement a seemingly simple concept. And occasionally I discover that if I change the source code slightly, the compiler can generate a significantly better executable.

Anyone who has used SQL knows that, especially in joins, the details and order of operations can sometimes make a huge difference in how long it takes the command to execute and how much memory is required. Experience teaches the programmer to have a feel for what does and does not work, but that experience is acquired by trying several versions and timing them.

In C—at least, on older compilers without good optimization—there is a difference between ++I and i++. I’m referring to cases where there are no side effects, where the pre- or post-increment value isn’t part of another calculation or stored somewhere else. The net effect is the same, but the process may not be. The i++ is widely, almost instinctively used by most C programmers. I have trained myself to use the ++I form instead. The C standard specifies that the result of either of those expressions can be used in a subsequent calculation. In essence it is left in an accumulator.

However, for ++I the result after the increment is used, and for i++ the value before the increment is used. That means for i++ a copy of the value has to be stored somewhere before the real value can be incremented; after that, it no longer has the correct value. So an extra register or memory location is tied up and an extra instruction is required.

Now a good optimizing compiler can recognize when the value is not needed, and optimize out those steps, but with optimization off (or inadequate) the compiler must generate the extra code in case it is needed. Back when I learned this secret, that level of optimization was often omitted in the interest of keeping the compile times manageable. Even today, it is often easier to debug code that hasn’t been optimized, so at least during development, the trick may result in savings.

Here’s another one that tripped me up recently. I’ve long known the advantages of lookup tables. Working mostly with small 8- or 16-bit integer CPUs, a lot of stuff can be sped up by replacing complex calculations with a table lookup (assuming the size of the table is affordable).

I was part of a discussion recently in the Asterisk community about ulaw and alaw transcoding and various ways to accomplish it. I was shocked that given the amount of memory available today and the small size of the required table, anyone would even consider an approach other than indexing into an array.

I was enlightened by others that all this memory is available to us at the price of multilevel caches for performance. While the table size was insignificant for memory utilization, it was not insignificant compared to cache size, and a cache miss was more costly than the algorithm that might replace the table. Looks like life has come full circle.

What optimization tricks have you learned? Or what pitfalls have you encountered by painful experience?