Static analysis tools are in widespread use because they are effective at finding programming defects. They work by analyzing the source code of a program without executing it, so don’t require test cases. Loosely speaking, these tools operate in two phases. First, they parse the source code to create a model of the program; then they use a variety of techniques that examine the model to find defects. These techniques can range in sophistication from simple syntactic analysis through whole-program path-sensitive symbolic execution. They can find simple defects such as violations of spelling conventions, or more serious run-time errors such as null pointer exceptions, resource leaks, and data races. The most valuable analysis techniques are those that find the defects that are most damaging when they show up in the field. Of course this can vary tremendously depending on the application domain—a program intended for use in a secure environment will be vulnerable in very different ways to a program designed to be used in a trusted setting. This article is intended to be domain-agnostic so I discuss the kinds of bugs that are likely to be important in many kinds of applications.
Although languages do have many classes of programming error in common, the potential for damage varies enormously between languages. For example, the consequences of a run-time error such as a buffer overrun can be catastrophic in memory-unsafe languages such as C and C++, but languages such as Java are more disciplined so have a much lower risk of serious and unpredictable effects from such errors. This article explores these differences for C and C++ and Java and explains which analysis techniques are most appropriate for each language. Tools need sophisticated path-sensitive analysis techniques to find the most common and risky defects in C and C++ code, but even quite lightweight analysis algorithms can be extremely effective for Java.
Risks of C and C++
C was designed as a language for systems programming at a time when it was important to be able to wring as much speed as possible from systems software. Unfortunately the same language features that make it possible for compilers to generate very fast code also make C a very risky language to use. Although C++ is better in many respects, it has inherited many of the problematic features of C. (From here on when I refer to “C”, I mean to include the subset of C++ that shares these issues.) The two chief complaints are weakness on type safety, and unchecked pointer arithmetic. The lack of type safety means it is possible to cast a value of one type into another type with no guarantees that the conversion is legal. The ability to do pointer arithmetic means it is possible to access essentially any location in the address space of the program. These two features can be used to create programs that are extremely fast and efficient, but the consequences of misuse can be severe.
In C it is easy to write programs that stray outside the bounds of defined behavior, with hugely unpredictable results. For example, consider the well-known buffer overrun bug; this bug is one of the most notorious in the history of computing, because it has enabled innumerable security breaches. The ability to do unchecked pointer arithmetic is at the root of why buffer overruns are so dangerous.
It is easy for a programmer to inadvertently introduce a buffer overrun—all it takes is to forget to check whether incoming data can fit in the available space. What makes it so hazardous is that C has no built-in protection against the damage it can wreak. A carefully-crafted input vector can allow attackers to hijack the running program and force it to do as they wish.
In contrast, the same defect in a Java program is relatively benign. The Java specification is explicit in requiring that every buffer access be checked, and an exception thrown if the access is determined to be illegal. In Java, even an unhandled exception has predictable consequences and Java programmers are accustomed to thinking about what exceptions may occur and how to handle them. As a result it is conventional for programs to be written to log exceptions and either recover or restart. This turns a buffer overrun from a potentially catastrophic bug into a relatively minor annoyance that is fairly easy to debug.
There are several other classes of programming defect that have serious ramifications for C programs, but are mostly harmless or non-existent in Java code. For example, one of the more difficult aspects of C/C++ programming is dynamic memory management. The programmer is entirely responsible for allocating and releasing blocks of memory: when this is done incorrectly the program can leak memory, and the heap can even be corrupted if inappropriate memory locations are freed. In Java programs, automatic garbage collection makes memory management mostly a non-issue.
Risks of Java
In contrast to C, the most common bugs in Java programs do not cause crashes as much as introduce subtle semantic errors. Java has a very rich set of libraries, running the gamut from basic data types like maps and regular expressions, through UI toolkits, and on to enterprise-architecture frameworks. As with any API there are ways in which the libraries can be misused, and this misuse may introduce defects.
For example, all classes in Java are derived from the Object class, for which methods equals() and hashCode() are defined. A very important invariant in Java is that objects that are considered equal must also have equal hash codes. If this invariant is violated, it is considered a bug because classes such as HashTable will then not work as expected. This defect will never trigger a run-time error directly; instead it is more likely to cause puzzling symptoms. For example, an object placed in a hash table may become impossible to retrieve. Unfortunately there are many ways that a programmer can inadvertently write code that violates the invariant—for example, a subclass that overrides equals() but not hashCode() is very likely to have this problem.
Similarly, many classes in Java implement the Serializable interface, which allows objects to be written to and read from persistent storage. This interface can be incredibly useful, but implementing it can be very tricky so there are conventions that reduce the likelihood of mistakes. For example, it is conventional to explicitly define the serialVersionUID field in all serializable classes to take advantage of the runtime’s built-in compatibility checking. Code that does not do so is perfectly legal, but vulnerable to bugs. Implications for Static Analyzers
It is fair to say that the most common serious defects in C programs are memory access errors, whereas a large proportion of the serious mistakes in Java programs are caused though misuse of standard APIs. Unsurprisingly, the techniques best suited to finding these classes of defects are quite different.
To find a memory access error in a C or C++ program, an analysis tool must find a program execution path on which the error manifests. Most programs work correctly for most execution paths, so defects will generally occur only on unusual or lightly-exercised paths. These often correspond to corner cases that are difficult to cover with testing. The relevant part of the path can start in one part of the program, go around various loops, and involve procedure call chains that span compilation unit and module boundaries. Thus a whole-program path-sensitive analysis that is capable of performing symbolic execution is one of the most effective techniques for finding those defects statically. Implementing such an analysis is not for the faint-hearted—vendors that offer such tools have put several person-decades of effort into developing sophisticated analysis algorithms. Almost as important as finding the defects is helping users understand them in the context of the entire program, so the tools provide features to help with this. Figure 1 below shows a screenshot from one such tool.
In contrast, the Java issues discussed above require a less sophisticated analysis because they are mostly simple properties of the code’s structure. Consequently a great deal of benefit can be gained from using relatively lightweight tools. For Java, the primary tool in this domain is FindBugs (findbugs.sourceforge.net). This is a free, open-source tool that is very easy to use either standalone or integrated with other tools. All Java developers can benefit from using FindBugs on their code from time to time, and the price means there is no good reason not to do so.
Tools of this type are effective because they have knowledge bases of hundreds of patterns that indicate bad practice, questionable correctness, latent security vulnerabilities, or simply suspicious code. Both of the Java issues discussed earlier can be detected with FindBugs (Fig. 2).
As mentioned above, problems like API misuse and memory access errors occur across many application domains. However, if your program is intended to be used in a setting where it is particularly susceptible to certain kinds of defect, then it is worth considering using multiple techniques and tools to find and eliminate those defects. For example, if Java is being used for an Internet-enabled database application, defects may render the application vulnerable to attacks such as SQL injection or cross-site scripting. In that case, it is advised to use tools that are specialized for identifying those issues.
Concurrency defects such as data races, starvation, and deadlock are in a category of their own because they are very serious in all languages and because they are very difficult to find and diagnose. Data races are particularly problematic because they are usually very sensitive to timing. A program with a data race bug can run perfectly hundreds of times in the same environment with the same inputs, yet still fail with confusing and mysterious symptoms when executed once more.
Statically detecting concurrency defects requires algorithms even more complex than those for finding memory errors. The most sophisticated tools for C and C++ do already have this ability, although their effectiveness varies considerably and scalability can be problematic. For Java, such capabilities are only just emerging in commercial tools and have yet to be introduced in free open-source tools.
Despite the many problems with C and C++, there is still no widely-accepted alternative high-level language for many applications. This is particularly true for embedded systems because of the prevalence of special-purpose processors. C is the standard language because the cost of developing tool chains for C for those processors is relatively low. It is reasonable to expect that C and C++ will be with us for a very long time to come. Unhappily, despite widespread knowledge of the defects that C and C++ programs are particularly susceptible to, programmers continue to make the same mistakes and these bugs remain disappointingly prevalent. Advanced static-analysis tools for C and C++ offer one way to find these defects before they show up in the field.
A more encouraging trend for embedded systems development is that programmers are increasingly using safer languages such as Java when appropriate, such as for the desktop or mobile applications used to communicate with the embedded systems. Of course such programs are prone to bugs too, but even the worst defects will tend to be less devastating than those found in C and C++ programs. The good news is the availability of low-cost static analysis tools which make it possible to find those bugs early in the development cycle.