On the Matter of Volkswagen, and Cheating Benchmarks

You have probably heard about the latest cheating debacle where Volkswagen made its diesel cars look good by changing the engine’s run-time profile when emission tests (aka benchmarks) were running. This allowed the engine to pass the test even though it was generating more emissions during normal driving. This allowed the cars to deliver better gas mileage and performance, but not at the rated emissions.

In other words, Volkswagen was cheating and they got caught. The results will be expensive and annoying—not just to the company but also stockholders, car owners, and the public in general that has to live with more pollution. Sadly the ones that should really be hurt are unlikely to spend time in jail or owe money. The CEO of Volkswagen has left with a golden parachute and a new CEO is attempting to pick up the pieces.

Unfortunately this type of event is far from atypical. The magnitude is unusual and the fact that it was against the law is significant. This does bring up a number of issues that are being discussed on the Internet, but that many are overlooking including benchmarks, DCMA, and reverse engineering, plus what happens to owners of the affected diesel cars.

I bring up benchmarks because I was the first PC Labs Director for PC Magazine (Fig. 1) many decades ago. I helped put together the first benchmarks for PCs, printers, and networks. We distributed them via bulletin-board systems running on banks of modems as well as on 5.25-in floppy disks. You may have to go to the Computer History Museum to see them now.

Electronicdesign Com Sites Electronicdesign com Files Uploads 2015 04 Fig1 5

1. PC Labs was the foremost testing center for PC Magazine for many years.

The benchmarks are simple compared to the ones used these days. Then again, so were the computers since many of these benchmarks were DOS-based.

Eventually we had graphics benchmarks, which is where a lot of cheating occurred. In some instances, video drivers were specifically written to check if tests were being run and adjusted the way the driver performed often doing nothing. In one sense, this is a valid optimization since a set of operations that does no useful work such as changing what is on the screen could help improve the performance of the overall system. Of course, adding this check could lower it, too.

This type of optimization is common in compilers where dead code elimination is a common feature. It is this type of optimization that benchmark developers and users need to be aware of.

I recall an advertisement from many years ago (I still cannot find a copy) from Perkin-Elmer about their latest minicomputer Fortran compiler. It showed performance improvements for various tests of low, one- or two-digit percentage improvements, but one test stood out with a test that was hundreds of times better. It turned out to be a dead code optimization that essentially eliminated the test and a marketing person decided this was a good thing to hype.

Sometimes these types of features are something positive, like dead code elimination, and should be tested. On the other hand, a designer is often trying to test another aspect of the system and these types of optimizations skew the results making the test invalid.

For example, a simple test for interface throughput or sequential disk read/write speeds might just write a file filled with zeros. If the driver has a special way of handling zero records or records that contain a fixed pattern then it could improve system performance making the operation complete sooner. Unfortunately this would not result in useful or correct information about what the designer wanted to test.

Before you start emailing me that this type of optimization is ridiculous, consider our migration to flash storage, the performance of flash memory controllers and sparse files. At this point, flash writes have a high cost in terms of speed and power. Likewise, most flash file systems remap sectors. This type of feature could be built using a default set of “special” sectors and these sectors would not actually have to exist.

Did you happen to know on some compilers for RISC systems with a lot of registers would dedicate a register to values like 0 or 1 because it was more efficient to use the register than a literal?

Scandal Implications

So we have good and bad reasons for “cheating” a benchmark. The issue has direct impact on all areas of electronics (and life in general, but that is another story). It is not restricted to performance either. Take EEMBC’s Ultra Low Power Benchmark as an example. It tests the power efficiency of microcontrollers. Designing the benchmark is a challenge because optimizations in compilers, run-time systems, and the hardware need to be taken into account. The results can have a major impact on the fortunes of many companies when one chip is chosen over another. On the plus side, the source code is available and users can run the tests on their hardware.

Unfortunately, having access to source code or even to test a system has been taken away by contract or by law. The Digital Millennium Copyright Act (DCMA) prevents reverse engineering of many systems. It is used to protect Volkswagen and other companies. This is a major issue because companies have an incentive to cheat and hide it using DCMA. That is not the intention of the act, but that is the result.

Academics wanting to test a system are not the only ones being restricted. Journalists often need to sign non-disclosure agreements when evaluating hardware and software so they do not run benchmarks or only run specified benchmarks or they must have the results “approved” before publishing them. You might be surprised how many clickwrap contracts have Terms of Service that bind you to this type of requirement.

Another implication of the Volkswagen debacle is how the cars will be “upgraded” or fixed. This means replacing or reprogramming the cheating modules. It also means adjusting the performance of the cars so they meet emission requirements. I wonder how many will try to ask that the new “feature” is not added to their car—sort of like upgrading to Windows 10 and getting its new Windows Customer Experience Improvement Program (CEIP) features.

Benchmarks remain a critical tool, but even valid tests can be misinterpreted accidentally or on purpose. Not all problems like this are related to active circumvention of benchmark results. Often it is the interpretation and how these results are used. This is what happened with the Ford Fusion hybrid we own. It was sold with a mile/gallon (mpg) rating of 47 mpg that was changed to 42 mpg (44 city/41 highway). This was over a “misinterpretation” of how the test results should be applied. We, along with other Ford Fusion owners, received a check to take care of the additional gas we would need to buy over the typical lifetime of the car. It is still a great car, but it was definitely a disappointment.

A minor point on interpretation of benchmarks results I wanted to mention is the numeric accuracy versus precision. I always had to argue at PC Labs about not providing and comparing numeric results with high precision. There was always, “Why can’t we compare values to six decimal places?” As engineers we know why. This also leaves out the issue of uncertainty. A proper result includes error ranges. In general, uncertainty is often low as is the precision discussed, but it can easily be manipulated to allow A to be better than B even if the results are essentially identical given the complete details.

So for now, our Ford Fusion and a Volkswagen diesel will still get you from point A to point B. It may just not be burning the same amount of gas as you expected.