Benchmarking has been around before computers but they remain useful tools. I had a hand in some of the popular PC benchmarks from PC Magazine when I was Lab Director of PC Labs. These days I am looking more at the embedded side of things. This is where the Embedded Microprocessor Benchmark Consortium (EEMBC) comes into play. They have a range of benchmarks targeting chip and embedded designers. Their CoreMark is very popular.
Their latest benchmark targets floating point hardware especially on SoCs. Markus Levy is founder and president of EEMBC. He is also president of the Multicore Association and chairman of the Multicore Developer's Conference. We recently talked about the new benchmark and its impact on embedded developers.
- Know Your Benchmarks Before You Make Comparisons
- The Growing Rationale for Deep Packet Inspection Benchmarks
Wong: EEMBC has been around for more than 16 years, why the recent need for floating-point benchmarks?
Levy: While floating-point units in processors have been in use for decades, in recent times floating point has become more mainstream, appearing in many embedded applications such as audio, graphics, automotive, and motor control. The FP-enabled processors are better able to take advantage of the need for increased precision in these applications. Floating-point representation makes numerical computation much easier, and as a matter of fact, many algorithms are first coded in floating-point representation before they are painstakingly converted to integer (to be able to run on processors without hardware floating-point support). Furthermore, FP implementations of many algorithms take fewer cycles to execute than fixed-point code (assuming the fixed-point code offers similar precision). To better support FP, more related features are being added to processors. For example, the Cortex-A cores are including FPU back-ends with multi-issue / speculation / out-of-order execution. There’s even FP capability in low-cost microcontrollers, and devices are being integrated with the ability to perform single cycle FP MAC with dual memory access.
Wong: How does EEMBC’s FPMark compare to existing benchmarks?
Levy: In the same way that the popular EEMBC CoreMark was intended to be a “better Dhrystone”, FPMark provides something better than the “somewhat quirky” Whetstone and Linpack benchmarks. There are also other FP benchmarks already in general use (i.e. Linpack, Nbench, Livermore loops), but each has multiple versions (therefore one never knows how to compare scores) and none of which have a standardized way of running them or reporting results. FPMark is built on the same framework as the EEMBC MultiBench, therefore the porting will be very familiar to those who have previously used MultiBench. Using this framework, a user of FPMark can simultaneously launch one or more contexts of a given workload and thereby study some of the general system-level effects on a multicore device. For example – and although FPMark wasn’t intended as a multicore benchmark and is mostly for computationally-intensive workloads – launching multiple contexts will increasingly stress memory bandwidth and latency and scheduling support.
Wong: What are some of the unique features of FPMark?
Levy: For starters, the FPMark benchmark suite contains 10 different kernels (Fig. 1), including FFT, ray tracing, Fourier coefficients, a back-propagation neural net simulator, Black Scholes, Arc Tangent, etc. This variety of kernels supports many application areas that utilize floating- point representation. But the thing that makes FPMark really unique is that most of the kernels are implemented as single-precision and double-precision workloads.
Ultimately, the application itself will determine whether single-precision or double-precision is needed, therefore, FPMark provides both to allow users to make the appropriate comparisons. In the methodology, FPMark specifies the required degree of accuracy for the result. When a compiler builds floating-point code or certain floating-point libraries are used, there is a certain amount of inaccuracy that is generated depending on the optimizations. FPMark requires that the final result is accurate to 30 bits of mantissa (out of 52) and 14 bits of mantissa (out of 23), for double precision and single precision, respectively.
To allow FPMark to be used with a wide range of devices (from low-end microcontrollers to high-end PC processors), there are three workloads for each benchmark (small, medium, and large). Specifically, the small data workloads are appropriate for microcontrollers. Medium data workloads are suitable for mobile CPUs and mid-range devices such as Cortex-A7 and Analog Devices SHARC processors. Large data workloads fit the PC processors and even some high-end mobile devices. Using the Linear algebra benchmark as an example, here’s a chart with a simple memory comparison. But this can be less straightforward. For example, some of the benchmarks have smaller data sizes, but the algorithm itself uses the data repeatedly to derive a result, as would be the case for neural net as it hones in on a node. However, this gives an idea of the difference between small, medium and large, as well as single- and double-precision.
EEMBC Linear algebra Memory Requirements (kbytes)
Wong: With a total of 53 workloads in FPMark, how does one make a simple and quick but meaningful comparison?
Levy: While EEMBC realizes that the true value of this benchmark suite will be seen by closely examining all the detailed scores that are generated, the members also crafted two official FPMark scores for quick comparisons. One score is literally the FPMark, calculated by taking the geometric mean of all the individual scores and multiplying the result by 100 for scaling (our attempt at making sure that no processor has a score of less than one (1). We also created a MicroFPMark, targeted at microcontrollers that aren’t able to run the double-precision-large-data workloads. The MicroFPMark is a geometric mean of the single-precision/small data workloads.