We’ve heard the clichés before.
“There’s no such thing as a free lunch.”
“If it sounds too good to be true, it probably is.”
“No good deed goes unpunished.”
We have numerous ways of expressing our skepticism when it looks like we can get something for nothing. There’s always a price to pay.
In engineering, the most fundamental expression of that price lies in energy conservation. You want something fast? You’ll need higher operating power. You want lower power? Then you’re just going to have to wait.
It’s that “power x time = energy” thing. Efficiency aside, you can’t monkey with the energy. Dial down power, time goes up, and vice versa.
You can almost take that to the bank—“almost” because you can find situations where, in fact, both power and performance go in the right direction at the same time, hard as that may be to believe.
If you were looking for opportunities to save time in the world of IC design, nothing would jump out faster than verification. Hardware verification alone chews up far more than its fair share of the design cycle: 70% is frequently cited for the verification chunk. Verifying the software that’s supposed to run on complex multicore systems-on-a-chip (SoCs) just compounds the problem, as that software requires trillions of clock cycles to validate.
This was less of a problem in the past because embedded systems designers could count on standard processors and other components, including FPGAs, to put together a system—even just a prototype system—and run their software. If it turned out that they bungled something up on the hardware, well, they just went and fixed it.
This isn’t possible with SoCs. You can’t simply assemble one in a Fry’s shopping cart. You’ve got millions of dollars of mask costs waiting to be spent. And they will stay waiting until everyone is darn sure everything is right so those mask costs get spent only once.
So now there are two problems. First, you’re no longer building the system out of relatively standard parts. You’re building it out of intellectual property (IP)—highly configurable IP, combined in unique ways using customized interconnect schemes. It’s no longer trivial to prototype. Second, if you wait for hardware to run software, you’re going to be late. So now you need to test the software in advance of complex, custom hardware, which gives you two choices: simulation and emulation.
Simulation is traditional. Large companies invest heavily in server farms to support the vast numbers of engineers doing verification. One large company is reputed to have farms of more than 30,000 servers running 24/7. That represents an enormous investment, not only in the hardware, but even more so in infrastructure and operating costs.
And those servers have traditionally been used “merely” for accelerating hardware simulation. It’s becoming more and more important to test out not only the hardware, but also the software that will run on the processor. If there’s a problem that involves the hardware, you want to know about it before the masks are committed.
Even simply booting Linux requires billions of clock cycles. Putting actual embedded applications through their paces can multiply the amount of required simulation capacity many times over. While more servers can get the job done in less time, the power tradeoff is as you might expect: more servers draw more power. There’s no free lunch.
By The Numbers
Let’s say you want to accelerate the simulation of a 100 million-gate design by 200 times. The typical power consumption of a PC is 500 W, so 200 of them would consume 100 kW.
The typical simulated clock speed that you can get from a PC for this size design is about 10 Hz. So, 200 of them running in parallel give you a 2-kHz equivalent. And that’s optimistic. It assumes you really can speed things up 200 times by running parallel simulations.
That kind of acceleration is unlikely because simulation is a tough thing to parallelize. You can’t take a specific run and spin it off into different processors. By definition, it’s sequential. The whole point is to determine the sequence of what happens after what. You can’t run that in parallel. It’s not that different from a pregnancy. Nine months is what it takes, and no help from volunteering relatives and girlfriends can accelerate that natural process.
You may not be able to parallelize a single test run, but you can parallelize a set of tests. If you have a suite of 1000 regression tests, you can replicate the simulation environment in each server and have each one run different tests. If the tests are all the same length, then each server runs five of them and you’re done 200 times faster.
That’s not typical, however. The long tests are always going to limit things. So this is an example where you pay 200 times the power price, but get less than 200 times back in faster turnaround. It may not be a good bargain, but if simulation is the only option, then, well, faster is faster, and it may be acceptable.
The Emulation Option
But simulation isn’t the only option. Emulation is the other approach. And the numbers change pretty dramatically, even though different emulators can have very different power consumption levels. For an emulator configured to run 100 million gates, you may consume as much as 50 kW if the emulator uses a custom processor chip, or you could use as little as 2 kW for a standard FPGA-based emulator.
These emulators can run the full design, so there’s no need to sign up parallel units to split the workload. On these large designs, the processor-based units can run an equivalent of 200 kHz—about a hundred times faster than simulation. Even at the higher 50-kW level, that’s still only half the power of the servers needed to complete the equivalent simulation. Both power and performance, then, go in the right direction.
Given that standard FPGA-based emulators consume about one-twenty-fifth the power of other emulators, you might expect them to be slower. But, in fact, they run around five times faster, at just under 1 MHz on a 100 million-gate design. That’s 500 times faster than simulation, with one-fiftieth the power consumption.
Since the general laws of nature (and common sense) look suspiciously at something for nothing, it’s reasonable to ask why this might be. We can divide that answer in two parts, one for the comparison of emulation to simulation and one for the difference between emulators.
In the first case, it’s a simple matter of efficiency. When you simulate, you are literally employing hundreds of chips to do the work that your designed chip will eventually do by itself. It’s like having 50 guys digging a hole with their hands compared to one guy with a shovel. The guy with the shovel will get done faster, and he will exert less overall effort in the bargain. In the same manner, an emulator is simply more efficient than simulation.
If we look at the emulators themselves, a different dynamic comes into play. First, the processor-based emulator executes the design just like a simulator does: via a data structure. The design is not an actual hardware logic implementation. Secondly, those custom processor chips are expensive to create, so each chip must be amortized for as long as possible. That means you aren’t necessarily going to want to do a new chip on each technology node, and you’re not going to want to jump onto the latest node simply because it’s so expensive.
By contrast, FPGAs drive technology. Their huge volumes make it advantageous for FPGA makers to be as aggressive as possible in the technology they use. An emulator built with a leading-edge standard FPGA gets to take advantage of far more advanced technology than a custom processor chip will be able to access. And, when an FPGA is used, the design is implemented as hardware logic, with the obvious benefit of higher speed and accuracy.
So there are actually good reasons why you can truly get better performance with less power consumption by using emulation instead of simulation. And that’s especially true for FPGA-based emulators—they give the most performance for the least power.
It almost tastes like a free lunch.