C-level hand-off—generating the entire RTL design from C-language code written by algorithm developers—has been one of the most elusive goals of the EDA industry. A few design teams, often working on heavily datapath-based designs, claim to routinely achieve it. Many more teams employ C code in behavioral modeling, and some of these use C-to-RTL tools to generate non-programmable RTL for some blocks. Announcements of new “high-level synthesis” tools routinely claim that the entire problem is now solved. But generating RTL for a full chip from C remains in an ambiguous state; neither as impossible as levitation nor as achievable as say, hang-gliding.
There are powerful reasons for this ambiguity. To understand them, it is useful to examine the SoC design process as it is practiced today, with particular attention to how the original design intent passes from stage to stage of the process.
A Matter of Intent
Let us begin by walking through the stages of a typical SoC design (Fig. 1). The process invariably begins with a written specification in a human language, usually English. This spec is typically the result of consolidating information from engineering, marketing, and customers and then trying to form some coherent image of a chip that can meet the most important needs.
This specification passes to the algorithm designers, who may or may not have been involved when the document was created. It is their job to partition the frequently-unclear spec into data sets and subsystems, to define the data and to find--or create--algorithms by which the subsystems will do their work. In this task they employ spreadsheets, algebraic modeling tools, such as Matlab, and often C programs to define and model the behavior of individual subsystems at an algorithm or functional level.
The work of the algorithm designers passes on to another set of specialists, the chip architects. Their job is to take in the algorithm designers’ view of the system—data sets and algorithms—and turn it into an SoC architecture: functional blocks, memory instances, ports and busses.
Already we have seen the design intent pass through two manual translations, each with the chance of human error. Algorithm experts may mistake the meaning of the written spec. Chip architects may not see the overall vision of the algorithm experts. But there is a much more serious cause of lost design intent just ahead.
In imagining the SoC, architects are visualizing a set of highly concurrent processes. Not only does each subsystem in the chip run independently, but many subsystems are themselves made up of concurrently-executing blocks: multicore CPUs, vector processors, or pipelined operations, for example. Yet the algebraic models architects use imply nothing about how their operations will be executed. And a behavioral description in C provides no explicit information about parallelism as many architects do, and as C-to-RTL tools demand.
By modeling in C, architects impose a particular, sequential order on the operations in the system. Despite efforts to model independent subsystems separately and to sometimes use non-ANSI-C extensions to describe concurrency and synchronization, the original concept of a fully concurrent system cannot pass into the C models. So an important piece of the algorithm developers’ and architects’ original vision of the chip is lost from the design data. At this stage also, the architects begin the process of hardware/software partitioning. But they must work without accurate estimates of the impact of their decisions on chip performance, size and power.
At this point the C code has become a de facto—and as we have seen, significantly flawed—executable spec. It defines the behavior and implies the structure of the design. It then moves on into the hands of another, often separate, team that will begin the definitions of processors, memory instances, and data path widths that will define the microarchitecture. This team will extract some of the C code for optimization and execution on embedded CPUs or DSPs. They may convert some blocks directly from C to RTL. But many blocks will be handed off to RTL coders for manual coding based on written specs and behavioral C into synthesizable RTL.
At this point, the design has moved from architecture to implementation. But arguably it has left behind much of what the algorithm experts and architects knew. That knowledge will be painfully duplicated later in the design when timing won’t close, accelerators have to be designed, pipelines have to be extended or replicated, buffers deepened, and partitioning revisited, all at a huge cost in schedule and waste. It would have been far better if the architects could have built their C models, used those models themselves to make accurate estimates of performance and power at the C-level, and done a correct job of partitioning and microarchitecting, eliminating iterations and creating code that actually could be converted into RTL. But that has been impossible.
Or so it was thought. But there are these counter-examples, like non-programmable data paths, where C-to-RTL conversion routinely works. Most of the time, only a restricted set of ANSI-C code is allowed to be used in these conversions. And there are points where even this restricted high-level synthesis fails, but an experienced designer can see what went wrong. Several years ago, the founders of Algotochip asked themselves two questions. First, is it possible to restrict the domain of C-to-RTL conversion to blocks where it can actually succeed, and deal with the rest of the design in another still automatic way? And second: if we include experienced humans in the flow, to guide and complement the tools, can we not merely achieve a working SoC, but actually capture what the algorithm developers and architects knew about the design, and pass that knowledge around the restrictive filter of C modeling, so the architects’ wisdom can improve the results of the synthesis? We believe we have shown that the answers to both questions are “yes.”
The all -important key is a simply and blindingly obvious, in retrospect—realization. Let the algorithm developers write in C. In fact, let them write in full ANSI C, including pointer expressions, Malloc and whatever else they need to express the design that was born in their whiteboard diagrams.
Then from the code, extract data and control graphs, just as if we were going to optimize the code. Use these graphs, in an automated but human-supervised process, to divide the code blocks into two bins: blocks that are good candidates for C-to-RTL conversion and those that aren’t.
Now comes the realization. You can’t create good RTL from those difficult blocks. But if you leave them in C, you can generate automatically optimized small programmable processor kernels to execute them. Accelerator synthesis has been a viable technology for years, and has been commercialized several times: it is proven. And, most important here, the performance and power of a small processor kernel optimized for a specific, small block of control code, can approach that of hand-translated RTL once the block is integrated into the chip. This is true partly because the custom processor can avoid caches and busses, instead connecting like a state machine directly into the blocks it interacts with.
Generating a processor entails a lot of additional work. A processor requires not just good RTL, but simulation models, a C compiler, assembler, and linker. But tools can generate all these items automatically from the processor description.
Now we have a system composed of many blocks of C. Some are good candidates for RTL synthesis. Some are destined to stay in software. We should pause here to admit that while this methodology is intellectually rigorous, the result may be partly impractical. Some design teams, for reasons of preference or preservation of legacy code, will want to use an industry-standard processor. They may choose to consolidate some C blocks onto an ARM or other CPU. This is not a problem.
At this stage, the design is in a perfect state for verifying the architecture. It is a relatively fine-grained model that is still executable, instead of requiring much slower simulation. If only there were an accurate way to estimate performance and power for the small blocks, there would be a good chance of preventing iterations.
And in fact there is a way. The key again is the small blocks and the fine-grained hardware/software partitioning. We have found that for these blocks wecan build estimators that have better than 90% accuracy at predicting the final design, based on the PDK for the target process.
Accurate estimators allow our team, working in cooperation with the architects, to optimize the architecture with confidence. From here, our tools convert the C-code blocks and processor descriptions to RTL, and then take the RTL through a conventional synthesis-to-GDSII flow.
We have done several things here: 1) restrict the problem by automatically separating out code that can be implemented in RTL from code that should remain in software; 2) produce a human-supervised C-to-RTL tool optimized for this restricted problem; 3) produce an optimized CPU core generator; and 4) make chip optimization at the C level predictable. The result is a flow that allows a algorithm developers to hand off an ANSI C model of a full SoC, with performance and power constraints and test vectors, and expect back GDSII that will produce a working chip along with all associated SDKs, application software and firmware to run on this chip. An example might help show how this flow works in practice.
Designing A Communications Processor
As an example, we will use an LTE-like communications processor. The digital functions of the chip are represented by a large C program as shown below (Fig. 2).
2. This represents a C-code call graph for an LTE-type communications system with some of the main functions called.
Much as the chip architects might receive it from the algorithm developers, this C program is the point of hand-off from the customer to the Algotochip design team, as shown in Figure 3.
The chip architects at Algotochip identify the data flows and opportunities for parallelism in the design by applying code analysis tools to the C program. Then by profiling the code they determine the activity levels for each block. The result might look like Table 1, which lists functions and the percentage of total compute time the program will spend in function calls from that level. For example the program spends all of its time in functions called by main, naturally enough. But as highlighted in the table, the program spends 86.38% of its cycles in and below the turbo decoder function. This data suggests that the turbo decoder is an excellent first candidate for implementation in hardware.
Table 1. The table indicates the percentage of time spent in each function when implemented as a software solution.
Accordingly, the architects load the C code for the turbo coder and its subsidiary functions into the RTL generating tools, which create the RTL and models for the turbo accelerator block. The architects then re-profile the code assuming use of the accelerator. Table 2 shows that the turbo decoder is now accounting for only 4.29% of the cycles. This figure can be reduced to zero by specifying a non-blocking interface for the accelerator.
Table 2. The Turbocoder is implemented in customized hardware with updated cycle counts.
For modest-speed systems this one acceleration may be enough. But if the specification calls for 150 Mb/s speed, for example, the tools will show that the design still is not fast enough. At this point the architects can go back to the table and observe that line 9, the FFT, is now accounting for 22.22% of the cycles. So they would repeat the process of C-to-RTL generation to create an FF hardware engine.
This iterative process of profiling, examination, and accelerator generation continues until the design meets its performance specs. We have found human intelligence a necessary part of the flow, to prevent, for example, the tools from creating an accelerator to compensate for a badly-coded block of C.
Once all the necessary hardware acceleration RTL is in place, the architects submit the remaining code to an optimal-CPU generator that will create a custom processor with the best instruction set and I/O structure for that code. The resulting chip architecture is shown in Figure 4. Note that the tool flow has created memory instances and interconnect as required by the application. For example, the FFT block sends data directly to the channel estimator’s local memory, not back through the CPU’s memory. (For convenience, we have shown connections to the CPU basic data memory as red arrows).
4. The final architecture block diagram is shown with the blue color blocks indicating non-programmable HW accelerators. Note that the application software to run on this architecture has also been created.
At this point the design is partitioned into hardware and software blocks using accurate timing estimation. A custom memory and bus architecture has been generated. Correct RTL and C blocks are ready to pass to the synthesis and software design teams, respectively. Elapsed time since the receipt of the original C code has been about four weeks.
Satish Padmanabhan, CTO and Founder of Algotochip Corp., has over 18 years of experience taking cutting edge technology from conceptual ideas to working SOCs. All the products worked in the first pass. Prior to Algotochip, Satish has worked in various Engineering and Management positions in ZSP, LSI Logic, Riverstone and Lucent. As a cofounding chief Architect at ZSP, he created the first Superscalar and the fastest DSP in the world. Satish was the Chief Architect of 10-Gigabit Ethernet and 10-Gigabit SONET products at Riverstone Networks and Lucent Technologies. These chips are used in the Metro Edge network by leading customers around the world. These products were the industry’s first to support complete virtual concatenation in an OC-192 link. Satish received his M.S. in Digital Signal Processing under Professor Allen Gersho from the University of California at Santa Barbara.