Carefully Weigh The Tradeoffs Of Cell-Based Vs. Structured ASICs

With the emergence of 90-nm process technology, ASIC designers get to explore uncharted levels of performance and density. However, it has also unleashed a slew of challenging design-integrity issues, from crosstalk and noise to IR drop and timing closure. Complicating the development process is a growing array of silicon integration options. Today’s designers can implement designs using either a cell-based or structured ASIC methodology.

General Design Goals

Our accelerator IC design project began when engineers on the team discovered that a number of patterns in Internet-based communications are repeated over and over in high-traffic applications. General-purpose, Pentium-class servers (Xeon, Opteron, and others)—usually stacked in blade configurations in a rack—often handle these computations.

But tackling such highly specialized algorithms with a general-purpose CPU was proving highly inefficient. Clearly, an opportunity existed for implementing these key algorithms in hardware, in the form of a specialized accelerator IC.

Accelerators have a long history in personal computing. Math coprocessors were widely used for many years until processor designers, moving to higher-density process technologies, integrated them on-chip. Designers of early analog modems used dedicated digital hardware in the form of gate arrays or cell-based ASICs to accelerate performance. This hardware digitally processed the signals until other technologies came along years later.

Since then, those functions have been absorbed largely into the single-instruction, multiple-data (SIMD) instructions that were added to the Pentium processor. Finally, designers still use accelerators for video graphics. By implementing these functions in a special-purpose accelerator, product developers can offer performance comparable to a bank of general-purpose processors for a fraction of the price.

Similar opportunities lie in Internet communications today. Many security functions using public key encryption add so much data to the primary processing task that they render the system highly inefficient for all but the transfer of relatively small volumes of data. A special-purpose security device that implemented proprietary encryption schemes in hardware could seriously affect system performance.

Similarly, the rapidly growing use of XML—the markup programming language designed to simplify the use of richly structured documents over the Web—has presented new challenges for server designers. The XML language is widely used to translate databases among dissimilar systems, match up fields in dissimilar e-commerce systems, or simply exchange data between Web sites. But repeatedly processing format conversions at high volumes can quickly eat up computing resources. A specialized accelerator IC could relieve the system processor of this overhead and, in the process, dramatically improve system throughput.

Such a design would require a highly complex and dense ASIC. To maximize performance, the accelerator chip must combine multiple parallel implementations of the logical, architectural, and data-movement operations of the algorithm. Processing in the application would need to be both deep and wide.

In the deep direction, pipelined stages perform various comparison and math operations on each data packet. These pipelines feature FIFOs at both the input and output ports. Output is then transferred to another pipelined stage for additional data processing. Throughput rates reach one output per clock up to 250 MHz.

In the wide direction, the core processor clones parallel copies of the pipelined FIFO sequences to achieve further multiples of performance. With interfaces to external double-data-rate (DDR) DRAM at the inputs and outputs, the chip can process large amounts of data well beyond what the chip could normally process at very high speed. Figure 1 contains a basic block diagram of the accelerator IC and an illustration of the processing task flow.

To meet performance requirements, the team estimated that the design would require approximately 5 million gates of logic and over 5 Mbits of high-speed SRAM in a state-of-the-art process technology. The chip would feature high-performance I/O and memory controllers for interconnect to high-speed DDR2 memory located off-chip. An embedded PCI-X interface core would supply a high-speed link to the server. Support for diagnostic and test functions comes from additional on-chip buses.

System Partitioning

One early decision faced by designers was how much flexibility they needed in the device’s SRAM configuration. By implementing the device in a cell-based ASIC, the designers could choose the size and number of SRAM blocks they wanted to use in the design.

Memory options in a structured ASIC were more constrained. Each master slice featured SRAM in fixed sizes and numbers. As a result, one of the team’s first decisions was whether it could construct the design efficiently with the SRAM configurations available in a structured ASIC platform. After reviewing the wide range of options available from leading structured ASIC vendors, the team decided this constraint wasn’t a limiting factor.

A second key consideration was the availability of expertise on the design team. Like many design teams across the industry in recent years, budget cuts have made significant impact on available design skills. While the team retained deep expertise in some areas of design, such as memory interfaces and IP integration, recent cutbacks diminished the team’s ability to address signal-integrity and testing issues. Moreover, few engineers had experience implementing high-speed serial interconnects. Finally, the team’s limited budget left no room to add resources to address these issues.

From a simple performance perspective, implementing the design in a 90-nm cell-based ASIC offered the most attractive option. Optimizing each functional block and retaining precise control over the entire IC layout let designers achieve approximately a 20% increase in performance over the same design implemented in a structured ASIC in the same 90-nm CMOS process. The 90-nm, cell-based approach also offered the optimal solution in terms of power dissipation. The designers estimated that the same chip designed in a 90-nm structured ASIC would dissipate approximately 70% more power than a comparable cell-based ASIC.

Development Cycle Time

The next key issue for the design team? Time-to-market goals. The initial product development schedule called for turning the chip design around in six months. Considering the rapidly changing market conditions and the flood of new competitors, the design team calculated that a longer design cycle would pose a major risk to the product’s success.

The ASIC design team calculated that increased exposure to signal- and power- integrity issues in a standard-cell approach would translate into a significantly longer development cycle—a crucial consideration given the project’s aggressive schedule. Resolving signal-integrity, power, and timing-convergence issues would likely require multiple respins of the design.

To minimize those issues, designers would need to devote more time up front for integrity checking, power-grid design, and clock-tree distribution. Layout turnaround times would likely be extended as the team grappled with multiple trials. Invariably, time-to-handoff would be difficult to predict. Furthermore, once the team reached that goal, it would still face the traditional eight-week or longer fabrication cycle. Overall, the team estimated that it would need 12 to 18 months to develop the chip in a cell-based ASIC.

Addressing those issues also would require designers with a solid background in concepts such as crosstalk, ground bounce, and power-supply noise. That requirement could pose a major challenge for a design team short on signal-integrity expertise and already operating on a limited budget.

In comparison, the team determined that it could reach the same goal using a structured ASIC in much less time and with significantly fewer risks. A structured ASIC shrinks the development cycle by reducing the number of mask steps in the design process. Unlike a standard-cell implementation where designers configure every layer of the device, a structured ASIC is embedded with preconfigured logic, memory, and I/O in its first few layers. Designers customize the device for their specific applications by configuring the final few metal layers.

The clock tree and power grid in the structured ASIC are predefined and precharacterized. Therefore, issues such as signal integrity, power integrity, and timing convergence are much more easily resolved because the physical synthesis tool can actively use the power and clock grids as an input to physical synthesis. In turn, that reduces the likelihood of multiple respins, common among many standard cell-based designs. It also offers a tighter link to back-end processes, which makes a single-pass handoff much more likely.

After extensive research, the designers determined that with a structured ASIC, they could go from netlist to engineering samples with their design in as little as two to three months. This significantly increases the ability to meet their tight development schedule. Figure 2 illustrates the typical time required by structured ASICs and cell-based ASICs for various stages in an IC design flow.

Development Costs

Development costs also were a major driver in the team’s decision-making process, particularly given the project’s constrained budget. The team estimated that developing a cell-based ASIC using a state-of-the-art process technology would likely require purchasing new tools for design analysis and physical verification. It also would likely take an investment into new system-level-awareness tools early in the design cycle and new signal-integrity analysis and functional-verification tools later on. Early estimates placed new tool expenses for a 90-nm cell-based ASIC design in the $300,000 range.

But the most imposing obstacle to implementing the design in a cell-based ASIC involved its large up-front nonrecurring-engineering (NRE) expense. Clearly, mask costs for a 90-nm, cell-based ASIC would run well into the millions of dollars.

In comparison, the cost structure of the structured ASIC looked extremely attractive. The architecture’s predefined clock structure helps simplify timing closure and minimize clock skew. Its embedded power grid would largely eliminate power-integrity issues and accelerate the place-and-route steps. And, its embedded design-for-test (DFT) structures minimize the need for time-consuming test insertion and functional/timing resimulation. Moreover, because the IC’s initial floorplan would be fixed, IP modeling and integration would require less effort and fewer engineering resources. The team estimated that new tool costs would run well under $100,000 to develop a structured ASIC.

The real savings, however, would come from reduced NRE costs. Instead of spending millions of dollars up front for developing a cell-based ASIC, the team calculated that it could develop the same design in a leading-edge, 90-nm structured ASIC for an NRE expense of less than $250,000 (see the table).

The primary drawback to using a structured ASIC would be higher unit costs. Given the structured ASIC’s lower gate density, designers estimated they would pay approximately twice as much per unit compared to a cell-based ASIC in high volume. However, because initial sales estimates called for shipment of only a few tens of thousands of units, volumes didn’t support the high NRE costs implicit in the standard-cell option at 90-nm node.

Designing the chip in a cell-based ASIC using a 130-nm process presented an interesting option. Though this design option didn’t offer the same density and performance benefits of a leading-edge 90-nm process, it did offer a mature, proven process technology that would lower risk and reduce the team’s break-even point in terms of cost. The proven 130-nm process offered significantly lower costs per gate and per I/O than a structured ASIC. That advantage was partially offset by higher SRAM costs, as SRAM is larger in the more mature 130-nm process technology.

The primary drawback to developing the IC in a 130-nm, cell-based ASIC was its high up-front costs and longer development cycle. While the 130-nm process offered lower up-front costs than a 90-nm cell-based ASIC, the half-million dollar NRE cost remained two to three times higher than that of a structured ASIC.

In addition, if product volumes fell short of expectations, the cell-based ASIC option would become less economical while the structured ASIC option would be profitable throughout a wider range of unit volumes. Finally, the team did note that if product volumes exceeded expectations, the NEC Electronics Instant Silicon Solution Platform (ISSP) structured-ASIC architecture (developed by NEC) offered a fast and simple migration path to the vendor’s cell-based ASIC technology and lower unit costs.

Tool Considerations

From a design-tool perspective, the standard-cell approach offered the design team maximum flexibility. The design team could use a wide range of tools, from multiple vendors, to develop its ASIC. But if team members opted to do so, they would invariably face some tool-integration issues and would have to accommodate the costs of tool integration and training within their budget. Another consideration was how many designers would have access to the tools. The high cost of cell-based ASIC design tools would limit the number of licenses that could be afforded.

In contrast, the use of a structured ASIC would require implementing predefined tools and a predefined flow. While this limited design flexibility, it also offered some attractive efficiencies. Traditional cell-based ASIC design is typically a highly iterative process. Problems created in synthesis are often only discovered downstream in timing analysis or physical design, because designers of cell-based ASICs use physical-synthesis tools that aren’t directly linked to the underlying silicon architecture. The tools don’t integrate knowledge of where power or clock routes are located. To maximize design flexibility, tool users must decide which wire-load model is best to apply for the design. Accordingly, calculations are largely based on estimates.

That fact makes physical design and place-and-route operations a highly iterative process in cell-based ASIC design. Timing closure often results in an unpredictable number of loops and is difficult to predict. With limited physical data, the tool can’t tightly correlate the end design with GDSII. Final place-and-route operations can become a long and time-consuming process.

Structured-ASIC design can eliminate much of this ambiguity by integrating the fixed aspects of the ASIC architecture into a physical-synthesis tool. As an example, through a partnership with tool vendor Synplicity, ISSP customers can use a jointly developed version of the vendor’s physical-synthesis tool optimized for the ISSP architecture, Amplify ISSP Pro.

By working off a detailed floorplan, the tool embeds knowledge of where the predefined power and clock routes in the ISSP architecture are located. It also integrates knowledge of the ISSP complex-multigate architecture into its calculations. The wire-load model is predefined with the slice selected by the designer. Also, because it already retains knowledge of the architecture’s design rules, such as its fixed ratio of flops, inverters, and multiplexers, a more highly optimized implementation can be created within a fixed device size.

Leveraging these predefined aspects of the structured ASIC architecture, the tool increases usage as well as lets the designer achieve a more predictable result. Typically, a physical-synthesis tool in a traditional cell-based design estimates about 40% of all routes in a design. Using an optimized tool with a partially predefined structured ASIC architecture, designers can come out of physical synthesis with approximately 70% of all routes fully known. That increased accuracy translates directly into faster timing closure, fewer design iterations, and ultimately, shorter time-to-tapeout (TAT). Figure 3 shows the ISSP structured-ASIC design flow, highlighting the position in the flow of Amplify ISSP Pro.

After a thorough investigation, the design team concluded that none of its implementation options offered advantages in all aspects of the design. The 90-nm, cell-based option presented the highest performance and lowest-power implementation. It also offered the lowest unit cost at high volume.

But high NRE costs, an extensive investment in development tools and design expertise, and a longer development cycle offset those advantages. The move to a more mature 130-nm process would reduce some of those high up-front costs. But the designers would need to scale down their performance expectations and still face a long and complex development cycle.

Given the project’s aggressive development schedule and the team’s limited resources, the 90-nm structured ASIC option offered the optimal strategy. By using a state-of-the-art 90-nm process, the designers could meet their performance goals. At the same time, the fixed power and clock nets in the structured ASIC architecture, along with the availability of architecture-optimized tools, would simplify the design flow, minimize design respins, and significantly shorten TAT.

The structured ASIC’s lower up-front investment was a key advantage. Ultimately, however, the technology’s shorter development cycle and the ability it gave the team to bring the product to market faster were the deciding factors. In the end, the team agreed that the right way to deliver an accelerator is on an accelerated schedule.