We need improved standard-cell libraries if we're ever to resolve timing closure and boost standard-cell chip performance. In sub-0.18-µm silicon, short wire loads may have a capacitance of 1 fF, while long (several millimeter) wires may have a capacitance as large as 1000 fF. In typical cell libraries, drive strengths range from 1x to 4x on logic cells, and perhaps 1x to 16x for buffers. Loading ratios may be 1:1000, then, while drive ratios are 1:4 or 1:8.
Big differences in load and drive ratios cause timing-closure problems because delay depends on both load and drive. But when the ratios fall way out of scale, the one with the larger variation has the greatest impact. The variation lies largely in wire delays, which depend on routing. Routing is one of the last implementation steps, so accurate delay measurements wait until the end of implementation. Say goodbye to predictability and hello to iterations.
Differences in the load and drive ratios also affect chip performance. An analogy might be a world in which pc-board designers only had three resistance values: 10, 20, and 40 Ω. These can be combined to create many R values, but at what cost?
With system-level design moving from the board to the chip, resistance and thus load must be minimized because load is synonymous with delay. IC designers can control delays by selecting drives to match load. Designers may have trouble meeting specifications without a range of drive strengths to match loads. This issue becomes obvious when designers realize they are trying to drive both loads of 1 fF and of 1000 fF with nearly equal-sized cells.
Surprisingly, the optimum buffer size needed for a long wire is much larger than the largest buffer size in most libraries. Cells in today's libraries cannot drive the long wires on a large ASIC at top speed, causing the performance of standard-cell processes to fall further behind full-custom designs of the same feature size.
Adding buffers to compensate for missing drive strengths makes things worse. This common approach constructs cells from compute and buffer stages. Compute stages implement logic functions using minimum-size transistors. Output buffers are added to drive larger loads. However, logic optimization algorithms add buffers to do the same thing. Optimization can add a buffer if needed, but it can't remove a rogue internal buffer stage. Cells with output buffers do nothing to enrich the library for optimal performance.
To see how today's libraries get this way, look at today's library evaluation techniques. Typically, a test design is synthesized with a tool like Synopsys Design Compiler. DC uses wireload models, which worked fine when wire-delay ratios were smaller relative to drive-strength ratios. But this technique has led to false conclusions that have shaped today's libraries.
With today's submicron processes, wireload tables in synthesis lead to false conclusions about performance. Lacking any real placement data, synthesis calculates the estimated wire load as a function of fanout. In sub-0.25-µm processes, the correlation between wire capacitance and fanout is quite weak. A single fanout wire can travel 2 mm and have a very large wire load of 500 fF. Such a load obviously cannot be split, and it is much larger than the predicted load for a single fanout net.
To fix these problems, libraries need more drive strengths for more logic functions. For EDA tools designed to use enriched libraries, tool run times shrink, design area can shrink, and designs reach closure much more quickly.
With a small number of drive strengths, the wire is the limiter. Improve those libraries and deliver the best standard-cell performance physically possible.