Understanding 28-nm SoC Design With ARM-Based Cores

Combining today’s process nodes, which offer smaller feature sizes and faster interconnects, with the implementation of ARM-based systems-on-chips (SoCs) at 28 nm gives designers every opportunity to create higher-speed and more compact devices. However, with these opportunities come challenges that must be considered.

28-nm Technology: Challenges And Solutions

Since more transistors can fit into the same area using smaller process nodes such as 28 nm, designs tend to be much larger and more power-hungry than those on older nodes. In addition, due to wires becoming longer and thinner, wire delay has a significantly great effect on a design’s timing characteristics.

On top of that, smaller process features introduce new hurdles to overcome in manufacturing, including lithography effects as well as more complex design-rule-checking (DRC) rules that must be considered by both the designer and tool flow. In other words, to achieve the benefits afforded by 28 nm, design teams need to contend with three key challenges: design size/complexity, timing variability, and silicon manufacturability.

A robust prototyping and hierarchical methodology, as well as the implementation of power-reduction features in the design, handles design size and complexity. Advanced modeling and clock/design optimizations address timing variability. The application of various design-for-manufacturing (DFM) techniques can maximize silicon manufacturability.

Design Size And Complexity

Increased design sizes and complexity impact design flow in several ways. Larger designs mean longer runtimes and less chance for interactive exploration. Furthermore, they generally result in devices that consume more power. So, what are the best ways to surmount these challenges?

First, given the fact that many 28-nm designs have from 10 million to 50 million placeable instances, it’s important that the design methodology employs some form of abstraction to prevent unreasonably long runtimes. This is especially true of the full-chip design prototyping stage, where designers must analyze many “what-if” scenarios to come up with the best floorplan and starting point for the design. But an abstracted netlist tends to lose some accuracy due to compression or omission of information, and therefore runs the risk of not providing convergent results upon analysis.

To create a smaller netlist while still preserving a reasonable amount of accuracy, a designer could take a “flexible abstraction model” approach. It preserves sequential elements of the netlist, while core (non-boundary) combinational logic is compressed to significantly speed up runtime. This enables rapid, interactive prototyping at the floorplan level. With the flexible abstraction model, the physical implementation tool can produce a much smaller netlist, yet still maintain the connectivity of the relevant logic and mimic the behavior of a fully uncompressed netlist.

Data indicates that netlist compression of up to 95% is possible with the flexible abstraction approach, cutting down runtime to 5% of the original netlist. It also significantly improves memory usage. Thus, designers can analyze and plan the design interactively, even for designs that are traditionally too large for such analysis. The end result is an improved floorplan design and a better starting point for the rest of the design flow to converge, which means a faster path to attaining the desired power, performance, and area.

In addition, since most 28-nm designs are hierarchical in nature due to increasing design sizes, more efficient hierarchical closure becomes a necessity. The recommendation here is an interface logic module (ILM) approach to top-level hierarchical closure, where the tool can at least “see” block-level boundary logic timing during top-level optimization.

Going a step further, it’s now possible to optimize block-level boundary paths at the top level. Users can customize the visibility and “editability” of different partition blocks at the top level, allowing designers to ensure sufficient optimization of blocks containing many inter-partition critical paths. This approach’s main benefit is that it gives users the flexibility to utilize the appropriate mode of visibility and editability for each partition block. For example, users can opt to employ a “black box” mode for fully closed, less timing-critical blocks, or an “interface editable” mode to allow for concurrent top- and block-level optimization for blocks with more inter-partition critical paths.

From a power-consumption perspective, it’s fairly common knowledge that larger, more complex designs tend to have higher power requirements. Reducing power, then, is almost always a high priority in 28-nm design methodologies. The only exceptions are those cases where it’s not possible to do so, such as designs that have to be on all the time (e.g., certain gigabit networking applications).

Two standard power-saving methods that should be the staple of all 28-nm methodologies are clock gating and multiple threshold voltage swapping. Clock gating can typically reduce dynamic power consumption by 15% to 30%, whereas multi-threshold voltage swapping usually reduces leakage power consumption by 50% to 70%. The effectiveness of both techniques depends heavily on how much design area is occupied by random logic versus hard macros.

In 28-nm designs, leakage power becomes more of an issue compared to previous nodes. This is due to thinner gate oxides, as well as more electron leakage between the drain and source of transistors. Especially in mobile applications, excessive leakage power has the potential to kill a project if not handled well.

Therefore, in addition to employing multi-threshold voltage swapping, many design teams also take a power-domain approach by implementing power shutoff (Fig. 1). Power shutoff is by far the most effective means of reducing leakage power. When a power domain is in power-shutoff mode, the logic connection to the rails is separated by an open transistor, which drops leakage power consumption by approximately 92% to 99%.

1. This power-domain-based low power methodology spans from system to GDSII.

Power shutoff may seem like an end-all to power-reduction techniques. However, in many cases, not all of the chip can be shut off—control logic, sleep signals, and even some entire blocks of logic may need to remain functional while the non-critical portions are shut down. In addition, if the design’s behavior requires the entire design to be always-on (e.g., gigabit networking devices), power shutoff tends to be less effective. Therefore, power shutoff is most efficient for designs where large blocks of logic or macros can be shut down for extended periods of time. Many of ARM’s processor cores, such as the Cortex A15/A9 processor cores, have boundary signals that put the processor core in power-shutoff mode.

Variability In Timing Closure

With 28-nm technology, it’s possible to produce ARM-based SoCs that operate in the gigahertz clock-frequency range. The main driver behind this is the acceleration of interconnects in the 28-nm realm. However, one factor complicating a design’s timing closure involves the variability of both interconnects and cell timing characteristics.

One thing is certain at 28 nm: thinner and longer wires are closer to other metal than ever before. Thus, wire delay becomes more of a problem in a design’s total timing characteristics. System variability may be caused by other factors, though, such as on-chip variation and diverged timing characteristics across different modes and corners.

As such, the impact of wire delay on system timing has prompted many designers to look at it earlier in the design flow. To wit, it has triggered a re-emergence of physical-aware synthesis. The synthesis tool has visibility of the exact location of standard cells and other logic, allowing for more accurate estimation and calculation of wire delay than a wireload model approach. The end result is a more convergent path to timing closure.

If possible, the synthesis script should include writing out a netlist with full legalized placement (where there are no overlapping placements, or placements that violate DRC rules) to physical implementation. This would allow the logic and physical structure planned out by the synthesis engine to continue undisrupted into the physical-implementation stage.

During physical implementation, it’s important to utilize upper metal layers more systematically for longer routes. Foundries may provide a few higher layers with superior timing characteristics than the lower metal layers. Therefore, especially for timing critical nets, it’s important that the methodology allows for signals to quickly traverse up to the higher metal layers and then go across large distances using the faster, higher metal-layer interconnect.

Modern design methodologies allow for the setup of multimode, multicorner (MMMC) views. A view includes components such as a cell delay corner and RC delay corner, tied together with a certain mode (specified in the timing constraints). For 28-nm designs, design teams typically end up with five to 10 different MMMC views during implementation. However, the number of views for the signoff stage can escalate to more than 30 views.

Effective view pruning, which involves selecting only timing-critical views for implementation, helps reduce the flow’s overall runtime. During final static timing analysis (STA), though, one must account for all modes and corners. Consequently, an effective 28-nm methodology would need to include some way to automate final fixes for any remaining critical paths that appear when using all MMMC views.

Due to increased on-chip variation (OCV) effects at 28 nm, usage of OCV de-rating factors makes designs infinitely harder to close, even when utilizing common-path pessimism removal (CPPR). Obtaining a more realistic view on OCV involves taking the next step to “advanced OCV.”

The advanced OCV approach doesn’t model OCV de-rating factors as a fixed percentage. Instead, 2D tables describe OCV effects in terms of physical distance between cells, and the number of logic stages between cells. Many 28-nm design teams have begun using advanced OCV tables in timing libraries. Also, newer physical-implementation methodologies allow for both timing analysis and optimization to be done, accounting for advanced OCV tables.

One particularly critical aspect of 28-nm design is clock network implementation. Typical criteria given to the clock-tree-synthesis (CTS) engine consist of max skew, slew rate, latency, and others. However, they’re essentially supporting criteria that work toward the goal of a design that meets timing, power, and area specifications.

Attaining the tightest skew between registers that aren’t critically timing-connected to each other doesn’t necessarily improve a design. In fact, it might be detrimental to a design’s overall timing, because that power and area could potentially be used to optimize a more critical timing path.

Therefore, the ideal 28-nm methodology would look at CTS as a part of the complete picture—meeting the timing, area, and power specifications of the design, not just meeting skew or latency constraints indiscriminately. A new technology, clock-concurrent optimization, builds the clock network concurrently while optimizing the design to meet power, performance, and area criteria (Fig. 2).

2. Clock-concurrent optimization combines clock tree synthesis with optimization.

Silicon Manufacturability

One notably visible challenge at 28 nm is that initially, yields will be lower than more mature processes. This is unavoidable and will improve upon the maturation of manufacturing capability. However, a few aspects are particularly critical to the manufacturability of 28-nm designs.

The usual techniques of maximizing yield still apply at 28 nm: via minimization, multi-cut vias, and wire widening and half-grid wire spacing. They reduce the possibility of opens and shorts due to manufacturing defects and should be included in any 28-nm design methodology. In addition, 28-nm DRC rules are much more complex than rule sets of more mature process technologies. Thus, designers should at least have an understanding of the major new rules in the DRC rule deck.

Lithography effects also create challenges at 28 nm. Sometimes the foundry can handle them, but any good 28-nm methodology should offer ways of preventing, analyzing, and fixing unwanted lithography effects. Essentially, 28-nm metal routes and pins look nothing like the straight-line shapes seen on monitors. After fabrication, 28-nm metal layers can be heavily distorted to the point where they may even cause functional failures if left unattended.

The main issue with lithography effects is that they can’t be analyzed and prevented purely by DRC rules. These effects happen based on how the surrounding metal objects are shaped. The most reliable way to detect lithography effects, therefore, is through simulation-based analysis. However, full simulation-based analysis is computationally intensive and runtimes can range from two to 10 hours per mm² of 28-nm design area. Consequently, full-simulation-based lithography analysis should only be used as the final signoff, whereas a more efficient in-design DFM solution is best during implementation.

There are multiple in-design DFM methods. For instance, a pattern-matching-based approach can make lithography analysis more efficient (Fig. 3). Pattern matching, using a foundry-qualified or user-generated yield-detractor pattern library, is run on the post route database. Any exact matched patterns in varying orientations are identified and automatically removed via rerouting. Currently used in production at 28 nm, this method runs at a fraction of full simulation runtime.

3. Layout analysis can be automated using lithography-based pattern matching. (courtesy of GlobalFoundaries)

Another technique involves one more qualification step: model-based litho simulation is run on an ambit of the particular matched yield-detractor pattern area. If the model-based litho check turns positive, the match is flagged as a lithography hotspot. This method runs approximately 50 to 100 times faster than full simulation-based lithography analysis. It has become very popular for analyzing lithography effects in the implementation environment, because designers needn’t wait the 40 to 50 hours for full-simulation analysis to complete.

Litho unfriendly patterns (LUPs) are used to detect lithography hotspots (Fig. 3, again). A library of LUPs can be provided to the place-and-route system, which then detects if any portion of the design matches any patterns specified in the LUP library. If there’s a match, then it’s either flagged as a litho hotspot violation or automatically sent to a more detailed view (using full simulation-based lithography analysis).

Reducing runtime while getting signoff-quality lithography analysis can also be accomplished by first kicking off a batch job to perform full simulation lithography analysis on a 95% complete design. If there are no issues, the designer then can continue to close timing, power, and area on the design. At the final stage of the process, when the design meets all timing, power, and area requirements, an incremental lithography analysis can be run just on the parts of the design that have changed.

As mentioned earlier, the main motivation behind these methods is to complete the lithography analysis in the most efficient manner.