Maximizing Hierarchical Design Throughput for Today’s Large Designs

1 of Enlarge image

Fig 1. As an example of today’s ASIC design complexity, IBM’s Cu-32 ASIC product offering delivers 2.9K raw gates per square millimeter. Such advanced process nodes enable SoC design teams to integrate and implement extremely powerful and complex systems, and designers are taking advantage of these available gates. (Source: The McLean Report 2011)

Fig 2. Generally, design teams schedule a number of early netlist hand-offs, or netlist drops, from the logical to the physical designers before the “final” netlist is available. This enables physical design teams to start exploring various implementation strategies.

Fig 3. To compress design schedule time, designers often reuse earlier design blocks and use third party IP. It is very rare that a new chip featuring billions of transistors is designed completely from scratch. Generally, most of a new design’s transistors are used to form memories or functions derived from similar functions implemented in earlier designs. (Source: Semico Research Corporation, Study Number SC103-10, October 2010)

Fig 4. It is not uncommon for macros to have extreme aspect ratios. For example, some memories may be tall and skinny, imposing a minimum height requirement for a block shape.

Fig 5. Given a top-level floorplan, design teams then start work on block floorplanning and top-level floorplan refinement in parallel. Top-level designers need visibility into the interface logic of blocks to have better accuracy of top-level timing analysis.

Fig 6. Visibility of the placement of macros within blocks enables better decision making regarding changes to block shapes from the top level.

As today’s system-on-chip (SoC) designs continue to grow in size and complexity, schedules for physical design are staying the same or shrinking. Design teams must deliver products quickly to market to capture as much revenue as possible. In response, many design teams are turning to hierarchical design flows to implement large designs, allowing them to divide the task into pieces that can be implemented in parallel, thereby compressing the overall development schedule.

This article will discuss various challenges to design exploration and design planning in a hierarchical flow for large SoCs. It will further delve into efficient techniques that produce fast turnaround times and make possible concurrent physical implementation, enabling predictable design convergence.

As an example of today’s ASIC design complexity, IBM’s Cu-32 ASIC product offering delivers 2.9K raw gates per square millimeter. Such advanced process nodes enable SoC design teams to integrate and implement extremely powerful and complex systems, and designers are taking advantage of these available gates. Experts are citing transistor counts of 2 to 3 billion for the latest processor designs and 800 million to more than 1 billion for today’s ASIC SoC designs.

Daily, the pressure increases on design teams to ship these mammoth designs. Applications ranging from smart phones, tablet computers, and automotive navigation systems to high-end network switching and computer servers utilize complex, high-gate-count SoC devices. Consumers’ appetite for better graphics, faster response times, and more functionality seems insatiable – last year’s “leading edge” is nothing compared to what is available today! Because the market window for many products is so short, the companies producing the silicon must leverage a short development cycle to survive.

How are they getting it done? Most design teams compress their development schedules by starting physical design in parallel with logical design. They reuse design data from previous designs, employ production-proven intellectual property (IP), and implement hierarchical design methodologies.

As stated earlier, hierarchical methodologies allow design teams to divide the chip into manageable pieces to implement in parallel, thus saving time. However, to achieve optimal throughput and time savings, they also take advantage of many opportunities to save time throughout the flow as the design progresses from planning through implementation. A number of timesaving opportunities are discussed later in this article.

Scheduling Logical and Physical Design in Parallel

Physical designers rarely receive a final netlist of a completed logical design before starting physical layout of the chip. In today’s environment, physical designers and logical designers start work at nearly the same time. Generally, design teams schedule a number of early netlist hand-offs, or netlist drops, from the logical to the physical designers before the “final” netlist is available. This enables physical design teams to start exploring various implementation strategies. By the time the final netlist arrives, the physical designers have developed a detailed implementation strategy that enables them to minimize the time to tapeout. Minimizing CPU runtime while working with early netlist drops enables physical designers to explore and assess more floorplan solutions. This is critical to finding the best floorplan to ensure minimum time to tapeout and highest quality of results when the final netlist arrives.

Practically speaking, functional and timing engineering change orders (ECOs) are often given to the physical designers after the final netlist. Ideally, these are relatively small in terms of the number of gates and nets affected; however, ECOs do require time and effort to implement, and they must be included in the physical layout before tapeout.

Design Reuse and Intellectual Property

Another technique that teams use to compress design schedule time is reuse of earlier design blocks and use of third party IP. It is very rare that a new chip featuring billions of transistors is designed completely from scratch. Generally, most of a new design’s transistors are used to form memories or functions derived from similar functions implemented in earlier designs.

This trend, coupled with the scheduling of early netlist drops, means that much of the gate-level content of a new design is available in early netlist drops.

Flat Design Flows

For many years, SoC designs were taped out using flat planning and implementation flows. There are two primary measures of quality for flat floorplans: One is that macros, such as memories, are placed in a way that does not create excessive routing congestion, while the other is that the overall placement accommodates functional data flow to ensure maximum operating speed. Flat flows are not efficient for extremely large, complex designs. The CPU runtimes and memory overhead to complete trial placement and routing in the early phases of the design may take days. For planning purposes, physical designers need to complete trial runs in less than a day. At most, the trial runs should be able to complete in overnight batch jobs. This enables designers to analyze results and prepare new batch jobs during the day, maximizing their productivity while they explore potential implementation strategies.

Once the planning phase is complete, designers have well-defined placement of key objects, such as memories, and pre-routing of the primary power ring and mesh structure. Next, the design moves into a refinement phase and then, later, an ECO phase. Once routing is feasible, designers start running optimizations to shorten timing path delays as much as possible to achieve the highest operating frequency. In a flat flow, each optimization pass must process all of the design data, which results in significant runtime and memory-use overhead. When the design moves into the ECO phase, there is a risk that implementing an ECO in one functional part of the design will degrade the timing in another functional part. This could lead to a ping-pong effect, where new problems are introduced each time another problem is solved.

Hierarchical Flows for Size

Partitioning is the process of breaking a design into physical blocks. Block size, in terms of instance count, is a common criterion used for partitioning. Generally, block layouts in hierarchical designs are implemented using flat methodologies. Understanding how large a block design should be while meeting an overnight runtime criterion is an important factor in determining hierarchical design blocks. Block designs of a size that meet this criterion can be given to designers to work on in parallel, enabling them to run batch jobs overnight and then spend the next day assessing results. At the end of the day, they can set up new batch runs. This maximizes designer productivity during the planning phase.

Once a design is partitioned into blocks, the physical designers responsible for the full chip create a floorplan for the chip by placing and shaping the blocks and then assigning pins to the block boundaries. The block shape and pin placements represent the physical constraints that are passed to the block design teams. Time budgeting is a process top-level physical designers use to divide top-level timing constraints to create timing constraints for the blocks.

Similar to flat floorplanning, the primary planning objectives are minimizing routing congestion and placing key objects to accommodate data flow, thus ensuring maximum operating speed. Unlike flat floorplanning, top-level physical designers must determine the placement, shape, and pin assignments of blocks. On top of that, they must determine the macro placements, standard-cell placements, I/O pad placements, and the power rings and mesh structures for the top level of the design.

To minimize the CPU runtime and memory size requirements, top-level physical designers often use a black-box planning approach. The gate-level netlist content is eliminated from the blocks, and the blocks are sized to achieve an estimated target of required area. Timing models or assumptions relative to timing path segments within blocks must be developed. Only top-level logic, I/O pad cells, macros, and empty blocks are used to form the top-level floorplan.

This is the only option if the content of the blocks is unknown. Designers estimate block size and timing characteristics based on their experience. However, as previously discussed, only a small percentage of new design content is unknown and designed from scratch. For most designs, the gate-level content of major blocks is already known. Converting blocks with known content into black boxes represents a tradeoff of accuracy for the sake of throughput time. This technique maximizes throughput in terms of CPU runtime and minimizes memory requirements for processing the design data during top-level planning.

For blocks with known content, the tradeoff introduces the risk of producing unrealistic physical and timing constraints for the blocks. Will the block shapes, based on estimated area, accommodate the various macros within the blocks? It is not uncommon for macros to have extreme aspect ratios. For example, some memories may be tall and skinny, imposing a minimum height requirement for a block shape. How accurate will budgeted block timing constraints be, based on estimated timing characteristics of the blocks? Block shape errors and poor timing constraints often require multiple iterations to resolve. For blocks of known content, is this a good tradeoff of accuracy versus throughput time?

Hierarchical Flows for Planning Speed and Accuracy

Block size, in terms of instance counts and timing complexity, is not the only criterion for dividing a design into manageable pieces. In many cases, blocks are divided by key functional components such as processor cores. Often, third-party IP vendors provide the functional components. Dividing the design into functional components enables use of pre-designed verification test benches and simulation vectors. This allows blocks to be given to teams experienced in implementing designs of a particular function – again, maximizing designer productivity and minimizing design time.

When a partitioned design has many blocks with known content, designers responsible for the top-level floorplan can use this information for more accuracy. For example, they could place the top level and the gate-level content of the blocks as if the design were flat and use the resulting placement to help determine locations and shapes of the blocks. Viewing content “as if flat” ensures that block shapes will correctly contain their corresponding macros. While this approach has an accuracy advantage, the designer loses the significant CPU runtime benefit of a black-box approach.

Creating a detailed flat placement to help plan an initial top-level floorplan is overkill. Trading off some accuracy for speed can enable a better starting point for top-level floorplanning. A rough placement of the full design tuned to run quickly and drive automated block placement and shaping provides a much better starting point.

To minimize the runtime and memory usage, the placement algorithms must be tuned to focus on producing results that drive shaping and enable accurate top-level timing assessment. Block shapes must contain their macros, and the algorithms must produce a relatively accurate placement of interface logic at block boundaries to enable accurate assessment of top-level timing. This type of exploration placement and shaping enables design teams to assess more potential top-level floorplan solutions in a reasonably short amount of time, while mitigating the risk of iterations required to reshape blocks just to fit their macro content. The accuracy of top-level timing is better, which, in turn, enables a more accurate time-budgeting process to produce block-level timing constraints.

A rough placement requires a global router that is able to reveal top-level routing congestion without the requirement of a full, detailed, legal placement. During design planning, the global router must produce a quick, accurate report of routing congestion to enable designers to explore and refine the floorplan solution.

When it comes to routing, it is also important to consider the power ring and mesh routing. An efficient flow enables designers to write mesh and ring requirements in terms of key design objects – such as blocks, groups of macros, and voltage areas. In this way, designers can quickly update power routing as needed when making changes to shapes, sizes, or locations of key design objects.

Given a top-level floorplan, design teams then start work on block floorplanning and top-level floorplan refinement in parallel. For the top-level team, refining a black-box floorplan may be fast, but it is prone to creating situations that may not work for the block designers. Top-level designers need visibility into the interface logic of blocks to have better accuracy of top-level timing analysis.

Visibility of the placement of macros within blocks enables better decision making regarding changes to block shapes from the top level. Top-level designers can see how changes would affect block designers. Tools need to create these abstracts for the top-level designer to minimize CPU runtime and memory requirements for top-level floorplan assessment and refinement without sacrificing accuracy.

When the design moves from a planning and refinement state to a refinement and ECO state, intelligent abstracts are replaced with actual block data. Tools should understand that the top level needs only to see interface logic, and does not need to see all logic within the completed blocks. The interfaces should be essentially transparent. Top-level timing can be closed more efficiently if top-level designers can optimize all segments of timing paths that go through hierarchical block pins. Designers can use models, but then they are limited to optimizing only top-level logic. If the timing cannot be closed, they are then forced to provide updated timing constraints to block designers, who, in turn, respin their blocks and provide new block timing models.

Conclusion

The large complex SoC designs of today require hierarchical design flows to enable design teams to meet ever-shrinking schedule requirements. A basic hierarchical flow for size requires use of black boxes with estimated areas and timing models for top-level floorplanning and refinement. However, this can lead to unexpected iterations, as these estimates do not provide enough information about timing and content.

A more accurate hierarchical flow trades off some speed for accuracy, but is tuned to enable fast exploration of initial floorplan solutions while considering block content. Top-level floorplan refinement is better, as intelligent abstracts enable designers to see complete timing paths from the top and macro content of the blocks. Final timing closure is made more efficient by using tools that render blocks’ interface logic transparent and can also simultaneously optimize both top-level and block-level logic of interface paths.