Up to now, there have been two main methods of clock distribution for large, high-performance designs: conventional clock-tree synthesis (CTS) and clock mesh. Multisource CTS has emerged as a new method that is a hybrid of these two. This article explains the differences between CTS, multisource CTS, and clock-mesh distribution technologies.
Table of Contents
- Key Differences Between CTS, Multisource CTS, And Clock Mesh
- Amount Of Shared Path
- OCV Benefit Of Shared Clock Path
- Mesh Fabric
- Design Complexity
- Timing Analysis
Key Differences Between CTS, Multisource CTS, And Clock Mesh
There are four key differences between conventional CTS, multisource CTS, and clock mesh: shared path, mesh fabric, design complexity, and timing analysis. Each subsequent section discusses each of the three clock distribution methods with respect to these key differences. At the completion of the article, you will know the differences and be better equipped to try a new method that may be better suited to your next design start.
Amount Of Shared Path
The most obvious difference is the structural depth of the shared path between the clock root and the sinks. Consider an example of the same set of sinks addressed by each of the three clock distribution methods (Fig. 1).
1. The most obvious difference between CTS, multisource CTS, and clock-mesh structures is the depth of the shared path between the clock root and the sinks. Shown here is the same set of sinks addressed by each of the three clock-distribution methods (from left: conventional CTS, multisource CTS, and clock mesh).
A conventional clock tree, shown at left, is characterized by an organic tree structure from the clock root that branches out to each of the sinks in the design. There is unlimited depth for both buffer and clock-gating levels. Most of the sinks in the design share very few paths back to the clock root—so few, in fact, that for any two sinks in the design, the only reliably shared part of the path is the root buffer.
Multiple clock trees of small to moderate depth primarily distinguish multisource CTS, shown at center. There are typically between three and nine levels of clock gating and buffers. The multiple clock trees are at the bottom of the structure below the mesh grid and all of the structure above the mesh form a shared path back to the clock root. A substantial portion of the overall insertion delay of the clock is in the form of a shared path.
Clock mesh, shown on the right, is characterized by an extremely shallow logic depth below the mesh, usually just a single buffer or clock gate directly driving the sinks. Most of the insertion delay in a clock mesh design is a large, shared path from the root to the mesh.
OCV Benefit Of Shared Clock Path
The respective logic depths (unlimited, moderate, and very shallow) are inversely related to the level of shared path between the sinks and the clock root. Path sharing reduces the impact of on-chip variation (OCV) effects on the design because when the sinks share the same clock path to the root, any process-variation occurrence in that path affects both flops equally and all timing assumptions are preserved. In the absence of path sharing, one must increase the clock margin by a derating factor to account for the possibility that either or both the launch and capture flip flops experience a process-variation phenomenon.
We may define the extra margin by multiplying the insertion delay of the non-shared path by a derating scalar, typically between 7% and 10%. Worse yet, it is applied in a range of plus or minus the derating factor. We then add the product to the timed skew of the design and derate the clock-frequency performance of the design.
The current technology nodes encourage large designs with many different functions. As designs grow larger, the impact of OCV derating increases. Of the three clock-distribution methods, conventional CTS is the most adversely affected by OCV derating, and the growing trend is to move away from conventional CTS for high-speed designs.
On the other extreme, the sinks in a clock-mesh design share the overwhelming majority of total clock path. The result is that the measured clock skew increases very little due to OCV derating, preserving the high performance of the design. This is the main reason that clock-mesh design has long been the preferred clock-distribution method deployed by performance-oriented processor designs, whether arithmetic and logical units (ALUs) or graphical processing units (GPUs).
Multisource CTS falls between conventional CTS and clock mesh with regard to the amount of shared path. The high flexibility of multisource CTS enables the designer to trade off the clock level depth of the multiple trees against the OCV immunity of the design. Multisource CTS has other areas of flexibility that we will discuss later.
The mesh fabric is another obvious difference between conventional CTS and the two other methodologies. Clock mesh and multisource CTS both use a mesh fabric, though there is a large difference in the density of the mesh deployed.
Clock mesh uses a dense mesh fabric as the final component of the shared path of the design. This fabric is typically about as dense as the power/ground meshes and consumes significant routing resources. Most implementations place the fabric at the highest routing layers. This minimizes the impact on data signal routing, which usually takes place at lower levels, while concurrently taking advantage of the low resistance of the wider redistribution layer (RDL) or other higher metal layers for high-speed routes.
The mesh fabric provides uniformity across its entire expanse. The multiply driven fabric smoothes out the arrival time differences of the clock at each driving point. The effective result is that the clock skew at the fabric is zero. The skew component of the design is thus limited to the wire segment attaching the buffer or clock gate to the fabric and the wires and flops thereby driven. This explains the ultra-low skew values achieved with clock mesh.
The multisource CTS mesh fabric is one order to two orders of magnitude less dense than the clock mesh fabric (Fig. 2). One must take greater care with multisource technology to ensure that the skew performance at the fabric meets the design objectives.
2. The multisource CTS mesh fabric (at right) is one order to two orders of magnitude less dense than the clock-mesh fabric.
Both clock mesh and multisource CTS have more immunity to insertion delay because the prevalence of shared path minimizes the portion of the insertion delay that is exposed to the OCV derating. Both methods also benefit from a highly structured buffer topology that drives the fabric.
We normally configure the pre-mesh drivers as multilevel H-Trees. An H-Tree, as the name implies, is typically a configuration of five drivers that trace the center and endpoints of an “H” pattern (Fig. 3). Notice the clock root buffer at the center of the design. For visibility, the routes on the two H layers are in different colors. In this example, the top-level “H” lays on its side, while the four lower-level “H” structures, shown in purple, are in a normal “H” orientation.
3. In this example of H-Tree routing, note the clock-root buffer at the center of the design. For visibility, the routes on the two H layers are in different colors. Here, the top-level “H” lays on its side, while the four lower-level “H” structures (in purple) are in a classic “H” orientation.
Power Tradeoff Differences
The coarse pitch of the multisource CTS mesh fabric has the benefit of using considerably less power than the extremely fine pitch of a clock-mesh fabric. Because there is one order to two orders of magnitude less metal in the multisource CTS fabric, the reduction in the power requirement is considerable. Whereas clock mesh consumes between 20% and 40% more power than the same design implemented with conventional CTS, a multisource design will have power numbers much closer to conventional CTS than clock mesh.
Furthermore, multisource CTS allows greater clock gating depth, enabling more complex clock gating schemes, which contributes to additional power savings. The flexibility of multisource CTS enables clock gating complexity and mesh fabric density tradeoffs versus power. This is not possible in a clock-mesh approach.
Attaching To The Fabric
Another difference between multisource CTS and clock mesh is how the design logic attaches to the mesh fabric. In clock mesh, the dense fabric defines relatively small bins that contain cluster or sub-cluster amounts of logic. These structures of buffers or clock gates and the sinks they drive are often called twigs to keep within the arboreal metaphor of trees and branches.
One may attach clock mesh twigs by either comb routes or fishbone routes to the horizontal and vertical spines of the mesh (Fig. 4). Comb routing minimizes the skew but trades this benefit off against routing resources. For most designs, fishbone routing is the better tradeoff.
4. One may attach clock mesh twigs by either comb routes or fishbone routes to the horizontal and vertical spines of the mesh. Comb routing minimizes the skew but trades this benefit off against routing resources. For most designs, fishbone routing is the better tradeoff.
Whether the routing method is comb or fishbone, clock mesh twigs attach to the nearest point along any spine to attach to the mesh fabric.
By contrast, multisource CTS offers much larger logic groupings that are themselves small clock trees. Multisource CTS clock trees attach to the coarse mesh fabric at locations called tap points, which are defined to provide low skew and insertion delay. We typically create tap points at the intersection of the spines of the coarse mesh fabric (Fig. 5).
5. Multisource CTS clock trees attach to the coarse mesh fabric at locations called tap points, which are defined to provide low skew and insertion delay. We typically create tap points at the intersection of the spines of the coarse mesh fabric. Shown is a design with seven tap points defined.
Each tap point is the local root of one of the multisource CTS clock trees. The root buffer normally attaches to the intersections with stacked vias directly to the input pin of the buffer.
In clock mesh there is no concept of assigning sinks to a clock root because the fabric is so dense and each twig is small. But in multisource CTS, tap-point assignment is an important step. In addition to showing the tap points at grid intersections, Figure 5 shows distinctly colorized areas that have been assigned to the tap point within the colorized boundary. The tap points are the clock tree roots, and the assigned sinks define the boundaries of each clock tree.
Mesh-Less Multisource CTS
There are even differences within the multisource CTS category. One may implement multisource CTS without a mesh fabric at all. In this case, the H-Tree endpoints are the tap points to which the sinks in each region are attached.
Mesh-less multisource CTS offers designers yet another set of tradeoffs. In this case, the tradeoff is between OCV tolerance and the ease of the flow. The flow is easier because there is no need to determine the correct pitch of the mesh fabric. It is also easier because timing the design is simpler.
The third area of difference among these methods is how the complexity of the clock-gating plan and the floorplan influences the effectiveness of the clock-distribution approach. Conventional CTS is the most accommodating approach for dealing with design complexity. It is the baseline against which to judge clock mesh and multisource CTS.
Clock mesh is the most rigid of the three approaches. An ideal clock mesh design has no RAMs, ROMs, or other hard blocks. Indeed, it is a flat sea of gates. This is ideal for clock mesh because there are no obstructions that prevent the placement of pre-mesh H-Tree buffers such that each “H” is ideal. The lack of obstructions also enables the H-Tree routes to be perfectly straight, making it easier to ensure an ideal balanced H-Tree. Clock mesh also benefits from a shallow, uniform design topology below the mesh fabric to comply with the limit of two levels of clock buffers or clock gating.
As in other areas of difference, multisource CTS falls between conventional CTS and clock mesh with respect to its handling of design complexity. The depth of the multisource clock trees tolerates most clock-gating plans well, and the smaller pre-mesh H-Tree means fewer drivers to account for amid RAMs and hard blocks in the floorplan.
In conventional CTS, we perform timing analysis with standard timing analysis tools, both the accepted signoff static timing engines and the similar timing engines embedded within the place and route tools. This makes conventional CTS the easiest method to time through every stage of the flow.
In the mesh topologies, circuit simulation is required to time the multiply driven mesh fabrics. This adds a level of complexity to the clock mesh and multisource flows that may at first seem prohibitive. However, the standard is for automation within the place and route tool to launch the simulation run and then annotate the timing values onto the design for subsequent static timing reports and analyses. While this mitigates the circuit-simulation learning curve somewhat, it cannot completely obviate some exposure to the underlying simulator technology.
As new technology nodes enable increasingly larger and more feature-rich designs, the choice of clock-distribution methodology becomes ever more important. Conventional CTS, which has traditionally been the default choice for all designs, may no longer be the optimal choice when an extremely high clock frequency is required.
Thus, it is a good idea to broaden the clock-distribution skill set to include clock mesh and multisource CTS technologies. Experience with these methodologies enables designers to make the most optimal design choice given the design goals: clock frequency, OCV tolerance, power consumption, flow ease, and time-to-market pressure.
The three technologies explored in this article cover the polar extremes with conventional CTS delivering good to very good clock frequency, moderate OCV performance, best low-power profile, and best ease of use. On the other extreme, clock mesh exhibits the best clock frequency performance, the best OCV tolerance, the worst low-power profile, and the lowest ease of use.
Many designers are finding that multisource CTS offers an attractive, flexible solution between the two extremes of conventional CTS and clock mesh. Knowing the differences between these three clock distribution methods will enable the optimal implementation for your next design.
- Mallik Devulapalli and Yuichi Kawahara, Clock Mesh Variation Robustness: Benefits and Analysis, http://www.design-reuse.com/articles/21019/clock-mesh-benefits-analysis.html
- Haroon Gauhar, Stephanie Miller, Ashutosh Mujumdar, Dermot O’Driscoll, Yuichi Kawahara, Mallik Devulapalli, Jason Binney, and Tom Chau, Structured Methods for Delay, Power, and Variation, https://www.synopsys.com/news/pubs/snug/sanjose08/binney_final.pdf
- Harvey Toyama, Clock Mesh for Mainstream Designs, http://www.synopsys.com/cgi-bin/protected/iccwp/pdfr.cgi?file=clock_mesh_wp_V4.pdf
- Harvey Toyama, Multi-Source CTS Delivers Flexible High Performance and Variation Tolerance, http://www.synopsys.com/cgi-bin/protected/iccwp/pdfr.cgi?file=multiSource_cts_wp_v2.pdf