Improve Backplane Performance With Source-Synchronous Designs
System performance goals are in a constant march toward higher levels of throughput and bandwidth. This progression is forcing designers to leave the comfort of traditional synchronous interfaces behind. These traditional designs mainly suffer from purely physical performance limitations. This interface works in "absolute" time. All agents in a synchronous interface take marching orders from a dedicated clock source distributed via equal length traces to minimize skew across the system.
Current levels of integration and IC process capability have significantly reduced the timing delays associated with interface ICs as well as skew from clock drivers. Still, the transport delay cannot be eliminated or ignored. It just takes time to move signals from one agent to another. Because the transport delay or flight time of a signal limits the system's operating frequency, designers have resorted to wider parallel buses to meet overall system bandwidth requirements. Beyond a certain point, however, the pain and cost associated with increasingly wider widths overshadows the performance gained. Ultimately, alternative solutions must be considered.
One possible solution is to incorporate a source-synchronous interface onto a traditional passive-synchronous backplane interface. Due to the common parallel architecture, this interface is an ideal upgrade from the traditional synchronous design. It improves the throughput of many passive backplanes and on-card buses. All interface architectures are bound to have tradeoffs. But in a source-synchronous design, the advantage of additional system bandwidth far outweighs the cost of implementation.
Specific source-synchronous implementations are found in areas where bus throughput is critical to the overall system performance. Two good examples are double data rate (DDR) and Rambus memory modules. Both employ variations of a source-synchronous architecture, thereby improving the bandwidth of the memory subsystem in computers.
The definition of a source-synchronous system is a system that uses a strobe or clock signal generated by the address/data signal source to latch or clock the address/data signals at the receiving agent. Implementing a self-timed strobe at the receiver eliminates the flight time variable from system timing equations. Eliminating flight time allows the designer to maximize the potential bandwidth of any interface technology by increasing the operating frequency. Because interface signal timing is now working in "relative" time, the global skew requirements of a system clock have likely been reduced.
The Synchronous Interface By studying a traditional synchronous design, we can establish a baseline performance level for a given interface (Fig. 1). Our study includes clock distribution, signal routing, and a typical solution. The solution results are shown in a graphical or visual form to allow a better understanding of all the variables and the degree to which they affect overall system timing and performance.A centrally located clock source uses matched trace delay to generate and distribute multiple clock signals. Ideally, these signals arrive at all synchronous elements or card edges at the exact same instant. In most systems, an additional level of clock distribution is undertaken at the card level. This second level is often handled by some form of phase-locked loop (PLL) so that all cards, independent of on-card clock requirements, present an equal load to the central clock. Therefore, one source of clock skew is eliminated.
Unlike the clock lines, data, address, and control signals are typically routed in a multidrop or daisy-chain arrangement. This topology allows for varied card-to-card routing delays based on the relative card positions. In an unterminated backplane design, like CompactPCI, the designer needs to account for an additional "settling time" that's equal to the maximum flight time of the backplane interface.
In an effort to improve the signal integrity and performance of multidrop interfaces like VME, an alternative routing structure may be employed. This structure, commonly referred to as a star configuration, routes signals in such a manner as to equalize the delays of any card-to-card transmission. Routing signals this way also serves to improve the switching behavior of traditional I/O drivers because the interface now looks and behaves more like a lumped capacitive load.
This technique, however, isn't without its disadvantages. Chief among them is the need for equal trace lengths between multiple system cards, which results in a very large number of card-to-card interconnects. This significantly adds to backplane routing and manufacturing complexity.
It would be possible to produce higher system speeds using PECL or other reduced-swing differential signal technologies. But, the GTLP family makes for a nice case study because traditional and source-synchronous products are available.
In order to do a complete analysis, typical values for maximum clock skew, minimum and maximum flight time, crosstalk, and multiple output switching (MOS) events have been included too. The numbers shown don't necessarily represent state-of-the-art examples, but they are reasonable placeholders with which to do this type of analysis (Fig. 2).
Robust design techniques, either traditional or source-synchronous, demand that the designer work all potential system variables to their extreme values. By pushing and stretching these design variables, a 3D space is created that defines the solution boundaries. For example, the setup margin calculation includes maximum clock skew, maximum interface IC propagation delay, maximum MOS and crosstalk signal effects, maximum card-to-card flight time, and minimum interface IC setup time.
Using the worst-case numbers will define the maximum possible data rate for this particular system. Changing any of the variables changes the system constraints too, and therefore, the maximum data rate. Engineering the system to work at the desired data throughput requires balancing all of the above constraints. Over-engineering costs time-to-market and resources, while under-engineering the system often leads to expensive failures in the field.
In contrast to the setup calculation, the hold calculation is independent of the data rate. In the setup analysis, all the variables are stretched to data sheet maximums. For the hold analysis, they are compressed, allowing IC transitions and data flight times to occur as soon as physically possible. One particular point to notice with clock skew is that in hold time calculations, the clock skew is subtracted from IC delay and flight time. In setup time calculations, clock skew is added to IC delay and flight time.
Setup margin (ns) = CLK period − CLK skew − CLK to Q TPD (max) − MOS TPD − crosstalk − flight time − setup CLK to input
Hold margin (ns) = CLK to Q TPD (min) + flight time − CLK skew − hold CLK to input
Putting the results into tabular form and reducing the setup margin to zero nets a maximum clock rate of over 50 MHz with the GTLP18T612 Interface IC (Table 1).
We can now move from the traditional synchronous design to the source-synchronous design by swapping components and leaving the physical interface and constraints intact. This allows for a direct comparison of the architecture. As with the traditional design, the analysis includes signal and clock routing as well as a representative solution. The solution results are again shown in a graphical or visual form to better understand all of the variables and the degree to which they affect overall system timing and performance.
The single clock line needs to follow the same path, use the same trace width, and have the same device loading as the datapath bits (Fig. 3). This will reduce the clock-to-data skew that's caused by pc-board effects to a minimum possible value.
The solution showing a GTLP17T616 interface sends out the datapath clock signal on the CLKAB to CLKOUT path (Fig. 4). This path has skew specifications designed to guarantee timing relative to all the CLKAB to B datapath signals (TPDEL). Because multiple devices are used across the parallel interface, device-to-device skew must be accounted for in the timing calculations. The addition of the device-to-device timing variable requires a delay element to be added in the CLKIN to CLKBA path. This delay element works to guarantee the B to CLKBA data-setup-time specification. Without the additional delay, interdevice skew, multiple output switching, and crosstalk delays infuse enough design uncertainty to absolutely guarantee a valid clock-to-data relationship.
Now that the interface is a source-synchronous environment, master clock skew has been eliminated from the variable list. In this analysis, the designer must ensure the data or interface clock adheres to the setup and hold requirements of the receiving device for the accompanying data (Fig. 5). The source-synchronous GTLP interface device outputs the data clock 1.5 ns behind its accompanying data, adding worst-case device-to-device skew. Potential signal-to-signal interference (crosstalk) provides a mechanism to ensure that the correct data is always clocked into the receiver. By slowing the clock relative to the data, one end of our timing window has been established. The second half of the analysis determines how quickly new data may be clocked into the receiver.
The data clock now lags its accompanying data by 1.5 ns. The part-to-part skew assumes the data clock (CLKOUT) is being driven from a "slow" device, adding another 1.5 ns of potential delay. A multiple output switching event and/or signal-to-signal interaction on the pc board must also be accounted for in the worst-case timing analysis.
Now that the data clock has reached the receiving device, maximizing the loop delay through the GTLP receiver and the additional delay element will determine the best possible data rate given all the worst-case numbers for propagation delay. In the example shown in Figure 4, all the system variables consume 14.5 ns of time, which works out to a clock frequency of about 69 MHz (Table 2). Any attempts to clock data through the interface at a higher frequency cannot guarantee enough time to clock in valid data before the incoming data has started to change state at the receiver inputs.
Sending a single clock with the data has improved the interface clock frequency by 20 MHz, or 40% in terms of actual bandwidth or throughput. While 40% is a terrific gain based on the small amount of work required to reconfigure the interface, there's still room for improvement. One of the variables used in the previous solution is for device-to-device skew. Eliminating the device-to-device skew variable by sending a private clock signal from each GTLP interface device shortens the worst-case loop delay needed to clock or register the incoming data in the GTLP interface device. There's a design tradeoff. For better overall system performance, increase the number of interface clocks and accept the routing overhead of additional clock lines.
By eliminating performance limiting variables and minimizing the loopback delay (CLKOUT to CLKIN to CLKBA path) associated with the GTLP17T616, an optimal solution can be achieved. Variable elimination creates a tighter set of constraints surrounding the clock-data relationship allowing higher bandwidth across the same parallel interface. The key variable eliminated is the device-to-device skew. Removing skew from the data rate calculation allows the maximum CLKIN to CLKBA delay to be reduced by 4.25 ns.
Looking back to the original synchronous interface and comparing the results shows a 120% bandwidth enhancement over the traditional design. Considering that we're using the same backplane, similar products, and a new architecture, the results are impressive. Nothing in life is free, and this performance has a price. We added eight clock lines, or if you look at it another way, we took away eight data lines. The source-synchronous nature of the backplane signals dictates that the data must now be retimed to either a master system clock or on-card clock. The basic retiming will probably take place in a system ASIC, generally consisting of at least a register and most likely a synchronous FIFO. Either of these solutions adds to the latency of information across the backplane interface, but doesn't effect the overall bandwidth of the interface.
In order to put this performance into perspective, you must consider that current PCI specifications at 66 MHz allow four to five slots. In addition, a PCI-X proposal al-lows only three slots. The waveforms in Figure 6 are on a fully loaded eight-slot CompactPCI backplane. At nearly twice the frequency and twice the number of slots, this source-synchronous design offers a serious performance advantage.
The source-synchronous architecture is an ideal upgrade from traditional synchronous interface design. It improves the throughput of many passive backplanes. All interface architectures are bound to have tradeoffs, but the positives resulting from a source-synchronous design can far outweigh the negatives.
First, as timing budgets tighten, designers are moving toward this type of architecture in many application areas. Also, master clock skew requirements have been eased. This may allow for some cost savings in the clock distribution subsystem.
Furthermore, source-synchronous architectures add a FIFO or resynchronization requirement to system ASICs. Signal flight time is no longer a performance limiting factor. Plus, hold time margins are easier to meet.
In the future, source-synchronous designs will use state-of-the-art differential signaling techniques to further extend the capabilities of passive parallel backplane interfaces.