High-speed digital buses have evolved dramatically over the past decade. Not only are they faster, but they're also changing how systems clock data. To improve data throughput, emerging synchronous digital buses are sending data multiple times per cycle via an array of clocking schemes. This article presents a framework for understanding how source-synchronous clocking can optimize timing margins for high-speed interfaces.
Timing budget is the account of timing requirements or timing parameters necessary for a system to function properly. For synchronous systems to work, timing requirements must fit within one clock cycle. A timing-budget calculation involves many factors, including hold-time requirements and maximum operating frequency requirements. By calculating a timing budget, the limitations of conventional clocking methods can be seen.
Let's use Figure 1 as an example for a system with standard clocking. The figure shows a memory controller interfacing with an SRAM. Both the SRAM and memory controller receive clock signals from the same clock source. It's assumed that clock traces are designed to match the trace delays. The relevant timing parameters are:
The maximum-frequency calculation gives the minimum cycle time of the system if the worst-case input setup time, clock to output time, propagation delay, clock skew, and clock jitter are considered. The maximum frequency is given by:
tCO(max, SRAM) + tPD(max) + tSU(max, CTRL) + tSKEW(max, CLK) + tJIT(max, CLK) CYC
The hold-time calculation verifies that the system outputs data too fast, violating input hold time of the receiving device in the system. In this case, the worst-case condition occurs when the data is driven out at the earliest possible time. The formula is given by:
tCO(min, SRAM) + tPD(min) - tSKEW(min, CLK) - tJIT(min, CLK) > tH(max, CTRL)
Now let's assume the following values for the timing parameters of our SRAM and memory controller. In this case, we will use a high-speed SRAM with a double-data-rate (DDR) interface, where data is driven by the SRAM with every rising and falling edge of the clock.
The minimum hold-time requirement is calculated as:
tDOH + tPD - tSKEW - tJIT > tH -0.45 ns + tPD - 0.2 ns - 0.2 ns > 0.4 ns -0.85ns + tPD > 0.4 ns tPD > 1.25 ns
Assuming that the delay per inch of an FR4 board trace is 160 ps/in., the trace length from SRAM to memory controller must be at least 7.82 in. Using 1.2 ns for tPD, the maximum operating frequency is calculated below. Because the SRAM has a DDR interface, the timing budget is based on a half cycle:
tCO + tPD + tSU + tSKEW + tJITCYC/2 0.45 ns + 1.25 ns + 0.5 ns + 0.2 ns + 0.2 ns CYC/2 2.6 ns CYC/2 5.2 ns CYC 192 MHz > fCYC
With a 7.82-in. FR4 trace length and typical timing parameters, the timing budget requirements are met for an operating frequency of up to 192 MHz. In systems that have limited board space, the 7.82-in. minimum trace-length constraint becomes a difficult requirement to satisfy in systems.
If it isn't possible to introduce a trace delay, the memory controller can satisfy the hold-time requirement by using a delay-locked loop/phase-locked loop (DLL/PLL) to phase-shift the clock signal to capture data at an earlier time (Fig. 3). The memory controller will have to resynchronize captured data with the system clock. Using this method will introduce additional PLL/DLL jitter, which decreases the system's maximum operating frequency. With the added delay of the PLL, the minimum hold-time requirement becomes:
tDOH + tPD(trace) + tPLL/DLL_DELAY -
tSKEW - tJIT > tH
tCO + tPD + tSU + tSKEW + tJIT
+ tJIT_PLL/DLL CYC/2, where tJIT_PLL/DLL
is jitter introduced by the DLL/PLL.
Clock skew, clock jitter, and trace propagation delay can significantly limit system performance, even with the fastest SRAMs and ASICs/FPGAs available.
As mentioned earlier, the trace delay is approximately 160 ps/in. if an FR4 board is used. This is a significant number considering how the data-valid window at high frequencies has become 2 ns (e.g., for a 250-MHz, double-data-rate (DDR) device) and lower. Skew between the clock signals can also significantly reduce timing margins. We shall see that source-synchronous clocks can significantly reduce propagation delay, skew, and jitter, making timing closure more attainable.Advantages Of Source-Synchronous Clocking In a typical source-synchronous transaction, a rising clock edge associated with each word of data is sent out. (There can be multiple data per clock cycle with DDR.) The receiving device uses the clock edge to latch the data. Then, it resynchronizes the data to the master or common clock. Having the clock and data/control signals synchronized and transmitted by the same device virtually eliminates board-trace propagation delay of the signal with respect to the clock.
But different board-layout considerations arise with source-synchronous clocking. In a system with an independent clock generator, which supplies clocks to multiple devices, the primary concern is to design trace lengths so that all of the clock edges arrive simultaneously at the devices. This may involve lengthening traces to devices near the clock generator.
With a source-synchronous approach, the main concern is maintaining phase alignment between clock and data by matching the trace lengths of the output clock and data signals. Assuming that proper trace matching is done, propagation delay of data with respect to the clock no longer applies.
There are various ways to implement source-synchronous clocking:
Most DDR memory devices, such as the QDR-II/DDR-II SRAM, use this method, and they serve as an example in this discussion. The memory devices transmit both clock and data to the receiver.
The QDR-II produces a pair of output clocks, CQ and /CQ, which are ideally 180° apart from one another. The receiver uses the rising edges of both clocks to latch in data.
In both memory devices, the receiver must delay the clock to satisfy its setup- and hold-time requirements for the data capture. This delay can be implemented through an on-chip delay block, using either a PLL or DLL at the receiving end, or an on-board trace delay.
The first two methods are favored in FPGA designs due to their frequency migration capability. To use the same design at a higher frequency, the FPGA code can be modified to change the amount of delay introduced by the PLL/DLL. An ASIC, on the other hand, is typically designed to run at a particular frequency. Board trace delay is often the preferred method for ASICs.
Any one, or all, of these three types of jitter should be considered in a timing-budget calculation, depending on the application. With respect to a synchronous clocking environment, the clock source or the PLL/DLL generating the input clock typically causes the jitter variation. When added to the timing budget, jitter (tJIT) can significantly reduce timing margins, especially at high frequencies.
However, if the clock and data have the same jitter at the receiving device, the jitter component (tJIT) can be eliminated from the timing budget. Such is the case with source-synchronous clocks, where clock and data are driven and tightly aligned by the same transmitting device. This is usually when the clock is designed like one of the outputs. Although clock-to-output time variation between the clock and data pins must still be considered, this parameter is typically around ±100 ps.
In some applications, source-synchronous clocks are delayed by more than one cycle to latch in data. In this instance, long-term jitter is added to the timing budget and will reduce the timing margin. That jitter component is also called N-cycle jitter, where N is the number of cycles by which the source-synchronous clock is delayed with respect to the data. Needless to say, delaying source-synchronous clocks by more than one cycle isn't recommended.Timing-Budget Calculation As shown in the following example, eliminating propagation delay from a timing budget can greatly improve system-timing margins. Figure 5 shows an example of an SRAM with source-synchronous clocks.
Let's perform a timing-budget calculation with this setup. As per the design, the rising edge of the SRAM's output clock is aligned with the start of the data-valid window. We assume that the memory controller will delay the clock on-chip with a DLL/PLL to satisfy its setup-and-hold time. The timing parameters of the SRAM and memory controller remain the same. Assuming the clock and data trace lengths are matched, timing-budget calculation can ignore the trace propagation delay (tPD). Furthermore, clock-generator jitter and skew (tSKEW, tJIT) no longer apply because the skew and jitters are the same for the clock and data (barring pin-to-pin variations of ±100 ps). In this setup, an additional parameter to be considered is:
tJIT,SRAM: jitter in the SRAM's output source-synchronous clocks with respect to data. This can be caused by pin-to-pin clock-to-output variations.
In this example, we assume that the clock and data traces are perfectly matched. For cases of variations in trace lengths caused by layout designs, length variation between clock and data traces must be taken into account. This parameter also does not apply if the clock trace is intentionally lengthened to delay the clock with respect to data (for centering the clock to the data-valid window). Assume the following value for the SRAM's source-synchronous clock jitter:
tJIT,SRAM: ±0.2 ns
We can calculate the minimum cycle time step by step as listed in the table (see the table).To have a non-negative margin: tM > 0 tCYC/2 - 2.2 ns > 0 tCYC/2 > 2.2 ns tCYC > 4.4 ns fCYC
With source-synchronous clocks, setup-and-hold time requirements are met, and no constraints exist on the trace length of the data signal. The maximum frequency of operation is calculated to be 227 MHz, a 35-MHz increase from the conventional clocking method. Note that the main frequency-limiting factors in this case are the setup-and-hold time of the controller.Best Practices For Source-Synchronous Clock Usage To extract the most benefit from source-synchronous clocks, designers should keep the following points in mind: