Chapter 2

Clock Deskew Scheme

The clock deskew scheme developed in this thesis entails distribution of clock signals to various locations, where the different distribution channels are continuously monitored. The scheme is designed to be applicable to multiple clock phase systems; a design example involving a four-phase clock system is given. A phase generator circuit has been designed that can be stopped and restarted without skipping a phase in order to be compatible with the Boundary-Scan Test method chosen for testing the F-RISC chips.

There are two versions of the deskew scheme to be presented in this thesis. In both of these versions, a clock distribution and skew compensation chip sends clock signals to each of the chips in the system requiring synchronization and, in turn, receives return clock signals (again, from each of the chips). A clock loop is defined to be the feed-back connection between the skew compensation chip and a chip needing a deskewed clock. The difference between the two schemes lies in the way the reference delay loop is implemented. In the first scheme, the reference delay loop has a fixed delay associated with it and is implemented within the deskew chip (Fig. 2.1). Whereas in the second scheme, one of the clock loops is chosen as the reference delay loop (Fig. 2.2). Each of the schemes has its advantages and disadvantages, but the case will be made in this thesis for the general superiority of the first scheme. The first scheme has an advantage over the second during the layout stage of the design process. More specifically, the layout of the return paths of the clock loops is simpler. Since an independent delay is used as the reference loop, the return path, which is distributed to one of the inputs of PLLs, is no longer needed to be matched closely with the return paths of the clock loops that connect to the other inputs of the PLLs. This is in sharp contrast with the second deskew scheme, where the reference loop is also the clock loop providing the clock signal to one of the chips. In this case, a care must be taken when laying out the return paths of the clock loops. The delay through the portions of the return paths of the clock loops from the VCDEs to the inputs of the PLLs must match each other just as in the first scheme, but they must also match the delay through the corresponding portion of the reference clock loop. Here in lies the main problem with the second deskew scheme. The return path of the reference loop is distributed to one of the inputs of every PLLs in the deskew chip, consequently, this path covers a wide area within the deskew chip with increased likelihood for delay variations in the connections to different PLLs. Contributing also to the delay variation of this path are the n-1 PLL loads. In order to match the delay in this path, the corresponding portions of the return paths of the other clock loops need to be appropriately changed, most likely increased, which usually suggests a larger deskew chip area than for the first scheme. However, the second scheme has some qualities that are more attractive than the first scheme.

The second scheme requires one PLL less than the first scheme. The number of PLLs required is equal to the number of the chips requiring the clock signal; n, for the first scheme (with the fixed reference delay), whereas the required number for the second scheme is n - 1. Another more significant advantage is the relaxed performance requirement on PLL design for the special case of deskewing just two clock signals. As will be shown later in this chapter, under this special case, the contribution of the steady state phase error of the PLL to the overall clock skew error is halved for the second scheme, whereas, for the first scheme, all of the phase errors contribute to the overall clock skew error. This allows a simpler PLL design since a larger steady state phase error can be tolerated for a given acceptable overall skew error.

Fig. 2.1. Deskew scheme with internal reference delay (first scheme)

Fig. 2.2. Deskew scheme without internal reference delay (second scheme)

Fig. 2.3. Delay diagram for the first scheme

Fig. 2.4. Delay diagram for the second scheme

The output of the reference delay loop, Tref or T0 , depending on the actual scheme, is distributed to one of the two inputs of the PLL in each of the clock loops. The other input of the PLL is the output of the returned path of a given clock loop. The PLL compares the two inputs and adjusts the Voltage Controlled Delay Elements (VCDEs) such that the delay difference between the inputs, the steady state phase detector error, is reduced to a minimum. Once all the PLLs have been locked, the total delays in the sent paths of the clock loops are equal to each other as are the delays in the returned paths (Fig. 2.3 and Fig. 2.4). Since the delays in all the sent paths are equal to each other, the clock signal at each of the chips on an MCM are then effectively deskewed.

In the following sections a more mathematical derivation of the forementioned principles and the necessary conditions of the deskew schemes are presented. As will be shown later, the maximum skew tolerance is limited by the performance of the PLLs and the accurate matching of the local delays.

2.1. Deskew Schemes

Let the source of the clock signal, labelled as the Master CLK in Fig. 2.1 and Fig. 2.2, be selected as the reference point with respect to other points in the block diagram. The term, *T0 , *represents the total delay from the master clock to the input of the PLL in the clock loop 0. Likewise,* T1 , T2 , ¼
, Tn *represent the delays through other clock loops. The output of the filter of a PLL is connected to both sets of the VCDEs in each of the clock loops, thus resulting in the identical delays through them, Tlc0 and Trc0. Similarly, Tlcn and Trcn, of the clock loop n, are equal to each other. Once all the PLLs have been locked, *T0 , T1 , ¼
, Tn* are equal to each other. Similarly, T0*, T1*, ¼
, Tn* are also equal to each other, resulting in the deskewed clock signals *(*Fig. 2.3 and Fig. 2.4).

A potential source of the skew error is the mismatch in the delays of the drivers/receivers between the clock loops. These mismatches can occur in such a way that they compensate each other and thus do not contribute significantly to the final clock skew error; if the driver in the sent path of the clock loop 0 have larger delay than the driver in the clock loop n, the receiver in the sent path of the clock loop 0 may have smaller delay than the corresponding receivers in the clock loop n by approximately the same amount. On the other hand, the cases in which the drivers/receivers mismatches contribute significantly to the final clock skew error would be the ones in which the drivers and receivers of a given clock loop have same delay directions relative to the corresponding elements of other clock loops. For example, a case where both the driver and the receiver on the sent path of the clock loop 0 have larger delays than the corresponding ones in the clock loop n.

In any case, only half of the aforementioned type of delay mismatch contributes to the final skew error thanks to the loop topology of the clock loops; for the clock loops 0 and n, the delay mismatch between the drivers/receivers mismatch can be expressed as,

In terms of the error terms defined in Eq. 2.5, which are summarized at the end of this chapter,

It will be shown later that only half of this delay mismatch is across the sent path while the other half is across the return path. The delay terms that correspond to the devices in a same chip such as, *Tru0/Trun* and *Tlu0/Tlun* , are in good agreement with each other. Other terms, however, represent delays through devices that are fabricated on different chips, possibly different wafers, and thus are of serious concern. This concern is the Achilles' heel for the both versions of the deskew scheme. Even though their contributions to the final skew error are halved, the necessity of selecting chips with acceptable drivers/receivers delay match can prove to be very expensive.

During the layout stage, care should be also taken to make sure that the same differences in the lengths of the routed portion of the sent paths of different clock loops are maintained also in the returned paths, for example, *Trmn - Trm0 = Tlmn - Tlm0. *This condition would be also satisfied if the delays of the sent path and return path of a clock loop are matched; latter condition may be easier to satisfy during the routing or layout stage of the design process.

2.1.1. Restrictions Placed on the Scheme

A restriction must be placed on the maximum clock loop delay differences for both of the schemes, such as *Tn - Tref* for the first schme and *Tn - T0 *for the second.

For the multi-phase clock generation, only a zero multiple difference among all of the possible even integer multiple differences is permitted. Stated another way, the maximum difference in the lengths of the clock loops (for example, *Tref - Tn*) that can be tolerated has to be less than the delay corresponding to the period of the clock. This restriction comes from the consideration of the implementation of the clock phase SYNC signal. The SYNC signal is used to make certain that the clock phases across the various chips are in synchronization. If the maximum delay difference of the clock loops is allowed to be larger than the delay corresponding to the clock period, there is no viable means to distribute the SYNC signal unless the SYNC signal distribution is itself deskewed. The task of distributing the SYNC signal would then be as complex as the distribution of the clock signal itself; it would not be a cost-effective task and, hence, is not a desirable solution. An alternate solution, where the SYNC signal is not deskewed but is distributed to all the chips in a daisy chain manner, is needed and has been studied. This solution is discussed in the following section.

2.1.2. Synchronization Procedure for a Clock with Multiple Phases

The solution for the problem of synchronization of the clock with multiple phases distributed to various chips consists of sequentially implementing the following four steps:

*Step 1: *Wait until all the PLLs have locked; at this stage, the clock signals are deskewed but the clock phases may be out of synchronization.

*Step 2: *Stop the master clock.

*Step 3: *Send the SYNC signal to all the chips; then, the chips are ready to generate the first clock phase at the next master clock edge.

*Step 4: *Restart the master clock.

After the first step, the delays of the VCDEs in each of the clock loops are set such that all the clock loop delays are equal to each other. The duration of the master clock stoppage should be set so that it is short relative to the filter bandwidths of the PLLs but is long enough to allow for the SYNC signal to reach all the chips before the restart of the master clock. This gurantees that the delays of the VCDEs set during the first step is still maintained through the fourth step. Hence, during the fourth step, when the master clock has restarted, the first wave of the master clock signal will simultaneously reach the chips. Since the SYNC signal sent during the third step readied the chips to generate the first clock phase upon receiving the next master clock edge, and since the master clock signal reaches all the chips simultaneously during the fourth step, the clock phases that are generated after the fourth step by the chips are all synchronized with each other. The reason for this is that the delays through all the clock loops during the fourth step are still equal to each other.

2.1.3. The Final Clock Skew Error for the First Deskew Scheme

In this section, a more mathematical treatment of the first deskew scheme and the derivation of the expression for the final clock skew error are presented. Simple expressions can be derived without any loss of generality from the parameters of two clock loops. The term, *Tref*, represents the fixed delay through the reference delay loop. It is connected to one of the inputs of all of the PLLs in the scheme as shown in Fig. 2.1. The delay through the reference loop should be chosen such that following equation is true under the assumption that the clock loop n has the maximum distribution delay (*Trmn* and *Tlmn* ) while the clock loop 0 (*Trm0* and *Tlm0* ) has the minimum delay:

The above condition ensures that the fixed reference delay is such that the VCDEs with a limited variable delay range in each of the clock loops have enough delay range to settle at appropriate states.

Following delay equations are satisfied by the clock loop 0 and the clock loop n, respectively:

Eq. 2.1

Eq. 2.2

where, *T0* is the delay in the clock loop 0,* Tn* is the delay in the clock loop n, *Tclk *is the period of the clock, k is an integer, and is the steady state phase detector error. When the PLLs have locked the term in Eq. 2.1 and Eq. 2.2 approaches zero. Above equations can be combined into a single equation by subtracting Eq. 2.2 from Eq. 2.1.

Eq. 2.3

A more detailed form of Eq. 2.3, expressed in individual delay terms of the clock loops, is given below:

Eq. 2.4

Since both the sent and returned routed paths of a clock loop are closely matched to each other, it can be considered that *Trm0* »
*Tlm0* and *Trmn* »
*Tlmn*. Similarly, the VCDE delays in the both paths of a clock loop are closely matched, *Trc0* »
*Tlc0* and *Trcn* »
*Tlcn*.

To state this in a more formal way, let,

Eq. 2.5

In terms of these parameters, Eq. 2.4 can be rewritten as,

Eq. 2.6

From this equation the expression relating the skew error between the two clock loops can be derived as follows:

Eq. 2.7

Eq. 2.8

Eq. 2.9

Which leads to,

Eq. 2.10

For k0n = 0, as is required in the scheme (see section 2.1.1), Eq. 2.10 simplifies to,

Eq. 2.11

Note the similarity of the above equation with the maximum skew error equation for the second scheme derived in the next section. It can be readily seen that the relative phase detector error, , can be as large as , where is the maximum steady state phase error of the PLLs in the clock loops; the magnitudes of the and can be equal to but their signs can be opposite of each other. In terms of , the Eq. 2.11 becomes,

Eq. 2.12

This equation gives the upper bound on the final skew error in the clock signals distributed to various chips. The full weight of the phase detector error term, , contributes to the upper bound of the final clock skew error, whereas, only half of their values contribute for the rest of the error terms in Eq. 2.12. As will be described in the next section, in the counterpart error equation for the second deskew scheme under the special application case of just two deskewable clock signals, only half of the phase detector error term () contributes to the overall clock skew error.

Indeed, the factor of 1/2 on the right side of the skew error expression (Eq. 2.18) for the second scheme should be interpreted as the relaxed requirement on the performance specification of the PLL relative to the first scheme. By reducing the mismatch error terms, the performance requirements of the PLLs can be relaxed by as much as a factor of two relative to the tolerable system skew margin.

The Eq. 2.12 and Eq. 2.18 shows that the limitation of the deskew scheme is due to the limitations in the matching of the component delays and the performance of the PLLs. The component delay mismatch terms, d c0, d cn, e ru, e lu, could be made as small as required relatively easily since the corresponding components are all fabricated in a single chip. However, the terms, e rd, e ld, are more likely to be larger since they represent the delay mismatch of the devices fabricated on different chips. Of course, if they were fabricated on the same wafer, these terms also would be made very small. The design objective should be to minimize these error terms in the above equations.

2.1.4. The Overall Clock Skew Error for the Second Deskew Scheme

This scheme has the same performance limitations with regards to the final clock skew error as that are applied to the first scheme for the objective of deskewing three or more clock signals. For the special case of just two deskewable clock signals, however, the second scheme has a significant advantage. In the second scheme, the phase detector error contribution to the overall clock skew error is reduced by a factor of two. In other words, a lower performance or a cheaper PLL can be used in the second scheme, and have the same overall clock skew error as would be obtained when a higher performance PLL was used in the first scheme. In the interest of clarity, the following derivation of the expression for the clock skew error of the second scheme (Fig. 2.2) repeats some of the derivations obtained for the first scheme. The error expression for the special case of only two deskewable clock signals is obtained before the expression for the general case of three or more clock signals.

Following delay equation is satisfied by the clock loop n:

Eq. 2.13

where, *T0 *is the delay in the clock loop 0, *Tn* is the delay in the clock loop n,* Tclk *is the period of the clock, *kn* is an integer, and is the steady state phase detector error of the PLL in the clock loop n. The term, , should become negligibly small once the PLL in the clock loop n has been locked. This equation can be expressed in terms of the component delays of the clock loops as follows (Fig. 2.2);

Eq. 2.14

If the relationships among the terms in above equation are defined as follows,

the Eq. 2.14 can be rewritten as,

Eq. 2.15

From this equation the expression relating the clock skew error between the two clock loops can be derived as follows:

Eq. 2.16

which leads to,

Eq. 2.17

For *kn*= 0, as is required in the scheme, Eq. 2.17 simplifies to,

Eq. 2.18

Above expression for the overall clock skew error is valid for the special case of two deskewable clock signals. The main distinguishing factor between the above equation (Eq. 2.18) and Eq. 2.12 of the first scheme is that is reduced by a factor of two in the above equation but is not in Eq. 2.12. Unfortunately, this expression does not hold true for the more general case involving three or more clock signals. The skew error between any of the clock signals and the reference clock signal, *T0**, is still governed by Eq. 2.18, but the error among the rest of the clock signals, *T1*, ... , Tn** , is governed by a different equation similar to Eq. 2.12 of the first scheme. It should be noted that for the first scheme, the same skew error expression govern both the special and the general cases.

The delay equations that are satisfied by the clock loop 1 and the clock loop n can be expressed as,

Eq. 2.19

Eq. 2.20

After subtracting Eq. 2.19 from Eq. 2.20,

Eq. 2.21

It can be seen that Eq. 2.21 is equivalent to Eq. 2.3 of the first scheme. Skipping the analogous error parameter definitions and intermediate derivation steps, the general case overall clock skew error for the second deskew scheme, which is equivalent to Eq. 2.12 of the first scheme, is as follows:

Eq. 2.22

Above result makes sense since the clock loop 0 time delay (*T0*) can be treated as the reference delay of the first scheme (*Tref*) for the general case of the second scheme. Because the skew errors are governed by essentially the same equation for the general cases of both schemes and the layout of the first scheme is much easier to implement than the second scheme, the first scheme is judged to be superior and is recommended for the general cases of three or more deskewable clock signals.

2.1.5. Notational Definitions

The following delay mismatch terms are derived from the delay terms defined in Fig. 2.1 and Fig. 2.2:

The delay mismatch between the sent and return paths of the clock loop 0.

The delay mismatch between the sent and return paths of the clock loop n.

The delay mismatch between the VCDEs on the sent and return paths of the clock loop 0.

The delay mismatch between the VCDEs on the sent and return paths of the clock loop n.

The delay mismatch between the upper drivers of the clock loops, 0 and n.

The delay mismatch between the upper receivers of the clock loops, 0 and n.

The delay mismatch between the lower receivers of the clock loops, 0 and n.

The delay mismatch between the lower drivers of the clock loops, 0 and n.