Chapter 4

At-Speed Testing Scheme

If the F-RISC / G chip set has been properly designed we still must rely on Rockwell to fabricate working chips and, most likely, General Electric to package them. The most obvious testing procedure, packaging the die onto a prototype board and building a system around it, then trying to run software on that system, is prohibitively time-consuming and expensive. It is desirable to test the die fabricated by Rockwell to ensure that they work both functionally and at-speed, and then test the packaged die to ensure that the MCM traces and die are functional after packaging. Although there are standard methods of performing functional tests of un-packaged die, suitable means of speed testing die in the GHz range do not exist. As a result, two schemes, one for the core CPU chips and developed by Bob Philhower, and another for the RAM chips and developed by myself, were developed.

The goal of the testing schemes for the cache chips is to provide an exhaustive static testing capability of on-chip circuitry as well as all off-chip drivers and receivers. At-speed testing of the core circuitry and critical I/O paths is also a necessity. Other design goals are:

While exhaustive testing of the cache RAM could be accomplished by supplying data and address patterns externally (assuming sufficient ability to probe the I/O pads on the chip, i.e. the ability to touch down on each external pad with a high-bandwidth probe), this type of static testing does nothing to assure us that the chip will work at the desired speeds; applying these patterns at speed is prohibitively expensive in the speed regime at which F-RISC/G operates, and capturing the outputs for analysis is nearly impossible. Testing the cache controller chip becomes more difficult due to the necessity of modifying on-chip state information. Mounting the chips on an MCM and wiring them to the CPU core for testing is also prohibitively expensive, as each non-functioning chip must then be removed. Minimizing re-working costs can most easily be accomplished through an early identification of working die - generating a supply of "known good die" (KGD).

The extreme speed of the chips also prohibits exercising the on-chip circuitry using external signal generators. In fact, just about the only system capable of testing circuits of this speed are special GaAs HBT circuits of similar speed. Further, embedded testing has been shown to be an effective method for identifying KGD [Fris95]

Bob Philhower, in his doctoral thesis [Phil93], details several alternative testing schemes for high-speed circuitry, so these alternatives will merely be summarized here. [Maie94] also contains information on Philhower's scheme and on the scheme implemented for the testing of the cache RAM.

If ease of testing is the primary goal, perhaps the ideal testing scheme is a built-in self-test (BIST) system. In a BIST system, special circuitry is included on-chip to generate test patterns which are presented to the normal on-chip circuitry. The outputs of the on-chip circuitry are then compared to the expected outputs by on-chip comparators. Methods are incorporated into the circuitry to allow verification of proper functioning of the testing system, itself.

The advantage of this type of system is that it requires few additional I/O pads (typically just a signal to tell the circuitry to start a test, and an output signal to reveal the results of the test), and its simplicity to perform allows large quantities of die to be tested in a short period of time.

This type of system was implemented on the RPI Register File Testchip which was fabricated by Rockwell in 1993, and a modified version (RPI Register File Testchip 1A) which Rockwell fabricated in 1994. On this chip the transistor cost of the testing circuitry was 21%. Since the testing requirements of the cache RAM are similar to that of the register file, a similar percentage could be expected for this chip. This would result in an added cost of several thousand transistors, clearly too many given the poor yields expected in immature processes such as the HBT process used in F-RISC / G. The effects on the cache controller would be similar, given that circuitry would be required to fully exercise the tag RAM as well as all on-chip combinatorial logic.

A second difficulty with this type of testing is that normal chip operation is slowed by the need to add testing circuitry to on-chip critical paths. In addition, adding these devices increases power density that could otherwise be allocated to other circuitry.

Another type of testing scheme, originating within IBM [Dill88], is Level-Sensitive-Scan Design (LSSD). The key idea in LSSD designs is that the latches used on-chip for state information are replaced by master-slave latches which are tied together into one long shift register.

This allows the capability of shifting in new state information serially, allowing the core circuitry to process this information, and shifting out the modified state information.

Philhower determined that an LSSD latch would require 50% more power and 167% more transistors than the non-LSSD D-latch in the CML circuitry used in the F-RISC/G prototype. Master - slave latches would be similarly effected.

A third testing scheme which is gaining popularity commercially is called boundary scan. There is a recently adopted ANSI/IEEE standard 1149.1-1990[Maun86; Maun92; Webe92] which allows full static testing of chips with lower transistor penalties than LSSD designs. Philhower discusses this standard in his thesis, and determined that while the IEEE standard was not a reasonable choice for testing low-yield parts, modifications were possible that would allow the use of even fewer transistors, and would also allow at-speed testing. An advantage of boundary scan over LSSD is that rather than replacing all on-chip latches with LSSD devices (which, at an extreme, could include modifying any register files or RAM blocks to allow serial loading), only the I/O drivers are modified. Modifications to the standard 1149.1-1990 which allow testing for speed have been proposed [Chan92] but published schemes require too high a device, area, and power budget to be of use in the F-RISC / G system.

  1. Evaluation of the F-RISC/G Boundary Scan Scheme

The scheme proposed and implemented by Philhower has several key features:

There are, however, some problems with utilizing his scheme in the cache RAM chip:

Although the scheme used in the datapath and instruction decoder chips is unsuitable for use in the cache RAM, many of the ideas first implemented in these chips can be modified for use in the RAM chips. Figure 4.1 shows the receiver used in Philhower's boundary scan scheme (the "core boundary scan scheme"). Figure 4.3 shows a schematic for the driver used in the core boundary scan scheme. Figure 4.4 is a schematic of the overall boundary scan control system used in the core CPU chips.

The Philhower scheme contains several key modifications to the ANSI/IEEE standard. Perhaps the most important of these is the capability to perform speed testing of on-chip circuitry. In order to accomplish this, the scheme takes advantage of the four-phase clocking available on-chip in the instruction decoder and datapath chips.

FIGURE 4.1: RECEIVER USED IN F-RISC / G CORE BOUNDARY SCAN SCHEME

The idea behind the at-speed testing in this scheme is that when a given clock phase is asserted the input pattern is presented to the on-chip circuitry. On some other clock phase, the outputs of the core are sampled. The time between clock phases is known, so all that remains is to scan the sampled data off chip for examination. Figure 4.2 illustrates this type of at-speed testing. If the sampled pattern is as expected, it is clear that the on-chip circuitry is capable of operating in the allowed time period.

FIGURE 4.2: SIMPLIFIED AT-SPEED TESTING TIMING DIAGRAM

FIGURE 4.3: DRIVER USED IN F-RISC / G CORE BOUNDARY SCAN SCHEME

Philhower proposed using a master - slave latch as a sampling element. Figure 4.5 illustrates the sampling behavior of the master - slave latch used in both Philhower's scheme and the scheme presented in this thesis. Each curve represents a different allowed time between the sampling clock signal and the pattern transition (in this case, a high-to-low transition.) As can be seen, if only 2 ps are allowed for sampling, the transition may be missed.

FIGURE 4.4: F-RISC/G CORE BOUNDARY SCAN SCHEME

FIGURE 4.5: SAMPLING BEHAVIOR OF MASTER - SLAVE LATCH (MODIFIED FROM [PHIL93])

The scheme further allows any clock phase to be selected for either pattern presentation or pattern sampling, allows the first or second occurrence of a given phase to be used to trigger sampling, and allows a selectable offset from the clock phase to be used to trigger pattern presentation and sampling.

This scheme allows a great deal of flexibility, but requires a large amount of logic to implement. Besides the obvious overhead of supplying a four-phase clock generator, four-input multiplexors are required for both the pattern present and pattern sample circuitry to select a clock phase, and further circuitry is required to select the delay offsets from the selected clock phases. Still more circuitry is required to select from between the first or second occurrence of the clock phase used to trigger the pattern sample signal.

Another feature of this design is that the scan latches are located in the pad ring, near the drivers and receivers. As the scan latches which act as a shift register are also used to sample core outputs, the scan clock received by some latches may be delayed by the pattern sampling logic. As all scan latches are clocked in parallel, skew becomes an important issue; a skew of one gate delay between adjacent latches can cause improper scanning operation.

FIGURE 4.6: PARALLEL SCAN CLOCKING

Figure 4.6 shows the parallel clocking distribution required by Philhower's scheme. Scan clock signals are generated in the core of the chip, and broadcast to the scan latches at the perimeter. In addition, configuration scan latches in the core receive their own scan clock signal. The interfaces between parallel clock tree branches are particularly vulnerable to problems caused by skew.

If two master-slave latches receive different clocks, it is conceivable that the latch which occurs later in the scan chain may actually be receiving an earlier or later version of the scan clock. If the master of the second latch is clocked simultaneously to the slave of the first latch, when the second latch finally receives the clock to latch its slave, it will have the wrong data in its master. In order to deal with this skew problem, chips incorporating this boundary scan scheme must use buffer trees on the parallel clock lines to properly balance the delays in the scan clock to each group of scan latches.

In order to properly balance these buffer chains extreme care must be taken in standard cell placement and routing and extensive simulations must be performed. In addition, in order to calibrate the CAD tools for future design work and to determine the precision of test results, additional circuitry is provided on-chip to enable measurement of skew between the four sides of the chip for each critical clocking signal. This circuitry, as well as the actual buffers, represents a serious hardware expense in a yield-limited technology. Furthermore, if signal propagation times are significantly affected by the resistive element of line impedance, this scheme is even more difficult to implement properly. The distance between the corners of the chip and the location where the clocking and control signals enter the pad ring can be up to 4 mm, so there can be a large skew between the latches at the center of the pad ring sides and the corners.

Aside from the scan clock skew, the physical location of the pads around the pad ring can also limit the usefulness of at-speed testing. The "input pattern presentation" and "output pattern sampling" signals should, ideally, arrive at all latches simultaneously in order to allow the greatest testing resolution.

In contrast, the scheme used in the cache RAM uses a serial clocking mechanism in which the clock is generated at a single source, and passes through all of the latches in the reverse direction of the scan chain. The advantage here is that latches which occur later on the scan chain will always have their masters clocked before the previous latch has its slave latched. As a result, each succeeding latch is guaranteed to be prepared to accept data from the previous latch.

As the cache RAM is the most abundant chip on the F-RISC / G MCM, it was decided to minimize device count, and therefore heat dissipation, wherever possible. While the flexibility of Philhower's scheme is appropriate for use on the cache controller, the cache RAM required a new, less hardware-intensive scheme.

  1. Test Scheme Design

The goals of any testing scheme are complete testability with minimum effect on normal chip operation and minimum cost. In the case of the F-RISC/G cache RAM there were additional requirements as well. A functional chip must not only be proved to function, but must be proved to function at speed. The chip must be testable on the wafer and on the MCM, and a method of testing MCM traces must be incorporated. Power dissipation must be kept to a minimum, as 16 of these chips are to be mounted on a 10 cm x 10 cm MCM.

The implemented scheme accomplishes these goals with a fraction of the transistor count of the scheme used in the datapath, instruction decoder, and cache controller chips. The scheme should prove generally useful in any low-yield technology to test register-intensive chips with relatively few time-critical I / O paths.

  1. Overall Scheme

The overall idea behind the cache RAM testing scheme is to use a limited number of boundary scan latches and a minimum amount of control logic to allow access time and functional testing of the RAM.

As shown in the block diagram (Figure 3.17), the RAM chip contains a large number of pads. d[0:63] are the bi-directional pads which are used for communication with the secondary (L2) cache. Since L2 cache transactions are, by design, relatively rare, the speed of this 64 bit wide path is not as critical as that of the four bit path, di[0:3] and do[0:3]. The four bit path is used for communication with the core CPU (specifically a datapath chip).

Due to the large quantity of pads required on the RAM chip and current yield limitations it is not feasible to include a full boundary scan receiver or driver in each one. This is likely to be a problem in any exotic technology, as pin-outs tend to remain relatively constant even as RAM dimensions decrease.

FIGURE 4.7: RAM CHIP TESTING SCHEME

The testing scheme (Figure 4.7) makes use of special boundary scan driver and receiver pads for the four bit data path. The 64 bit data path utilizes touchdown pad-sharing driver / receiver pairs to allow testing of this path with no additional transistors.

The boundary scan chain includes both the standard sampling and presentation latches and special purpose circuitry which is designed to aid in testing. Among the special purpose circuits which are coupled to the scan chain is an eight bit rotator and a five bit counter. These circuits provide a continuous testing mode in which high speed waveforms may be obtained as output signals.

Continuous Testing Mode

Continuous mode operation is one of the two ways in which the RAM chip can be tested. In this mode, on-chip circuitry is used to generate patterns with which the cache memory blocks can be filled. The five bit counter is used to generate consecutive row addresses (each memory block contains 32 rows), and the eight bit rotator is used to produce a four-bit pattern which can be used to fill one of the four nibbles in any of the four register files.

The size of the rotator was chosen so as to be evenly divisible into 32, so that if multiple passes are made through the memory block then the same data will be written into each row each time. In addition, 16 different patterns are required if each nibble is to be filled with a unique pattern. By doing this, it can be ensured that the multiplexors which select blocks and nibbles are functioning properly by filling all of the nibbles of all of the blocks with unique patterns. If the patterns can be read out on the scope, then the multiplexors must have worked; a multiplexor stuck-at fault would result in some patterns overwriting others, while other more subtle faults would prevent the patterns from being written into the register files in the first place.

In addition to the counter and rotator, four other scan latches can be loaded and used to provide a block and nibble address.

Once the cache memory blocks are loaded with test patterns, the test circuitry can be switched to read mode. Any of the four bits do[0:3] may be viewed on the scope output pad (the bit to be viewed is externally selectable). This mode is particularly useful for generating scope output which can be used to quickly confirm that large portions of a chip work.

FIGURE 4.8: WRITE TIMING IN CONTINUOUS MODE

As shown in Figure 4.8, an external clock (HSCLK) with a targeted frequency of 1 GHz is used to produce a 500ps write pulse during write operations. The write pulse may be shifted (through the use of a voltage-controlled delay line) by around 500ps to ensure that the write pulse is preceded by settled address and data signals. The length of this delay is selectable by changing the voltage applied to an analog control pad during testing.

During read operations HSCLK is used to ensure that the core circuitry is only allowed 500ps to process the address and return the correct result. As shown in Figure 4.9, the counter produces a new address with each falling clock edge. 500ps later, the clock triggers a master-slave latch to sample the selected do line. If the circuitry is too slow, the wrong data will be sampled and presented at the scope pin, thus enabling the tester to determine if the die is functioning at speed.

FIGURE 4.9: READ TIMING IN CONTINUOUS MODE

A special feature of this testing mode is that an external pin (scan clock) can be used to switch between write and read mode. If the scan clock signal is low when continuous testing begins, then the write signal is automatically asserted with each clock pulse. To switch to reading from the register file, it is necessary only to assert SCAN, raise the scan clock, and de-assert SCAN. (Using the SCAN signal in this way is necessary as the scan clock pad is not synchronized to the high speed clock. As a result, runt write pulses might otherwise occur.) This feature enables rapid filling and testing of the register files without multiple scanning operations.

Single Shot Mode

A single shot test consists of applying one set of inputs to the core circuitry synchronously to a high speed clock and sampling the outputs of the core circuitry synchronously to a second edge on that same clock. Possible operations include reading a single address from the core memory onto either the 4 or 64 bit data paths, writing data to a single address, or changing a control line (HOLD, WIDE, or WRITE).

This type of test is similar to the method used in the F-RISC/G core chips and cache controller. The chief disadvantage of this test is that it makes exhaustive testing of all memory cells tedious. Since the results are stored in the driver latches and must be shifted out serially, the results are not satisfying in the sense of producing clear, scope-measurable output. This type of test is useful, however, for performing tests of the MCM traces, the L2 drivers and receivers, and for testing worst case access times (in which the row, block, and nibble addresses do not increment consecutively.)

FIGURE 4.10: SINGLE SHOT TIMING

While this testing method is substantially similar to that developed by Philhower, there are several important modifications. Rather than implementing a four-phase clock and digital delay lines, input pattern presentation and output pattern sampling are triggered by consecutive edges of the high speed clock. There is no facility for skipping an edge between pattern presentation and sampling, as in Philhower's scheme. These simplifications reduce transistor counts in the testing control logic The duration between pattern presentation and pattern sampling can be continuously modified by changing the frequency of HS_CLK, where Philhower's scheme accomplishes the same thing by varying the external clock, the phase on which to latch, whether the first or second occurrence of that phase triggers sampling, and a digital delay line-controlled offset from the selected clock phase. Figure 4.10 is a timing diagram showing single shot operation.

  1. Special Drivers and Receivers

In order to implement boundary scan testing in a chip, it is necessary to replace the normal off-chip drivers and receivers with increased functionality pads which allow pattern sampling, presentation, and scanning. The major differences between this scheme and that presented in Philhower's thesis are the pads used on the L2 path and the modification of the boundary scan receivers.

FIGURE 4.11: L2 PATH DRIVER / RECEIVER

The boundary scan system used on the cache RAM chip requires three distinct types of I / O pads. The I / O pads used for communication with the L2 cache consist of 64 bi-directional pads, each of which consists of a tri-state driver and receiver connected to each other via the pad. This allows the signal from the driver to be used as an input to the receiver if no other signal is present at the pad.

Figure 4.11 is a schematic representation of the L2 Driver / Receiver circuit.

The connection between the driver output and receiver input allows static testing of the L2 drivers and receivers. The idea is that a pattern is loaded into the RAM, output onto the drivers, and, through the connection, read back into the receivers.

In addition to these driver / receiver pads, a special boundary scan receiver is used. This receiver is used for the address pads, 4-bit data input pads, and control signals. The purpose of these receivers is to capture inputs from the MCM during MCM testing, and to supply surrogate inputs to the core during die testing.

The receivers consist, logically, of two multiplexors, a master-slave latch, and a transparent latch. Each of the two latches has a multiplexor on its input which can be selected to allow either of two inputs two be latched.

The master-slave latch is known as the "scan latch" and the transparent latch is known as the "input latch." The purpose of the input latch is to allow the presentation of input vectors to be presented to the core circuitry synchronous to the high speed clock. The scan latch is used to capture incoming signals from the MCM traces or when testing the receivers.

A large resistor is used to connect the secondary address source (the scan latch) to each of the pads. If a signal is asserted on the pad externally, it will overwhelm this weak, secondary signal. This resistor allows testing of the pad path through the multiplexor. The pattern is inverted as it passes through the resistor, allowing us to test whether or not the multiplexor has switched.
INP_SELThis signal selects the inputs to the multiplexors on the inputs of the transparent latches. For normal (non-testing) operation, and to test the pad signal path through the input latch and multiplexor, the INP_SEL signal must be set to select the pads (low). For die testing, the INP_SEL signal must select the outputs of the scan latches (high). For MCM testing, the INP_SEL signal must select the pads.
PP (Pattern Present)This is the write signal for the transparent latches. This signal must be asserted to present the patterns stored in the scan chain or from the pads to the core.
CSC (Close Scan Chain)This signal is used to "close the scan chain," causing each scan latch to receive its input from the previous latch on the scan chain rather than from the pad latches. This signal should be set to select the pad latch outputs for MCM testing (in order to allow the scan chain to capture the pad inputs.)
PS (Pattern Sample)Used to latch in the incoming pattern during an MCM test. The external scan clock line generates this signal if in the TEST state during an MCM test.
SC (Scan Clock)This is an externally provided signal used to trigger the master-slave scan latches.

TABLE 4.1: BOUNDARY SCAN RECEIVER CONTROL SIGNALS

Table 4.1 summarizes the control signals which need to be supplied to the boundary scan receivers.

FIGURE 4.12: BOUNDARY SCAN RECEIVER

Figure 4.12 shows, schematically, the boundary scan receiver. Two receivers are shown connected to each other on the scan chain. As can be seen in this figure, the transparent latch and I / O receiver are laid out in a single pad cell, while the scan latch is a separate cell. This differs from the scheme used in the instruction decoder and datapath chips, which use receiver cells with embedded scan latches.

One reason for moving the scan latches to the core was to allow better control over scan clock routing in order to eliminate the problems caused by skew between latches adjacent on the scan path. Since there is always delay on the scan clock, the master of the second latch may still be receiving a high clock signal while the slave of the preceding latch is receiving a low clock signal. If the delay on the scan path is shorter than the delay on the scan clock, then the master on the second latch will let the scan path signal through to the slave, and the slave will latch it when the clock signal finally arrives. This has the effect of causing the scanned signal to simply skip a latch altogether.

A second reason to put the scan latches in the core was to allow more flexibility by allowing non pad cells to be freely interspersed with the receiver scan latches on the scan chain. For example, the latch used to capture a write operation and the rotator are actually on the scan chain, as are several configuration latches. While this type of organization would have been possible in the scheme used on the other chips, it would have been difficult to simulate before routing, as the scan chain order is unknown until that time. This problem may be rectified in the future by using better design software.

A possible further advantage of this scheme is that the scan latches can be placed closer together, improving the chance that a yield problem will not effect the scan chain adversely. The perimeter of the die used in F-RISC / G is approximately 4 cm. The scan chain length using Philhower's scheme would therefore be 4 cm. If the scan chain doesn't function correctly, the rest of the chip is untestable. By placing the scan latches in the core of the RAM chip, the length of the scan chain is cut to approximately 2 cm. All else being equal, the chance of suffering a critical metallization fault on the scan chain is greatly reduced.

Two special variations of the boundary scan receiver are used in the WRITE receiver and HOLD receiver. In the WRITE pad, the transparent latch is replaced by a two-input AND gate. One input is the output of the input multiplexor, and the other is PP. This limits the width of the internal WRITE pulse to one half high speed clock cycle.

In the HOLD pad, the latch is replaced by a two-input OR gate. One input is the output of the input multiplexor, and the other is PP. This arrangement prevents glitches on the internal HOLD signal caused by the latch delay.

An additional special I / O cell used in this boundary scan scheme is the boundary scan driver. This driver is used only on the four-bit data output path. It is used both to put signals on the bus to the datapath and to sample these signals during testing. This allows testing of the drivers .

The drivers consist, logically, of a multiplexor and a master-slave latch. The latch has a multiplexor on its input which can be selected to allow either of two inputs to be latched. There is also a transparent latch which can be used to latch the data output from the register files, even if the input address changes. The transparent latch contains pull-ups, allowing it to be used to receive the tri-state signals generated for use on the 4-bit bus.

Like the driver used in the core CPU chips, this driver does not use an extra multiplexor to allow the contents of the scan latch to be output on the MCM traces for MCM trace testing. The reason that this multiplexor is not necessary is that the core can usually be set to a state to produce the desired outputs. In the cache chip, this is clearly always the case.
PS (Pattern Sample)This is the write signal for the scan latches when used to sample core outputs. In this configuration, the outputs from the core feed the output drivers, and feed back into the scan latches.
CSC (Close Scan Chain)This signal is used to "close the scan chain," causing each scan latch to receive its input from the previous latch on the scan chain rather than from the output drivers. When CSC is asserted, scanning operation can take place.
HOLDThis signal is the same signal used to latch the outputs of the register file within the register file macro. It is used here to hold the outputs, even if the address changes, thereby changing the 4 bits selected for output on the 4-bit data path.
SC (Scan Clock)This is an externally provided signal used to trigger the master-slave scan latches. When used to trigger the scan latches dedicated to the four-bit output path, this signal must be OR'd with PS.

TABLE 4.2: BOUNDARY SCAN DRIVER CONTROL SIGNALS

Table 4.2 lists the control signals used in the boundary scan drivers. Figure 4.13 shows the logical representation of the drivers. The tri-state receivers, off-chip drivers, and D-latches are laid out in a combined pad cell, while the scan latch is located in the core.

It is important that the feedback path from the pad to the scan latch be as short as possible in order to minimize capacitive loading on the driver. This must be taken into account during routing, and is a possible reason to include scan latches in the pad ring as was done in the Philhower scheme.

FIGURE 4.13: BOUNDARY SCAN DRIVER

The final special I/O cell required for the boundary scan scheme is a clock receiver for the SCAN CLOCK and SCAN signals. These signals arrive single-ended at the pads, and have slow rise times (and thus are more sensitive to noise) which could cause deleterious effects to the state of the chip undergoing test. The clock receiver pad converts the single-ended signal to full differential, and contains a Schmidt trigger to prevent glitches near the high and low transition points from causing multiple logical transitions.

  1. Special Tests

Aside from the capability to test the core circuitry, several other critical tests are made possible by this testing scheme, among them the ability to test the MCM traces, the drivers and receivers, and the L2 data path.

MCM Trace Test

Using the capabilities provided by the driver and receiver circuitry, it is possible to test all of the MCM traces which connect to the cache RAM dies. In order to test the control or four-bit input traces, it is necessary only to select the pads rather than the scan latches for input during single-shot testing.

The chips which are to provide these signals must first be set to output a test vector on the appropriate drivers using the testing scheme provided by those chips. The cache RAM's are then set to perform a single shot test, and the test is started. When the pattern sampling in the drivers normally takes place, the scan latches in the receivers scan as well. As a result, the signals arriving on the receivers are latched by the receivers, and can be scanned out for analysis.

FIGURE 4.14: MCM SCAN PATH

In order to test the output latches on the four-bit wide path, signals which are to be output on the drivers must be loaded into the register file using the four-bit data path and the die test mode of operation.

A single shot test is then initiated, and the contents of the selected register file will be presented to the MCM traces.

In order to test the L2 traces it is necessary only to perform a single shot test with the WIDE control bit set.

Several chips may be wired together on the same scan path as shown in Figure 4.14, saving some time during testing, and wiring on the MCM. The length of the path should be short enough to allow easy identification of malfunctioning components, however. Control signals may be broadcast in parallel to all chips, or broken into separate control signals.

Testing the MCM Drivers / Receivers

It is important to be able to test the off-chip drivers and receivers without mounting the die on the MCM. In order to do this, the weak coupling between the scan latch and the pad in the drivers and receivers is used.

To test the boundary scan receivers, the pad inputs to the input multiplexors are selected and a single shot test is performed. The contents of the scan latches will then proceed through the receivers, mux, D-latch, and into the scan latch through the inverting resistor. The result is that if the input path is functioning properly, the contents of the scan latches in the receivers will be inverted.

To test the boundary scan drivers a single shot test takes place as normal. The outputs of the off-chip drivers are always latched by the driver scan latches if the drivers are functioning normally.

Testing the L2 Drivers and Receivers

In order to test the L2 Driver / Receiver pads it is necessary to perform several consecutive single shot tests.

First, the register files must be loaded with the desired 64-bit pattern using the four-bit wide data path, four bits at a time. This will require 16 separate test runs, one for each nibble.

Once this is accomplished, a WIDE READ test must be performed. This has the effect of causing the test vector to be read from the register files onto the driver / receiver pads, where it is then available for input from the receivers. A second WIDE READ must next take place, this time with the HOLD control signal asserted. This has the effect of causing the register files to latch the vector, preventing it from changing during the next single shot test, which is a WIDE WRITE, presumably to a new row address. This combination of tests has the effect of reading a 64 bit double-word from a particular row and writing it back into a new row. Single shot or continuous mode testing can then be used to detect if the test was successful.

  1. Implementation and Test Plan

Since this testing scheme is simple, the hardware required to implement it is also simple. The testing circuitry can be divided into several sub-blocks: the state machine, the state machine decoder, the counter, the rotator, and the I / O pads.

  1. External Connections

One of the most difficult aspects of the design of this testing scheme was assigning external control lines. Aside from the fact that even without testability the RAM chip requires 136 differential data pads, 9 differential address pads, and 3 differential control pads, monetary and physical considerations limited us to two Cascade 10 pin 5 GHz bandwidth probes (Figure 4.15) [Casc91]. Of these 10 pins, 4 are reserved for power and ground. While some needle probes and ground-signal-ground probes are available, only four separate probes can be used at once due to physical considerations. (The chassis used to hold the probe arms can hold only four probes at once.)

FIGURE 4.15: CASCADE PROBE HEAD

As a result of these limitations, boundary scan control is limited to 12 pins. Table 4.3 lists the pins which are intended to be probed by the two Cascade probes. While the signal which controls whether die or MCM trace testing takes place might have been more conveniently placed on the pad ring, the decision was made to put it on the scan chain instead in order to allow the two channel select pads which select a bit for display on the scope to be placed on the pad ring.
SCAN: When asserted, scanning operation takes place when the scan clock pulses.
BS_IN: Used to provide data serially to be shifted onto the scan chain.
BS_OUT: Used to read data on the output of the scan chain.
SCOPE: Displays any one signal on the four-bit data out bus, as selected by CH0 and CH1.
CH0: The low order selection bit for the multiplexor which selects between the four data out bus signals.
CH1: The high order selection bit for the multiplexor which selects between the four data out bus signals.
HS_CLK: The high speed clock used for at-speed testing. It also clocks the boundary scan state machine.
SCAN_CLK: The scan clock. It serves the dual use of selecting between read and write when in continuous die test mode.
WRITE: An analog signal used to set the digital delay offset of the write pulse.
W_DEL: Used to view the digital delay offset of the write pulse.
C_SYNC: Generated by roll-over of the address counter in continuous operation. Used to provide a scope sync signal.
SS: When asserted, single-shot operation. Otherwise, continuous operation.

TABLE 4.3: BOUNDARY SCAN PROBE ASSIGNMENTS

In addition to the pads shown in this table, NORMAL is used to determine whether normal die operation or testing operation is to take place. This signal requires a special receiver which, in the absence of an external signal, is pulled low. Sometimes this signal is referred to as . This represents a shortcoming of the testing scheme in that it is possible that a die can pass all tests, and yet not work in normal operation because of a failure in this pad or in the asynchronous logic used to override the internal testing signals.

  1. Testing Control Logic

Controlling the boundary scan test mechanism with a limited number of pads proved to be challenging. Since many control signals are needed on chip, and only twelve pads are available for all testing functions (control, input, and observation), a state machine was designed to control the testing from on chip based only on a few external signals. A decoder is used to produce the needed on-chip signals based on the current testing state. This scheme is similar to that used on the core CPU chips, although the decoding logic is decidedly simpler. Due to the specialized nature of the RAM chip, is was possible to eliminate some complication while imposing only minor testing inconveniences.

A state machine is used by the boundary scan testing system to generate control signals for the drivers, receivers, and on-chip logic. This state machine implements four states: SCAN, TEST_RUN, S.S. (Single shot), and DONE. The state machine is driven synchronously by the high speed clock (HSCLOCK). An additional testing state, NORMAL, is entered asynchronously when the external TEST signal is brought low.

The state of the controller is determined by the SCAN signal (which is provided by the test probe) and the NORMAL pad. The TEST signal is normally pulled up, but should be pulled down on the MCM after testing is complete.
NormalThis state is intended for the normal operation of the RAM chip within the F-RISC architecture. In this mode, all core inputs come from the receiver pads, and all core outputs go to the driver pads.
ScanWhen in this state, the scan chain is closed, and toggling the scan clock results in shifting the boundary scan shift register.
Test_RunWhen in this mode, a test is run. The exact details of what this entails result from the contents of the SS pad and the scan_clock pad.
Single

Shot

Upon entering this state, the input pattern is presented to the core circuitry.
DoneUpon entering this state, the output pattern is sampled. The pp signal remains high to allow continuous mode operation to work. If a single shot write has taken place, the write latch is cleared to prevent timing problems on succeeding single shot tests.

TABLE 4.4: BOUNDARY SCAN CONTROLLER STATES

Table 4.4 summarizes each of the five states. The NORMAL state is actually not implemented from within the state machine, but is imposed by gating the decoder outputs with the TEST signal. As a result, if this signal goes low at any time, the Normal mode of operation is asynchronously entered.

The SCAN state is the default state during testing mode. Any time the SCAN signal goes high, the state machine enters this state, which is used to conduct scanning operations. At power up, the SCAN signal should be asserted to put the testing control circuitry into a known state.

When SCAN goes low the state machine enters the TEST_RUN state on the first falling edge of the high speed clock (HSCLK). The selection signals to the multiplexors on the inputs of the D-latches in the receivers are set on entering this state.

On the next falling edge the pattern present (pp) signal is asserted, exposing the core to the receiver scan latch (or pad) contents. The pattern sample (ps) signal has already gone high a half cycle earlier in preparation for the next state. (Figure 4.16). During normal operation, the pp signal is always asserted.

FIGURE 4.16: BOUNDARY SCAN STATE TRANSITIONS

On the next rising clock edge, the output pattern is sampled by the driver scan latches during single shot operation. Upon entering DONE state, the write scan latch is cleared to prevent timing problems on subsequent single-shot tests. (During scanning operation, the write latch would otherwise store the 1, and at the beginning of the next test, a runt write pulse would occur.) During continuous mode the machine is kept in this state while testing proceeds. During single shot mode, the testing halts after the patterns are sampled.

FIGURE 4.17: BOUNDARY SCAN CONTROLLER STATE DIAGRAM

Figure 4.17 is the state diagram which implements the boundary scan scheme on the cache RAM chip. The state machine is implemented with two master / slave latches with two input AND gates on their inputs, and one additional two input gate. A separate decoder is used to produce the control signals required by the built-in self test circuitry and drivers and receivers.

  1. Timing

A timing diagram showing several of the key elements of the testing scheme is shown in Figure 4.18. This section describes in more detail the important events that occur during each phase of testing operation.

Normal Operation

During normal (non-testing) operation of the RAM chip, the testing circuitry must not interfere with the normal operation of the chip. As can be seen from the timing diagram, normal operation occurs when the TEST signal is low. When this signal is low, regardless of the signals on any of the other scan control signals, the PP signal remains high, exposing the core to the signals on the external pads. The INP_SEL signal is set to select the pads. In order to enter testing mode, it is necessary to raise the TEST signal and supply a clock to the HSCLK pad. On the first falling edge of this clock after TEST is asserted, the state machine will enter the SCAN state.

FIGURE 4.18: TIMING DIAGRAM

Scanning Operation

Once in the SCAN state it is possible to serially shift data into and out of the scan latches. The SCAN state is entered by raising SCAN at any time during testing mode (when TEST is high). Data to be shifted in is presented at the BS_IN pad, while data is shifted out onto the BS_OUT pad. The SCAN_CLK signal is used to clock the scan shift register, causing the signal on the BS_IN pad to be shifted into the first latch, the data from each latch to be shifted into succeeding latches, and the data from the last latch to be shifted onto the BS_OUT pad.

The external SCAN signal is known as CSC (Close Scan Chain) on chip. When scanning operation is to take place the scan chain must be "closed," that is, the input multiplexor on each scan latch must select the output of the previous scan latch.

During scanning operations the pattern present (pp) signal goes low, protecting the core from transient signals caused by the shifting operation along the scan chain.

Single Shot Tests

The single shot test is used to present an input pattern to the core at a set time and to sample the outputs at a set time thereafter. By varying the delay between input presentation and output sampling until the core circuitry fails to produce the proper output pattern it is possible to determine the maximum operating speed of the core circuitry. The single shot test is designed to produce worst case results. This is accomplished by routing signals from the scan latches through the input mux to the D-latch, and then through the D-latch. The outputs are routed through to the pad and then latched.

As shown in the timing diagram, single shot operation begins on the first high speed clock edge following the de-assertion of the SCAN signal. The state machine first switches to the TEST_RUN state, opening the scan chain (de-asserting CSC) to enable the scan latches to sample data rather than act as a shift register. On the next falling edge, the pp and ps signals both go high. The high pp signal releases the input pattern to the core. The ps signal goes high in preparation of the next falling clock edge. At this point the state machine is in the SINGLE_SHOT state.

On the next falling clock edge, the state machine enters the DONE state. Upon entering this state, the pattern sample signal is brought low, triggering the scan latches to sample. The state machine then idles in the DONE state until SCAN is brought high again. During single shot reads, the cache RAM blocks are allowed only one pulse width of the high speed clock to produce data. This is because the address is presented on one edge, and the outputs are sampled on the next edge of the high speed clock.

To perform a single-shot write, the high speed clock must be set to the frequency of the desired write pulse. In addition, after the write takes place, the WRITE latch is asynchronously cleared to prevent problems during subsequent single shot tests. The control circuitry ensures that the width of the write pulse as seen by the cache RAM blocks is equal to the pulse width of the high speed clock.

MCM Test

The MCM test is used to test both the MCM traces and the path through the drivers and receivers which is taken during normal operation. If the chip is not mounted on the MCM, the contents of the scan latches are inverted and brought to the receiver pads to be used as alternate inputs. The driver outputs are latched by the driver scan latches for a similar reason.

MCM testing operation will normally take place using the single-shot functionality of the testing circuitry. The large resistors which tie the scan latches to the receiver inputs provide too large an RC delay to provide useful scope output in continuous operation. The MCM test is selecting by scanning a 0 into the DIETEST latch. The test proceeds as in the normal single shot test with the exception that INPSEL is set by the internal testing circuitry to select the pads rather than the scan latches, as in normal chip operation. One important thing to remember in this testing mode is that if a 0 is in the WRITE scan latch, it will be inverted to a 1, and a runt WRITE pulse will be sent to the core. In addition, if a 1 is in the WRITE scan latch, a runt WRITE pulse reaches the core anyway. As a result, the contents of the register files may be modified in an unpredictable manner during an MCM test. This necessitates testing the drivers and receivers in separate steps, first loading the register files to test the drivers, then running a driver test, and then running a receiver test. The contents of the register files need not be known to conduct a receiver test.

L2 Driver / Receiver Testing

Testing of the 64 driver / receiver pairs which are used for communication with the L2 cache can be accomplished using a series of single shot tests. The testing method varies depending on whether the die has been mounted to the MCM.

If the die has been mounted to the MCM, then a series of tests similar to those described in MCM Trace Test should be performed. The cache RAM's are set to read data out onto the wide data path, and the L2 chips are set to capture these signals. Assuming that the MCM traces are functional, this allows testing of the L2 drivers. To test the receivers, the process is reversed, and the L2 chips are set to read data out onto the bus while the cache RAM's are set to capture these signals.

If the die has not been mounted to the MCM, the testing proceeds differently. The cache RAM blocks are pre-loaded using either single shot or continuous tests. A wide read then takes place, which puts the contents of the selected cache row onto the L2 drivers. Next, another single shot wide read, this time with HOLD asserted, takes place. It is necessary to set HOLD high in order to preserve the data on the L2 drivers for the next step, which is a wide read to a new row. The data proceeds from the L2 drivers to the L2 receivers, and is read back into a new row. Other tests can then be performed to assure that the cache RAM contains the proper data in the proper locations.

  1. Continuous Mode Testing

The continuous testing mode is selected by lowering the SS pad signal prior to lowering SCAN during testing mode. A continuous mode test consists of two discrete testing stages: continuous write and continuous read.

The SCAN CLOCK signal pin is used to select between writing and reading operations. This signal should be set low before lowering SCAN to start off with a continuous write test.

With each low to high transition of the high speed clock, the on-chip address counter will increment. With each high to low transition, the new address is loaded into the address scan latches, from where it can be presented to the core circuitry. Data is supplied to the cache RAM blocks by the on-chip rotator, which can be pre-loaded with an eight-bit pattern. Four of these eight bits are supplied to the cache RAM block which is selected with the two bit block address scanned into the address scan latches.

A WRITE pulse is automatically supplied during each clock cycle by the internal testing circuitry, regardless of the contents of the WRITE scan latch. The length of the pulse is equal to one half of the high speed clock wave length. This allows write access time testing. The pulse may be shifted by modifying the voltage on the external WRITE pin. The delay may be observed on the W_DEL pin.

After 32 high speed clock periods, the selected nibble of the selected block will be filled. The testing circuitry allows the writing to continue, however, since the pattern will simply repeat itself each time through the cache RAM block.

To switch to read access time testing, the SCAN signal must be raised. Next the SCAN CLOCK signal must be raised. Finally, SCAN must again be lowered.

One half clock period after each high speed clock transition, a sampling latch on the SCOPE output is triggered. This ensures that only one half clock period is allowed for each cache read, allowing read access time testing.

The counter provides the C_SYNC signal each time it rolls over. This signal can be used as a trigger by an oscilloscope.

  1. Testing Sequence

The testing circuitry allows many different types of tests to be performed, and allows several types of problems to be detected. The sequence in which testing occurs is important in determining exactly what is wrong when a test fails.

As a large number of die are expected to be tested, it is initially important to determine if some chips have flaws so fatal that further testing would be a waste of time.

Apart from the obvious tests that should be performed for any chip (e.g. does it draw current?), the first test that should be performed for chips utilizing this boundary scan scheme is a scan through test. In this test, a pattern of zeroes and ones is scanned into the chip, and compared to what eventually is scanned out. Obviously, the input pattern and output pattern should be identical.

This allows testing of the scan control circuitry, scan latches, and some of the I/O pads. If the test fails, the mode of failure should be noted. A chip failing this test may still be testable in other ways (e.g. through the scope output), but can probably be eliminated from further testing.

Chips which pass the scan through test can then be quickly tested using continuous mode testing. Different patterns can be loaded into each of the sixteen nibbles, and read out on the scope. Each bit of the nibble can be tested by using the channel select pads. If each write - read cycle works, the chip can be further tested by going back and reading each nibble again, to ensure that subsequent writes did not overwrite old patterns, thus indicating a block or nibble addressing problem. If such a problem occurs, the Block Select Latches can provide diagnostic information. The Block Select Latches of the blocks most recently selected (hopefully, only one at a time), will contain a 1.

This testing should be performed with a relatively slow clock, as before speed testing can take place, it is useful to perform functional testing. If all of these tests are passed, the speed of the clock can be increased and the test repeated. This cycle can be repeated until failure, telling the tester the read and write access times of the chip.

Assuming some bit errors are found, single shot testing can be used to determine whether the problem was caused by the continuous mode circuitry.

The drivers and receivers can be tested by performing an MCM test, in which the contents of the scan latches should be inverted and re-latched.

If all of these tests are passed, an L2 test can be performed to test the L2 drivers and receivers.

When all of these tests are passed, a chip is a candidate for mounting on the MCM. Several tests should be performed once the chip is mounted on the MCM.

First, the die should again be functionally and speed tested using the procedures outlined above. The drivers and receivers can now only be tested in conjunction with the MCM traces, by causing one chip to send data to another chip, where it is latched.

Among problems that can be detected along the way are bad nibble addressing, problems in the write circuitry (through the WRITECATCH latch and the W_DEL signal pad), bad drivers and receivers, bad MCM traces, bad scan circuitry, slow dies, and cache RAM block bit errors. Problems in the SCOPE circuitry are also detectable through comparison of the SCOPE output to the contents of the driver scan latches.

  1. Individual Chips
  2. Cache Controller

The cache controller scan path is shown in Appendix B. The path is listed starting at the scan out end.

In this table, the phase on which each signal is typically asserted is indicated. Each ">" or "<" represents and offset of 20-50 ps from the indicated clock phase.

The configuration latches which are shown are used to control which phases are used for sampling and pattern presentation (pre-sampling), and how much of an offset from the selected phase is used.

The scan chain as a whole consists of 117 latches. Of these, 12 are used solely for configuration purposes, 65 are drivers, and 40 are receivers. Several signals are located on multiple sides of the die, which can provide useful skew information.

  1. Cache RAM

The cache RAM scan path is significantly shorter than that on the other chips. This is because scan latches were included only on the high speed paths. The scan chain consists of only 26 latches, of which four are in drivers, seven are in receivers, and the remaining fifteen are special purpose latches (eight are part of the rotator, five are part of the address counter, and two others are used to check whether a write occurred and whether a die or MCM test is supposed to occur.)

Table 4.5 lists the scan path in order, starting at the scan out pad.

In addition to conventional boundary scan testing, the cache RAM also includes circuitry which allows waveforms to be displayed on an oscilloscope. The SCOPE pin, in conjunction with the VIEWA and VIEWB multiplexor select lines, allows any one of the four high-speed data lines to be viewed when operating in continuous testing mode.
D3Out Driver
D2Out Driver
D1Out Driver
D0Out Driver
Write_Capture Testing
Rotator0 Testing
Rotator1 Testing
Rotator2 Testing
Rotator3 Testing
Rotator4 Testing
Rotator5 Testing
Rotator6 Testing
Rotator7 Testing
Counter4 Testing
Counter3 Testing
Counter2 Testing
Counter1 Testing
Counter0 Testing
ADR3 Receiver
ADR2 Receiver
ADR1 Receiver
ADR0 Receiver
Dietest Configuration
WIDE Receiver
WRITE Receiver
HOLD Receiver

TABLE 4.5: CACHE RAM SCAN CHAIN