A comprehensive testing scheme is necessary to increase the yield of a multichip package. Usually, chips are tested before insertion into a package, interconnect layers are inspected and repaired after every layer, and final assembly is inspected and functionally verified before declaring a fully working package. A number of test schemes are used for these steps [John87] such as automatic and manual optical inspection, capacitive probing, acoustic microscopy, x-ray and laser inspection, voltage -contrast e-beam [Keez94], and electrical in-circuit testing schemes such as built in self test (BIST), level sensitive scan, and boundary scan. Electrical testing schemes are attractive because of low cost and the ability to incorporate the scheme right into a design. Most of the F-RISC/G package testing is done using in-circuit electrical testing schemes.
All the F-RISC/G chips use boundary scan testing for functional and at-speed verification. Boundary scan was preferred because it used only a fraction of resources required by a BIST scheme [Phil93]. This chip level scheme was easily adapted to test a collection of chips together at the package level. The testing is done at several points during the package fabrication phase. The electrical testing scheme is used in conjunction with the optical inspection of interconnect at GE to fully verify the package. This chapter describes the complete test scheme with emphasis on package level testing.
Two types of tests are devised to verify the chips and the package: functional and at-speed. The main constraints in the design of these tests are
These guidelines shaped the testing scheme described in the rest
of this chapter. Figure 7.1 illustrates the overall test process
used for this purpose. Tests will be done both at RPI and GE with
different types of test equipment at different stages as shown
in Table 7-1. First stage is to identify Known-Good-Dies (KGD).
Once enough dice are accumulated to populate a package they will
be forwarded to GE for packaging. The KGD identification part
will be repeated after the dice are placed on the substrate. In
case of a die failure due to pick-n-place operation that die can
be replaced at this step. The first layer of insulator and metal
are put down next and all the chip pads brought up. The dice are
tested again with boundary scan tests individually. Any fault
here will require debelting of tape on that sub-module and possible
failure of all the dice on that module. After that all the layers
are put down and chips are tested for functionality by scan chains
and for speed by program execution.
|Test Stage||Test Scheme||Location|
|Bare die||Boundary scan||RPI||High-speed probes, probe station|
|After pick-n-place||Boundary scan||RPI||High-speed probes, probe station|
|After first tape||Boundary scan||RPI||High-speed probes, probe station|
|From second to last tape||Optical inspection||GE|
|After module completion||Boundary scan,
|RPI||Test jig with heat sink|
All the chips employ boundary scan schemes to comprehensively
test the functionality of the chips. This has been discussed in
detail in [Phil93][Maie96] but will be overviewed here for additions
and corrections. A boundary scan test scheme places a scan cell
between the core logic and an I/O pad and connects all the scan
cells in a single chain as shown in Figure 7.2. The input to the
first scan cell and the output of the last scan cell are brought
out as separate control pads on the chip periphery and are used
to move data in and out of the chip, serially, during testing.
These scan cells are controlled by additional signals to control
the testing status - bare chip, chip on package, or normal chip
operation - and to confirm operation of the chip at-speed.
There are twelve signals in total as shown in Figure 7.3 using two six-channel probes. All the signals are described later in Table 7-2 and Table 7-3. Additional power probes are put down on the remaining two sides of the chips. The G-P-G power probe and S-S-P-G-S-S-G-P-S-S signal probes can handle a maximum of 1.0 A and 0.5 A current respectively [Casc81]. A set of 50 _ G-S/S-G probes, mounted on a flexible arm, are used to test any individual driver on the pad periphery.
In the standard boundary scan testing scheme, test vectors are
serially shifted into the scan cells and applied to the core logic.
The data is collected from the scan cells connected to the outputs
of the core logic after a clock cycle. The added at-speed testing
feature of the circuitry allows the presentation of the test vector
and observation of the circuit response with an individually variable
delay from the start of one of the four clock phases. The variable
delays can be adjusted from 0 to 280 ps in steps of 40 ps.
Chip testing on the package is done in three phases. In the first
phase the standard suite of boundary scan tests is executed after
the chips are placed on the submodule substrates. At this point
all the chips can be tested at-speed. The idea behind the first
phase is shown on left in Figure 7.5. In the second phase the
tests are done after gluing down the first insulator layer and
depositing the first metal layer. The schematic of the second
phase is shown on right in Figure 7.5. This brings up all the
pads to the first metal layer and the standard boundary scan test
scheme is applied again to test the integrity of all the chips.
The module top view at this stage is shown in Figure 7.7. If all
the chips pass all the tests the whole module is declared functional
and is ready for integration. In the third phase the chips are
tested after all the interconnect is built and the three modules
are integrated together.
The interconnects are tested optically at GE, after building each
layer, to achieve maximum possible yield for the whole interconnect
structure. Any defect at this stage can be reworked without debelting
the whole tape. This testing is complemented later by testing
the wires electrically using the scan chains.
Final module verification will proceed in the order
The final package I/O is shown in Figure 7.8 with all the scan
chains. All these tests will be conducted on a separate test jig
One round of interconnect integrity will be made at GE by optical
inspection. Another way of testing these interconnections is by
using the scan chains displayed in Figure 7.8. First, all the
output latches on the chips in a scan chain are loaded with test
vectors. In the next step these chips are clocked and the data
is transmitted over the wires to the receivers on other chips
as shown in Figure 7.9. The test vectors are latched at these
receivers and are scanned out to be compared with the original
vectors. Depending on the degree of failure the interconnect can
either be reworked or just discarded to be built up again. This
method doesn't provide delay information of the wires unless the
test is conducted by carefully deskewing the clocks to the two
chips. Delay indication of the interconnect is obtained by using
a pair of unused pads on the RAM chips to create a ring oscillator
on the package.
As mentioned before, the scan chains of a few chips are connected together in a daisy chain on the package to facilitate testing of a few chips together with a considerable reduction in testing time. The SCAN_IN and SCAN_OUT signals of a group of chips are daisy chained and all the control signals are broadcast to that group. The following chains were made on the package
1. ID, DP0, DP1, DP2, ICC, and DCC.
2. IM0, IM2, IM4, IM5, IM6, and IM7
3. DM0, DM2, DM4, DM5, DM6, DM7
5. IM1, DM3, and DM1
The boundary scan chain itself can be tested by setting it into oscillation by using an off-chip - during chip testing - or off-MCM - during MCM testing - inverter to set the whole chain into oscillation. The scan clock input in this case is a slow speed free running clock and the whole boundary scan line acts as a long multichip shift register.
The memory chips use a different boundary scan scheme than the ID, DP, and the CC chips though in principal they are same. The signals available for both schemes on the respective chips are given in Table 7-2 and Table 7-3. The signals which are shared among all the chips in a chain are shown lightly shaded in the tables. Signals shown in darker shade need to be supplied individually to all the chips. Table 7-2 shows that only INP_SEL needs to be supplied individually for testing MCM wires. SCAN is a control signal requiring a sharp rise time. The scan clock (SC) to the boundary scan cells is daisy chained in reverse order to data, as shown in Figure 7.10 to remove any possibility of a race condition due to excessive skew.
The TEST signal is internally pulled high to TEST mode. Once the chips are tested and placed on MCM it is pulled low externally. Cache RAM chips don't use the deskewed clock and therefore need an external clock to run the internal state machine for testing. The HSCLK signals are daisy chained by providing another driver beside the HSCLK receiver to send the output to another chip. This way of active daisy chaining as compared to passive daisy chaining gets us signals with sharp rise times.
Another trick used to provide maximum flexibility for testing
was to provide all the configuration signals for the datapath
chips on the package edge. These bits are used for personalizing
the datapath chips so that they act as DP0/DP1/DP2/DP3. By changing
these bits any datapath chip can emulate any other datapath chip
increasing the debugging capability of any logic fault. They can
also be used to test the middle module alone for possible at-speed
verification by converting the DP2 chip to DP3.
|VIEWA||Depends on SEL, high speed (output)|
|SEL||Selects either VIEWA or VIEWB (input)|
|VIEWB||Depends on SEL, high-speed (output)|
|START||Start signal for scanning (input)|
|TEST||Enables testing mode (input)|
|SC||Scan clock (input)|
|SCAN_IN||Scan in data (input)|
|INP_SEL||Select input mode (input)|
|SCAN||Control signal (input)|
|SCAN_OUT||Scanned out data (output)|
|SYNC||Sync signal (input) - Initializes 4-phases|
|CNTRSYN||Counter high bit for scope (output)|
|SCAN_IN||Serial input port for test vectors (input)|
|SS||Selects single-shot/continuous mode (input)|
|SCAN||Daisy chained (input)|
|SCAN_OUT||Serial output port (output)|
|W_DEL||Delayed write line (output)|
|SC||Scan clock (input)|
|ANALOG||Delay of the write line (input)|
|CHSEL1||Select line for SCOPE output (input)|
|CHSEL0||Select line for SCOPE output (input)|
|SCOPE||High-speed data for scope (output)|
|HSCLK||High speed clock (input)|
|TEST||Pulled high internally|
A full system test will be done when all the chips and the submodules
are verified as functional at their intended speed. A simple block
diagram of the testing setup is given in Figure 7.11. Since the
secondary and main memory is missing from the package, a technique
of testing the system with the on-board cache memory only, was
designed. This memory is loaded via scan chains and the system
is booted up to read from it. Since the size of L1-cache is very
small, small programs are used to demonstrate the at-speed operation
of the system. These programs are at the end of this chapter.
An approximate arrangement of the electrical support system is
shown in Figure 7.12. Power and control signals are supplied by
two-sided custom flex cables from Advanced Circuit Technology,
Inc., Nashua, NH. These cables are custom manufactured to supply
10-100 A of current depending on the length and voltage drop
There will be 4 cables of length not exceeding 4-6 inches on all
four sides connected to a surrounding PCB with removable connectors
such as cinch buttons. The PCB will be custom designed to carry
low inductance power planes with enough bypass capacitance. Table
7-4 gives a short list of required parts for this custom PCB.
|Cu 2-sided Custom Flex||4||From ACT|
|Bypass Capacitors||Board level|
The total power requirements of the setup are given in Table 7-5.
Switching power supplies have an efficiency of more than 80% while
linear supplies are about 50% efficient. Linear supplies on the
other hand show much less ripple at the output. A good power supply
also has overvoltage, overcurrent, and overtemperature protection
capability and can operate in series or parallel with similar
supplies. these supplies can be controlled from the front panel
or via a GPIB controller. The power comes at high dc-voltage level
to the board to reduce losses in the power cables and keep noisy
power supply away from the module under test. This will require
dc-to-dc converters in the test box.
|Thermo-electric coolers||473 W|
|Blower||208 V 60 Hz/220 V 50 Hz 1 PH|
|Switching power supply||713 * 1.25 = 891 W|
|Linear power supply||713 * 2 = 1426 W|
|Total (switching power supply)||891+180 = 1071 W|
|Total (linear power supply)||1426+180 = 1606 W|
The clock to the package is supplied via a 50 surface mount mini-SMA
Hirose connector from a Weinschel Enginnering Model 432A/438A
.01-4 GHz benchtop tunable oscillator. This clock drives the input
of a clock tree which generates 10 synchronized clocks. One of
this clock output is available via a coax connector for external
monitoring. One other option was to use a surface mount clock
oscillator such as available from Mini-Circuits. The problem was
that these oscillators work in a narrow range such as 1.5 GHz-1.91
GHz and therefore the package can't be run at any other frequency.
Following steps are followed while powering up the system:
The processor in the current form doesn't
support at-speed connections to the second level cache (L2) memory.
It can run instructions that are scanned in serially into the
on-module primary instruction cache to verify the 2 GHz, 4 phase
operation claims for the demonstration tests of the processor.
The capacity of the primary instruction cache is 2 KB which is
equivalent to 512 instruction words with 32-bit widths. Small
specialized programs were developed to fit into this memory to
show the processor running at speed when executed. The speed critical
parts of the processor and their specifications are given in Table
Thus, to show a processor running at-speed
requires instructions exercising all the circuit macros given
in the Table 7-6. Normally, this can be done by running a few
instructions and checking the resulting state of the processor.
One good way of defining this state is to look at the resulting
contents of the program counter, register file, and cache, i.e.,
all the storage bits. This requires the placement of many high-speed
control signals on the package and still lacks a clear proof of
the processor speed. Therefore, a dynamic method of testing the
processor speed was devised by recognizing the importance of the
carry out signal from the most significant bit of the 32-bit adder.
The connectivity and routing information of this signal is shown
in Figure 7.13.
This bit can be set to change every cycle by executing an add instruction coupled with a branching loop as shown at the end of this section. If the processor is running at a clock of 1 GHz this bit will run at 500 MHz. Since the carry out bit from DP3 is unused on the processor it is routed out to a connector for observation purpose without any extra loading on the net slowing down its rise time. It can also be programmed to show different sequences thus negating the possibility of a fluke signal.
Since the chips were designed by multiple
designers, a way of combining and simulating the chip netlists
- containing the net delays - in a hierarchical manner was devised.
The processor was broken up into two clearly divided sections
for timing verification - core logic and cache memory. The core
logic timing will be described by Steven Carlough in his thesis.
The cache timing involves the instruction fetch, and data load/store
timings. The rest of the sections describe the simulation technique
and simulated cache timings alongwith the system boot sequence.
The simulations were done using qsim tool in the compass
design automation tool suite. One strange problem with these
simulations was the incompatibility of the cache memory chip netlist
with the rest of the architecture chips' netlist which kept crashing
the simulations. A lot of consultation was done with the company
but the exact source of error remain undetermined.
The top level simulations were done in the following manner:
An additional delay element was added to every
interchip net in the top level schematic as shown in Figure 7.14.
The input or output nets of this element were assigned the delay
value extracted from the package layout. These delay values were
provided in a .sim file and were represented with an equivalent
capacitance value. An example fragment is shown in Figure 7.15
with the delays shown on a section of the data bus.
A daisy chain net was splitted up in the fashion
shown in Figure 7.16 and delays of each branch were inserted for
separate nets. The daisy chain nets were common in instruction
decoder-datapath broadcast and cache cycles. An example daisy
chain net is shown in Figure 7.17. For example, the delay of IOCNTRL0
signal from the ID to the DCC chip is a sum of the ICCIOCNTRL0
(176 ps) and DCCIOCNTRL0 (95 ps) delays.
The instruction fetch cycle is allocated two pipeline stages [Phil93] and is completed within 2 ns. The instruction cache controller (ICC) keeps a copy of the program counter, called remote program counter (RPC), during normal instruction execution and issues a 9-bit address every cycle to the primary instruction cache [Maie96]. The eight IM chips receive this address and put out 4-bits each on the instruction bus after 750 ps.
The critical path in instruction fetch is
exercised when a branch is taken. The DP chip sends out the branch
target address at phase 1 which is latched at the ICC on phase
3. ICC immediately sends out the lower 9 bits of this address
to the IM chips and simultaneously processes the address for a
cache hit or miss. Since all the programs to be run on F-RISC/G
are small enough to fit inside the instruction cache, an instruction
miss is not realized in the demonstration code. The address is
latched at the instruction memory and data becomes available after
750 ps at its output latches. The instruction word is sampled
by the instruction decoder at phase 1 of the following cycle.
The instruction fetch cycle and its delay components are shown
in Figure 7.18 and Table 7-7.
|DP Driver I/P to DP Driver O/P||136(ø1)||162(ø1)|
|DP Driver O/P to ICC Receiver I/P||59||246|
|ICC Receiver I/P to ICC Latch O/P||130||130|
|ICC Latch O/P to ICC Driver O/P||70||70|
|ICC Driver O/P to IM Address I/P||63||303|
|IM Address I/P to IM Data O/P||750||750|
|IM Data O/P to ID Receiver I/P||59||175|
The datapath address bus output becomes valid
162 ps after phase 1 in the worst case. Immediately, the BRANCH
signal is asserted to tell the instruction cache controller to
branch. Instruction cache controller loads the address at its
inputs into its local program counter and also sends it out to
the instruction RAMs. The IM chips receive the address at different
instances due to their physical separation and output the data
750 ps after receiving the address. The data is latched at the
instruction decoder at the start of phase 1. Bits I3..I7 are needed
100 ps before phase 1 and therefore they are latched 100 ps before
phase 1. For these bits the cycle time is 1900 ps which is still
within the 5% margin.
Data Load Cycle
The data load path is an extra 250 ps longer
than the instruction fetch path [Maie96]. Therefore, in case of
a load the data is transferred to datapath chips from data memory
chips in 2250 ps. This path is shown schematically in Figure 7.19
and the minimum and maximum delay paths are given in Table 7-8.
|DP Driver I/P to O/P|
|DP Driver O/P to DCC I/P|
|DCC I/P to DCC O/P Latch|
|DCC O/P Latch to DCC Driver O/P|
|DCC Driver O/P to DM Address I/P|
|DM I/P to O/P|
|DM O/P to DP I/P|
|DP I/P to DP Latch|
The worst case margin in the data load case
is 23.69% while the best case margin in 72.1%.
Data Store Cycle
The data store operation begins with the generation
of the data to be stored during the DE stage [Phil93]. This data
is stored in two pipeline registers until it is transferred to
the cache during the D1 stage. The instruction decoder signals
the data cache controller by asserting WDC long before the target
address is supplied by the datapth in the D1 stage. The data cache
controller latches the address from the datapath so that the datapath
doesn't have to keep it stable. By the time this address reaches
the data memory, the datapath chips latch the data at their outputs.
This data is latched by the data memory on the receipt of the
DINLTCH signal from the data cache controller. The path is shown
schematically in Figure 7.20.
The timings of the store operation closely
follow the load timings as shown in Table 7-9. There is no inherent
cycle in this operation as compared to the load and fetch cycles
and satisfying the timings is much easier than those operations.
|DP Driver I/P to O/P|
|DP Driver O/P to DCC I/P|
|DCC I/P to DCC O/P Latch|
|DCC O/P Latch to DCC Driver O/P|
|DCC Driver O/P to DM Address I/P|
|DP O/P to DM I/P|
System Boot Sequence
The module is initialized by asserting the
SYNC and RESET signals from the clock generator. This initializes
all the state machines and latches in the instruction decoder,
datapath, and cache controller chips. At this point the processor
executes the boot routine provided in Figure 7.23 to initialize
the on-board primary cache. The state diagram given in Figure
7.21 depicts the system operation clearly. The boot routine given
fills up the cache to validate its contents.
When power is applied to the processor and a RESET signal is sent to the instruction decoder it generates a processor reset interrupt and branches to the address 20hex. When this address is requested from the instruction cache controller it is forced as a miss by asserting the INIT signal on the chip from outside. Subsequently, one cache line starting from the address 20hex is fetched from memory. The instruction at 20hex is a BRANCH to a LOAD instruction residing at the same cache line address in main memory and therefore another miss occurs and this line is fetched. From now on 32 LOADs are issued to validate the data cache. After validating the data cache the remaining instruction cache is validated and after that a BRANCH is issued to the place where the real program resides. The instruction decoder needs a pair of LOAD and STORE instruction right after the first BRANCH instruction to settle into normal mode.
There is a fixed relationship between the
SYNC signal and the master clock. The SYNC signal needs to be
asserted after 100 ps of a clock edge. This synchronizes all the
chips by bringing up their clocks in same phases. This is shown
in Figure 7.22. By the time RESET is lowered the IMM bits at the
output of ID go to 0020hex. The condition code remain at 1111.
Rest of the signals from ID stabilize to 0 or 1 except MBYA and
MBYB which remain unknown. ABUS goes to 0020hex with in the first
cycle. After 2 cycles the CC bits at the ID output go to 1110.
When this condition code reaches DP3 after 240 ps it triggers
BRAOUT and BRAOUT2 at its output. The branch signal goes from
DP3 to DP0-2, ID, and ICC chips. ICC receives the branch signal
and starts fetching the instruction at the address given by the
@LABEL LOAD R0 = addr/IOCTRL=3
[481 LOAD instructions]
BRA [OS starting address] /CC=1 /RTN=0
Simple methods to show the raw speed of the
processor have a greater chance of working. The toggling of carry
out bit satisfies scanning, loading, fetching, branching, and
execution of the processor. The following simple program produces
a 10101010... sequence at the DP3-CARRYOUT pin with a frequency
of 500 MHz or a quarter of the frequency supplied by the master
clock and is a good indicator of the processor speed. It was assembled
and simulated using asg compiler and friscsim simulator. The program
listing is attached alongwith the object file. The maximum speed
of this bit is 500 MHz as shown in Figure 7.25.
; Carryout.fs - This program generates a 101010... seq. at DP3-COUT pin suitable to be viewed on the MCM.
; 2/27/97 - Atul
; Line 1,2: Load ffffffff into register R0.
; Line 3 : Branch to the same pc location and execute next 3 instructions (COUT Bit= 0)
; Line 4 : Add 1 to R0 and store the result in R1 (COUT Bit = 1).
; Line 5 : Add 0 to R0 and store the result in R1 (COUT Bit = 0).
; Line 6 : Add 1 to R0 and store the result in R1 (COUT Bit = 1).
; After that the execution goes back to Line
3 (branch instruction).
addi r0 = 0 + 0xffff
@0000 4000FFFF ('h65555555AAAAAAAA) ADDI R0=0 + 0xFFFF /NOAT
addi r0 = r0 + 0xffff /ldh
@0001 4204FFFF ('h65595565AAAAAAAA) ADDI R0=R0 + 0xFFFF /LDH /NOAT
branch _1 pc = pc + 0x0000 /ex=3 /sq=0
@0002 24170000 ('h5965566A55555555) BRANCH _1 PC=PC + 0x0000 /LAT=7
add r1 = r0 + 1
@0003 C2082001 ('hA559559559555556) ADD R1=R0 + 0x01 /NOAT
add r1 = r0 + 0
@0004 C2082000 ('hA559559559555555) ADD R1=R0 + 0x00 /NOAT
add r1 = r0 + 1
@0005 C2082001 ('hA559559559555556) ADD R1=R0 + 0x01 /NOAT
ASSEMBLER SYMBOL TABLE:
|Instruction||Register R0||Register R1|
|ADDI R0=0 + FFFF||0000FFFF||uuuuuuuu|
|ADDI R0=R0 + FFFF||FFFFFFFF||uuuuuuuu|
|ADD R1=R0 + 1||FFFFFFFF||00000000|
|ADD R1=R0 + 0||FFFFFFFF||FFFFFFFF|
A comprehensive test scheme was designed to test all the chips and the final package with minimal equipment and man-power. The scheme is based on boundary scan and comprises of additional at-speed tests. A high-speed 2-GHz clock is supplied either via an on-board deskew chip or via length controlled clock transmission lines. Boot up routine and test programs were generated to guide the testing process.