Several challenges are associated with the implementation of F-RISC/G package in all the design regimes - physical, electrical, thermal, thermo-mechanical, and environmental as described in chapter 1. Low yield of the process, and high power dissipation of the circuit technology coupled with the restrictive layout design rules limited the monolithic integration levels. Wafer scale integration was not practical due to overall low yield, and tight timing constraints on the interchip global wires. Wafer scale integration approach was also not helped by the poor heat spreading, and conducting capability of GaAs wafers. Therefore, a multichip package on a fast substrate was the most promising solution.
Another interesting challenge was the logical integration of chips
due to a chronological order in the physical design of the main
architecture chips [Camp97]. Due to different designers working
independently, a number of timing, clocking, and testing issues
were thrown over the wall to be taken care of, later, by the package
designer. All this contributed to make the design of this package
a unique problem. This chapter details the issues encountered
in physical partitioning, interchip timing, global synchronized
clocking, off-chip interconnect, noise containment, power distribution,
heat dissipation, testing, yield management, and external interfacing.
Chapter 5-7 describe the solutions to these problems in the final
The chip partitioning and floorplanning process was done at the start of the project but kept getting modified due to several design rule changes and delays in actual chip design. Though it resulted in the following insights useful for any system:
Reduced design time - One important factor was the small size of the design team. The partitioning had to reduce the total design time and decouple the chips such that multiple chips can be designed in parallel.
As few types of chips as possible - This requirement surfaced due to a need to fit all the chips on one reticle owing to a high cost of fabrication. It is cheaper to send same mask plates to a foundry as many times as needed to obtain more chips in case of a low process yield. This fact provided an upper bound for the total area of all the chips - equal to the maximum available area on a reticle - of 19.8 mm x 19.8 mm.
Modularity - The chips should provide functionality for a primary module of the architecture. For example, if a cache controller is included with the cache memory chip it may not be possible later to use the same chip for memory alone in other designs as the cache architecture evolves.
Upgradability - The storage portion of the architecture dissipates significant amount of power. This was a result of using the same process and circuit technology to design both logic and memory. The memory needed the same process to obtain high access speeds unavailable in the commercial market. Since, commercial memory chips were getting closer to the speed required for this function it was obvious to keep the memory chips separate so that in future a low power chip could replace these chips.
Interconnect hierarchy - The partitioning strived to minimize the signals going off-chip as they incur an additional timing penalty of delay through a driver and a receiver pair. An interconnect hierarchy evolved simultaneously with chip-level partitioning, with shorter signals on-chip and longer signals off-chip. On-chip wires used capacitive or distributed RC limited wires while off-chip signals utilized fast transmission lines to reduce the delay between critical components.
The above guidelines led to the CPU to be partitioned into clearly defined blocks, shown in Figure 4.1. The partitioned system contains one instruction decoder (ID), four 8-bit datapath chips (DP), two cache controller chips (ICC, DCC), sixteen 2-Kb cache chips (IM, DM), and a clock deskew (DSK) chip. These chips are flexible for possible future use as building blocks for bigger systems. For example, datapath or cache memory chips can be stacked to provide systems of higher bit-width datapath or larger band-width cache. A package level floorplan was made, based on the chip partitioning, as shown in Figure 4.2. Major critical paths too, are shown together with the chip placements.
Figure 4.2 : F-RISC/G floorplan
with ID - Instruction Decoder, DP - Datapath, DCC - Data Cache
Controller, ICC - Instruction Cache Controller, IM - Instruction
Memory, DM - Data Memory, and DSK - Clock Deskew Chip.
Specifying and determining off-chip delays for each and every
net was one of the most important task as it required cooperation
among the chip and package designers. This was done in an iterative
manner as the microarchitecture and the chip placement evolved
over the period of the project. After numerous revisions of critical
path pad placements the longest net length reduced from 10 cm
to 5 cm in the floorplan shown in Figure 4.3.
The longest length nets are the address and control signals from ICC and DCC chips to IM and DM chips, and the control signals from the instruction decoder to the datapath chips. The bounding boxes in Figure 4.3 show the extent of these critical nets. The major speed-critical paths on the MCM are as follows:
Several other signals crossing the chip boundaries are also critical
but they do not decide the speed of the system by themselves and
can be arranged to have different off-chip times. All these paths
along with specific interchip-communication are discussed in the
Instruction decoder to datapath coupling is the tightest in the
whole system. Any slowing down of the signals in this section
will ripple through the whole design. Both, instruction decoder
and the datapath chips were designed by a single designer [Phil93]
and therefore their relative placement was fixed at an earlier
stage. For example, the placement of instruction decoder and the
four datapath chips has to be in the manner shown in Figure 4.4.
They can not be placed, e.g., in the sequence ID-DP3-DP2-DP1-DP0.
This is due to the timing of carry chain signal which has to propagate
from DP0 to DP3 via DP1 and DP2 in the fastest time possible and
the short time required for CINLSS to go from ID to DP0.
Most of the signals between ID and DP chips traverse the closest edges of the chips as shown in Figure 4.5 traversing the path number "1". Path number "2" and "3" are about one chip edge longer than path "1" and therefore require more timing margin between the drivers and receivers. Path "4" is the longest path and therefore determines the critical path length between the instruction decoder and datapath chips. There are very few signals on this path. The worst case operation is an addition with operand B equal to $FFFFFFFF and operand A equal to $00000001. This sends the carry information from the LSB to the MSB. In the worst case the ALU operation is followed by a feed forward of the ALU result to operand B for the next ALU operation. The shifting operation is done by a loopback among the DP chips.
As shown in Figure 4.6, a lower average distance can be obtained
for ID-DP communication with a different pad arrangement. If area
array pads had been allowed this would have been the best way
to place and connect these chips. A pie chart showing the major
components of the worst case ID-DP critical path is shown in Figure
The communication between the ALU and the memory is the main speed
bottleneck for any high-speed computer and is worse for a low
integration implementation like FRISC/G. Figure 4.8 provides a
view of the signals between the CPU core and the primary memory.
The distance between the core and the primary cache is large due
to a relatively large number of chips involved. The specifications
called for a cache access cycle time of 2 ns resulting in allocation
of two pipeline stages each for instruction access and data access
[Phil93]. The instruction cache controller keeps a copy of the
remote program counter to issue an instruction every cycle but
needs a new address during a branch and thus the address bus goes
to both instruction and data cache controllers. Each datapath
chip supplies 8 bits of address to the primary data cache. The
cache controllers check to see if the desired address is in the
cache memory and in case of a hit let the primary cache memory
send the data to the core. Each datapath chip also receives 8
bits of data from the data memory. The instruction decoder chip
receives 32 bits of instruction from the instruction memory every
cycle. The SRCB field of the instruction, denoted by the I3..I7
bits, are received one phase earlier than other bits.
The cache critical path, shown in Figure 4.9, represents a complete
L1 cache transaction for both instruction and data. In the case
of a regular instruction fetch this path goes from the instruction
cache controller to the instruction cache memory and ends at the
instruction decoder. If it is a branch address then the instruction
fetch path starts from the datapath chips and goes to the instruction
decoder via the instruction cache controller and the instruction
cache memory. In the third case of data load or store this path
goes from the datapath chips to the data cache controller and
ends at the datapath chips themselves via the data cache memory.
The timing information for this path is shown in Table 4-1.
|Transaction Type||Path||Available Time [ps]|
|Incremental Fetch||ICC - IM - ID||1750/2000|
|Branch Fetch||DP - ICC - IM - ID||1750/2000|
|Load/Store||DP - DCC - DM - DP||2250|
The data cache path has 250 ps (one phase) longer available to
it than the instruction cache cycle owing to the pipelining of
load and store addresses.
The timing edges to the instruction decoder chip, datapath chips, and the cache controllers are supplied by the clock deskew chip [Nah94] as shown in Figure 4.10. This chip supplies a separate 2 GHz differential clock to each of the receiver chips and deskews it, against variations induced by factors such as changes in dielectric constant due to moisture or changes in clock receiver characteristics due to local temperature, so that the clock edges arrive at the same instant on all the receivers. All the chips receiving the master clock return a copy to the deskew chip. A four phase generator inside the receiver chips generates the internal four phases. These phases are non-overlapping 250 ps pulses with a period of 1000 ps. This scheme puts a number of routing constraints both on-chip and off-chip to minimize the skew and put less strain on the deskew chip. The layout environment, including the wire length and crossovers, between the clock receivers and the four phase generators on the chips are kept same for minimizing on-chip skew. The off-chip MCM wires are matched in length to minimize off-chip skew.
RESET and SYNC signals are additional signals needed for processor startup and maintaining synchronization between all chips and are generated by the clock deskew chip. The RESET signal goes to the instruction decoder to reset the program counter and flush the pipeline. The SYNC signal is daisy-chained and is distributed to all the receiver chips so that all the four phase generators can come up in the same phase at the same global instant. The relative timing between the master clock and SYNC puts an additional routing constraint in the MCM routing phase.
The design and analysis domains - chip/pad design, routing, electrical, thermal-thermomechanical - of multi-chip modules can be considered to be linked by package wiring pitch or more generally by interchip distance as shown in Figure 4.11. The physical design of the package can be simultaneously done in parallel with chip design if all the pad locations and the interconnections among them are specified. This puts a designer in the routing domain. This domain deals with the problem of completely routing the MCM. Initially the interchip distances and wiring pitches are guessed based on available design rules and chip connectivity statistics. For a fixed interchip distance as the wire pitch increases the chances of completely routing a system decreases too. Therefore as a general rule as the wire pitch increases the routing becomes harder and the routability goes down.
Once a wiring pitch is settled upon, the next step is the analysis in the electromagnetic domain. The wiring pitch should not introduce unwanted noise. As the wiring pitch increases the noise goes down. In case of a sufficient noise margin the worst case net is simulated to check the delay specifications in the chip/pad design domain. Driver power may need to be adjusted to get a satisfactory performance. The driver voltage will depend on the maximum interconnect length. This length depends directly upon the wire pitch. If the delays meet the timing specifications the next step is to analyze the design in the thermal domain. Chips may need to be spaced further apart in case of excessive junction temperature based on a fixed cooling method. Thus, all the design curves are linked directly or indirectly to one parameter - wire pitch. The diagonally positioned design domains behave in the same manner with respect to the wire pitch. The routing and chip/pad design domain try to push the wire pitch towards the origin, as shown in Figure 4.11, while the electromagnetic and thermal/stress domain try to pull it away from the origin. All the design domains must be satisfied simultaneously. If any of the design constraints are not satisfied, another design iteration is attempted.
Differential signaling between chips was chosen due to its high
noise immunity, low power dissipation than comparable single ended
transmission, and faster propagation due to low swings. A differential
signal rejects common mode noise such as ground bounce and has
very low simultaneous switching noise due to constant current
circuits. Since the on-chip circuits are also full differential,
overall noise immunity in the package is very high. The configuration
of a point-to-point net with a differential driver connected to
a differential receiver via a 50 transmission line is shown in
Figure 4.12. The receiver farthest from the driver has 50 resistors
connected to ground to serve as pullups for the driver and terminators
for the transmission lines. All the package level nets are either
point-to-point or daisy chained nets with short stubs.
High-density routes with fast rise time signals require wide enough
lines to reduce the skin-depth limited resistance presented to
the high frequencies in such signals. This resistance degrades
the rise time of the signal adding to propagation delay. The skin-effect
resistance can be reduced by wider lines. The length of the longest
line in the design was estimated below 5 cm. Its resistance is
shown in Figure 4.13 with varying width and thickness at different
frequencies. The line resistance varies with the square root of
frequency and is shown again in Figure 4.14 for a 36 µm wide
and 4.5 µm thick line used in the final design.
Figure 4.13: Resistance vs. frequency with varying wire width (w) and thickness (T) in a 5 cm long Cu wire.
Figure 4.14: Resistance vs. frequency
for a 5 cm long, 36 µm wide, and 4.5 µm thick Cu wire.
The wire pitch depends on the level of noise immunity desired in the design other than the usual limits of manufacturing technology. There are three major types of noise present in a high-speed design - reflection noise, switching noise, and coupling noise. Reflection noise is contained here by controlled impedance structures with matched terminations. Switching noise is generated by a fast change in current requirement caused by the simultaneous switching of output drivers. This causes resistive and inductive voltage drop in the power rails. This noise is an order of magnitude smaller due to constant current characteristics of differential logic. Coupling noise depends on wire size and shape, wire spacing, location of ground, and length of parallel segments. It can be reduced by increasing the trace separation, reducing the distance two traces run in parallel, slowing the rise time of the active signal or moving the traces closer to a reference plane.
The environmental factors that reduce noise margin in the system
are power supply gradients, supply voltage regulation, power supply
ripple, and temperature gradients due to unequal power dissipation
of different chips and gradient in the ambient temperature due
to rise in temperature of coolant air and local hot spots. Worst
case noise calculations add the coupled and the switching noise
together as they can occur at the same time. Reflection noise
doesn't add to the worst case noise calculations as it doesn't
occur at the same time. Coupling noise can be reduced significantly
by using a stripline transmission structure.
A proper power distribution scheme is important due to several reasons. A reduced supply voltage runs the chip at a slower speed as shown in Figure 4.15. It shows the variation in the oscillating frequency of four different oscillators with varying supply voltage.The high current requirement of the chips imply a careful distribution of the voltage to all the points on the chips to keep the differential voltage noise under control. This noise is caused by a supply voltage gradient between the driver and receiver supply contacts.
Bipolar transistor operation can be seriously affected by temperature
[Bell86]. The beta of the transistors decreases with increasing
temperature at the same collector current as shown in Figure 4.16,
which will result in slowing down the circuits. The increase in
junction temperature also decreases the base-emitter forward bias
voltage by 2mV/C. This will increase the voltage swings and the
total power. Any circuit is designed with a timing margin to keep
it safe from temperature variations and other parameters.
The circuit delays induced due to the temperature increase will cut into this safety margin. In a bipolar transistor, at thermal equilibrium, the power dissipation is given by
where, IC is the collector current and VCE is the voltage across the collector and emitter. IC is affected by base-emitter voltage VBE and the collector-base reverse saturation current ICBO . As the junction temperature increases, ICBO increases which, in turn, causes IC to increase, thus increasing the power dissipation. This process can result in a shift in the dc operating point of the transistor, or, worst case, in a thermal runaway condition where IC keeps increasing until the junction overheats and burns out.
To model temperature induced delays, circuits in the critical
path were modeled and simulated at higher temperatures using temporary
models at different temperatures provided by Rockwell. The circuits
simulated were the ALU register file, L1 cache memory and carry
chain adder. The speed of the circuit operation is shown in Figure
4.17. The maximum delay variation is about 10% from the room temperature
value till about 70°C. Therefore, a 10% slack is considered
enough for all critical path delays in conjunction with the junction
temperatures below 70°C to make speed.
The chip yield has always been a big issue during its product phase but in this technology it has been one of the main issues from the very start due to traditionally low yield levels. A comprehensive Known Good Die (KGD) testing is important due to the compounding nature of package yield. Yield of the package is given as
where Yc is the average yield of an individual chip,
n is the number of chips on the module, YI is yield
of the module interconnect, YA is the yield of the
module assembly, and YT is the yield of the module
test. If the interconnect yield, assembly yield, and test yields
are assumed to be 100% and the yield of an individual chip is
assumed as 90% the yield of 23-chips will be (0.90)23
or 9.9%. This is a very low yield and therefore a lot of emphasis
is placed on providing a known good die to the foundry.
The initial design contained provision for more than a thousand
differential signals between the on-board primary cache and external
secondary cache. This restriction is not present in the current
version. The package still requires I/O signals for testing and
power. The requirements for the test signals are presented in
Table 4-2. There are 280 signals needed for just testing these
chips alone other than the power supply. These signals will be
augmented by a few additional signals to confirm the speed of
the processor. To reduce the signal I/O count the testing scheme
can be designed to combine share some of these signals among various
chips. The I/O connectors and power pin out for this purpose are
explained in chapters 5 and 7.
Successful physical testing and verification - both functional and at-speed - is usually the last step before declaring the design successful but is to be kept in focus from the very start of the design process. Numerous strategies and approaches are suggested in literature towards this purpose but each new problem requires the designer to think again instead of implementing one of these approaches in isolation. Typically, for a mix-n-match MCM, one usual problem is the non-uniformity of test schemes in the component chips [Abad94]. This problem is mostly absent from this processor as all the chips were meant to be put on an MCM from the very start and incorporate uniform test scheme. All the four digital chips - instruction decoder, datapath, cache memory, and cache controller - employ boundary scan testing for both on-wafer and on-MCM tests [Phil93].
The physical design of the scheme is different on these chips though, due to the different designers and different time of implementations. The testing problem in the current context can be defined as the identification of known good dies (KGD) for insertion into the package and the definition of subsequent testing methods to ensure a working package. The chip level scan can identify known good dies (KGD) for insertion on the MCM. When the chips are packaged in an MCM, the same boundary scan logic can be used to test the functionality of the inter-chip wiring. The outputs of one chip are sent over the MCM wires to the other chips and scanned out. The full testing scheme is described in chapter 8.
Many materials and processes are available for fabricating multichip modules. Care is required in selecting materials that are mutually compatible mechanically, thermally, electrically, and chemically [Lica95]. GE-HDI process was chosen to design and assemble the proposed package. The GE-HDI process is able to place chip edges very close to each other on a substrate and build copper-polyimide interconnect structure on top of the chips [Daum93][Gdul93][Lica95]. In a typical process, chips are placed inside cavities milled in a substrate and overlaid with alternating insulator and metal layers to form an interconenct structure. The insulator layers are made by applying adhesive on top of the previous layer and laminating a dielectric film. The first insulator layer uses Ultem (a trademark of GE) thermoplastic adhesive and KaptonE (a trademark of DuPont) film. Ultem can be softened by heating to 210 C to remove the overlay for rework in case of a faulty chip. The upper insulator layers are made by applying SPIE (a siloxane-polyimide/epoxy blend) adhesive and KaptonE film. The interconnection layers are made by sputtering Ti as a barrier layer followed by sputtered Cu. After that Cu is electroplated to the required thickness. This copper is further encapsulated by sputtering a Ti layer.
Advantages of this packaging scheme are the separation of the
thermal and electrical paths which lets a designer design aggressively
in both electrical and thermal domain. The process flow followed
in this design is illustrated in Figure 4.18 and Figure 4.19.
There is a lack of efficient MCM design tools in the commercial
arena. Most of the available tools are souped up versions of old
PCB design tools and lack the IC like nature of MCM designs when
it comes to dealing with complex schematics and routing. Mentor
Graphics Corporation's HybridStation suite of tools are used for
design here. Figure 4.20 provides the design flow.
Librarian is used to create and modify PCB geometry data,
catalog files, and mapping files. Package is used to create
and modify assignments of logic symbols to physical geometries.
Layout is used to place geometries and route traces on
PCB designs. Fablink is used to generate manufacturing
data, drawings, and reports [Ment93]. AutoTherm is used
to perform thermal analysis on a specified package. AutoTherm
is 2D only. Engineering change order is executed by back-annotating
the design in Layout and going back to the schematics.
The F-RISC/G package pushes state-of-the-art in speed, density, and power dissipation and needs careful design in all these domains to achieve a 2-GHz external clock rate and 1-ns cycle time processor. One reason for a high power dissipation is the high-power dissipation of the cache memory. In case this memory is implemented in a sufficiently high-speed and low-power technology the total package power dissipation can drop down to a 60-70 W range. Next chapter describes the final package design.