CHAPTER 4

DESIGN OF F-RISC/G PACKAGE - REQUIREMENTS

Introduction

Several challenges are associated with the implementation of F-RISC/G package in all the design regimes - physical, electrical, thermal, thermo-mechanical, and environmental as described in chapter 1. Low yield of the process, and high power dissipation of the circuit technology coupled with the restrictive layout design rules limited the monolithic integration levels. Wafer scale integration was not practical due to overall low yield, and tight timing constraints on the interchip global wires. Wafer scale integration approach was also not helped by the poor heat spreading, and conducting capability of GaAs wafers. Therefore, a multichip package on a fast substrate was the most promising solution.

Another interesting challenge was the logical integration of chips due to a chronological order in the physical design of the main architecture chips [Camp97]. Due to different designers working independently, a number of timing, clocking, and testing issues were thrown over the wall to be taken care of, later, by the package designer. All this contributed to make the design of this package a unique problem. This chapter details the issues encountered in physical partitioning, interchip timing, global synchronized clocking, off-chip interconnect, noise containment, power distribution, heat dissipation, testing, yield management, and external interfacing. Chapter 5-7 describe the solutions to these problems in the final package.

Chip Partitioning and Initial Floorplanning

The chip partitioning and floorplanning process was done at the start of the project but kept getting modified due to several design rule changes and delays in actual chip design. Though it resulted in the following insights useful for any system:

Reduced design time - One important factor was the small size of the design team. The partitioning had to reduce the total design time and decouple the chips such that multiple chips can be designed in parallel.

As few types of chips as possible - This requirement surfaced due to a need to fit all the chips on one reticle owing to a high cost of fabrication. It is cheaper to send same mask plates to a foundry as many times as needed to obtain more chips in case of a low process yield. This fact provided an upper bound for the total area of all the chips - equal to the maximum available area on a reticle - of 19.8 mm x 19.8 mm.

Modularity - The chips should provide functionality for a primary module of the architecture. For example, if a cache controller is included with the cache memory chip it may not be possible later to use the same chip for memory alone in other designs as the cache architecture evolves.

Upgradability - The storage portion of the architecture dissipates significant amount of power. This was a result of using the same process and circuit technology to design both logic and memory. The memory needed the same process to obtain high access speeds unavailable in the commercial market. Since, commercial memory chips were getting closer to the speed required for this function it was obvious to keep the memory chips separate so that in future a low power chip could replace these chips.

Interconnect hierarchy - The partitioning strived to minimize the signals going off-chip as they incur an additional timing penalty of delay through a driver and a receiver pair. An interconnect hierarchy evolved simultaneously with chip-level partitioning, with shorter signals on-chip and longer signals off-chip. On-chip wires used capacitive or distributed RC limited wires while off-chip signals utilized fast transmission lines to reduce the delay between critical components.

The above guidelines led to the CPU to be partitioned into clearly defined blocks, shown in Figure 4.1. The partitioned system contains one instruction decoder (ID), four 8-bit datapath chips (DP), two cache controller chips (ICC, DCC), sixteen 2-Kb cache chips (IM, DM), and a clock deskew (DSK) chip. These chips are flexible for possible future use as building blocks for bigger systems. For example, datapath or cache memory chips can be stacked to provide systems of higher bit-width datapath or larger band-width cache. A package level floorplan was made, based on the chip partitioning, as shown in Figure 4.2. Major critical paths too, are shown together with the chip placements.

Figure 4.1: Partitioning of F-RISC/G into a modular chip-set.

Figure 4.2 : F-RISC/G floorplan with ID - Instruction Decoder, DP - Datapath, DCC - Data Cache Controller, ICC - Instruction Cache Controller, IM - Instruction Memory, DM - Data Memory, and DSK - Clock Deskew Chip.

Interchip Timing

Specifying and determining off-chip delays for each and every net was one of the most important task as it required cooperation among the chip and package designers. This was done in an iterative manner as the microarchitecture and the chip placement evolved over the period of the project. After numerous revisions of critical path pad placements the longest net length reduced from 10 cm to 5 cm in the floorplan shown in Figure 4.3.

Figure 4.3 : F-RISC/G timing constraints.

The longest length nets are the address and control signals from ICC and DCC chips to IM and DM chips, and the control signals from the instruction decoder to the datapath chips. The bounding boxes in Figure 4.3 show the extent of these critical nets. The major speed-critical paths on the MCM are as follows:

  1. ID - DP instruction broadcast
  2. Communication among DP chips for carry propagation and data shifting
  3. Data cache access cycle (2250 ps)
  4. Instruction RAM access cycle (2000 ps)
  5. I3...I7 cache cycle (1750 ps)
  6. Deskew to ID, DP, ICC, DCC clock routing
  7. Distribution of SYNC and RESET signal from the deskew chip
  8. Broadcast of BRANCH from DP3 to other DPs, ID and ICC (250 ps)
  9. MISS Generating Circuitry - Here the cache controller reads an address from the bus and in case of a miss generates the MISS signal for the CPU

Several other signals crossing the chip boundaries are also critical but they do not decide the speed of the system by themselves and can be arranged to have different off-chip times. All these paths along with specific interchip-communication are discussed in the following sections.

Instruction Decoder to Datapath Communication

Instruction decoder to datapath coupling is the tightest in the whole system. Any slowing down of the signals in this section will ripple through the whole design. Both, instruction decoder and the datapath chips were designed by a single designer [Phil93] and therefore their relative placement was fixed at an earlier stage. For example, the placement of instruction decoder and the four datapath chips has to be in the manner shown in Figure 4.4. They can not be placed, e.g., in the sequence ID-DP3-DP2-DP1-DP0. This is due to the timing of carry chain signal which has to propagate from DP0 to DP3 via DP1 and DP2 in the fastest time possible and the short time required for CINLSS to go from ID to DP0.

Figure 4.4 : Relative placement of ID and DP.

Most of the signals between ID and DP chips traverse the closest edges of the chips as shown in Figure 4.5 traversing the path number "1". Path number "2" and "3" are about one chip edge longer than path "1" and therefore require more timing margin between the drivers and receivers. Path "4" is the longest path and therefore determines the critical path length between the instruction decoder and datapath chips. There are very few signals on this path. The worst case operation is an addition with operand B equal to $FFFFFFFF and operand A equal to $00000001. This sends the carry information from the LSB to the MSB. In the worst case the ALU operation is followed by a feed forward of the ALU result to operand B for the next ALU operation. The shifting operation is done by a loopback among the DP chips.

Figure 4.5: Signals between ID and DP.

As shown in Figure 4.6, a lower average distance can be obtained for ID-DP communication with a different pad arrangement. If area array pads had been allowed this would have been the best way to place and connect these chips. A pie chart showing the major components of the worst case ID-DP critical path is shown in Figure 4.7.

Figure 4.6 : Stronger coupling between ID and DP.

Figure 4.7: Major components of ID-DP broadcast.

CPU to Memory Communication

The communication between the ALU and the memory is the main speed bottleneck for any high-speed computer and is worse for a low integration implementation like FRISC/G. Figure 4.8 provides a view of the signals between the CPU core and the primary memory.

Figure 4.8 : CPU to memory communication.

The distance between the core and the primary cache is large due to a relatively large number of chips involved. The specifications called for a cache access cycle time of 2 ns resulting in allocation of two pipeline stages each for instruction access and data access [Phil93]. The instruction cache controller keeps a copy of the remote program counter to issue an instruction every cycle but needs a new address during a branch and thus the address bus goes to both instruction and data cache controllers. Each datapath chip supplies 8 bits of address to the primary data cache. The cache controllers check to see if the desired address is in the cache memory and in case of a hit let the primary cache memory send the data to the core. Each datapath chip also receives 8 bits of data from the data memory. The instruction decoder chip receives 32 bits of instruction from the instruction memory every cycle. The SRCB field of the instruction, denoted by the I3..I7 bits, are received one phase earlier than other bits.

Primary Cache Communication

The cache critical path, shown in Figure 4.9, represents a complete L1 cache transaction for both instruction and data. In the case of a regular instruction fetch this path goes from the instruction cache controller to the instruction cache memory and ends at the instruction decoder. If it is a branch address then the instruction fetch path starts from the datapath chips and goes to the instruction decoder via the instruction cache controller and the instruction cache memory. In the third case of data load or store this path goes from the datapath chips to the data cache controller and ends at the datapath chips themselves via the data cache memory. The timing information for this path is shown in Table 4-1.

Table 4-1 : Cache critical paths

Transaction TypePath Available Time [ps]
Incremental FetchICC - IM - ID 1750/2000
Branch FetchDP - ICC - IM - ID 1750/2000
Load/StoreDP - DCC - DM - DP 2250

The data cache path has 250 ps (one phase) longer available to it than the instruction cache cycle owing to the pipelining of load and store addresses.

Figure 4.9 : Memory critical path.

Clock Synchronization

The timing edges to the instruction decoder chip, datapath chips, and the cache controllers are supplied by the clock deskew chip [Nah94] as shown in Figure 4.10. This chip supplies a separate 2 GHz differential clock to each of the receiver chips and deskews it, against variations induced by factors such as changes in dielectric constant due to moisture or changes in clock receiver characteristics due to local temperature, so that the clock edges arrive at the same instant on all the receivers. All the chips receiving the master clock return a copy to the deskew chip. A four phase generator inside the receiver chips generates the internal four phases. These phases are non-overlapping 250 ps pulses with a period of 1000 ps. This scheme puts a number of routing constraints both on-chip and off-chip to minimize the skew and put less strain on the deskew chip. The layout environment, including the wire length and crossovers, between the clock receivers and the four phase generators on the chips are kept same for minimizing on-chip skew. The off-chip MCM wires are matched in length to minimize off-chip skew.

RESET and SYNC signals are additional signals needed for processor startup and maintaining synchronization between all chips and are generated by the clock deskew chip. The RESET signal goes to the instruction decoder to reset the program counter and flush the pipeline. The SYNC signal is daisy-chained and is distributed to all the receiver chips so that all the four phase generators can come up in the same phase at the same global instant. The relative timing between the master clock and SYNC puts an additional routing constraint in the MCM routing phase.

Figure 4.10 : Clock distribution scheme

.

Overview of Package Design Methodology

The design and analysis domains - chip/pad design, routing, electrical, thermal-thermomechanical - of multi-chip modules can be considered to be linked by package wiring pitch or more generally by interchip distance as shown in Figure 4.11. The physical design of the package can be simultaneously done in parallel with chip design if all the pad locations and the interconnections among them are specified. This puts a designer in the routing domain. This domain deals with the problem of completely routing the MCM. Initially the interchip distances and wiring pitches are guessed based on available design rules and chip connectivity statistics. For a fixed interchip distance as the wire pitch increases the chances of completely routing a system decreases too. Therefore as a general rule as the wire pitch increases the routing becomes harder and the routability goes down.

Once a wiring pitch is settled upon, the next step is the analysis in the electromagnetic domain. The wiring pitch should not introduce unwanted noise. As the wiring pitch increases the noise goes down. In case of a sufficient noise margin the worst case net is simulated to check the delay specifications in the chip/pad design domain. Driver power may need to be adjusted to get a satisfactory performance. The driver voltage will depend on the maximum interconnect length. This length depends directly upon the wire pitch. If the delays meet the timing specifications the next step is to analyze the design in the thermal domain. Chips may need to be spaced further apart in case of excessive junction temperature based on a fixed cooling method. Thus, all the design curves are linked directly or indirectly to one parameter - wire pitch. The diagonally positioned design domains behave in the same manner with respect to the wire pitch. The routing and chip/pad design domain try to push the wire pitch towards the origin, as shown in Figure 4.11, while the electromagnetic and thermal/stress domain try to pull it away from the origin. All the design domains must be satisfied simultaneously. If any of the design constraints are not satisfied, another design iteration is attempted.

Figure 4.11: MCM design spaces [adapted from Loy94].

Off-Chip Communication

Differential signaling between chips was chosen due to its high noise immunity, low power dissipation than comparable single ended transmission, and faster propagation due to low swings. A differential signal rejects common mode noise such as ground bounce and has very low simultaneous switching noise due to constant current circuits. Since the on-chip circuits are also full differential, overall noise immunity in the package is very high. The configuration of a point-to-point net with a differential driver connected to a differential receiver via a 50 transmission line is shown in Figure 4.12. The receiver farthest from the driver has 50 resistors connected to ground to serve as pullups for the driver and terminators for the transmission lines. All the package level nets are either point-to-point or daisy chained nets with short stubs.

Figure 4.12: Schematic of differential off-chip communication.

Skin-effect limited transmission

High-density routes with fast rise time signals require wide enough lines to reduce the skin-depth limited resistance presented to the high frequencies in such signals. This resistance degrades the rise time of the signal adding to propagation delay. The skin-effect resistance can be reduced by wider lines. The length of the longest line in the design was estimated below 5 cm. Its resistance is shown in Figure 4.13 with varying width and thickness at different frequencies. The line resistance varies with the square root of frequency and is shown again in Figure 4.14 for a 36 µm wide and 4.5 µm thick line used in the final design.

Figure 4.13: Resistance vs. frequency with varying wire width (w) and thickness (T) in a 5 cm long Cu wire.

Figure 4.14: Resistance vs. frequency for a 5 cm long, 36 µm wide, and 4.5 µm thick Cu wire.

Noise containment

The wire pitch depends on the level of noise immunity desired in the design other than the usual limits of manufacturing technology. There are three major types of noise present in a high-speed design - reflection noise, switching noise, and coupling noise. Reflection noise is contained here by controlled impedance structures with matched terminations. Switching noise is generated by a fast change in current requirement caused by the simultaneous switching of output drivers. This causes resistive and inductive voltage drop in the power rails. This noise is an order of magnitude smaller due to constant current characteristics of differential logic. Coupling noise depends on wire size and shape, wire spacing, location of ground, and length of parallel segments. It can be reduced by increasing the trace separation, reducing the distance two traces run in parallel, slowing the rise time of the active signal or moving the traces closer to a reference plane.

The environmental factors that reduce noise margin in the system are power supply gradients, supply voltage regulation, power supply ripple, and temperature gradients due to unequal power dissipation of different chips and gradient in the ambient temperature due to rise in temperature of coolant air and local hot spots. Worst case noise calculations add the coupled and the switching noise together as they can occur at the same time. Reflection noise doesn't add to the worst case noise calculations as it doesn't occur at the same time. Coupling noise can be reduced significantly by using a stripline transmission structure.

Power Distribution

A proper power distribution scheme is important due to several reasons. A reduced supply voltage runs the chip at a slower speed as shown in Figure 4.15. It shows the variation in the oscillating frequency of four different oscillators with varying supply voltage.The high current requirement of the chips imply a careful distribution of the voltage to all the points on the chips to keep the differential voltage noise under control. This noise is caused by a supply voltage gradient between the driver and receiver supply contacts.

Figure 4.15 : Variation of delay times with respect to supply voltage.

Heat Dissipation

Bipolar transistor operation can be seriously affected by temperature [Bell86]. The beta of the transistors decreases with increasing temperature at the same collector current as shown in Figure 4.16, which will result in slowing down the circuits. The increase in junction temperature also decreases the base-emitter forward bias voltage by 2mV/C. This will increase the voltage swings and the total power. Any circuit is designed with a timing margin to keep it safe from temperature variations and other parameters.

Figure 4.16 : Beta vs. IC curve for a typical HBT transistor.

The circuit delays induced due to the temperature increase will cut into this safety margin. In a bipolar transistor, at thermal equilibrium, the power dissipation is given by

[4. 1]

where, IC is the collector current and VCE is the voltage across the collector and emitter. IC is affected by base-emitter voltage VBE and the collector-base reverse saturation current ICBO . As the junction temperature increases, ICBO increases which, in turn, causes IC to increase, thus increasing the power dissipation. This process can result in a shift in the dc operating point of the transistor, or, worst case, in a thermal runaway condition where IC keeps increasing until the junction overheats and burns out.

To model temperature induced delays, circuits in the critical path were modeled and simulated at higher temperatures using temporary models at different temperatures provided by Rockwell. The circuits simulated were the ALU register file, L1 cache memory and carry chain adder. The speed of the circuit operation is shown in Figure 4.17. The maximum delay variation is about 10% from the room temperature value till about 70°C. Therefore, a 10% slack is considered enough for all critical path delays in conjunction with the junction temperatures below 70°C to make speed.

Figure 4.17: Delay vs. temperature for the critical circuits on the package.

Yield Management

The chip yield has always been a big issue during its product phase but in this technology it has been one of the main issues from the very start due to traditionally low yield levels. A comprehensive Known Good Die (KGD) testing is important due to the compounding nature of package yield. Yield of the package is given as

[4. 2]

where Yc is the average yield of an individual chip, n is the number of chips on the module, YI is yield of the module interconnect, YA is the yield of the module assembly, and YT is the yield of the module test. If the interconnect yield, assembly yield, and test yields are assumed to be 100% and the yield of an individual chip is assumed as 90% the yield of 23-chips will be (0.90)23 or 9.9%. This is a very low yield and therefore a lot of emphasis is placed on providing a known good die to the foundry.

Pin-out for Signal I/Os and Power

The initial design contained provision for more than a thousand differential signals between the on-board primary cache and external secondary cache. This restriction is not present in the current version. The package still requires I/O signals for testing and power. The requirements for the test signals are presented in Table 4-2. There are 280 signals needed for just testing these chips alone other than the power supply. These signals will be augmented by a few additional signals to confirm the speed of the processor. To reduce the signal I/O count the testing scheme can be designed to combine share some of these signals among various chips. The I/O connectors and power pin out for this purpose are explained in chapters 5 and 7.

Table 4-2: Signals required for testing the chips.

Chip
Number on Package
Number of Test Signals
Total I/O
ID112 12
DP412 48
CC212 24
CR1612 192
Deskew14 4
Total24 280

Testing Issues

Successful physical testing and verification - both functional and at-speed - is usually the last step before declaring the design successful but is to be kept in focus from the very start of the design process. Numerous strategies and approaches are suggested in literature towards this purpose but each new problem requires the designer to think again instead of implementing one of these approaches in isolation. Typically, for a mix-n-match MCM, one usual problem is the non-uniformity of test schemes in the component chips [Abad94]. This problem is mostly absent from this processor as all the chips were meant to be put on an MCM from the very start and incorporate uniform test scheme. All the four digital chips - instruction decoder, datapath, cache memory, and cache controller - employ boundary scan testing for both on-wafer and on-MCM tests [Phil93].

The physical design of the scheme is different on these chips though, due to the different designers and different time of implementations. The testing problem in the current context can be defined as the identification of known good dies (KGD) for insertion into the package and the definition of subsequent testing methods to ensure a working package. The chip level scan can identify known good dies (KGD) for insertion on the MCM. When the chips are packaged in an MCM, the same boundary scan logic can be used to test the functionality of the inter-chip wiring. The outputs of one chip are sent over the MCM wires to the other chips and scanned out. The full testing scheme is described in chapter 8.

Packaging Technology

Many materials and processes are available for fabricating multichip modules. Care is required in selecting materials that are mutually compatible mechanically, thermally, electrically, and chemically [Lica95]. GE-HDI process was chosen to design and assemble the proposed package. The GE-HDI process is able to place chip edges very close to each other on a substrate and build copper-polyimide interconnect structure on top of the chips [Daum93][Gdul93][Lica95]. In a typical process, chips are placed inside cavities milled in a substrate and overlaid with alternating insulator and metal layers to form an interconenct structure. The insulator layers are made by applying adhesive on top of the previous layer and laminating a dielectric film. The first insulator layer uses Ultem (a trademark of GE) thermoplastic adhesive and KaptonE (a trademark of DuPont) film. Ultem can be softened by heating to 210 C to remove the overlay for rework in case of a faulty chip. The upper insulator layers are made by applying SPIE (a siloxane-polyimide/epoxy blend) adhesive and KaptonE film. The interconnection layers are made by sputtering Ti as a barrier layer followed by sputtered Cu. After that Cu is electroplated to the required thickness. This copper is further encapsulated by sputtering a Ti layer.

Advantages of this packaging scheme are the separation of the thermal and electrical paths which lets a designer design aggressively in both electrical and thermal domain. The process flow followed in this design is illustrated in Figure 4.18 and Figure 4.19.

Figure 4.18 : Process flow.

Figure 4.19 : Process flow (Contd.)

CAD Tools

There is a lack of efficient MCM design tools in the commercial arena. Most of the available tools are souped up versions of old PCB design tools and lack the IC like nature of MCM designs when it comes to dealing with complex schematics and routing. Mentor Graphics Corporation's HybridStation suite of tools are used for design here. Figure 4.20 provides the design flow.

Figure 4.20 : Package CAD flow with the tool names in parentheses.

Librarian is used to create and modify PCB geometry data, catalog files, and mapping files. Package is used to create and modify assignments of logic symbols to physical geometries. Layout is used to place geometries and route traces on PCB designs. Fablink is used to generate manufacturing data, drawings, and reports [Ment93]. AutoTherm is used to perform thermal analysis on a specified package. AutoTherm is 2D only. Engineering change order is executed by back-annotating the design in Layout and going back to the schematics.

Summary

The F-RISC/G package pushes state-of-the-art in speed, density, and power dissipation and needs careful design in all these domains to achieve a 2-GHz external clock rate and 1-ns cycle time processor. One reason for a high power dissipation is the high-power dissipation of the cache memory. In case this memory is implemented in a sufficiently high-speed and low-power technology the total package power dissipation can drop down to a 60-70 W range. Next chapter describes the final package design.