CHAPTER 1

INTRODUCTION AND REVIEW

Introduction

The demand for greater processing power is more than ever as we celebrate the 50th anniversaries of invention of the transistor [Moor97], and ENIAC - the first general purpose digital computer, and the 25th anniversary of the invention of Intel 4004, the first microprocessor [Patt96]. The processors have doubled in performance every 18 months [Tred96] and researchers are looking at heterojunction based technologies to reap even greater performance [Jala95]. A number of heterostructures based high-speed device technologies are available today to a circuit designer [Kroe82] in Indium Phosphide (InP), Gallium Arsenide (GaAs), and Silicon-Germanium (SiGe) material systems. These technologies promise switching speeds of an order of magnitude higher than the current MOS and bipolar technologies [Kroe82]. The fabrication processes of a few of these technologies have matured to provide moderately large scale integration capability [Deyh95]. Transit time frequency, ft, is a parameter for evaluation of these devices. This frequency has been observed as high as 350 GHz for certain MESFET, HEMT, and HBT devices [Ali91]. Availability of dense packaging methods makes these technologies attractive for multi-chip implementation of complex circuits [Greu88].

One promising heterostructure technology is the advanced GaAs heterojunction bipolar transistor (HBT) technology [Kroe57][Asbe91]. Low yield and high-power dissipation of HBT circuits limit the number of gates that can be integrated on a chip. A high-speed RISC engine [Ditz80] can be realized by densely packing these chips on a fast package. The Fast - Reduced Instruction Set Computer (F-RISC/G) project at Rensselaer has applied this idea to develop a 1-ns cycle time, 32-bit processor with a 50-GHz GaAs/AlGaAs HBT technology [Phil93]. The processor is partitioned into twenty four chips with five custom chips in the chipset.

One reason to push for a faster clock rate system than stringing together a number of processors in parallel - as in a massively parallel processor (MPP) - to obtain a desired peak performance rate is the lack of a proportionate speedup in the performance of average applications due to an inherent serial nature of an average code stream. The parallel processors are also unable to match the memory network bandwidth with the data bandwidth required by the processing core. The cost effectiveness of assembling thousands of slow processors together with gigabytes of memory is also low due to the development time and low volume of the shipped machines. A case in point is the $50 million Intel/Sandia MPP which has 9624 Pentium Pro processors running at 200 MHz each with 500 GB of primary memory [Clar97], with a peak performance of 1.8 teraflops. The actual performance of such MPPs in some cases, is as low as 15% of the peak levels [Clar97]. Again, a higher clock rate processor will always increase the total performance of these machines.

Why HBT ?

The potential advantages of a bipolar transistor with a wide-gap emitter over a homojunction transistor have been known for a long time [Shoc48][Kroe57]. The gain of a bipolar transistor can be increased by either higher doping of base or a band-gap difference between the base and emitter. The bases are usually doped upto their limit but there is still an additional opportunity for heterostructure transistors to engineer the bandgap [Kroe82]. This, in turn, permits a re-optimization of doping levels and geometries, leading to higher speed devices. The tradeoff between breakdown voltage and transit time frequency is given by an empirical relation - Johnson limit [John65] - which states that the product of ft and the breakdown voltage for bipolar devices follows ft.BVceo 300 GHz. Volts [1. 1]

All the heterostructure devices can circumvent the above inequality, thereby permitting even higher ft values. The most popular material systems for HBTs are the ones which have identical lattice constants, such as AlGaAs/GaAs and InGaAs/InP. The development of molecular beam epitaxy and metal-organic vapor deposition (MOCVD) techniques in 1970s enabled the fabrication of high quality semiconductor heterostructures [Chan94]. Many other growth techniques, including low-temperature epitaxial growth, are currently being developed in order to facilitate the fabrication of high-quality semiconductor heterostructural devices.

Another advantage of these devices is the speed of these devices inspite of their size. As the refractive-lens based optical lithography systems are approaching their limit of 193nm resolution an increasing amount of resources will be required to go to smaller dimensions in the extreme ultraviolet (EUV) and vacuum ultraviolet (VUV) range [Boko97]. This will not stop the shrinkages for HBTs though as they are a few steps behind the state of the art in lithography. Future process shrinks, both laterally and vertically, will produce even higher speeds at progressively lower power. Figure 1.1 illustrates the constant nature of maximum ft as emitter is scaled down, in a typical HBT technology, along with a reduction in power.

Figure 1.1: Ft characteristics of typical HBTs with geometry scaling.

Problems

Interconnections loading the gates tend to slow down this performance. The performance of high-speed circuits depend upon the interconnect parasitics in addition to the unloaded performance of the switching device [Bako90] and the degree of integration possible with these technologies. The interconnect delays have not scaled down proportionately with device switching speeds. In ideal scaling the RC delay remains constant [Bako90] and in fact, the delay on the global lines increase due to the increase in chip size. Current trends in technology indicate the increasing dominance of interconnects - both on-chip and off-chip - in determining system performance [Mein96]. The interconnect delay is occupying as much as 50% of the available cycle time of a complex processor emphasizing the paramount importance of accurate prediction of interconnect parasitics [Edel95]. Thus, a proper interconnect engineering approach is being required in designing high-performance chips.

Existing CAD tools for extracting on-chip wiring parasitics are not sufficiently accurate due to a poor coverage of the 3-D effects of the nearby multilayer conductor geometries and non-planar processes. This is manifested even more in GaAs interconnects which lack a nearby ground due to the semi-insulating nature of the substrate. Thus, 3-D field solvers are used to compute these parasitics. Most of these 3-D field solvers [Oea94] are mesh based and require extensive computing resources for a small area on the chip. The presence of several dielectric layers in a typical GaAs technology makes it even more computationally intensive to extract interconnect parasitics by requiring a smaller mesh size. In practice, these solvers are used only on sections of critical nets and the results are used to generate heuristics which are then plugged in conventional 2-D extractors. Dense layouts and high-power densities also affects the device behavior due to self and proximity heating effects [Anho95], and high-frequency coupling between adjacent devices [Najm96]. The traditional measurements of isolated devices and interconnect test structures are inadequate for predicting the circuit behavior in such conditions and are needed to be supplemented by high-density structures mimicking real circuit conditions. Therefore, experimental verification of the results of these calculations are very important.

Thin-film multichip packages [Lica95] can provide the bandwidth and short interconnect needed to leverage a high-speed technology, such as AlGaAs/GaAs, for multi-GHz machines. The same packaging technology can also be applied to integrate both bipolar and CMOS chips permitting each type of technology to be used optimally. Next few sections illustrate the effect of the interconnect characterization problem and provide an overview of the F-RISC processor with an identification of the constraints put on the package.

Importance of Accurate Interconnect Characterization

High-speed digital designs emphasize the behavior of passive circuit elements [John93]. Limited bandwidth of a wire adds a rise time delay to a signal's intrinsic propagation delay with increasing switching speeds. Circuit performance is dominated by interconnect delays and if these delays can be estimated accurately the designer can allocate power optimally and control on-chip skew, effectively resulting in faster circuits with lower power dissipation. While interconnect capacitance has a smaller effect on bipolar circuit performance than in CMOS, it is by no means negligible. On the other hand, bipolar circuits suffer from the interconnect resistance much more than the CMOS circuits because of lower output impedance. Line inductance is beginning to become a factor with long narrow global signal lines and faster switching signals such as master clocks [Lipm97]. Coupling effects have become prominent with increasing aspect ratios [Zhan96] and have not, traditionally, been a source of concern. Increased number of interconnect has made its effect more pronounced. In GaAs most of the field lines terminate on other wires increasing the coupling. Till recently, digital simulation tools have not been taking coupling into account.

An accurate characterization of the fabrication process is even more important with an experimental technology such as GaAs HBTs. As the manufacturing technology is evolving, the process parameters are not adhered to the specifications given to the circuit designers. This in turn, shows up in unpredicted circuit performance. Therefore, a method is required to predict the process parameters based on a set of test structures and design the circuits keeping those parameter variations under consideration.

The effect of slight inaccuracies in delay at the interconnect modeling level manifests itself at the system modeling level where, in modern designs, a multitude of these wires are involved. One direct effect of modeling uncertainties is in the system clock margin which has to be increased slowing the system down. Even slowing down the system may not guarantee a safe system because of the skew present in some signals violating the logical data flow. This effect can be especially crippling in wave-pipelined systems which get their speed advantage from accurate delay estimations.

3-D Nature of Electric Field

The semi-insulating GaAs substrate reduces interconnect capacitances, but the lack of a ground plane in close proximity to the interconnect layers results in increased coupling to nearby nodes. Even if the die is made very thin by lapping and a backside ground is provided, the ground plane is far away from the interconnect layers. The total capacitance of a node is dominated by three dimensional fringing fields. In Si circuits the lightly doped silicon substrate acts as a ground plane and interconnect capacitances are dominated by the capacitance to the substrate. Thus, the capacitance of interconnections on Si is a strong function of the interconnect length and width, whereas the GaAs interconnect capacitance is a strong function of the shape and proximity of nearby conductors.

The methods used in many VLSI design tools for interconnect extraction are targeted for CMOS designs and employ empirical equations such as the Sakurai fitting equations [Saku83] to extract parasitic capacitances. Since the GaAs interconnect capacitances are not dominated by the capacitance to the substrate these methods are less accurate [Leco94]. The conventional extraction tools suffer from inaccuracies due to the assumption of planar electric fields in capacitance extraction of even very simple structures. In a dense geometry the sidewall fringing capacitance dominates both in the same layer and across layers. A 3D capacitance extraction is necessary to get accurate delay estimates for high speed GaAs circuits. As mentioned before, analyzing GaAs interconnects is also more difficult than analyzing Si interconnects because several dielectric layers must be included in the analysis, for example, a typical GaAs process can have silicon dioxide (Si02), silicon nitride (SiN), and polyimide dielectric layers [Asbe91].

Figure 1.2 and Figure 1.3 try to illustrate the difference between GaAs and Si interconnect capacitances. Figure 1.2 shows the field lines for three 2 m wide wires (1-2 --3) with a spacing of 2 m and 4 m on a 75 m thick GaAs substrate. Figure 1.3 shows the field lines for an equivalent interconnect structure on Si with a field oxide thickness of 1 m. The interconnect capacitances per unit length for the center conductor are shown in Table 1-1, computed using METAL - a mesh based capacitance extractor - from OEA International [Oea94].

Figure 1.2: Field lines for GaAs interconnect structure (1-2--3).

Figure 1.3: Field lines for Si interconnect structure (1-2--3).

The capacitance to the substrate, C20, is only 17% of the total capacitance for GaAs, but 78% for Si conductor. The coupling to the nearby wires is at least twice as strong with the GaAs substrate as compared to the Si substrate. In addition, the coupling to nearby wires does not decrease as fast with distance in GaAs as in Si. The coupling to conductor 3 is still 68% of C21 in GaAs even though the spacing to conductor 3 is 4 m and the spacing to conductor 1 is 2 m. The coupling to conductor 3 is only 42% of C21 in Si. The typical interconnect structures in a circuit are much more complex than the 2-D interconnect case used here. A typical case with three metal layers is shown in Figure 1.4.

Table 1-1: Comparison of GaAs and Si interconnect capacitance.

Capacitance
GaAs

[fF/m]
Si

[fF/m]
C20
0.022
0.141
C21
0.066
0.028
C23
0.045
0.012
C22
0.133
0.181

Figure 1.4: Typical 3-D interconnect geometry.

The importance of 3-D capacitance extraction tools increases as the devices and interconnects are scaled. If the process is scaled by a factor S (S<1), the interconnect length shrinks by S and interconnect capacitances also shrink by S. The capacitance per unit length stays about constant - actually increases a little - since the width and the spacing decrease. However, the maximum device current shrinks with S2 as the current per unit area in the emitter is about constant since it is already close to the dopant redistribution and electromigration limits. If interconnect capacitances shrink with S and the device current levels shrink with S2, capacitance induced delays will make up a larger fraction of the critical path delays. Thus accurate 3-D capacitance extraction is essential for aggressive circuit designs.

Design Tool Calibration Along with Process Variation

The processes for most of these advanced technologies are low-volume and geared for low-density analog designs . This translates into relatively coarse variation in process parameters such as dielectric thickness or topological non-uniformities because of a non-planar processes. Modern day CAD tools have been developed for the CMOS technology, which still has room to design, without full 3-D capacitance extraction. This lack of tools is another reason for designers working with alternative technologies to continuously calibrate their CMOS-based tools.

Advance Package Design for a Fast RISC Processor

Another interconnect problem surfaces in the off-chip environment. The low integration level of the technology places a number of critical paths off-chip in a tightly coupled system such as F-RISC. The high-speed communication requires more power and low resistance lines which is in conflict with the need for high-density fine geometry routing. The processor also demands high-power dissipation capabilities to keep the junction temperatures at manageable levels. The development of a high-speed, air-cooled package was undertaken towards this purpose and is described in chapters 4-8. An overview of the FRISC processor and constraints on the package are given in the following sections.

Device and Circuit Technology

The processor uses a baseline 50-GHz GaAs/AlGaAs HBT technology [Asbe91] from Rockwell International, with triple-level full differential current mode (CML) circuits reaching unloaded gate delays of 20 ps. A CML design is based on using a current switch to pull/push currents from/to its two complementary outputs. The current switch is based on the principle of a differential amplifier except that one of the two switch transistors is in cut-off mode while the other one is in active mode conducting current. This provides a large degree of common noise rejection and improves decoupling of signal pairs from one another. An example CML gate is shown in Figure 1.5. Resistive current sourcing, as opposed to active current sourcing, is employed in the current trees of these CML gates.

Most of the analog circuitry is also implemented in full differential form as well. A characteristic of full differential communication is the doubling of number of wires increasing chip area and reducing back end of the line yield. Also, a high-performance differential switch based design requires balanced differential routing to minimize the signal skew. Small differences in parasitic loading on the individual signal partners in a particular pair can result from local conductor variations even though the wires are routed in parallel tracks. In case of lumped interconnects the wire capacitance on each wire needs to be balanced without too much regard to the individual wire lengths. In case of RC distributed nets the interconnect topology needs to be matched much more closely.

Figure 1.5: A two-level three-input representative current-mode-logic (CML) gate.

Architecture

FRISC architecture grew out of the first generation RISC processors [Kate85][Henn84] after realizing that a simple fast RISC engine can be designed with the latest high-speed devices. For a first design iteration a RISC data engine doesn't need to provide various features available in commercial microprocessors [Patt85]. Most of these features - such as superscalar execution, superpipelining, register scoreboarding - have trickled down from earlier supercomputer designs. This trend of borrowing features from supercomputer designs will continue, with more device integration, to increase performance as other ways to increase performance are limited due to a freezing up of the architecture. F-RISC/G processor simple in nature as compared to commercial processors in terms of the native instruction set. A block diagram of the processor along with the proposed package is shown in Figure 1.6 [Phil93]. It is a 32-bit CPU with shared address bus between instruction and data caches.

The important characteristics of this processor are as follows:

The core system is partitioned into instruction decoder, datapath, memory, and system clock. The memory is further divided into cache controller and cache memory. The part of memory hierarchy residing in the core is called L1 cache. A full system will have L2 cache, L3 cache, and the main memory. All chip operations are controlled by a four phase clock distributed over the chip with minimal skew. All the chips are summarized in Table 1-2. The current package doesn't have L2 cache and the wires between L1 and L2 cache.

Table 1-2: Summary of FRISC/G Chipset.

Chip
# on

MCM
Size

[mm x mm]
Device Count
Power

[W]
Total Power

[W]
Heat Flux

[W/cm2]
Signal

Pads
Power Pads
Datapath*
4
8.5x9.3
9785
13.0
52.0
16.4
80
124
Instruction* Decoder
1
7.6x8.7
7358
12.0
12.0
18.1
120
80
Cache*

Memory
16
7.0x9.2
14300
11.2
179.2
17.3
96
130
Cache*

Controller
2
9.5x8.3
13172
12.6
25.2
15.9
118
136
Clock* Deskew
1
6.0x6.0
-
4.0**
4.0**
11.1
-
-
L2 Cache
34**
-
-
5.0**
170**
-
-
-

* core chip **estimated.

Constraints on the Package

One assumption of partitioning at the chip level was the capability of high-speed multichip packages. Compared to conventional packaging, an MCM package may improve system operating frequency by a factor of three, overall package area by a factor of seven and power dissipation by thirty percent [Dai92]. The overall timing constraints of the F-RISC/G processor require a dense package which can provide high-bandwidth high-speed low-noise interconnect, large supply current with low gradients, and high heat dissipation capability uniformly over a small area. The next few sections give an overview of all the constraints on the package. These constraints will be detailed later in chapter 4.

Critical Delays

The processor has four tightly coupled 8-bit datapath chips on the package. Control signals from the instruction decoder to these chips travel on the package and therefore a number of critical paths lie on the package. The representative critical path here is the loop involving carry chain communication between the datapath slices which is completed in 1-ns. Other critical paths are the memory address transfer from the CPU to the cache controllers and the subsequent 2.25 ns data and 2-ns instruction transfer cycle. These and other critical paths are described in chapter 4.

High-Density, High-Speed, and Low-Noise Interconnect Structure

The off-chip wires act as transmission lines due to the fast rise times of the driver outputs of the order of 70 ps - 100 ps and long line lengths. The speed of signal propagation on these transmission lines is inversely proportional to the square root of the dielectric constant of the substrate. The timing budget of the critical paths indicate a low-K dielectric [Maie96] to obtain low time-of-flight delays. High-speed propagation also requires low loss lines to maintain sharp rise times at the receiver inputs. Low noise requirement demands impedance matched terminated wires with low parasitics and wide spacings. High-density routing demands narrow wire pitches. These requirements sometimes try to push the interconnect structure in opposite directions. For example, the demand for low noise increases the routing pitch while the demand for high-density tries to reduce it. These tradeoffs and the final interconnect scheme is described in chapters 4 and 5.

Clock Distribution

A low skew clock distribution is essential to make system speed. There are seven chips on the package requiring a 2 GHz clock to generate four 250 ps phases internally. The sources of skew on the package are variations in the process, temperature, humidity, coupling noise, and mismatched routing. The clock distribution scheme is described in chapter 5.

High Power Dissipation and Air Cooling

The package dissipates about 220 W of power with an average heat flux of 10.47 W/cm2. There are two major constraints on the scheme to dissipate this much power. First constraint is that the package has to be air-cooled and the second one dictates the maintenance of the device junction temperature around room temperature due to their validity at 25C only. The chips slow down with an increase in temperature due to a decrease in current gain of the devices. An active cooling scheme coupled with an air-cooled exchanger was devised to cool the package to obtain such a low junction temperature. The temperature gradients are also kept to a minimum as any skew in signal transmission cuts into the system cycle time. Extensive thermal modeling of the package is presented in chapter 6.

Power Distribution

The package consumes 221.2 W of power at a dc bias of -5.2 V and 42.5 A. Distribution of this power to all the chips with a small variation is important to maintain the noise margin. External supply voltage regulation should also be under control. In addition to these dc limitations the package should reject power supply ripples. Power distribution scheme is discussed in detail in chapter 5.

Testing

Testability is essential at both the chip level and the package level. Availability of known good die (KGD) for placement on the package is critical for the success of this multichip packaging effort. Since the yield is not expected to be very high it is very important that the amount of rework is minimized once the chips are in place on the package. Extracting these chips out of the package may damage other chips too. The testing scheme should work with reduced transistor budget and should be able to confirm the speed of the system. Its implementation off-chip is described in chapter 7.

Summary

An accurate characterization of interconnect network plays an important role in the design of high performance chips. The design of a testchip to characterize a fast GaAs HBT process for digital applications is described in chapter 2. The techniques are applicable to other compound semiconductor technologies and to silicon technologies such as silicon-on-insulator (SOI). The test results are given in chapter 3. The results and testing experience from this test chip were applied to fine tune the chipset and the package of the F-RISC/G processor. The package design is described in chapters 4 -7 with conclusions in chapter 8.