The demand for greater processing power is more than ever as we celebrate the 50th anniversaries of invention of the transistor [Moor97], and ENIAC - the first general purpose digital computer, and the 25th anniversary of the invention of Intel 4004, the first microprocessor [Patt96]. The processors have doubled in performance every 18 months [Tred96] and researchers are looking at heterojunction based technologies to reap even greater performance [Jala95]. A number of heterostructures based high-speed device technologies are available today to a circuit designer [Kroe82] in Indium Phosphide (InP), Gallium Arsenide (GaAs), and Silicon-Germanium (SiGe) material systems. These technologies promise switching speeds of an order of magnitude higher than the current MOS and bipolar technologies [Kroe82]. The fabrication processes of a few of these technologies have matured to provide moderately large scale integration capability [Deyh95]. Transit time frequency, ft, is a parameter for evaluation of these devices. This frequency has been observed as high as 350 GHz for certain MESFET, HEMT, and HBT devices [Ali91]. Availability of dense packaging methods makes these technologies attractive for multi-chip implementation of complex circuits [Greu88].
One promising heterostructure technology is the advanced GaAs heterojunction bipolar transistor (HBT) technology [Kroe57][Asbe91]. Low yield and high-power dissipation of HBT circuits limit the number of gates that can be integrated on a chip. A high-speed RISC engine [Ditz80] can be realized by densely packing these chips on a fast package. The Fast - Reduced Instruction Set Computer (F-RISC/G) project at Rensselaer has applied this idea to develop a 1-ns cycle time, 32-bit processor with a 50-GHz GaAs/AlGaAs HBT technology [Phil93]. The processor is partitioned into twenty four chips with five custom chips in the chipset.
One reason to push for a faster clock rate system than stringing together a number of processors in parallel - as in a massively parallel processor (MPP) - to obtain a desired peak performance rate is the lack of a proportionate speedup in the performance of average applications due to an inherent serial nature of an average code stream. The parallel processors are also unable to match the memory network bandwidth with the data bandwidth required by the processing core. The cost effectiveness of assembling thousands of slow processors together with gigabytes of memory is also low due to the development time and low volume of the shipped machines. A case in point is the $50 million Intel/Sandia MPP which has 9624 Pentium Pro processors running at 200 MHz each with 500 GB of primary memory [Clar97], with a peak performance of 1.8 teraflops. The actual performance of such MPPs in some cases, is as low as 15% of the peak levels [Clar97]. Again, a higher clock rate processor will always increase the total performance of these machines.
The potential advantages of a bipolar transistor with a wide-gap emitter over a homojunction transistor have been known for a long time [Shoc48][Kroe57]. The gain of a bipolar transistor can be increased by either higher doping of base or a band-gap difference between the base and emitter. The bases are usually doped upto their limit but there is still an additional opportunity for heterostructure transistors to engineer the bandgap [Kroe82]. This, in turn, permits a re-optimization of doping levels and geometries, leading to higher speed devices. The tradeoff between breakdown voltage and transit time frequency is given by an empirical relation - Johnson limit [John65] - which states that the product of ft and the breakdown voltage for bipolar devices follows ft.BVceo 300 GHz. Volts [1. 1]
All the heterostructure devices can circumvent the above inequality, thereby permitting even higher ft values. The most popular material systems for HBTs are the ones which have identical lattice constants, such as AlGaAs/GaAs and InGaAs/InP. The development of molecular beam epitaxy and metal-organic vapor deposition (MOCVD) techniques in 1970s enabled the fabrication of high quality semiconductor heterostructures [Chan94]. Many other growth techniques, including low-temperature epitaxial growth, are currently being developed in order to facilitate the fabrication of high-quality semiconductor heterostructural devices.
Another advantage of these devices is the speed of these devices
inspite of their size. As the refractive-lens based optical lithography
systems are approaching their limit of 193nm resolution an increasing
amount of resources will be required to go to smaller dimensions
in the extreme ultraviolet (EUV) and vacuum ultraviolet (VUV)
range [Boko97]. This will not stop the shrinkages for HBTs though
as they are a few steps behind the state of the art in lithography.
Future process shrinks, both laterally and vertically, will produce
even higher speeds at progressively lower power. Figure 1.1 illustrates
the constant nature of maximum ft as emitter is scaled
down, in a typical HBT technology, along with a reduction in power.
Interconnections loading the gates tend to slow down this performance. The performance of high-speed circuits depend upon the interconnect parasitics in addition to the unloaded performance of the switching device [Bako90] and the degree of integration possible with these technologies. The interconnect delays have not scaled down proportionately with device switching speeds. In ideal scaling the RC delay remains constant [Bako90] and in fact, the delay on the global lines increase due to the increase in chip size. Current trends in technology indicate the increasing dominance of interconnects - both on-chip and off-chip - in determining system performance [Mein96]. The interconnect delay is occupying as much as 50% of the available cycle time of a complex processor emphasizing the paramount importance of accurate prediction of interconnect parasitics [Edel95]. Thus, a proper interconnect engineering approach is being required in designing high-performance chips.
Existing CAD tools for extracting on-chip wiring parasitics are not sufficiently accurate due to a poor coverage of the 3-D effects of the nearby multilayer conductor geometries and non-planar processes. This is manifested even more in GaAs interconnects which lack a nearby ground due to the semi-insulating nature of the substrate. Thus, 3-D field solvers are used to compute these parasitics. Most of these 3-D field solvers [Oea94] are mesh based and require extensive computing resources for a small area on the chip. The presence of several dielectric layers in a typical GaAs technology makes it even more computationally intensive to extract interconnect parasitics by requiring a smaller mesh size. In practice, these solvers are used only on sections of critical nets and the results are used to generate heuristics which are then plugged in conventional 2-D extractors. Dense layouts and high-power densities also affects the device behavior due to self and proximity heating effects [Anho95], and high-frequency coupling between adjacent devices [Najm96]. The traditional measurements of isolated devices and interconnect test structures are inadequate for predicting the circuit behavior in such conditions and are needed to be supplemented by high-density structures mimicking real circuit conditions. Therefore, experimental verification of the results of these calculations are very important.
Thin-film multichip packages [Lica95] can provide the bandwidth
and short interconnect needed to leverage a high-speed technology,
such as AlGaAs/GaAs, for multi-GHz machines. The same packaging
technology can also be applied to integrate both bipolar and CMOS
chips permitting each type of technology to be used optimally.
Next few sections illustrate the effect of the interconnect
problem and provide an overview of the F-RISC processor with an
identification of the constraints put on the package.
High-speed digital designs emphasize the behavior of passive circuit elements [John93]. Limited bandwidth of a wire adds a rise time delay to a signal's intrinsic propagation delay with increasing switching speeds. Circuit performance is dominated by interconnect delays and if these delays can be estimated accurately the designer can allocate power optimally and control on-chip skew, effectively resulting in faster circuits with lower power dissipation. While interconnect capacitance has a smaller effect on bipolar circuit performance than in CMOS, it is by no means negligible. On the other hand, bipolar circuits suffer from the interconnect resistance much more than the CMOS circuits because of lower output impedance. Line inductance is beginning to become a factor with long narrow global signal lines and faster switching signals such as master clocks [Lipm97]. Coupling effects have become prominent with increasing aspect ratios [Zhan96] and have not, traditionally, been a source of concern. Increased number of interconnect has made its effect more pronounced. In GaAs most of the field lines terminate on other wires increasing the coupling. Till recently, digital simulation tools have not been taking coupling into account.
An accurate characterization of the fabrication process is even more important with an experimental technology such as GaAs HBTs. As the manufacturing technology is evolving, the process parameters are not adhered to the specifications given to the circuit designers. This in turn, shows up in unpredicted circuit performance. Therefore, a method is required to predict the process parameters based on a set of test structures and design the circuits keeping those parameter variations under consideration.
The effect of slight inaccuracies in delay at the interconnect
modeling level manifests itself at the system modeling level where,
in modern designs, a multitude of these wires are involved. One
direct effect of modeling uncertainties is in the system clock
margin which has to be increased slowing the system down. Even
slowing down the system may not guarantee a safe system because
of the skew present in some signals violating the logical data
flow. This effect can be especially crippling in wave-pipelined
systems which get their speed advantage from accurate delay estimations.
The semi-insulating GaAs substrate reduces interconnect capacitances, but the lack of a ground plane in close proximity to the interconnect layers results in increased coupling to nearby nodes. Even if the die is made very thin by lapping and a backside ground is provided, the ground plane is far away from the interconnect layers. The total capacitance of a node is dominated by three dimensional fringing fields. In Si circuits the lightly doped silicon substrate acts as a ground plane and interconnect capacitances are dominated by the capacitance to the substrate. Thus, the capacitance of interconnections on Si is a strong function of the interconnect length and width, whereas the GaAs interconnect capacitance is a strong function of the shape and proximity of nearby conductors.
The methods used in many VLSI design tools for interconnect extraction are targeted for CMOS designs and employ empirical equations such as the Sakurai fitting equations [Saku83] to extract parasitic capacitances. Since the GaAs interconnect capacitances are not dominated by the capacitance to the substrate these methods are less accurate [Leco94]. The conventional extraction tools suffer from inaccuracies due to the assumption of planar electric fields in capacitance extraction of even very simple structures. In a dense geometry the sidewall fringing capacitance dominates both in the same layer and across layers. A 3D capacitance extraction is necessary to get accurate delay estimates for high speed GaAs circuits. As mentioned before, analyzing GaAs interconnects is also more difficult than analyzing Si interconnects because several dielectric layers must be included in the analysis, for example, a typical GaAs process can have silicon dioxide (Si02), silicon nitride (SiN), and polyimide dielectric layers [Asbe91].
Figure 1.2 and Figure 1.3 try to illustrate the difference between GaAs and Si interconnect capacitances. Figure 1.2 shows the field lines for three 2 m wide wires (1-2 --3) with a spacing of 2 m and 4 m on a 75 m thick GaAs substrate. Figure 1.3 shows the field lines for an equivalent interconnect structure on Si with a field oxide thickness of 1 m. The interconnect capacitances per unit length for the center conductor are shown in Table 1-1, computed using METAL - a mesh based capacitance extractor - from OEA International [Oea94].
The capacitance to the substrate, C20, is only 17%
of the total capacitance for GaAs, but 78% for Si conductor. The
coupling to the nearby wires is at least twice as strong with
the GaAs substrate as compared to the Si substrate. In addition,
the coupling to nearby wires does not decrease as fast with distance
in GaAs as in Si. The coupling to conductor 3 is still 68% of
C21 in GaAs even though the spacing to conductor 3
is 4 m and the spacing to conductor 1 is 2 m. The coupling to
conductor 3 is only 42% of C21 in Si. The typical interconnect
structures in a circuit are much more complex than the 2-D interconnect
case used here. A typical case with three metal layers is shown
in Figure 1.4.
The importance of 3-D capacitance extraction tools increases as
the devices and interconnects are scaled. If the process is scaled
by a factor S (S<1), the interconnect length shrinks by S and
interconnect capacitances also shrink by S. The capacitance per
unit length stays about constant - actually increases a little
- since the width and the spacing decrease. However, the maximum
device current shrinks with S2 as the current per unit
area in the emitter is about constant since it is already close
to the dopant redistribution and electromigration limits. If interconnect
capacitances shrink with S and the device current levels shrink
with S2, capacitance induced delays will make up a
larger fraction of the critical path delays. Thus accurate 3-D
capacitance extraction is essential for aggressive circuit designs.
The processes for most of these advanced technologies are low-volume
and geared for low-density analog designs . This translates into
relatively coarse variation in process parameters such as dielectric
thickness or topological non-uniformities because of a non-planar
processes. Modern day CAD tools have been developed for the CMOS
technology, which still has room to design, without full 3-D capacitance
extraction. This lack of tools is another reason for designers
working with alternative technologies to continuously calibrate
their CMOS-based tools.
Another interconnect problem surfaces in the off-chip environment.
The low integration level of the technology places a number of
critical paths off-chip in a tightly coupled system such as F-RISC.
The high-speed communication requires more power and low resistance
lines which is in conflict with the need for high-density fine
geometry routing. The processor also demands high-power dissipation
capabilities to keep the junction temperatures at manageable levels.
The development of a high-speed, air-cooled package was undertaken
towards this purpose and is described in chapters 4-8. An overview
of the FRISC processor and constraints on the package are given
in the following sections.
The processor uses a baseline 50-GHz GaAs/AlGaAs HBT technology [Asbe91] from Rockwell International, with triple-level full differential current mode (CML) circuits reaching unloaded gate delays of 20 ps. A CML design is based on using a current switch to pull/push currents from/to its two complementary outputs. The current switch is based on the principle of a differential amplifier except that one of the two switch transistors is in cut-off mode while the other one is in active mode conducting current. This provides a large degree of common noise rejection and improves decoupling of signal pairs from one another. An example CML gate is shown in Figure 1.5. Resistive current sourcing, as opposed to active current sourcing, is employed in the current trees of these CML gates.
Most of the analog circuitry is also implemented in full differential form as well. A characteristic of full differential communication is the doubling of number of wires increasing chip area and reducing back end of the line yield. Also, a high-performance differential switch based design requires balanced differential routing to minimize the signal skew. Small differences in parasitic loading on the individual signal partners in a particular pair can result from local conductor variations even though the wires are routed in parallel tracks. In case of lumped interconnects the wire capacitance on each wire needs to be balanced without too much regard to the individual wire lengths. In case of RC distributed nets the interconnect topology needs to be matched much more closely.
FRISC architecture grew out of the first generation RISC processors [Kate85][Henn84] after realizing that a simple fast RISC engine can be designed with the latest high-speed devices. For a first design iteration a RISC data engine doesn't need to provide various features available in commercial microprocessors [Patt85]. Most of these features - such as superscalar execution, superpipelining, register scoreboarding - have trickled down from earlier supercomputer designs. This trend of borrowing features from supercomputer designs will continue, with more device integration, to increase performance as other ways to increase performance are limited due to a freezing up of the architecture. F-RISC/G processor simple in nature as compared to commercial processors in terms of the native instruction set. A block diagram of the processor along with the proposed package is shown in Figure 1.6 [Phil93]. It is a 32-bit CPU with shared address bus between instruction and data caches.
The important characteristics of this processor are as follows:
The core system is partitioned into instruction decoder, datapath,
memory, and system clock. The memory is further divided into cache
controller and cache memory. The part of memory hierarchy residing
in the core is called L1 cache. A full system will have L2 cache,
L3 cache, and the main memory. All chip operations are controlled
by a four phase clock distributed over the chip with minimal skew.
All the chips are summarized in Table 1-2. The current package
doesn't have L2 cache and the wires between L1 and L2 cache.
* core chip **estimated.
One assumption of partitioning at the chip level was the capability
of high-speed multichip packages. Compared to conventional packaging,
an MCM package may improve system operating frequency by a factor
of three, overall package area by a factor of seven and power
dissipation by thirty percent [Dai92]. The overall timing constraints
of the F-RISC/G processor require a dense package which can provide
high-bandwidth high-speed low-noise interconnect, large supply
current with low gradients, and high heat dissipation capability
uniformly over a small area. The next few sections give an overview
of all the constraints on the package. These constraints will
be detailed later in chapter 4.
The processor has four tightly coupled 8-bit datapath chips on
the package. Control signals from the instruction decoder to these
chips travel on the package and therefore a number of critical
paths lie on the package. The representative critical path here
is the loop involving carry chain communication between the datapath
slices which is completed in 1-ns. Other critical paths are the
memory address transfer from the CPU to the cache controllers
and the subsequent 2.25 ns data and 2-ns instruction transfer
cycle. These and other critical paths are described in chapter
The off-chip wires act as transmission lines due to the fast rise times of the driver outputs of the order of 70 ps - 100 ps and long line lengths. The speed of signal propagation on these transmission lines is inversely proportional to the square root of the dielectric constant of the substrate. The timing budget of the critical paths indicate a low-K dielectric [Maie96] to obtain low time-of-flight delays. High-speed propagation also requires low loss lines to maintain sharp rise times at the receiver inputs. Low noise requirement demands impedance matched terminated wires with low parasitics and wide spacings. High-density routing demands narrow wire pitches. These requirements sometimes try to push the interconnect structure in opposite directions. For example, the demand for low noise increases the routing pitch while the demand for high-density tries to reduce it. These tradeoffs and the final interconnect scheme is described in chapters 4 and 5.
A low skew clock distribution is essential to make system speed.
There are seven chips on the package requiring a 2 GHz clock to
generate four 250 ps phases internally. The sources of skew on
the package are variations in the process, temperature, humidity,
coupling noise, and mismatched routing. The clock distribution
scheme is described in chapter 5.
The package dissipates about 220 W of power with an average heat
flux of 10.47 W/cm2. There are two major constraints
on the scheme to dissipate this much power. First constraint is
that the package has to be air-cooled and the second one dictates
the maintenance of the device junction temperature around room
temperature due to their validity at 25C only. The chips slow
down with an increase in temperature due to a decrease in current
gain of the devices. An active cooling scheme coupled with an
air-cooled exchanger was devised to cool the package to obtain
such a low junction temperature. The temperature gradients are
also kept to a minimum as any skew in signal transmission cuts
into the system cycle time. Extensive thermal modeling of the
package is presented in chapter 6.
The package consumes 221.2 W of power at a dc bias of -5.2 V and
42.5 A. Distribution of this power to all the chips with a small
variation is important to maintain the noise margin. External
supply voltage regulation should also be under control. In addition
to these dc limitations the package should reject power supply
ripples. Power distribution scheme is discussed in detail in chapter
Testability is essential at both the chip level and the package
level. Availability of known good die (KGD) for placement on the
package is critical for the success of this multichip packaging
effort. Since the yield is not expected to be very high it is
very important that the amount of rework is minimized once the
chips are in place on the package. Extracting these chips out
of the package may damage other chips too. The testing scheme
should work with reduced transistor budget and should be able
to confirm the speed of the system. Its implementation off-chip
is described in chapter 7.
An accurate characterization of interconnect network plays an important role in the design of high performance chips. The design of a testchip to characterize a fast GaAs HBT process for digital applications is described in chapter 2. The techniques are applicable to other compound semiconductor technologies and to silicon technologies such as silicon-on-insulator (SOI). The test results are given in chapter 3. The results and testing experience from this test chip were applied to fine tune the chipset and the package of the F-RISC/G processor. The package design is described in chapters 4 -7 with conclusions in chapter 8.