The F-RISC (Fast Reduced Instruction Set Computer) project has
as its goal the exploration of the upper speed envelope for the
throughput capability of one computational node, through the use
of advanced HBT technology. F-RISC/G involves development of a
one nanosecond cycle time computer using 50 GHz fmax
GaAs/AlGaAs Heterojunction Bipolar Transistors (HBT).
In the past contract period the primary activities consisted of
a revision of the design of the cache memory chip, and some alterations
of previously "completed" architecture chips. These
revisions and alterations have been made necessary because it
has been discovered that previously existing CAD tools for extracting
IC wiring capacitance are not sufficiently accurate due to poor
coverage of 3D fringing effects. Certain 3D capacitance extraction
tools developed at Rensselaer have indicated that some of the
previous wiring capacitance estimates may be off by as much as
a factor of two. Since all HBT gate drive capabilities are optimized
for these wiring capacitances, this error could result in F-RISC/G
underperforming relative to its intended clock rate. This has
necessitated reworking most of the critical paths, the register
file, and the cache memory chip for the FRISC/G effort, and development
of several of the additional small test "chiplets" containing
subcircuits used to check the design principles employed in the
project. In addition, some of the previously reported work on
designing the cache memory with yield enhancing redundancy has
been found unsatisfactory, resulting in too many HBT's, too much
power dissipation, and little yield improvement. Part of the problem
had been the large number of transistors in the boundary scan
circuit together with the large size of the memory chip itself.
A revised strategy for boundary scan which takes advantage of
the internal organization of the memory itself has reduced the
transistor count so that no redundancy is necessary, and in the
process has reduced L1 cache power dissipation where most of the
heat is generated in FRISC/G. Another subcontract (under the HSCD
BAA) was awarded to Rensselaer by Rockwell. An indirect result
of this award is that we have had access to early models of the
new Rockwell 100 GHz fmax
GaAs/AlGaAs HBT and companion tentative design rules
which can serve as the basis of a FRISC/H superscalar with roughly
a 2x increase in clock rate, resulting in 4 times the MIPS rate
of FRISC/G or 4000 MIPS. The new transistor is about one quarter
of the size of the older 50 GHz model and it appears that wire
lengths may be halved. The collector current at which the fT
peaks is about one quarter that of the present 50 GHz
HBT, so the increase in performance will be accompanied by a reduction
in total power dissipated by a factor of 2!
- Exploration of the Fundamental Limits of High-Speed Architectures.
- Study of GaAs HBTs for Fast Reduced Instruction Set Computer Design including Adequacy of Yield, and Device Performance for High Performance Computing Applications.
- Research in the Architectural Impact of Advanced MultiChip Module (and 3D) Packaging in the GHz Range.
- Examination of Pipelining across Pad Driver Boundaries as a Means for Reducing Adverse Effect of Partitioning due to Yield Limitations in Advanced Technologies.
- Investigation of Power Management in High Power Technologies.
- Study of Appropriate Memory Organization and Management for Extremely Fast RISC Engines using Yield Limited Technologies.
- Use of Adaptive Clock Distribution Circuits for Skew Compensation at High Frequencies.
- Exploration of Superscalar and VLIW organizations in the sub-nanosecond cycle regime.
- Exploration of a combination of HBT and MESFET technologies for lower power, higher yield, but fast cache memory [AASERT Program].
- Exploration of novel new HBT technology such as HBT's in the
SiGe and InP materials systems.
The F-RISC/G (Fast Reduced Instruction Set Computer - version G) project has as its goal the development of a one nanosecond cycle time computer using GaAs/AlGaAs Heterojunction Bipolar Transistor Technology. More generally the project seeks to explore the generic question of how one can achieve with bipolar circuits higher clock rates than expected from silicon based CMOS alone. Traditionally, CMOS has achieved its increasing clock rates from lithography improvements, shortening the lengths of devices and interconnections to achieve higher speed. Bipolar devices, on the other hand, have achieved their speed by reducing the thickness of various device layers, with more recent improvements coming from band gap engineering in heterostructures, and schemes to include built-in acceleration fields in the base transit region (graded base techniques). Since thicknesses of various layers in semiconductor processing can be minimized with proper yield engineering, short device transit times (or high transit time frequency) in principle favor the bipolar device. This is often quantified, at least for analog applications, by the closely related unity current gain or transit time frequency, fT .
The fact that the intrinsic base region for the fastest device is so very thin tends to make Rb high. Also the large area of the buried collector under the entire base tends to make Cbc an important consideration. Device optimization can clearly benefit by lateral shrinking of the device, lowering the area of the base collector interface, and shortening the path through the base intrinsic resistance.
Another parameter for quantifying the performance of the HBT is its maximum oscillation frequency or unity power gain frequency in the common emitter configuration:
These two canonic frequencies are often quoted in the literature for analog circuit performance, and they give some feeling for how fast the basic device will respond. One must use caution, however, as these parameters often do not give a clear picture about the speeds of logic circuits which utilize these devices. In addition to these "intrinsic" HBT parameters, logic circuit performance also depends on the burden of wiring capacitance and circuit input loading. Nevertheless, the fmax parameter is considered a better indicator of logic circuit performance than fT .
It is possible to achieve clock frequencies in digital circuits close to 25% of fT in small circuits such as frequency counters and serial to parallel converters. Hence, some 12 GHz serial to parallel converters have been fabricated in 50 GHz fT technology using full differential logic circuits. Unloaded gate delays often approach 1/fmax. Hence the unloaded fast gate delay in the same 50 GHz technology is around 18 ps. Clearly circuits will not work any faster than this since in any practical case wiring delays will slow these results down. Of course, the circuits approaching these upper limits in speed also have high power dissipation.
Interestingly, fmax can be larger or smaller than fT. It can be larger if the RbCbc product is small enough. The fmax is also an important parameter in circuit applications when power must be controlled at high frequencies, and lowering this parameter is the subject of several recent articles. The Rockwell HBT has several unique features which focus on lowering this resistance. Realizing that base resistance has two components, one located in the active base region (intrinsic) and one to account for the base extension out to the base contact, the Rockwell HBT uses a thicker layer for the extrinsic base, and efforts to reduce the distance between the emitter and base contact to a minimum are employed.
Operation of the device in a loaded circuit situation depends on the ability of the transistor to supply current sufficient to accomplish the charge transfer to accomplish the desired voltage swing in the shortest time, when there is a logic transition at the input of the device. Clearly this depends on the transconductance of the device which in the bipolar case is:
Theoretically, this means that the device has a transconductance which is limited only by the amount of current which the device can handle, but of course this current determines power dissipation. In practice it is the current density in the emitter which ultimately limits this current because of dopant redistribution and other effects similar to electromigration which degrade the emitter. It is important to note that successful shrinkage of the HBT requires research to permit higher current densities. P-type Be doping in the GaAs/AlGaAs base has exhibited dopant redistribution at high current densities in the base, but carbon doping has improved this significantly. Currents of 3 mA in the Rockwell 50 GHz HBT with emitter stripe of 1.4 µm by 3 µm area led to Be redistribution, or a current density of 4.2 x 104 A/cm2. For comparison, a current density of 5 x 105 A/cm2 will cause electromigration in aluminum lines. Carbon doping can be substantially better than Be for the GaAs/AlGaAs HBT.
Traditionally CMOS has held an advantage in power dissipation. However, whether conventional or dynamic, CMOS utilizes a rail-to-rail logic voltage swing and at higher frequencies the dynamic power dissipation of this style of CMOS circuit can become a limiting factor in its exploitation for the high end of computing. Bipolar circuits, on the other hand, have a high static power dissipation, but can be arranged to have very low dynamic power dissipation in actual logic circuits by using much lower voltage swings.
Traditionally the high power dissipation of bipolar has been required in order to keep current densities up in the emitter for high gain. However, bipolar technology can also be scaled using more aggressive lithography than the current 1 µm minimum feature size, and its current density can be kept high while lowering the total current and therefore the static power. A comparable scaling of interconnections is also assumed in making this statement. At some point the dissipation per gate of these two technologies will converge, and the question will then be which one delivers the highest computational rate for the lowest total power. The answer to this question is surprising. We note that the present F-RISC/G 32 bit integer engine dissipates about 250 W up to and including L2 cache chips and produces 1000 MIPS. A DEC Alpha dissipates 30 W, with another 20 W in cache at L2, for a total of 50 W. The CMOS Alpha is a 64 bit architecture which we will ignore in the comparison since it is rare that 64 bit integers are required. Nevertheless, allowing for a factor of 2.5 between the MIPS rates for these two processors, the Alpha at 1000 MIPS would be about 150 W vs. 250 for the single FRISC/G engine. Hence the crossover is very close. At higher clock rates the relative power per MIPS comparison could easily go the other way. We shall return to this issue when we discuss InP and SiGe HBT and even GaAs/AlGaAs technologies in shrunk lithographies where this situation will prevail.
Noise is another key consideration in predicting the future for computer technology. CMOS is not a balanced (continuous) current logic technology. Devices conduct current in CMOS only while loads are charging. Once these capacitive wiring loads are charged or discharged the current flow ceases. Hence, currents in conventional or dynamic CMOS circuits are constantly switching on and off. The situation is worst in dynamic CMOS where large numbers of gates have their outputs precharged nearly simultaneously. This causes transient current surges in the power distribution system leading to switching noise due to parasitic inductances in the power supply. Bipolar circuits, arranged to redirect a constant current through differing paths in current trees, have the ability to dramatically reduce this noise. This constant current is also the cause for bipolar circuits having a high steady state power loss. Hence, the bipolar designer must attempt to use his logic units in every cycle because unlike in conventional CMOS each current tree unit dissipates static power even when it is not used. The main path for reduction of this power loss is through bipolar device scaling and architectures that efficiently use a small set of logic units. Current Mode Logic (CML) provides for such a rich family of functional cells using only modest numbers of transistors.
This type of circuit is best described as a "steering" circuit and not a "switching" circuit, because the current in the system is held constant, and is simply steered through different paths in the circuit depending on the logic state. In switching circuits the currents are switched on and off leading to switching noise. Now, it is possible to implement current steering using CMOS devices using some unconventional circuits called Current Steering Logic (CSL). However, these do not implement full differential logic with the kinds of very low voltage swings feasible in bipolar steering circuits. Even using E/D mode NMOS it is possible to make silicon MOSFET equivalents of GaAs SCFL current tree circuits. But these also do not get the voltage swing down to the levels possible with the bipolar device. Also a lot of circuit tricks which are possible in conventional or dynamic CMOS don't work on SCFL or CSL styles of circuit design.
Industrial assessments appear to predict that the Si homojuncion bipolar device cannot offer performance advantages relative to Si CMOS. This appears to have been based on a relatively obscure relationship derived at RCA in the 1960's which shows that the speed of the silicon homojunction bipolar device is related to its breakdown voltage.
Clearly, improvements in speed of the Si homojunction bipolar device will come with decreasing breakdown voltage strength of the thin base. The HBT permits a circumvention of equation (6). It is the HBT which appears to offer enough additional avenues for speed improvement that it has become the primary focus of our effort. However, Heterostructure MESFET's, or High Electron Mobility Transistors (HEMT's) could also play an important role in high performance computing especially in the InP system where complementary HEMT's are possible. However, the transconductance of the FET is much lower than that of the HBT. For example, the transconductance of a state of the art CMOS device with a channel length of 0.1 µm is only about 120 mS/mm of channel width. MESFET's can be as high as 600-l100 mS/mm. By comparison the bipolar device offers as high as 20,000 mS/mm for transconductance. As a result, loading effects in HBT circuits, while not negligible, are considerably lower than in CMOS or FET circuits. However, lightly capacitive loaded MESFET circuits can perform well as well as lightly loaded, deep submicron CMOS. However, in even moderately loaded circuits the interconnect capacitance can dominate the performance. Nevertheless, the lightly loaded situation can be of significance especially in regular structures such as memories. Hence, our effort should eventually encompass a combination of HBT and MESFET or HEMT device technology. This is expected to impact primarily the cache memory where the core of the memory could consist of MESFET cells, while the decoder and sense amplifiers could be implemented with fast HBT devices which preserve the low switching noise and low voltage swings desired.
Of course, none of these arguments is convincing unless the fabrication
yields of HBT circuits are compatible with the design of processors.
Based on yield projections made by Rockwell, our effort is currently
focused on building block components for a partitioned RISC design
with roughly 5000-7000 HBT's per chip. Such chips should be expected
to yield in the range of 10-20%. Cache memory is more challenging,
and will demand higher HBT counts, near 8,000 - 9,000. The inherent
device yields should be high using OMCVD processing as the oval
defect density [the technology specific yield detractor] is exceedingly
low at 5 ovals per square centimeter. Theoretically, this should
make it possible to fabricate 100,000 HBT devices, even today.
However, doping and thickness uniformity problems, and interconnect
defects appear to mask this improved state of affairs in the basic
materials for GaAs/AlGaAs technology. However, the figure of 5
ovals per square centimeter reported to us by Rockwell is only
a best case quotation. In practice, as we learned in our test
chip run, the existing state of affairs in practical runs may
be substantially worse. If these densities only occur as poorly
as they do in the best MBE epitaxy the number of ovals per square
centimeter might be as high as 100 per square centimeter. which
would be consistent with reasonable yields of 5000 HBT's. After
incorporation into a real circuit the practical yields might drop
to only 300 HBT's. Hence success of projects such as FRISC depend
crucially on quality control of the GaAs epitaxy, and use of high
2.0 Evolution to 100 GHz GaAs/AlGaAs HBT Technology
The work conducted on the FRISC/G project has relied on Rockwell's 50 GHz baseline HBT process. However, recent evolutionary improvements in the Rockwell process have led to an initial 100 GHz process. The exact process changes which have led to this improvement have not been disclosed. However, from equations (1) and (2) it should be evident that some of this improvement has come from improving the base transit time, we surmise that part of this is by making the base width thinner. The thinner base width increases fields within the base, which might appear unfavorable because the saturation drift velocity advantages of GaAs actually move closer to silicon at high field. But the base thickness may be so thin at this point that electrons won't have collisions in the base, and so the saturation drift velocity used to compute mobility is no longer relevant. In this regime the crossing of carriers through the base is termed ballistic. It is possible that for sufficiently small transit regions the GaAs/AlGaAs system will still be better than silicon at high field strengths. According to Rockwell the 100 GHz devices fit into the same outline in layout as the earlier 50 GHz devices, i.e. there was no lateral device shrinkage. So literally one could map the entire FRISC/G layout onto the new devices and enjoy improved speed advantages. Early simulations with this device, however, do not reflect a doubling of circuit performance. Note that not only did the device not shrink, but that the interconnection process had not shrunk either. Because there has been no lateral device or interconnection shrinkage in this initial device, we have estimated from SPICE simulations using Rockwell supplied models of this device that the new transistor could only provide at best a 35% increase in speed in lightly loaded circuit situations, and only 25% in more heavily loaded situations. Nevertheless, this speed up could be combined with future yield improvements to attain a doubling of clock frequencies for FRISC. Hence we have begun a quiet move towards taking advantage of this new device. Yield would have to double in the Rockwell line to achieve a clock rate doubling with the new transistor. Interestingly if the yield were to quadruple one would not even need the new transistor to double clock rates.
More recently, however, Rockwell has proposed to ARPA an option in its HSCD contract for development of a much more aggressive "digital" version of the 100 GHz process. The earlier offering of the so called "100 GHz" process had been primarily for power HBT applications. The newer device model offered comes much closer to providing unloaded gate delays which are half those of the 50 GHz process, but more importantly, the peak of the fT vs. IC curve at 100 GHz occurs for a transistor which is about one fourth of the area of the 50 GHz HBT and for a current which is also reduced by a factor of four relative to that process. The reduced emitter size is 1.2 µm by 1.2 µm, compared to the 1.4 µm by 3 µm of the 50 GHz process.
Wires in a 50 GHz and proposed 100 GHz processes were simulated to extract differential mode capacitance using METAL, a program made commercially available by OEA, Inc. of San Jose, California. The results are shown below in Table I. Two configurations were tried for wire pairs on metal 1, metal 2, and metal 3 layers. The first pair doesn't have any adjacent lines. The second pair has adjacent ground lines in the same layer. The design rules proposed for the "digital" shrink for wire width and separation are shown:
The striking result from this table is that there is very little change in the per unit length capacitance in the geometries shown between the two processes. One can conclude that due to the semi-insulating substrate the dominant capacitive effect is due to fringing fields. These are not influenced by the smaller areas due to the reduced wire width of the smaller wire rules. Consequently the only effect on capacitance is through shortening of the wire lengths, not on the shrinking of wire widths.
2.1 GaAs/AlGaAs HBT Device Performance Comparisons
This section presents a summary of the results comparing the current 50 GHz Q1 HBT (referred to subsequently as Q1) used presently by the FRISC group for all design work with the new 1.2 µm x 1.2 µm emitter stripe 100 GHz HBT (referred to subsequently as Qp1.2µmx1.2µm) under development at Rockwell. In order to evaluate the two devices, PSPICE netlists for three four-stage ring oscillator (three inverter stages and one buffer stage) were created. The Q1 device ring oscillator uses the FRISC group's high power buffer standard cell schematic for each buffer instance. This cell has a switching current of 1.4 mA, an amount too high for the Qp1.2µmx1.2µm device to handle because of its reduced emitter size. Therefore, the emitter and collector resistors for the high power cell were modified to create a buffer cell for the Qp1.2µmx1.2µm device with a switching current of approximately 690 µA, the maximum current value which the new device is capable of handling.
Initial PSPICE simulations were then performed on the two ring oscillator circuits neglecting all interconnect capacitance. The device model parameters used in the simulations were those provided by Rockwell Int. on 2/7/94. The third set of ring oscillator simulations were performed using a version of Qp1.2µmx1.2µm with a 1.4 µm x 3.0 µm emitter stripe (subsequently referred to as Qp1.4µmx3.0µm). The increase in the emitter stripe length allows the new device to handle the same amount of current as the Q1 device. The buffer circuits for this set of simulations, therefore, would be able to function with a 1.4 mA switching current, the same amount used in the Q1 buffer circuit. These additional simulations were performed in order to determine if any significant improvement could be achieved through the use of a 100 GHz device with a higher current handling capability than the 100 GHz device currently proposed by Rockwell Int. The buffer gate delays for the three sets of simulations are summarized in Table II. The outstanding result here is that the unloaded gate delay using Qp1.2µmx1.2µm devices is 8.5 ps, roughly 1/2 of that observed for the buffer using 50 GHz Q1 devices. This improvement in switching speed is consistent with the fmax values quoted by Rockwell Int. for the old and new devices.
The capacitance extractor provided with VTITOOLs was used to determine the equivalent capacitance lumped to ground for the internal lines of the high power buffer cell using Q1 devices. The results of the extraction show that although most of the internal interconnect capacitance is fairly small, the cell input and output metal lines which run from the top to the bottom of the cell present a substantial capacitive load.
The internal capacitance values determined by the extractor were incorporated into the PSPICE netlist and a 100 fF load capacitance to ground was placed on each of the external lines connecting the buffers together in a ring to show the effect of interconnect capacitance on the Q1 buffer's performance. The high power buffer cell layout was then redesigned for use with the Qp1.2µmx1.2µm and Qp1.4µmx3.0µm devices. This was essential because the use of these devices also entails the use of a new set of design rules. The device artwork and design rules for the new cell were based on information provided by Rockwell on 2/7/94. The design rule changes involve an approximately equivalent reduction in the metal line widths and metal line spacings. This, coupled with the reduction in size of the 100 GHz devices over that of the Q1 device, has enabled a significant reduction in the size of the buffer cells using the new devices.
A comparison of the three cells can be found in Figure 1. Note that the lengths of the internal metal lines in the new cells were reduced significantly from those found in the cell using Q1 devices, resulting in an overall reduction in the internal interconnect capacitance of the new buffer cells. To estimate the effective capacitance lumped to ground contributed by the internal wire lines in the new buffer cells, the capacitance values determined by the VTITOOLs extractor for the internal cell metal lines of the high power buffer used by the Q1 device were scaled by the ratio of the lengths of the equivalent lines for each cell.
These new internal interconnect capacitance values were incorporated into the PSPICE netlists for the ring oscillators employing the Qp1.2µmx1.2µm and Qp1.4µmx3.0µm devices. Since the widths of the new cells decreased relative to the width of the buffer cell using Q1 devices, it is reasonable to assume that the average length of the metal interconnect lines between cells in a design will decrease accordingly. Therefore, the 100 fF load capacitance for each external interconnect line in the new cells was scaled by the ratio of the new and old cell widths to reflect the decrease in interconnect capacitance which will accompany the decrease in the lengths of metal lines as a result of the implementation of the new design rules. This allows for a more accurate estimate of the performance improvements for a 100 GHz device in actual designs. A new set of PSPICE simulations were then performed for all three ring oscillators.
The results summarized in Table II illustrate that two major factors play a role in determining digital gate delays. They are the device switching speeds and the gate's capacitive loading. With no capacitive loading, a gate's switching delay is entirely determined by the device switching speeds within the gate. Therefore, the factor of two reduction in switching delay between the buffers using the 50 GHz devices and the buffers using the 100 GHz devices was not surprising. However, when capacitive loading effects are considered, additional delay is incurred in order to add or remove charge from the metal interconnect in order to change the metal line voltages. This delay can be approximated by:
|no capacitance||with capacitance|
|Q1||18 psec||100 fF||1.4 mA|
|Qp1.2µmx1.2µm||8.5 psec||54 fF||0.69 mA|
|Qp1.4µmx3.0µm||8.3 psec||60 fF||1.4 mA|
|Qp1.2µmx1.2µm||8.5 psec||50 fF||0.69 mA|
|Qp1.4µmx3.0µm||8.3 psec||50 fF||1.4 mA|
|Q1||18 psec||100 fF||2 mA|
|Qp1.4µmx3.0µm||8.6 psec||50 fF||2 mA|
In CML logic circuits Iswitch, the dynamic switching current, is approximately equal to the static current of the circuit. This indicates that gates driving high capacitive loads need to operate at higher current levels in order to prevent the capacitive loading delay to dominate the switching time. This is one reason why the buffer using Qp1.2µmx1.2µm device did not show the 100% increase in performance with capacitive loading that was observed with no capacitive loading. Clearly, therefore, there is a need for a 100 GHz device with the same current handling capability as the Q1 device for driving long interconnect lines. Simulation results for the ring oscillator using Qp1.4µmx3.0µm devices and incorporating capacitive loading effect showed a significant improvement in buffer switching times relative to those observed in similar simulations using a ring oscillator with Qp1.2µmx1.2µm devices. However, the goal of a 100% improvement in switching time with capacitive loading was not quite reached. This is because the design rule changes for the 100 GHz devices did not allow a full factor of two scaling of metal interconnect pitches and the 100 GHz devices did not shrink by a full factor of four in area. Thus, the metal line lengths within and external to the buffer cell could not be scaled by a full factor of two. Therefore, as illustrated by equation (7), the full factor of two improvement in capacitive loading delay was not achievable with the new design rules provided by Rockwell given a device with the same current handling capability as Q1.
To illustrate the potential gain of a new technology in which all design rules scaled by a factor of two, another set of simulations were performed with the ring oscillators using the Qp1.2µmx1.2µm and Qp1.4µmx3.0µm devices in which all capacitance values were scaled by a factor of two from those found in the ring oscillator using Q1. The results of these simulations are found in Table II. These simulations show that, within a reasonable margin of error, the buffer using Qp1.4µmx3.0µm devices with a switching current of 1.4 mA shows a 100% improvement in switching speed over the buffer using Q1 devices operating at the same switching current. Therefore, a factor of two improvement in switching speed is achievable, even with heavily loaded gates. This is most easily accomplished if Rockwell can provide us with a version of their proposed 100 GHz device which can handle higher current levels and decrease the constraints on their design rules to allow a factor of two decrease in interconnect line lengths. The use of customized layouts and hand placement and routing of gates along critical paths will also help to reduce interconnect line lengths to ease the burden on Rockwell of providing a full factor of two decrease in design rule scaling.
Yield for the new 100 GHz GaAs/AlGaAs process is difficult to
predict since the main mechanisms for yield detraction are unclear
in the present 50 GHz baseline process. However, if the principal
mechanism is the oval density, then any scaling which decreases
the area of the transistor should be beneficial, barring a new
mechanism for yield detraction. If, optimistically, we could assume
the oval defect is the principal yield detractor, then yield should
increase by a factor of 4 in this modest scaling. Comments made
by TI suggest that this may have occurred on their 50 GHz GaAs/AlGaAs
emitters down process also. Hence, if this proves to be the case
many of the assumptions of F-RISC/H would appear viable, namely
wider slice integration (16 bit to 32 bit widths might be possible).
This together with the faster device and ability to incorporate
VLIW superscalar concepts (one integer and one floating point
unit) could produce a machine with roughly 4 times the peak performance
of F-RISC/G or 4000 MIPS, and much greater viability for commercial
exploitation. The more lightly loaded gates could operate at one
quarter of the power, while heavily loaded gates might require
the same current as used prior to the scaling. Hence the processor
speed would quadruple while power consumption would actually be
The GaAs/AlGaAs material system is not the only one where HBT's may be fabricated. In recent years HBT's with exciting characteristics have been fabricated in the SiGe materials system at IBM. For several years TI, Hughes, and ATT have been working in the InP/InGaAs/AlGaAs alloy system and some of the fastest reported HBT devices have been reported in this system. Interestingly, successful complementary MESFET and HEMT devices have been reported in the same systems, although not necessarily simultaneously. However, in the case of SiGe, this exact combination of HBT and CMOS has been realized and is being modeled and characterized at IBM at 0.35 µm x 1.0 µm emitter sizes where lower static power dissipation is possible in the HBT. In addition to the lower currents characteristic of scaled systems the Vbe is only 0.7 V, or half that of the GaAs/AlGaAs system (1.4 V). In addition, the "safe" emitter current density without dopant redistribution appears to be much higher (1 x 105 A/cm2) in the SiGe system . This means that the transistor scaling can be much more complete since larger transistors will not be needed for higher currents. The IBM SiGe system offers a 50 GHz fT and a breakdown voltage of 3.5V. HBT integration levels have been demonstrated at greater than 10,000 HBT's, and recently IBM has signed a joint agreement with Analog Devices to create 1 GHz D/A converters which demonstrate 6K HBT yield levels for commercial purposes. There have even been hints of much larger yields. Yields of 10,000 HBT's would, as we have already stated, make 16 bit slice integration possible, eliminating two pad driver-receiver delays in the major carry critical path. Yields of 20,000 HBT's would permit complete integration of the 32 bit processor.
A preliminary probe of the IBM SiGe HBT process at East Fishkill has been funded as a part of this contract, and permission appears to have been secured to remake the RPI test chip previously fabricated in the Rockwell line in the IBM line. It is expected that the yield will be much higher than with the GaAs/AlGaAs process. Our group is awaiting a set of models and characterizations to be completed by IBM. One of our students will have to travel to IBM to work with proprietary design rules to establish whether this direction is as promising as it appears to be.
By comparison InP is at the opposite end of the yield scale with
HBT counts closer to 10-200, and for that reason it might be considered
premature to examine them for serious digital circuit implementation.
However, it was only a decade ago when the yields for GaAs/AlGaAs
HBT circuits was similarly low. The speeds of the InP HBT are
even higher than for the GaAs/AlGaAs HBT's which are currently
available. For example TI describes an HBT in the InP system which
has an fmax of 200 GHz and
this device is still not submicron in size, which suggests modest
digital circuits could be made with unloaded gate delays of as
little as 4.5 ps today! As with SiGe, the InP system has a Vbe
of only about 0.7V, so power dissipation is reduced from this
effect. Even these small chips could be useful as tester chips
for the GaAs/AlGaAs or SiGe HBT circuits since they are so fast.
However, they also create an opportunity to gain circuit design
experience in preparation for a time when yields for even these
delicate circuits may look attractive. TI has offered us an opportunity
to make some small test circuits in this foundry and we will attempt
to do so if the time is available given the other contract deliverables.
4.0 Design of a 1-20 GHz Voltage Controlled Oscillator
The effect of wire loading in the GaAs/AlGaAs HBT system is becoming more evident as experience is gained with it. In particular, despite the use of Polyimide and Au for the interconnect dielectric insulator and metalization, it is evident from the previous section that extreme care must be exercised in connecting the devices. Layout can be almost as important as it is in CMOS circuits to wrest the utmost speed from the HBT. This is evidenced not only by the delays imposed by the increases in rise time and fall time, but also by the decrease in the bandwidth of signals in the system. When pressing the upper limits possible with the HBT one must be aware of the nonlinearities of the HBT, which can generate harmonics and subharmonics. To probe this aspect of the technology we have attempted the design of a "challenge" circuit, targeted to operate at 40% of the fmax for the 50 GHz HBT process.
A high speed voltage-controlled oscillator (VCO) has been developed which can generate differential signals in the range of 1-20 GHz. VCOs will become an integral part of future computers for providing clocks for digital and other synchronous circuits. This VCO consists of a frequency generator, a frequency multiplier and a frequency divider, along with various high-speed buffers, multiplexers and drivers. The design uses 412 transistors and dissipates 2.60 W at 20.0 GHz. The frequency of oscillation is controlled by an external bias voltage. There is also a 12-stage ring oscillator which is included as a means for determining the baseline speed for the fabricated transistors. This project has enabled us to gain more experience with the Rockwell 50 GHz process as well as aspects of high frequency design and the effects of interconnect parasitics. This design has also directly benefited the FRISC project through the improvement of the high-speed register file design and the datapath chip.
4.1 System Design
The high-speed VCO consists of a frequency generator, a frequency multiplier and a frequency divider (see Figure 2). A base frequency is produced by the frequency generator and is controllable by adjusting the bias voltage input to the system. The base frequency can be multiplied by a factor of 2 or 4 by the frequency multiplier. The frequency divider is capable of dividing the frequency by factors of 2, 4 or 8. SPICE simulations have shown the VCO base frequency operating at 2-5 GHz and the multiplier and divider operating at frequencies up to 20 GHz.
The frequency generator (Figure 3) includes four delay elements connected in a ring with an inversion placed in the differential feedback path between the last and first elements. The frequency range of the generator is from 2 to 5 GHz and is controlled by an externally applied bias voltage. Also included in the delay elements are high-gain buffers which are capable of driving the long lines between the generator core and the VCO multiplexers. In order to attain the highest speed possible, the generator was implemented and placed first during the design process.
The frequency multiplier (see Figure 3) consists of several high-speed exclusive-OR (XOR) gates which serve to double the frequency of the input signals. These gates are used to generate two signals which are twice the frequency of the generator and are 90° out of phase. The quadrature inputs to each XOR are taken directly from the outputs of each element in the frequency generator. The output signals from both XOR devices are then fed into a third XOR which will generate a signal that is four times the frequency of the generator. The parasitic capacitance of these lines are very important and may have harmful effects which are manifested later in the system. Because the frequency-doubling effect of the XORs is best achieved when the input signals are exactly 90° out of phase, this implies that the parasitic capacitance of the input lines must be balanced as closely as possible. In addition, the capacitance of each line in the differential signal pair must also be closely matched to its counterpart to maintain the integrity of the signal. SPICE simulations with extracted capacitance values indicate that the input signals to the 4X XOR are approximately 87.5° out of phase.
The frequency divider (Figure 4) consists of three high-speed toggle flip-flops, each of which may divide the signal by a factor of 2, resulting in frequencies that are 1/2, 1/4 and 1/8 of that of the input. The divider circuit also contains a high-speed multiplexer which selects between the source frequency and the lower (divided) frequencies. To compensate for the additional parasitic capacitance incurred by the inputs to the divider, a high-gain buffer has been inserted into the high-speed path to drive the additional load. To further reduce unwanted parasitic capacitance and resistance, the output lines from the differential amplifier have been made 8 µm wide with 8.5 µm spacing between them. These lines travel approximately 500 µm between the amplifier and the chip pads.
4.2 Refinement Considerations
The most trouble experienced during the design and simulation of the VCO was related to the parasitic capacitance of interconnect within the system. This resulted in problems such as output-loading of subcells and unbalanced signal propagation and amplitude, thereby degrading the output of the system. As a consequence, much care was taken to ensure that the capacitive loading of cell connections was acceptable in terms of the resulting signal characteristics. When necessary, high-powered drivers were inserted into the system to compensate for the interconnect parasitics. In some cells, the driver and/or receiver circuits were modified in order to compensate for the loading. One of the most troublesome cells has been the multiplexer. This cell is critical to the operation of the system because the high-speed signal (i.e. the 20 GHz signal) has to pass through at least two instances of the cell. In addition, extensive SPICE simulations have shown that feedthrough of lower-frequency signals in the multiplexers can be a problem and may result in unwanted lower-frequency components within the high-speed output signal, resulting in a noisy waveform. Attempts to counter feedthrough by redesigning the multiplexer has proved troublesome, hence additional buffers were designed and included. However, these new cells are unsuitable for high-speed signal paths and thus are used only on low-frequency signals (< 10 GHz). The refinement of the multiplexer and output drivers is still continuing. A picture of the high-speed VCO chip layout is shown in Figure 5.
5.0 Capacitance Extraction
Device performance, logic style, and interconnect capacitances determine digital circuit performance. While interconnect capacitance has a smaller effect on bipolar circuit performance than in CMOS, it is not negligible. Circuit performance can even be dominated by interconnect delays in large digital circuits since the current levels need to be kept low to keep the power dissipation at manageable levels. Accurate device models for SPICE and accurate interconnect capacitances are essential for the design of high performance digital systems. If both device and interconnect characteristics can be accurately predicted, the available power can be distributed optimally.
The on chip interconnect parasitics can still be approximated by a capacitance at 10 GHz since even long wires (4-8 mm) are shorter than a wavelength:
and the interconnect resistances (25 Ohms/mm for metal 1, 6.7 Ohms/mm for metal 2) are low compared to the output impedance of the driving gate (175-425 Ohms) depending upon the power level of the CML gate.
Interconnect capacitance extraction for GaAs ICs is more challenging than for Si circuits. The silicon substrate is lightly doped and acts as a ground plane, at least at low frequencies. The capacitance of interconnects is therefore dominated by the capacitance to the substrate, and fast 2D capacitance extraction methods tend to yield good results. However, the GaAs substrate is semi-insulating and thus the ground plane is far away. If the backside of the GaAs IC is metalized the ground plane is 75-175 µm below the interconnect layers. While the semi-insulating GaAs substrate provides lower capacitances, the semi-insulating substrate increases coupling between adjacent wires. GaAs interconnect capacitance is dominated by coupling to nearby nodes and 3D fringing effects. The 2D capacitance extraction provided by our CAD tools, mainly targeted for CMOS circuit design, is not accurate enough for our purposes. Indeed some of the 3D capacitance results indicate that the 2D extractions results can be off by as much as a factor of two in dense layouts! Unfortunately there are currently no tools available that can perform a full 3D extraction of a full GaAs chip. However, finite element methods combined with tiling and a novel capacitance extraction method developed by Prof. Le Coz at RPI might provide this capability soon.
Our group has collaborated with Professor Yannick Le Coz at RPI and Dr. Ralph Iverson from Random Logic Corporation to make their 3D QuickCap program suitable for IC capacitance extraction by providing sample circuits for analysis. Random Logic Corporation plans to commercialize 3D QuickCap through Cadence. In addition, our group has received the OEA tool set through our collaboration with Rockwell, Cadence, and OEA on the HSCD contract. The OEA tools set includes the 2D and 3D capacitance extraction tool, METAL. The two tools use totally different approaches for 3D capacitance extraction:
Method used by 3D METAL
The 3D geometry is volumetrically meshed and the resulting mesh equations are solved by a fast solver. The mesh equations and hence the capacitance values depend upon the meshing parameters. The finer the mesh the lower the final capacitance values. The capacitance value tend to change by as much as 10% if the meshing parameters are refined. While the solver is astonishingly fast, the circuit complexity that can be analyzed is limited by the mesh size. Using adaptive localized meshing could help reducing the mesh size and the run time.
Method used by 3D QuickCap
Each electrode for which all coupling capacitances need to be calculated is automatically enclosed by a "Gaussian" surface. Random walks are generated from this closed surface. If a random walk hits another electrode the coupling capacitance between this electrode and the one enclosed by the "Gaussian" surface gets a capacitance contribution. The computationally expensive part is the generation of the random walks which requires finding the largest cube that can be drawn from the current point on the random walk to the nearest electrode or dielectric interface. The capacitance values are based upon the random walk statistics (how many walks start at the surface enclosing electrode i and terminated at electrode j with a certain reward) and the uncertainty in a capacitance value is given by:
The uncertainty can be reduced using the fact that for each capacitance between two nodes, two independent sets of statistics are available (i,j),(j,i).
To double the accuracy of an extraction, four times more random walks need to be executed. The tool reports the standard deviation in percent for each capacitance. The advantages of this novel method are:
- No meshing is required, resulting in low memory requirements.
- Each random walk is independent from each other, thus the random walk method lends itself very well to partitioning for execution on massively parallel machines.
- A large IC can be partitioned into tiles that can be analyzed independently (each tile and its nearest neighbors are analyzed) to reduce memory requirements.
- A unit cell can easily be repeated in the X or Y direction or in both. This has been important for the analysis of memory cells and address decoder circuits. Note, that only the geometry and not the electrode potentials are repeated.
- The method requires no floating point hardware.
5.1 Comparison of Methods and Tools for IC Backannotation
Both tools generate the 3D geometry based upon the 2D CIF or GDSII mask level description and a process or technology description file. METAL currently makes a planar assumption (all metal 1 is at the same level over the substrate, all metal 2 ...). Thus vias turn into pillars. QuickCap can handle non planar interconnect structures. Both tools use basically rectilinear geometries. QuickCap can also handle 45 degree geometries and conformal coatings but at a run time cost.
The complexity of the circuits that can be analyzed by METAL is currently limited by the mesh size. Analyzing a crossing of two differential wire pairs (100 µm long) required 55 Megabytes of memory for the mesh and 1.5 hours on a SPARC10/30 to solve the equations. QuickCap can currently analyze a RAM cell in an infinite array on a Mac II using 1 Megabyte of memory for program plus geometry data and provide a 10% result in one hour. To get a 1% result would however require 100 hours! However, the program should be able to generate 1% results in a few hours on a workstation. For digital ICs an accuracy of 1% for the total node capacitance would be sufficient since the process variations are larger than this. Both tools have difficulty with the thin 0.45 µm Silicon nitride layer in the Rockwell HBT process.
OEA's METAL tool is a commercial tool in suite of tools for capacitance, resistance and inductance extraction. QuickCap is only in the alpha stage. There might be some future collaboration between OEA and RLC which might lead to a commercial product in a short time.
Both companies are working on tiling to analyze large circuits. OEA is working on a product called Net_An that will be able to extract capacitances of selected nodes in large circuits. However, it is not clear to us whether tiling can be done as cleanly as for the Random Walk Method since the boundary conditions on each tile must be known in order to solve the mesh equations.
Currently the large mesh sizes prohibit the analysis of cells with the complexity of an HBT standard cell with OEA's METAL tool whereas RLC's QuickCap program can analyze standard cells even while running on a slow platform with only 8 MB of memory. The largest cell analyzed so far is the ALU carry chain cell shown in Figure 6. The analysis took 8 hours and 2 MB of memory. The tool extracted 32 primary nodes with an uncertainty below 10% or 0.5 fF. However, neither tool can currently analyze a full GaAs IC which is our ultimate goal. The tiling feature and hierarchy will help reducing the complexity and allow us to analyze large circuits in small sections. We will continue to collaborate with both companies in order to get close to our goal, to get accurate 3D capacitance values for a full GaAs chip.
Based upon the available 3D capacitance extraction results the capacitances can be up to a factor of two higher in dense circuits than predicted by the 2D extraction tool in VTITOOLs. We have therefore lowered interconnect capacitances by taking advantage of the third layer of metal that is now available in the Rockwell HBT process. The following changes were made:
- Datapath and Register file Test Chip: Upgrade Register file
- All standard cell chips: Reduce Capacitance of Standard Cell Feedthroughs using M3
Further, we used metal 3 for double strapping the power rails on existing chips in order to lower power rail voltage drops.
Changing metal layers turned out to be very time consuming since changes in an existing chip tend to have ripple effects and all modified chips need a DRC followed by a layout verification. On analog chips the capacitance has been reduced by spacing differential wires further apart in areas where no other wires are nearby.
The register file memory cell and address decoder have been modified in order to reduce capacitances. The most critical parasitic capacitances for memory performance are the bitline capacitances. The bitlines are long and are shared by all 32 memory cells in a column. The bitline current is 1.6 mA and hence the bitline sensitivity to capacitance is high. In the original memory cell shown in Figure 7, the top and bottom wordlines are in metal 2 and the bitlines are in metal 1. In the new memory cell the top wordline is in metal 3. The design rules for metal 3 are more restrictive than for metal 1 or metal 2. The minimal width is 4 µm and the minimal M3-M2 via is 7 µm by 7 µm. The top wordline is much wider than the bottom wordline since it carries the hold currents and, if the row is selected, it also carries 8 times the bitline current. The voltage drops on this wordline must be small to get a reliable and fast memory. The bottom word line only carries the low hold current for the 8 cells on a memory row.
The 3D capacitance for a unit cell in an infinite array have been extracted with QuickCap. The tool reports the following capacitance values per unit cell: (Table IV)
The largest contribution to the bitline capacitances comes from the coupling capacitance to the top wordline (2.45 fF or 26 %). The capacitance to ground is small (0.8 fF or 9%). A GaAs wafer lapped down to a thickness of 75 µm is assumed. The worst case effective bitline capacitance (lumped to ground) is 2.46 fF higher, the coupling capacitance to the bitlines in adjacent columns, since these lines can have opposite signal transitions. In this case the Miller effect increases the effective capacitance.
|bitline right:||9.05 fF|
|bitline left:||9.51 fF|
|wordline top:||14.00 fF|
|wordline bottom:||8.42 fF|
In the new memory cell the top wordline is now in metal 3 and thus further away from the bitlines. Further, the spacing between adjacent bitlines has been increased as far as possible to reduce coupling.
|bitline right:||8.19 fF|
|bitline left:||8.07 fF|
|wordline top:||11.10 fF|
|wordline bottom:||7.7 fF|
The largest contribution to the bitline capacitances still stems still from the coupling capacitance to the top wordline (1.1 fF or 13%.). The worst case effective bitline capacitance (lumped to ground) is 1.52 fF higher. The total bitline capacitance of the old memory cell is 16% higher and the worst case effective capacitance is 23% higher. While the effect is small, every picosecond counts in the register file.
6.0 Low Power Cache Memory Chip
In the existing cache memory chip a slightly lower power version of the register file memory block is used and repeated 10 times. Two of the memory blocks are redundant blocks. Only eight of the 10 blocks need to work in order to get a working memory part. Multiplexers are used to configure the memory such that 8 fully working blocks can be selected. This yield enhancing strategy has several drawbacks. The redundant memory blocks increase power because the unused parts cannot be powered down. Further, additional power is dissipated in the configuration logic. In addition, the redundant blocks and configuration logic together with the additional routing for the configuration logic increases the chip area.
The major concerns are yield and power since 16 of these memory chips are required for the F-RISC/G instruction and data caches. The cache memory chip must also be fully testable before and after MCM insertion, however, a full boundary scan design is not feasible because each driver, and receiver with boundary scan logic requires about 40 devices and the memory chip needs 64 drivers and 64 receivers for the wide interface to the secondary cache memory. Thus the questions is, how can we implement a low power cache chip with a sufficiently low device count to get comparable yield without redundancy but full testability? We can leverage the third metal layer that is now available in the Rockwell HBT process and the new capacitance extraction capabilities which can give us more accurate interconnect capacitances which is important for a low power design since the interconnect delay sensitivities are larger.
In order to reduce the device count in the cache memory chip we must reduce the device count in the memory part. This can be achieved by increasing the width of the memory blocks from 8 to 16 bits. The address decoder and threshold generator logic register file uses 285 devices. Thus if we increase the memory width to 16 bits we save 4 decoders and threshold generators or 1152 devices. However, increasing the memory width increases the voltage drop on the wordlines. Assuming the current loading per unit length is keep constant, the voltage drop increases with the square of the length. Thus we can expect a four times higher voltage drop on the bitlines if we use the current levels and layout of the old cache memory design. However, in the new design we can use metal 3 for the bitlines with has a 20% lower resistance and we will have lower bitline currents in a low power design. In addition, metal 3 can be routed over devices, thus we can provide wider bitlines and shrink the memory cell size. We need the 3D extraction capability in order to set the bitline currents to meet our 1 ns cycle time target.
Some of the tricks which are to be utilized to allow full testability while minimizing device count and power consumption are based on the special nature of the RAM chip. The chip has two personalities: when communicating with the CPU, each cache chip sends or receives four bits, but when communicating with the level 2 cache, each chip sends and receives 64 bits. (Figure 9).
The testing scheme makes use of special boundary scan driver and receiver pads for the four bit data path. The 64 bit wide data path utilizes weakly coupled driver / receiver pairs to allow testing of this path with no additional transistors.
Aside from allowing testing of on-chip circuitry, static testing of the MCM traces is possible. Signals which are to be output on the drivers must be loaded into the register file using the four-bit data path and the die test mode of operation. This is necessary as the scan drivers are incapable of presenting signals for output onto the MCM traces. This compromise allows us full testability while requiring fewer transistors than alternate schemes.
To test the L2 drivers and receivers we read in known data using
the four bit path, then we read it out on the wide path, change
the address, write the wide data back in, and try to read out
the contents using the four bit path or scan mode.
7.0 F-RISC/G Package Requirements
The placement of the chips on the MCM is shown in Figure 10. Chips are spaced 2 mm apart. A wiring pitch of 30 µm was selected to keep crosstalk under control. The longest wires are the daisy chained address lines which go from the instruction and data cache controllers to the L1 instruction and L1 data memory chips respectively. The length of the longest line is close to 9 cm. Wiring layers are made of 15 µm wide and 5 µm thick copper. There are only two wiring planes with Parylene as the interlayer dielectric (e = 2.67). The lines are in a triplate configuration and have a characteristic impedance of 50 ohms. Major timing constraints, which directed the placement of the chips, are shown in Figure 10.
The MCM wires should be high bandwidth lines (>10 GHz) to support the off-chip driver rise times of the order of 55-60 ps. The line attenuation should be low to provide adequate signal at the receiver end. Rise time degradation in the lines should be kept to minimum. It requires very smooth lines. The package needs to provide termination resistors, bypass capacitors, and adequate cooling capability.
Therefore, in short, the main requirements for the F-RISC/G package
are as follows:
Low Interconnect Delays - A low dielectric constant
polymer will provide high speed signal propagation on the transmission
lines. F-RISC/G assumes a speed of 0.16 mm/ps (or 6 ps/mm of delay)
for the lines.
High Bandwidth and Low Attenuation - The dielectric
should have a negligible loss in a 10 GHz - 20 GHz range. The
losses are typically dominated by skin-effect and resistive losses.
For F-RISC/G we need to be able to send a 400 mV signal with a
rise time of 60 ps over a 9 cm long line and get at least 150
mV signal at the receiver. This requires that the metal layers
have very smooth surfaces and no surface layers with a high resistivity
are used to promote adhesion.
Terminators and Bypass Capacitors - The package
must provide 616 terminators for the 50 Ohms transmission lines.
The high number of terminators require that they are an integral
part of the package. Either discrete bypass capacitors with a
high bandwidth ( up to 20 GHz) or an integral bypass capacitor
using high dielectric constant material are required.
Power Dissipation - The package must be able to dissipate
250 W of power while keeping the chip temperature below 100°
C. The maximum heat flux is below 20 W/cm2.
8.0 Reticle Status
The following chip set will go onto the RPI reticle scheduled for fabrication by May 94.
- Instruction Decoder
- Datapath Slice
- Modified Version of RPI Test Chip with improved register file
- High Speed VCO Test Chip
- Boundary Scan and Standard Cell Test Chip
- Deskew Test Chip
- Active and Passive Test Structures
The chips for the RPI reticle have been modified to lower capacitances and to reduce voltage drops on power rail by taking advantage on the new metal 3 layer now available in the Rockwell HBT process. Fabrication has been held back momentarily because of the recent 3D capacitance extraction results showed that the capacitance of our current 2D extraction tools can be off by a factor of two in dense layouts.
RPI has just become a Alpha site for the 3D QuickCap program ported to a workstation. This gives us the opportunity to extract 3D capacitances in dense wiring areas and adjust the drive capability of gates that have a higher capacitive load than predicted by our 2D extraction tools. Typically only the resistors in a few key gates need to be adjusted in order to significantly improve performance, but we need to have an accurate 3D capacitance extraction tool to locate these gates. Increasing the power level everywhere is no solutions since it increases power dissipation beyond manageable levels and also increases the voltage drops on power rails.
Rockwell is intends to use some of our test chips on their HSCD reticle together with active and passive test structures. This is important since seven fabrication runs are scheduled. This will provide us with chips from many different fabrication runs and allows us to continuously monitor HBT yield and performance.