DARPA Fall 1993 Semiannual

F-RISC/G and Beyond -- SubnanosecondFast RISC for TeraOPS Parallel Processing Applications

ARPA Contract Numbers DAAL03-90G-0187,

DAAH04-93G-0477,

[AASERT Award DAAL03-92G-0307 for Cache Memory]

Semi-Annual Technical Report

October 1994 - April 1995

Prof. John F. McDonald

Center for Integrated Electronics

Rensselaer Polytechnic Institute

Troy, New York 12180

(518)-276-2919
FAX (518)-276-8761
MACinFAX (518)-276-4882
e-mail: mcdonald@unix.cie.rpi.edu

http://www.cieem.rpi.edu/frisc/s95.html

Abstract

The F-RISC (Fast Reduced Instruction Set Computer ) project has as its goal the exploration of the upper speed envelope for the throughput capability of one computational node through the use of advanced HBT technology. F-RISC/G involves development of a one nanosecond cycle time computer using 50 GHz GaAs/AlGaAs Heterojunction Bipolar Transistors (HBT). In the past contract period the primary activities consisted of a revision of the cache memory chip, and some alterations of previously "completed" architecture chips. These revisions and alterations have been made necessary because it has been discovered that previously existing CAD tools for extracting GaAs IC wiring capacitance are not sufficiently accurate due to poor coverage of the effects of the varying nearby conductor geometries. Additional alterations in the ALU and instruction decoder chip became desirable as a result of measurements and errors found through the use of the APTIX/FPGA emulator. This emulator has provided the opportunity to execute numerous verification programs written in F-RISC/G assembly language. Finally, as a result of a companion HSCD subcontract to Rockwell four RPI test chips have been fabricated. This chips allow us to probe the speed and yield of the 50 GHz baseline HBT process. Improved yield has been confirmed, and ADC information indicates that 2500 HBT circuits will yield at about 50%, which is sufficient for 8K HBT circuits of the F-RISC/G architecture to yield at 11%. However, our testing showed that heavy loaded circuits will perform only at 50% of predicted speed. Lightly loaded circuits perform at 66% of the expected speeds. Rockwell has obtained evidence that their Polyimide interlayer dielectrics are too thin. This still leaves an indication that the HBT devices themselves are exhibiting of only 33 GHz rather than 50 GHz. All of the four RPI F-RISC/G architecture chips have been re-engineered several times to optimize them and correct minor errors while the issues of speed and yield have been analyzed. They yield now appears satisfactory, but the speed of the devices and interconnect are still inadequate. We are working with Rockwell to identify sources of these problems. Thus far, Rockwell has been extremely helpful when the problem source could be clearly identified.

Project Goals

o Exploration of the Fundamental Limits of High-Speed Architectures.

o Study of GaAs HBT for Fast Reduced Instruction Set Computer Design including Adequacy of Yield, and Device Performance for High Performance Computing Applications.

o Research in the Architectural Impact of Advanced MultiChip Module (and 3D) Packaging in the GHz Range.

o Examination of Pipelining across Pad Driver Boundaries as a Means for Reducing Adverse Effect of Partitioning due to Yield Limitations in Advanced Technologies.

o Investigation of Power Management in High Power Technologies.

o Study of Appropriate Memory Organization and Management for Extremely Fast RISC Engines using Yield Limited Technologies.

o Use of Adaptive Clock Distribution Circuits for Skew Compensation at High Frequencies.

o Exploration of Superscalar and VLIW organizations in the sub-nanosecond cycle regime.

o Exploration of a combination of HBT and MESFET technologies for lower power, higher yield, but fast cache memory [AASERT Program].

o Exploration of novel new HBT technology such as HBT's in the SiGe and InP materials systems.

Introduction

The F-RISC/G (Fast Reduced Instruction Set Computer - version G) project has as its goal the development of a one nanosecond cycle time computer using GaAs/AlGaAs Heterojunction Bipolar Transistor Technology. More generally the project seeks to explore the generic question of how one can achieve with bipolar circuits higher clock rates than expected from silicon based CMOS alone. Traditionally CMOS has achieved its increasing clock rates from lithography improvements, shortening the lengths of devices and interconnections to achieve higher speed. Bipolar devices, on the other hand, have achieved their speed by reducing the thickness of various device layers, with more recent improvements coming from band gap engineering in heterostructures, and schemes to include built-in acceleration fields in the base transit region (graded base techniques). Since the thicknesses of various layers in semiconductor processing can be minimized with proper yield engineering, short device transit times (or high transit time frequency) in principle favor the bipolar device. This is often quantified at least for analog applications by the closely related unity current gain frequency, .

(1)

The fact that the intrinsic base region for the fastest device is so very thin tends to make high. Also the large area of the buried collector under the entire base tends to make an important consideration. Device optimization can clearly benefit by lateral shrinking of the device lowering the area of the base collector interface, and shortening the path through the base intrinsic resistance.

Another parameter for quantifying the performance of the HBT is its maximum oscillation frequency or unity power gain frequency in the common emitter configuration:

(2)

These two canonic frequencies are often quoted in the literature for analog circuit performance, and they yield some figure of merit for how fast the basic device will switch. One must use caution, however, as these parameters often do not give a clear picture about the speeds of logic circuits which utilize these devices. In addition to these "intrinsic" HBT parameters logic circuits also depend on the burden of wiring capacitance and circuit input loading feasible with the technology. Nevertheless, the f_max parameter is considered a better indicator of logic circuit performance than f_T.

It is possible to achieve clock frequencies in digital circuits close to 25% of in small circuits such as frequency counters and serial to parallel converters. Hence, some 12 GHz serial to parallel converters have been fabricated in 50 GHz f_Ttechnology using full differential logic circuits. Unloaded gate delays often approach . Hence the unloaded fast gate delay in the same 50 GHz technology is around 18 ps. Clearly circuits will not work any faster than this since in any practical circuits wiring delays will reduce the switching speeds. Of course, the circuits achieving these upper limits in speed also have high power dissipation.

Interestingly, can be larger or smaller than . It can be larger if the product is small enough. The is also an important parameter in circuit applications when power must be controlled at high frequencies, and lowering this parameter is the subject of several recent articles. The Rockwell HBT has several unique features which focus on lowering this resistance. Realizing that base resistance has two components, one located in the active base region (intrinsic) and one to account for the base extension out to the base contact, the Rockwell HBT uses a thicker layer for the extrinsic base, and efforts to reduce the distance between the emitter and base contact to a minimum are employed.

In this contract period a 20 GHz "challenge" VCO test circuit has been designed and fabricated under companion HSCD BAA funding through a subcontract to Rockwell. This circuit achieves this high utilization of the available 50 GHz device capability through clever use of analog frequency doubler and quadrupler circuits. This process is fairly nonlinear and consequently a fair amount of subharmonic generation arises. These subharmonics are troublesome if the low pass characteristics of resistor capacitor interconnect networks suppresses the higher frequencies.

SiGe and InP HBT Technologies

The GaAs/AlGaAs material system is not the only one where HBT's may be fabricated. In recent years HBT's with exciting characteristics have been fabricated in the SiGe materials system at IBM. For several years TI, Hughes, and ATT have been working in the InP /InGaAs/AlGaAs alloy system and some of the fastest reported HBT devices have been reported in this system. Interestingly, successful complementary MESFET and HEMT devices have been reported in the same systems, although not necessarily simultaneously. However, in the case of SiGe this exact combination of HBT and CMOS has been realized and is being modeled and characterized at IBM at 0.35 µm x 1.0 µm emitter sizes where lower static power dissipation is required to obtain optimal device speeds. In addition to the lower currents characteristic of scaled systems the is only 0.7 V, or half that of the GaAs/AlGaAs system (1.4 V). In addition, the "safe" emitter current density without dopant redistribution appears to be much higher () in the SiGe system. This means that the transistor scaling can be much more complete since larger transistors will not be needed for higher currents. The IBM SiGe system offers a 50 GHz and a breakdown voltage of 3.5V. HBT integration levels have been demonstrated at greater than 10,000 HBT's, and recently IBM has signed a joint agreement with Analog Devices to create 1 GHz D/A converters which demonstrate 6K HBT yield levels for commercial purposes. There have even been hints of much larger yields. Yields of 10K HBT's would as we have already stated make the integration of 16-bit slices feasible, eliminating two pad driver-receiver delays from the most critical ALU path. Yields of 20K HBT's would permit complete integration of the 32-bit datapath.

A preliminary probe of the IBM SiGe HBT process at East Fishkill has been funded as a part of this contract, and permission appears to have been secured to remake the RPI test chip previously fabricated in the Rockwell line in the IBM line. It is expected that the yield will be much higher than with the GaAs/AlGaAs process. Our group is awaiting a set of models and characterizations to be completed by IBM. One of our students will have to travel to IBM to work with proprietary design rules to establish whether this direction is as promising as it appears to be.

By comparison InP is at the opposite end of the yield scale with HBT counts closer to 10-200, and for that reason it might be considered premature to examine them for serious digital circuit implementation.. However, it was only a decade ago when the yields for GaAs/AlGaAs HBT circuits was similarly low. The speeds of the InP HBT are even higher than for the GaAs/AlGaAs HBT's which are currently available. For example TI describes an HBT in the InP system which has an f_T of 200 GHz and this device is still not submicron in size, which suggests modest digital circuits could be made with unloaded gate delays of as little as 4.5 ps today! As with SiGe, the InP system has a V_BE0 of only about 0.7 V, so power dissipation is reduced. Even these small chips could be useful as tester chips for the GaAs/AlGaAs or SiGe HBT circuits since they are so fast. However, they also create an opportunity to gain circuit design experience in preparation for a time when yields for even these delicate circuits may look attractive. TI has offered us an opportunity to make some small test circuits in this foundry and we will attempt to do so if the time is available given the other contract deliverables.

A 20 GHz Voltage Controlled Oscillator

The effect of wire loading in the GaAs/AlGaAs HBT system is becoming more evident as we gain more experience. In particular, despite the use of Polyimide as a dielectric with a low dielectric constant and Au for the interconnects it is evident from the previous section that extreme care must be exercised in connecting the devices. Layout can be almost as important as it is in CMOS circuits to wrest the utmost speed from the HBT. This is evidenced not only by the delays imposed by the increases in rise time and fall time, but also by the decrease in the bandwidth of signals in the system. When pressing the upper limits possible with the HBT one must be aware of the nonlinearities of the HBT devices, which can generate harmonics and subharmonics. To probe this aspect of the technology we have attempted the design of a "challenge" circuit, targeted to operate at 40% of the for the 50 GHz HBT process.

A high speed voltage-controlled oscillator (VCO) has been developed which can generate differential signals in the range of 1-20 GHz. VCOs will become an integral part of future computers for providing clocks for digital and other synchronous circuits. This VCO consists of a frequency generator, a frequency multiplier and a frequency divider, along with various high-speed buffers, multiplexers and drivers. The chip contains 452 transistors and dissipates 2.60 W at 20 GHz. The frequency of oscillation is controlled by an external bias voltage. There is also a 12-stage ring oscillator which is included as a means for determining the baseline speed for the fabricated transistors. This project has enabled us to gain more experience with the Rockwell 50 GHz process as well as aspects of high frequency design using the devices provided. This design has also directly benefited the FRISC project through the improvement of the high-speed register file design and the datapath chip.

System Design

The high-speed VCO consists of a frequency generator, a frequency multiplier and a frequency divider (see Figure 3.1). A base frequency is produced by the frequency generator and is controllable by adjusting the bias voltage input to the system. PSPICE simulations have demonstrated that this base frequency ranges from 2-5 GHz. The base frequency can be multiplied by a factor of 2 or 4 by the frequency multiplier. Simulations have shown the frequency multiplier working up to 20 GHz. The frequency divider is capable of dividing the frequency by factors of 2, 4 or 8. The divider is capable of operating on signals up to 20 GHz.

The frequency generator (Figure 3.2) includes four delay elements connected in a ring with an inversion placed in the differential feedback path between the last and first elements. The frequency range of the generator is from 1 to 5 GHz and is controlled by an externally applied bias voltage. The range of the applied voltage is from -1 to +1 volt. Also included in the delay elements are high-gain buffers which are capable of driving the long lines between the generator core and the VCO multiplexers. In order to attain the highest speed possible, the generator was implemented and placed first during the design process. By arranging the delay elements to fit within a square (see Figure 3.4), interconnect length (and thus parasitic capacitance) was minimized and the effect of other circuits upon the core (e.g. wiring crossovers, etc.) was reduced. Experience in designing this module will assist the FRISC project in designing future VCOs which may be used to provide on-chip clocking signals.

Figure 3.1. High Frequency VCO Block Diagram.

Figure 3.2. VCO Frequency Generator and Multiplier.

The frequency multiplier (also in Figure 3.2) consists of several high-speed exclusive-OR gates which serve to double the frequency of the input signals. Two XORs are used to generate two signals which are twice the frequency of the generator core and which are 90_ out of phase. These quadrature inputs to each XOR are taken directly from the outputs of each element in the frequency generator. The output signals from both devices are then fed into a third XOR which will generate a signal that is four times the frequency of the generator. The parasitic capacitance of these lines are very important and have harmful effects which are manifested later in the system. Because the frequency-doubling effect of the XORs is best achieved when the input signals are exactly 90 out of phase, this implies that the parasitic capacitance of the input lines must be balanced as closely as possible. In addition, the capacitance of each line in the differential signal pair must also be closely matched to its counterpart. The design and layout of the XOR subcells was done in such a manner as to achieve nearly-identical internal capacitance values. In addition, the placement of the XORs in the layout was also pursued in order to attain the goal of balanced capacitance. PSPICE simulations with extracted capacitance values indicate that the input signals to the 4X XOR are approximately 87.5 degrees out of phase. A high-gain buffer has been placed between the high-speed XOR and the first multiplexor in order to reduce the loading on the output lines.

The frequency divider (Figure 3.3) consists of three high-speed toggle flip-flops which serve to divide the signal by a factor of 2, resulting in frequencies that are 1/2, 1/4 and 1/8 of that of the input. The divider circuit also contains a high-speed multiplexor which selects between the source frequency and the lower (divided) frequencies. Due to the decision of minimizing capacitance on the critical high-speed path through the system, the divider circuit has been placed outside of this path. To compensate for the additional parasitic capacitance incurred by the inputs to the divider, a high-gain buffer has been inserted into the high-speed path to drive the additional load. To further reduce unwanted parasitic capacitance and resistance, the output lines from the differential amplifier have been sized at 8 µm apiece with 8.5 µm spacing between them. These lines travel approximately 500 µm between the amplifier and the chip pads.

Refinement Considerations

The most time consuming task experienced during the design and simulation of the VCO was related to the parasitic capacitance of interconnect within the system. This resulted in problems such as output-loading of subcells and unbalanced signal propagation and amplitude, thereby degrading the output of the system. As a consequence, much care was taken to ensure that the capacitive loading of cell connections were acceptable in terms of the resulting signal characteristics. When necessary, high-powered drivers were inserted into the system to compensate for the interconnect parasitics. In some cells, the driver and/or receiver circuits were modified in order to compensate for the loading. One of the most troublesome cells was the multiplexor. These cells are critical to the operation of the system because the high-speed (i.e. 20 GHz signal) has to pass through at least two instances of the cell. In addition, extensive PSPICE simulations have shown that feedthrough of lower-frequency signals in the multiplexers can be a problem and may result in unwanted lower-frequency components in the high-speed output signal, resulting in a noisy waveform. Attempts to counter feedthrough by reducing the low-speed signal pull-up resistors within the multiplexor did not produce adequate results, hence the design of special buffers-with-enable was begun. While these new cells did help the feedthrough problem somewhat, some signals remained troublesome. As a consequence, special balanced-capacitance buffers-with-enable were designed. The outputs of these cells have equally balanced capacitance which results in a further reduction in leakage through the buffer. However, due to their increased capacitance values, they are unsuitable for high-speed signal paths and thus are used only on low-frequency signals (< 10 GHz). A picture of the high-speed VCO layout is shown in Figure 3.4.

Figure 3.3. VCO Frequency Divider.

Figure 3.4. High Speed VCO Layout.

Test Result from the High-Speed Voltage-Controlled Oscillator (VCO)

The high-speed voltage-controlled oscillator (VCO) was fabricated by Rockwell along with the other chips on the high-speed CAD design reticle supplied by the Rensselaer F-RISC group. A wafer was received at Rensselaer in March of 1995 at which point testing was commenced. The VCO is a "challenge" chip to test the limits of both the design tools and the fabrication process. The upper limit was determined to be 20 GHz through the use of PSPICE simulations. In addition to the VCO circuitry, a 12-stage ring oscillator was also included in the VCO layout in order to provide an internal process monitor and gauge the raw speed of the transistors. The measured results indicate that the delay per ring oscillator stage is 20 ps instead of the expected 15 ps based on SPICE simulations with the 50 GHz device model. The layout of this ring oscillator (shown in Figure 3.3.5) differs from others in the reticle in that the transistors are spaced at least 6 µm apart, thereby reducing the possibility of inter-transistor interference or cross-talk. The ring oscillator current switches operate at 2 mA, the maximum device current which also yields optimal f_T.

Figure 3.3.5 Layout of VCO Ring Oscillator.

Initial test data from the VCO indicate that the circuit is functionally working, but not at the expected performance. The oscillation frequencies generated by the VCO and the ring oscillator do not agree with PSPICE simulations using the baseline 50 GHz device models. In conjunction with other measurements this is due to increased parasitic capacitance and reduced device performance. In order to investigate the effect of each potential source of error, simulations have been performed with increased parasitic capacitance and/or reduced device performance. Simulation results are shown below in Table 3.1. The measured VCO signal yielded a frequency of 1.1 GHz. Comparing this result with the simulated results indicates that the best fit occurs when the devices are derated by 33% and the parasitic capacitances are increased by a factor of 1.46. A picture of the high-speed 4 x VCO core signal operating at 13.66 GHz is shown in Figure 3.3.6. This is the highest VCO frequency ever measured by the Rensselaer F-RISC research team on a circuit fabricated with the Rockwell 50 GHz HBT baseline process.

Figure 3.3.6 Measured VCO Frequency (4X core frequency).

Device	Capacitance Values	Maximum Frequency	Ring Oscillator
50 GHz	All set to 0	N/A	2.86
50 GHz	Extracted values	19.8	2.8
33 GHz	All set to 0	16.3	2.0
33 GHz	Extracted Values	14.5	1.98
33 GHz	1.45 x Extracted	13.75	1.96

Table 3.1 - Simulated VCO Results.

VCO Core (control = -1.46V)	2X VCO Core (control = -1.46V)	4X VCO Core (control = -1.46V)	Maximum 4X VCO Frequency	Ring Oscillator
1.67 GHz	3.33 GHz	6.67 GHz	13.66 GHz	2.04 GHz

Table 3.2 - Measured VCO Results.

While it appears that reduced device performance and increased capacitance explains the discrepancy between simulation and measurement, there is insufficient evidence from the VCO alone to justify this claim. Further measurements on the VCO must be obtained over a range of control voltages and settings in order to determine the existence of a strong correlation. The shape of observed waveforms to date do display strong agreement with their simulated counterparts, thereby increasing the credibility of the simulations.

Despite reduced operational speed, the functional capabilities of the VCO seem to be nearly complete. To date, the core VCO has been observed along with frequency-multiplied signals at 2X and 4X the core frequency. A frequency-divided signal at ¹/₂ the frequency has also been observed.

Carry Chain Delays

The HSCD reticle contains a modified version of the RPI test chip that was first fabricated in 1993. The chip contains among other test circuits an 8 bit carry chain since the ALU is on one of the most critical delay path of F-RISC. The 32-bit F-RISC datapath is implemented with four 8-bit slices. The carry chain test circuit is equivalent to the carry propagation circuitry in an 8-bit slice. The layout of the carry chain circuit is shown in Figure 4.1

Figure 4.1. Layout of Carry Chain Circuit on RPI Test Chip.

The carry chain circuit can be set into oscillation along a long path:

operand B -> 8-bit carry propagate chain -> MUX2 -> AND2 gate -> operand B

or a short path:

carry-in receiver -> AND2 -> MUX2 -> carry-out receiver -> carry-in receiver

The AND2 gates in the long and short oscillation path are used to select either the long or short path and to stop the oscillation. These AND gates are not needed in the in the actual carry chain.

We have measured the carry chain delays using a Tektronix sampling scope. Table 4.1 shows a comparison of typically measured and the simulated carry chain delays. The data is from wafer 6 of the first HSCD wafer lot. The measured delays are closer to the delays obtained by SPICE simulations using a device model with an f_Tof 33 GHz extracted from an earlier run than to the expected delays based on the 50 GHz HBT device model from the design manual. Ring oscillator measurements on the same wafer also indicate that the devices on HSCD wafer 6 are slower than expected.

Table 4.1. Measured and Simulated Carry Chain Delays.

Carry Chain
Oscillation Path
measured delays

Wafer 6 Die (0,0)

T=25C, SPICE]

33 GHz HBT

device model ² SPICE

50 GHz HBT

device model ¹ modified circuit

50 GHz HBT

device model ⁴

short chain T_short 258 231 (131)³ 181 ( 93)³ 131

long chain T_long 510 501 (362)³ 361 (225)³ 301

¹HBT model from Rockwell's design manual (based on wafer 2 S-parameter measurements).

² HBT model extracted from S-parameters from a wafer run in 1993.

³intrinsic delay without interconnect capacitances.

⁴ modified carry chain circuit using higher power levels and improved buffers & receivers.

⁴ power increased from 306 mW to 403 mW.

The circuit was backannotated using QuickCap, a full 3D capacitance extractor since the carry chain is on one of the most critical delay paths The GaAs substrate is semi-insulating and therefore the ground plane (backside metallization) is at least a substrate thickness (625 µm, 75-175 µm for lapped wafers) away from the interconnect layers. The interconnect capacitances are therefore dominated by coupling to nearby conductors, and not by the capacitance to ground as in Si circuits. Hence, a 3D capacitance extraction is necessary to get accurate delay estimates for high speed GaAs circuits. The large distance between the ground plane and the interconnect layers causes problems with finite-element methods because the whole GaAs substrate needs to be meshed. Random-walk methods, such as used in QuickCap, can handle this situation much more readily. The run time increases slightly because a fraction of the walks generated take more random hops before terminating on a conductor or ground plane, if the ground plane is far away. Analyzing GaAs interconnects is also more difficult than analyzing Si interconnects because several dielectric layers must be included in the analysis (GaAs, Si0₂, SiN, Polyimide, air).

The 3D geometry is generated from a 2D mask-level description of the circuit and a technology file that describes how each mask is used to grow or etch material during processing. In order to simplify the problem for analysis on a single workstation, a planar assumption is made; all metal 1 is at the same distance from the substrate etc. This reduces the geometric interactions between the layers and reduces the number 3D structures needed to represent the IC geometry. Comparison of planar and non-planar models of standard cells has shown that the planar assumption changes the capacitance values only by a few percent. Currently, 3D extraction of large circuits is very time consuming since a hierarchical SPICE netlist must be used for running and optimizing large circuits. However, the 3D capacitance extractor tools at our disposition do not support hierarchical extractions and are not integrated in our main CAD tool suite. Thus several time consuming manual steps are required for preparing layouts for extraction and for feeding the extraction results back into the simulation. We have made the developers aware of this problem and urged them to support hierarchical capacitance extractions. Simulations result with and without interconnect capacitance show that the interconnect delays are quite significant. Almost 50 % of the delays on the short path are due to interconnect delays.

32-bit ALU Delays

The measured carry chain delays on wafer 6 are matched quite well by SPICE simulations with 33 GHz device model. SPICE predicts delays that are 10% and 2% too low. This can be expected since large parallel plate measurements indicated that the interconnect capacitances are 1.45 times higher than expected due to problem with the thickness of dielectric layers. The 50 GHz model in the design manual was derived from measurements taken from wafer 2 of the same lot. Thus we have two device models that represent the devices on two wafers from the same lot. We can, therefore, see the impact of device speed variations on the critical ALU path which could determine the cycle time of F-RISC. The following gate delays have been estimated from SPICE simulations:

Table 2. Gate, Driver, and Receiver Delays.

Circuit 33 GHz devices [ps] 50 GHz devices [ps]

Driver T_DR 87 68

Receiver T_REC 55 42

AND2 T_AND 55¹,60² 37¹,38²

XOR2 T_XOR 45 36

MSLATCH T_MSL 60 45

SBUF2 T_SBUF 45 35

MUX4 T_MUX 50 37

DLMUX2 T_DLM 45 35

¹ AND2 gate on short oscillation path

² AND2 gate on long oscillation path

The 32 bit F-RISC datapath is partitioned into four 8 bit slices. The worst case ALU delay includes, therefore, three chip crossings. The worst case operation is a 32-bit addition with operand B equal to $FFFF'FFFF and operand A equal to $0000'0001. This operation will cause carry propagation from the least significant bit to the most significant bit. In the worst case the ALU operand is followed by a feed forward of the ALU result to operand B for the next ALU operation. The worst case delays on the ALU path with the 8-bit carry select adder are shown in Table 4.3.

Clearly the 1 ns target goal can not be achieved with 33 GHz devices. Even with 50 GHz devices the delays on the ALU are too long to support a 1 ns cycle time. The delays had to be reduced to allow for 25 ps of skew and provide some slack time. An analysis of the interconnect delays shows that most of the interconnect delay penalties are due to the long connections from the standard cell core to the periphery with the I/O pads. The interconnect capacitance on the carry-in and carry-out signals between the driver/receiver circuits are quite high (450 fF, 250 fF). The short path includes two of these heavily loaded connections from the I/O pad ring to the standard cell core and SPICE simulations indicate that 49% of the short path delay is due to interconnect delays.

Special cells with a very high drive capability are required to reduce the delays on these highly loaded signals since the gates on the carry chain are already high power gates. Simply increasing the current levels everywhere to the maximum supported by the smallest HBT device (2 mA) is not a viable option because of power dissipation problems in GaAs. The drive capability can be improved by using emitter followers or by using a push-pull super-buffer.

Table 4.3. Estimated 32-bit ALU Delays.

33 GHz

devices 50 GHz

devices modified circuit with

50 GHz devices

Slice 1:
Operand B -> Carry_Chain -> MUX2 -> Driver
Delay = T_long-T_AND+T_DR

537

391

331

Chip to Chip MCM Delay (10 mm, er = 2.72) 55 55 55

Slice 2:
Carry_in Receiver -> MUX2 -> Driver
Delay = T_short-T_AND

203

144

94

Chip to Chip MCM Delay (10 mm, er = 2.72) 55 55 55

Slice 3:
Carry_in Receiver -> MUX2 -> Driver
Delay = T_short-T_AND

203

144

94

Chip to Chip MCM Delay (10 mm, er = 2.72) 55 55 55

Slice 4:
rec ->SBUF2 ->MUX2 ->XOR2 -> MSLATCH
Delay = T_short- T_AND - T_DR+ T_SBUF + T_XOR
feed forward:
MSLATCH -> MUX4 -> DLMUX2
Delay = T_MUX + T_DLM

206

95

147

72

97

72

Total 1409 1063 853

Since we use three levels of current switches the outputs of a gate can have three different offset levels (0,-VBE0,-2*VBE0 VBE0 = 1.4V). The different levels are necessary to prevent saturation of the current switches which are stacked up to three levels deep in our current tree logic. Figure 4.2 shows the different buffer circuit configurations.

Figure 4.2. CML, ECL, and Super Buffers.

Figure 4.3. Delay versus Interconnect Loading for HBT Buffer circuits at 25C. bi1o1 : CML buffer input at level 1, output at level 1, power=3mA*5.2V bi2o2 : ECL buffer input at level 2, output at level 2, power=5mA*5.2V bi3o3: ECL buffer input at level 3, output at level 3, power=5mA*5.2V si2o2: Super-buffer input at level 2, output at level 2, power=4mA*5.2V

The drive capability of the different buffer circuits is shown in Figure 4.3. The super-buffer clearly has the best drive capability while dissipating less power than a level 2 differential ECL buffer with an emitter follower current of 2 mA. However, the input to super-buffer must be at the same level as the output or at a lower level. Emitter followers can be added to any current tree/gate and introduce only a small intrinsic delay. The input of a CML or ECL buffer can be at any level, only the long tail resistor that determines the tree current needs to be adjusted. The CML gate has the lowest power dissipation and the lowest device count, the main reasons for preferentially using CML in our designs. However, the CML buffer and the ECL buffer with a level 3 output have a higher interconnect loading sensitivity than the super-buffer or the ECL buffer. Hence, signals with high interconnect loads must be at offset level 2 driven either by ECL gates or by a super-buffer. However, these circuits can only be used on critical paths because of the increase in power and devices.

Completion of Cache Design

The F-RISC/G system is illustrated in Figure 5.1. Instructions supplied by the instruction cache are decoded by the instruction decoder, which sends the operand and destination addresses and control information to the datapath. The data cache is used only by Load and Store instructions.

The level 1 (L1) cache comprises the primary instruction and data caches. Each cache consists of a single cache controller chip and eight 2-kbit RAM chips. The controllers for the two caches are identical, differing only in the settings of some control signals. This approach was necessary in order to minimize fabrication costs, and minimizing the penalty for this restriction represented a significant percentage of the design effort. The cache controller handles all handshaking with the secondary cache and the CPU, and sets the control lines of the RAMs as appropriate. The cache controller and cache RAM designs have been completed.

Figure 5.1. F-RISC / G System

Each RAM chip is configured to store 32 rows of 64 bit and is single-ported. One unique feature of these chips is that they have two distinct "personalities." Each RAM may read or write data four bits at a time using the DIN and DOUT buses. Each 64-bit row of memory may be filled one nibble at a time. A separate 64-bit bi-directional bus (L2BUS) allows reading or writing of an entire row at once. This feature helps to reduce cache miss penalties.

The key functional components of the cache controller chip are the tag RAM, a three stage pipeline with integrated counter, and a comparator. The organization and interconnection of these functional structures is illustrated in Figure 5.2. The chip additionally includes circuitry to supply appropriate control signals to the major functional units and circuitry which provides at-speed testing capability of wafers or dies. The cache controller was designed for dual use in both the instruction and data caches. For this reason, the first pipeline latch serves also as the Remote Program Counter (RPC) in the ICC configuration.

Figure 5.2. Simplified cache controller block diagram

Two data paths shown on the block diagram are critical and thus were carefully optimized. The first is the 9-bit path from the ABUS, through the master of pipeline latch 1, and out to the cache RAMs. The second critical path is the MISS generating circuitry. This path requires reading an address from the ABUS, addressing and reading the tag RAM, and sending the result into the comparator

As the F-RISC/G prototype is partitioned, interchip communications becomes an important issue. Large fractions of the cycle time are consumed by communication between chips. Each off-chip communication entails a driver and receiver delay (I/O delay) as well as an MCM time of flight delay. Rise time delays and skew must be considered as well.

Figure 5.3. Data cache communications

Figure 5.3 illustrates the communications that occurs with the primary data cache. The primary cache communicates with the secondary cache, the datapath, and the instruction decoder. Figure 5.4 shows that the primary instruction cache also communicates with all of the core CPU chips as well as the secondary instruction cache.

Figure 5.4. Instruction cache communications

Table 4.1 lists the line lengths and associated delays for communications between the CPU and the primary cache. The line length figures are based on work performed by Atul Garg as part of his doctoral research. In order to determine these line length figures, the entire MCM was routed by hand.

The delay figures are based on a dielectric with e_r=2.67, which translates to a time of flight on the MCM of 5.44 ps/mm. An additional 50 ps per line was allowed for rise time degradation and slack.

Table 5.4. MCM net lengths - CPU / cache signals

Signal MCM Length (mm) Delay (ps)

ABUS 39 250

WDC <39 <250

STALLM 26 190

ACKI 17 140

ACKD 25.5 190

VDA <39 <250

BRANCH <39 <250

DATAOUT upper path: 22

lower path: 27 170

200

MISSI 18 150

MISSD 25 185

INSTRUCTION 25 190

DATAIN upper path: 22

lower path: 28 170

200

Secondary Cache Communications

Table 5.5 enumerates the signals used for communication between the primary and secondary caches.

Table 5.5. Secondary cache communications

Signal Width From To Description

L2ADDR 28 CC L2 28-bit line address.

L2DONE 1 L2 CC Indicates that the L2 has completed a transaction. Any data L2 places on the bus must be valid when this is asserted.

L2DIRTY 1 CC L2 Indicates that the L2 will be receiving an address to be written into.

L2MISS 1 CC L2 Indicates that the address on L2ADDR is needed by the CPU.

L2CNTRL 2 CC L2 Cache flush and init information.

L2SYNCH 1 CC L2 A 1 GHz clock used for synchronizing with L2.

L2VDA 1 CC L2 The address currently on L2ADDR is valid.

Since little is known about the eventual design of the secondary caches (even the device technology is not certain), as much freedom as possible was given to the designer of the secondary cache while still assuring that the "usual case," the Load hit, is optimized. As a result of this uncertainty and the fact that the secondary caches do not share the synchronized clock used by the primary caches and core CPU, the timing requirements of the secondary caches are very specific. The specification of the handshaking between the L1 and L2 caches is now complete.

As part of the design of the cache controller chips, complete system timing diagrams, incorporating MCM delays and operations within each chip, were created. From these diagrams it was confirmed that at-speed system operation is possible using MCM's with dielectrics in the e_r=2.67 range. Some representative timing diagrams follow.

Figure 5.5. Data Cache Timing -Clean Loads

Figure 5.6. Data Cache Timing - Load Copyback

Figure 5.7. Instruction Cache Miss Timing

Figure 5.8. Data Cache During Instruction Cache Stall

By replacing the carry-in receiver with a super-buffer circuit and adding emitter followers to the carry output multiplexer gate the ALU delays are reduced from 1063 ps to 853 ps while only increasing the power dissipation of the carry chain from 306 mW to 403 mW.

Figure 5.9. Cache Controller Artwork

Cache RAM Summary

The purpose of the cache RAM chip is to provide high speed memory for the primary caches of F-RISC. This memory part is capable of storing 2 kbits of data. The data can be accessed either 4 bits at a time or 64 bits at a time. Each type of access is accomplished using different data input and output pads, allowing concurrent data accessing without the use of a single shared data bus. The 4 bit data path is known as the fast path because of the high speed of the memory accesses along this path while the 64 bit data path is known as the wide path. The large size of the bus, 512 bits, connecting the L1 and L2 caches helps reducing cache miss penalties. The cache RAM must be capable of either storing or retrieving data on the fast path once every processor cycle. Since the F-RISC processor has a 1 nanosecond cycle time, system timing constraints require that he pipelined cache RAM has a read access time of 750 ps in order to provide instructions and data to the processor in a timely fashion. In addition, the write access time after the application of the write signal on the fast path can be no longer than 750 ps in order to allow data to be stored before it is removed from the data lines by the processor.

There aren't any off the shelf memory parts currently available which can access data at the required rates. For this reason, the cache RAM chip was designed using the same Rockwell process utilized in the F-RISC processor chips. This process provides devices with fast enough switching speeds and high enough current handling capabilities to design a memory part with the required design specifications. The artwork for the cache RAM chip is shown in Figure 5.10.

Capacitance values of the nets on this chip were extracted using two-dimensional and three-dimensional capacitance extraction software. Digital simulations were performed using these capacitance values to insure the functionality of the control and testing logic as well as to estimate the timing parameters associated with the cache RAM design. SPICE simulations were also performed using the extracted capacitance values on both the digital and analog circuits along the critical paths of this chip to determine critical cache RAM timing parameters. The results of these simulations show that the cache RAM chip will function properly and meet the required design specifications.

Figure 5.10: Cache RAM Chip Artwork

Appendix

High Speed Circuit Design (HSCD) Measurements

HSCD Reticle

Four RPI test chips were submitted to Rockwell for fabrication in last July. The layout of the reticle is shown in Figure 11. This reticle contains four RPI chips - passive test chip, standard cell test chip, 20 GHz voltage controlled oscillator (VCO) test chip, and register file / carry chain test chip. The first wafer (wafer 6 of the HSCD lot) was shipped to RPI in December.

Figure 11: Layout of the RPI-Rockwell Reticle

The mask contains a variety of circuits to determine the basic device performance as a function of power supply voltage, current level, temperature and processing variations Specifically, the passive test chip contains test structures to measure wiring parasitics on a HBT chip. It also carries ring oscillators and gate delay chains to provide basic delay information as a function of capacitive load and fanout. Other chips contain a number of key circuits used in the main architecture chips. The 20 GHz VCO chip has a high-speed voltage controlled oscillator on the chip with several other circuits to test the performance of the process. The register file test chip is an optimized version of the previous test chip fabricated at Rockwell. It also includes the high-speed carry chain macro and associated support circuits. The standard cell test chip contains a number of representative standard cells used in the F-RISC/G chips and tests the implementation of the boundary scan test scheme applied to test the instruction decoder and the datapath chips.

Divider circuits are used to determine flip-flop performance. Several functional circuits are also used including a 2:1 mux, 1:2 demux, 4x4 parallel multiplier and a 7-bit LFSR. These circuits are used to evaluate yield and cell performance in a variety of conditions. Additional test structures were included to measure individual cell and device characteristics.

Passive Test Chip

The layout of the chip is shown in Figure 12. This chip contains both passive test structures and active test structures.

Figure 12: Layout of the Passive Test Chip

The passive structures are meant for measuring wiring parasitics on a AlGaAs/GaAs HBT chip and comparing the measured results with results obtained from CAD tools. The structures are divided into five categories - capacitors, inductors, probe calibration, transmission lines, and resistors.

The active structures are divided into three categories -- coupling, ring oscillators, and device characterization structures. The coupling structures allow measuring the coupling between differentially coupled wires and single-ended wires.

A number of 8-stage ring oscillators are placed on the chip to measure the device performance with respect to the load and the local temperature. These oscillators are made up of standard Q1 and the new round Q1 transistors. The oscillation frequencies of these structures lie in the range of 0.5 GHz - 3.0 GHz. A number of device-characterization structures are also provided close to the ring oscillators to correlate the measurements with the device performance.

MIM Capacitors Test Results

MIM capacitors are made between M1 and M2 layer sandwiching only the nitride layer.

No Structure Name Size

[µm x µm] Simulated

Cap. [pF] Measured

Cap. [pF] Change

1 cap4pf 100 x 80 2.08 1.88 +/- 0.16 -9.6 %

2 cap8pf 200 x 160 8.32 7.57 +/- 0.10 -9.0 %

Parallel Plate Capacitors Test Results

The following tables compare the expected and measured parallel plate capacitance figures. Mayo has measured similar parallel plate capacitor structures on one of their wafers. Even though the measurements are from two different wafer lots, both measurements indicate that the capacitance figures are off by 40-50 %. Thus either the dielectric constant or the dielectric thickness are off. After reporting the problem to Rockwell, Rockwell has measured the Polyimide thickness on a crossection of an HSCD wafer and informed us that the measured dielectric layer thickness is in the 0.9 - 0.95 µm range instead of the 1.6 µm shown in the design manual.

No Structure Name Size

[µm x µm] Simulated Cap. [pF] Measured Cap. [pF] Change

1 capm1m2_1 450 x 160 1.09 1.56 +/- 0.01 43 %

2 capm1m2_2 450 x 320 2.18 3.07 +/- 0.02 41 %

3 capm1m2_4 450 x 760 5.18 7.14 +/- 0.18 38 %

Parallel Plate Measurement Results from Mayo (different wafer lot)

No Structure Type Size

[µm x µm] Simulated Cap. [pF] Measured Cap. [pF] Change

1 M1/M2 250 x 160 0.606 0.858 41 %

2 M2/M3 250 x 160 0.467 0.725 55 %

3 M1/M3 250 x 640 1.055 1.462 39 %

3-D Capacitors

These structures will measure the influence of neighboring conductors, power and ground rails on other layers, and crossovers on the capacitance of a line. The structures have been analyzed using the 2-D/3-D capacitance extraction tools from Random Logic Corporation (QuickCap) and OEA International (Metal).

Ring Oscillator Test Results

Since HBT logic is almost always designed with differential logic it was felt that several of these differential line configurations were also required. These structures include wires with varying nearby grounded conductors, wires with adjacent differential lines, wires with metal planes on other layers, signal line overcrossings etc. To address perceived difficulties in measuring some of the parasitics some of these structures were incorporated into wire length oscillator circuits which could be simulated with SPICE using the calculated values of capacitance provided by tools such as METAL by OEA and Quickcap by RLC, and then comparing the frequency of oscillation between the calculated waveforms and measured waveforms.

Since the structures described above involve some active transistor devices, a means for measuring these device characteristics in the area on the wafer and die are provided with special probe de-embedding sites to characterize the HBT devices at microwave frequencies. There are deembedded transistors and deembedded Schottky diodes on the chip.

Figure 13 compares measured and expected ring oscillator delays. The measured delays are from wafer 6 of the first HSCD lot. The simulated delays are based on SPICE simulations with different device models and different interconnect crossections. The measured results are matched quite well if the capacitance figures are multiplied by a factor 1.4 to account for the thinner than expected dielectric layers and by using a 33 GHz device model instead of the 50 GHz device model from the design manual. We have meanwhile made measurements on two additional wafers (wafer 3 and wafer 8). However, the ring oscillator delays on these wafers are very close to the ones measured on wafer 6. Thus, the device performance on at least three wafers is matched best with a 33 GHz device model.

Figure 13. Measured and Simulated Ring Oscillator Delays.

Summary of Measurement Results

We could verify that the VCO and the modified test circuits are working correctly. We have measured on the 2.6 W VCO 'challenge' chip a oscillation frequency of 13.6 GHz, the fastest HBT circuit we have seen so far. However, our measurements on the four RPI test circuits on the HSCD reticle indicate that at least three wafers have the following two problems.

o The S-parameter measurements of large parallel plate capacitors which should have a uniform dielectric thickness show that the interconnect layers are too thin, resulting in a 1.46 times higher capacitance. Measurements at Mayo of similar structures fabricated on another wafer lot indicate that this might not just be a problem on the HSCD run. Rockwell has examined a crossection of an HSCD wafer and reported that the M1-M2 Polyimide thickness is only 0.9 - 0.95 µm instead of 1.6 µm as shown in the design manual. Since our group is designing large digital HBT circuits interconnect capacitances have a significant impact on performance. Hence, this problem should be resolved before the fabrication of our architecture chip, otherwise the processor will be slowed down by 20%.

o The ring oscillator measurements on the RPI 'passive' HSCD indicate that the devices on the first wafer run are slower than expected. We have seen only very small performance variations (<5%) on the three wafers we obtained from Rockwell. Rockwell has suggested that the devices might be slower than expected because of:

> cooling problems

> resistive or capacitive coupling between adjacent devices

However, preliminary 3D thermal modeling, measurement on lapped wafer, and measurements on ring oscillator in which the devices are spaced 6 µm further apart do not support these suspicions. One of the wafers was lapped to 7 mils and some preliminary worst case temperature calculations with QuickCap indicate that the device temperature should be below 50 C with our water cooled chuck. Further, if the device temperature is causing a problem we should have seen higher ring oscillator frequencies on the lapped dies (7 mil thick) compared to the 25 mil thick wafers. In addition, we consistently can match the measured result within 5-10% using a 33 GHz device model and by increasing the interconnect capacitance by the measured factor of 1.4. We are currently collaborating with Mayo and West Point to get device S-parameter measurements of the wafers and lapped dies. While it is possible that the isolated devices used for S-parameter measurement are performing according to specification we consider this unlikely. There is sufficient evidence of a second process problem, namely the devices on at least three wafers have an f_T in the 33-35 GHz range instead of 50 GHz. This problem should be addressed before the architecture reticle run, otherwise the chips will be 30% slower. This is of great concern to us since the HBT fabrication runs are very expensive and we only have sufficient funds for two foundry runs. The combination of the two problems results in a slow down of heavily loaded circuits by as much as 50%.

Circuit	33 GHz devices [ps]	50 GHz devices [ps]
Driver T_DR	87	68
Receiver T_REC	55	42
AND2 T_AND	55¹,60²	37¹,38²
XOR2 T_XOR	45	36
MSLATCH T_MSL	60	45
SBUF2 T_SBUF	45	35
MUX4 T_MUX	50	37
DLMUX2 T_DLM	45	35

	33 GHz devices	50 GHz devices	modified circuit with 50 GHz devices
Slice 1: Operand B -> Carry_Chain -> MUX2 -> Driver Delay = T_long-T_AND+T_DR	537	391	331
Chip to Chip MCM Delay (10 mm, er = 2.72)	55	55	55
Slice 2: Carry_in Receiver -> MUX2 -> Driver Delay = T_short-T_AND	203	144	94
Chip to Chip MCM Delay (10 mm, er = 2.72)	55	55	55
Slice 3: Carry_in Receiver -> MUX2 -> Driver Delay = T_short-T_AND	203	144	94
Chip to Chip MCM Delay (10 mm, er = 2.72)	55	55	55
Slice 4: rec ->SBUF2 ->MUX2 ->XOR2 -> MSLATCH Delay = T_short- T_AND - T_DR+ T_SBUF + T_XOR feed forward: MSLATCH -> MUX4 -> DLMUX2 Delay = T_MUX + T_DLM	206 95	147 72	97 72
Total	1409	1063	853

Signal	MCM Length (mm)	Delay (ps)
ABUS	39	250
WDC	<39	<250
STALLM	26	190
ACKI	17	140
ACKD	25.5	190
VDA	<39	<250
BRANCH	<39	<250
DATAOUT	upper path: 22 lower path: 27	170 200
MISSI	18	150
MISSD	25	185
INSTRUCTION	25	190
DATAIN	upper path: 22 lower path: 28	170 200

Signal	Width	From	To	Description
L2ADDR	28	CC	L2	28-bit line address.
L2DONE	1	L2	CC	Indicates that the L2 has completed a transaction. Any data L2 places on the bus must be valid when this is asserted.
L2DIRTY	1	CC	L2	Indicates that the L2 will be receiving an address to be written into.
L2MISS	1	CC	L2	Indicates that the address on L2ADDR is needed by the CPU.
L2CNTRL	2	CC	L2	Cache flush and init information.
L2SYNCH	1	CC	L2	A 1 GHz clock used for synchronizing with L2.
L2VDA	1	CC	L2	The address currently on L2ADDR is valid.

No	Structure Name	Size [µm x µm]	Simulated Cap. [pF]	Measured Cap. [pF]	Change
1	cap4pf	100 x 80	2.08	1.88 +/- 0.16	-9.6 %
2	cap8pf	200 x 160	8.32	7.57 +/- 0.10	-9.0 %

No	Structure Name	Size [µm x µm]	Simulated Cap. [pF]	Measured Cap. [pF]	Change
1	capm1m2_1	450 x 160	1.09	1.56 +/- 0.01	43 %
2	capm1m2_2	450 x 320	2.18	3.07 +/- 0.02	41 %
3	capm1m2_4	450 x 760	5.18	7.14 +/- 0.18	38 %

No	Structure Type	Size [µm x µm]	Simulated Cap. [pF]	Measured Cap. [pF]	Change
1	M1/M2	250 x 160	0.606	0.858	41 %
2	M2/M3	250 x 160	0.467	0.725	55 %
3	M1/M3	250 x 640	1.055	1.462	39 %