The F-RISC (Fast Reduced Instruction Set Computer ) project has
as its goal the exploration of the upper speed envelope for the
throughput capability of one computational node through the use
of advanced HBT technology. F-RISC/G involves development of a
one nanosecond cycle time computer using 50 GHz
GaAs/AlGaAs Heterojunction Bipolar Transistors (HBT).
In the past contract period the primary activities consisted of
a revision of the cache memory chip, and some alterations of previously
"completed" architecture chips. These revisions and
alterations have been made necessary because it has been discovered
that previously existing CAD tools for extracting GaAs IC wiring
capacitance are not sufficiently accurate due to poor coverage
of the effects of the varying nearby conductor geometries. Additional
alterations in the ALU and instruction decoder chip became desirable
as a result of measurements and errors found through the use of
the APTIX/FPGA emulator. This emulator has provided the opportunity
to execute numerous verification programs written in F-RISC/G
assembly language. Finally, as a result of a companion HSCD subcontract
to Rockwell four RPI test chips have been fabricated. This chips
allow us to probe the speed and yield of the 50 GHz baseline HBT
process. Improved yield has been confirmed, and ADC information
indicates that 2500 HBT circuits will yield at about 50%, which
is sufficient for 8K HBT circuits of the F-RISC/G architecture
to yield at 11%. However, our testing showed that heavy loaded
circuits will perform only at 50% of predicted speed. Lightly
loaded circuits perform at 66% of the expected speeds. Rockwell
has obtained evidence that their Polyimide interlayer dielectrics
are too thin. This still leaves an indication that the HBT devices
themselves are exhibiting of only 33 GHz
rather than 50 GHz. All of the four RPI F-RISC/G architecture
chips have been re-engineered several times to optimize them and
correct minor errors while the issues of speed and yield have
been analyzed. They yield now appears satisfactory, but the speed
of the devices and interconnect are still inadequate. We are working
with Rockwell to identify sources of these problems. Thus far,
Rockwell has been extremely helpful when the problem source could
be clearly identified.
o Exploration of the Fundamental Limits of High-Speed Architectures.
o Study of GaAs HBT for Fast Reduced Instruction Set Computer Design including Adequacy of Yield, and Device Performance for High Performance Computing Applications.
o Research in the Architectural Impact of Advanced MultiChip Module (and 3D) Packaging in the GHz Range.
o Examination of Pipelining across Pad Driver Boundaries as a Means for Reducing Adverse Effect of Partitioning due to Yield Limitations in Advanced Technologies.
o Investigation of Power Management in High Power Technologies.
o Study of Appropriate Memory Organization and Management for Extremely Fast RISC Engines using Yield Limited Technologies.
o Use of Adaptive Clock Distribution Circuits for Skew Compensation at High Frequencies.
o Exploration of Superscalar and VLIW organizations in the sub-nanosecond cycle regime.
o Exploration of a combination of HBT and MESFET technologies for lower power, higher yield, but fast cache memory [AASERT Program].
o Exploration of novel new HBT technology such as HBT's in the
SiGe and InP materials systems.
The F-RISC/G (Fast Reduced Instruction Set Computer - version G) project has as its goal the development of a one nanosecond cycle time computer using GaAs/AlGaAs Heterojunction Bipolar Transistor Technology. More generally the project seeks to explore the generic question of how one can achieve with bipolar circuits higher clock rates than expected from silicon based CMOS alone. Traditionally CMOS has achieved its increasing clock rates from lithography improvements, shortening the lengths of devices and interconnections to achieve higher speed. Bipolar devices, on the other hand, have achieved their speed by reducing the thickness of various device layers, with more recent improvements coming from band gap engineering in heterostructures, and schemes to include built-in acceleration fields in the base transit region (graded base techniques). Since the thicknesses of various layers in semiconductor processing can be minimized with proper yield engineering, short device transit times (or high transit time frequency) in principle favor the bipolar device. This is often quantified at least for analog applications by the closely related unity current gain frequency, .
The fact that the intrinsic base region for the fastest device is so very thin tends to make high. Also the large area of the buried collector under the entire base tends to make an important consideration. Device optimization can clearly benefit by lateral shrinking of the device lowering the area of the base collector interface, and shortening the path through the base intrinsic resistance.
Another parameter for quantifying the performance of the HBT is its maximum oscillation frequency or unity power gain frequency in the common emitter configuration:
These two canonic frequencies are often quoted in the literature for analog circuit performance, and they yield some figure of merit for how fast the basic device will switch. One must use caution, however, as these parameters often do not give a clear picture about the speeds of logic circuits which utilize these devices. In addition to these "intrinsic" HBT parameters logic circuits also depend on the burden of wiring capacitance and circuit input loading feasible with the technology. Nevertheless, the fmax parameter is considered a better indicator of logic circuit performance than fT.
It is possible to achieve clock frequencies in digital circuits close to 25% of in small circuits such as frequency counters and serial to parallel converters. Hence, some 12 GHz serial to parallel converters have been fabricated in 50 GHz fT technology using full differential logic circuits. Unloaded gate delays often approach . Hence the unloaded fast gate delay in the same 50 GHz technology is around 18 ps. Clearly circuits will not work any faster than this since in any practical circuits wiring delays will reduce the switching speeds. Of course, the circuits achieving these upper limits in speed also have high power dissipation.
Interestingly, can be larger or smaller than . It can be larger if the product is small enough. The is also an important parameter in circuit applications when power must be controlled at high frequencies, and lowering this parameter is the subject of several recent articles. The Rockwell HBT has several unique features which focus on lowering this resistance. Realizing that base resistance has two components, one located in the active base region (intrinsic) and one to account for the base extension out to the base contact, the Rockwell HBT uses a thicker layer for the extrinsic base, and efforts to reduce the distance between the emitter and base contact to a minimum are employed.
In this contract period a 20 GHz "challenge" VCO test
circuit has been designed and fabricated under companion HSCD
BAA funding through a subcontract to Rockwell. This circuit achieves
this high utilization of the available 50 GHz device capability
through clever use of analog frequency doubler and quadrupler
circuits. This process is fairly nonlinear and consequently a
fair amount of subharmonic generation arises. These subharmonics
are troublesome if the low pass characteristics of resistor capacitor
interconnect networks suppresses the higher frequencies.
The GaAs/AlGaAs material system is not the only one where HBT's may be fabricated. In recent years HBT's with exciting characteristics have been fabricated in the SiGe materials system at IBM. For several years TI, Hughes, and ATT have been working in the InP /InGaAs/AlGaAs alloy system and some of the fastest reported HBT devices have been reported in this system. Interestingly, successful complementary MESFET and HEMT devices have been reported in the same systems, although not necessarily simultaneously. However, in the case of SiGe this exact combination of HBT and CMOS has been realized and is being modeled and characterized at IBM at 0.35 µm x 1.0 µm emitter sizes where lower static power dissipation is required to obtain optimal device speeds. In addition to the lower currents characteristic of scaled systems the is only 0.7 V, or half that of the GaAs/AlGaAs system (1.4 V). In addition, the "safe" emitter current density without dopant redistribution appears to be much higher () in the SiGe system. This means that the transistor scaling can be much more complete since larger transistors will not be needed for higher currents. The IBM SiGe system offers a 50 GHz and a breakdown voltage of 3.5V. HBT integration levels have been demonstrated at greater than 10,000 HBT's, and recently IBM has signed a joint agreement with Analog Devices to create 1 GHz D/A converters which demonstrate 6K HBT yield levels for commercial purposes. There have even been hints of much larger yields. Yields of 10K HBT's would as we have already stated make the integration of 16-bit slices feasible, eliminating two pad driver-receiver delays from the most critical ALU path. Yields of 20K HBT's would permit complete integration of the 32-bit datapath.
A preliminary probe of the IBM SiGe HBT process at East Fishkill has been funded as a part of this contract, and permission appears to have been secured to remake the RPI test chip previously fabricated in the Rockwell line in the IBM line. It is expected that the yield will be much higher than with the GaAs/AlGaAs process. Our group is awaiting a set of models and characterizations to be completed by IBM. One of our students will have to travel to IBM to work with proprietary design rules to establish whether this direction is as promising as it appears to be.
By comparison InP is at the opposite end of the yield scale with
HBT counts closer to 10-200, and for that reason it might be considered
premature to examine them for serious digital circuit implementation..
However, it was only a decade ago when the yields for GaAs/AlGaAs
HBT circuits was similarly low. The speeds of the InP HBT are
even higher than for the GaAs/AlGaAs HBT's which are currently
available. For example TI describes an HBT in the InP system which
has an fT of 200 GHz and this device is still not submicron
in size, which suggests modest digital circuits could be made
with unloaded gate delays of as little as 4.5 ps today! As with
SiGe, the InP system has a VBE0 of only about 0.7 V,
so power dissipation is reduced. Even these small chips could
be useful as tester chips for the GaAs/AlGaAs or SiGe HBT circuits
since they are so fast. However, they also create an opportunity
to gain circuit design experience in preparation for a time when
yields for even these delicate circuits may look attractive. TI
has offered us an opportunity to make some small test circuits
in this foundry and we will attempt to do so if the time is available
given the other contract deliverables.
The effect of wire loading in the GaAs/AlGaAs HBT system is becoming more evident as we gain more experience. In particular, despite the use of Polyimide as a dielectric with a low dielectric constant and Au for the interconnects it is evident from the previous section that extreme care must be exercised in connecting the devices. Layout can be almost as important as it is in CMOS circuits to wrest the utmost speed from the HBT. This is evidenced not only by the delays imposed by the increases in rise time and fall time, but also by the decrease in the bandwidth of signals in the system. When pressing the upper limits possible with the HBT one must be aware of the nonlinearities of the HBT devices, which can generate harmonics and subharmonics. To probe this aspect of the technology we have attempted the design of a "challenge" circuit, targeted to operate at 40% of the for the 50 GHz HBT process.
A high speed voltage-controlled oscillator (VCO) has been developed which can generate differential signals in the range of 1-20 GHz. VCOs will become an integral part of future computers for providing clocks for digital and other synchronous circuits. This VCO consists of a frequency generator, a frequency multiplier and a frequency divider, along with various high-speed buffers, multiplexers and drivers. The chip contains 452 transistors and dissipates 2.60 W at 20 GHz. The frequency of oscillation is controlled by an external bias voltage. There is also a 12-stage ring oscillator which is included as a means for determining the baseline speed for the fabricated transistors. This project has enabled us to gain more experience with the Rockwell 50 GHz process as well as aspects of high frequency design using the devices provided. This design has also directly benefited the FRISC project through the improvement of the high-speed register file design and the datapath chip.
The high-speed VCO consists of a frequency generator, a frequency multiplier and a frequency divider (see Figure 3.1). A base frequency is produced by the frequency generator and is controllable by adjusting the bias voltage input to the system. PSPICE simulations have demonstrated that this base frequency ranges from 2-5 GHz. The base frequency can be multiplied by a factor of 2 or 4 by the frequency multiplier. Simulations have shown the frequency multiplier working up to 20 GHz. The frequency divider is capable of dividing the frequency by factors of 2, 4 or 8. The divider is capable of operating on signals up to 20 GHz.
The frequency generator (Figure 3.2) includes four delay elements
connected in a ring with an inversion placed in the differential
feedback path between the last and first elements. The frequency
range of the generator is from 1 to 5 GHz and is controlled by
an externally applied bias voltage. The range of the applied voltage
is from -1 to +1 volt. Also included in the delay elements are
high-gain buffers which are capable of driving the long lines
between the generator core and the VCO multiplexers. In order
to attain the highest speed possible, the generator was implemented
and placed first during the design process. By arranging the delay
elements to fit within a square (see Figure 3.4), interconnect
length (and thus parasitic capacitance) was minimized and the
effect of other circuits upon the core (e.g. wiring crossovers,
etc.) was reduced. Experience in designing this module will assist
the FRISC project in designing future VCOs which may be used to
provide on-chip clocking signals.
The frequency multiplier (also in Figure 3.2) consists of several high-speed exclusive-OR gates which serve to double the frequency of the input signals. Two XORs are used to generate two signals which are twice the frequency of the generator core and which are 90_ out of phase. These quadrature inputs to each XOR are taken directly from the outputs of each element in the frequency generator. The output signals from both devices are then fed into a third XOR which will generate a signal that is four times the frequency of the generator. The parasitic capacitance of these lines are very important and have harmful effects which are manifested later in the system. Because the frequency-doubling effect of the XORs is best achieved when the input signals are exactly 90 out of phase, this implies that the parasitic capacitance of the input lines must be balanced as closely as possible. In addition, the capacitance of each line in the differential signal pair must also be closely matched to its counterpart. The design and layout of the XOR subcells was done in such a manner as to achieve nearly-identical internal capacitance values. In addition, the placement of the XORs in the layout was also pursued in order to attain the goal of balanced capacitance. PSPICE simulations with extracted capacitance values indicate that the input signals to the 4X XOR are approximately 87.5 degrees out of phase. A high-gain buffer has been placed between the high-speed XOR and the first multiplexor in order to reduce the loading on the output lines.
The frequency divider (Figure 3.3) consists of three high-speed toggle flip-flops which serve to divide the signal by a factor of 2, resulting in frequencies that are 1/2, 1/4 and 1/8 of that of the input. The divider circuit also contains a high-speed multiplexor which selects between the source frequency and the lower (divided) frequencies. Due to the decision of minimizing capacitance on the critical high-speed path through the system, the divider circuit has been placed outside of this path. To compensate for the additional parasitic capacitance incurred by the inputs to the divider, a high-gain buffer has been inserted into the high-speed path to drive the additional load. To further reduce unwanted parasitic capacitance and resistance, the output lines from the differential amplifier have been sized at 8 µm apiece with 8.5 µm spacing between them. These lines travel approximately 500 µm between the amplifier and the chip pads.
The most time consuming task experienced during the design and
simulation of the VCO was related to the parasitic capacitance
of interconnect within the system. This resulted in problems such
as output-loading of subcells and unbalanced signal propagation
and amplitude, thereby degrading the output of the system. As
a consequence, much care was taken to ensure that the capacitive
loading of cell connections were acceptable in terms of the resulting
signal characteristics. When necessary, high-powered drivers were
inserted into the system to compensate for the interconnect parasitics.
In some cells, the driver and/or receiver circuits were modified
in order to compensate for the loading. One of the most troublesome
cells was the multiplexor. These cells are critical to the operation
of the system because the high-speed (i.e. 20 GHz signal) has
to pass through at least two instances of the cell. In addition,
extensive PSPICE simulations have shown that feedthrough of lower-frequency
signals in the multiplexers can be a problem and may result in
unwanted lower-frequency components in the high-speed output signal,
resulting in a noisy waveform. Attempts to counter feedthrough
by reducing the low-speed signal pull-up resistors within the
multiplexor did not produce adequate results, hence the design
of special buffers-with-enable was begun. While these new cells
did help the feedthrough problem somewhat, some signals remained
troublesome. As a consequence, special balanced-capacitance buffers-with-enable
were designed. The outputs of these cells have equally balanced
capacitance which results in a further reduction in leakage through
the buffer. However, due to their increased capacitance values,
they are unsuitable for high-speed signal paths and thus are used
only on low-frequency signals (< 10 GHz). A picture of the
high-speed VCO layout is shown in Figure 3.4.
The high-speed voltage-controlled oscillator (VCO) was fabricated
by Rockwell along with the other chips on the high-speed CAD design
reticle supplied by the Rensselaer F-RISC group. A wafer was received
at Rensselaer in March of 1995 at which point testing was commenced.
The VCO is a "challenge" chip to test the limits of
both the design tools and the fabrication process. The upper limit
was determined to be 20 GHz through the use of PSPICE simulations.
In addition to the VCO circuitry, a 12-stage ring oscillator was
also included in the VCO layout in order to provide an internal
process monitor and gauge the raw speed of the transistors. The
measured results indicate that the delay per ring oscillator stage
is 20 ps instead of the expected 15 ps based on SPICE simulations
with the 50 GHz device model. The layout of this ring oscillator
(shown in Figure 3.3.5) differs from others in the reticle in
that the transistors are spaced at least 6 µm apart, thereby
reducing the possibility of inter-transistor interference or cross-talk.
The ring oscillator current switches operate at 2 mA, the maximum
device current which also yields optimal fT.
Initial test data from the VCO indicate that the circuit is functionally
working, but not at the expected performance. The oscillation
frequencies generated by the VCO and the ring oscillator do not
agree with PSPICE simulations using the baseline 50 GHz device
models. In conjunction with other measurements this is due to
increased parasitic capacitance and reduced device performance.
In order to investigate the effect of each potential source of
error, simulations have been performed with increased parasitic
capacitance and/or reduced device performance. Simulation results
are shown below in Table 3.1. The measured VCO signal yielded
a frequency of 1.1 GHz. Comparing this result with the simulated
results indicates that the best fit occurs when the devices are
derated by 33% and the parasitic capacitances are increased by
a factor of 1.46. A picture of the high-speed 4 x VCO core signal
operating at 13.66 GHz is shown in Figure 3.3.6. This is the highest
VCO frequency ever measured by the Rensselaer F-RISC research
team on a circuit fabricated with the Rockwell 50 GHz HBT baseline
While it appears that reduced device performance and increased capacitance explains the discrepancy between simulation and measurement, there is insufficient evidence from the VCO alone to justify this claim. Further measurements on the VCO must be obtained over a range of control voltages and settings in order to determine the existence of a strong correlation. The shape of observed waveforms to date do display strong agreement with their simulated counterparts, thereby increasing the credibility of the simulations.
Despite reduced operational speed, the functional capabilities
of the VCO seem to be nearly complete. To date, the core VCO has
been observed along with frequency-multiplied signals at 2X and
4X the core frequency. A frequency-divided signal at 1/2
the frequency has also been observed.
The HSCD reticle contains a modified version of the RPI test chip that was first fabricated in 1993. The chip contains among other test circuits an 8 bit carry chain since the ALU is on one of the most critical delay path of F-RISC. The 32-bit F-RISC datapath is implemented with four 8-bit slices. The carry chain test circuit is equivalent to the carry propagation circuitry in an 8-bit slice. The layout of the carry chain circuit is shown in Figure 4.1
The carry chain circuit can be set into oscillation along a long path:
operand B -> 8-bit carry propagate chain -> MUX2 -> AND2 gate -> operand B
or a short path:
carry-in receiver -> AND2 -> MUX2 -> carry-out receiver -> carry-in receiver
The AND2 gates in the long and short oscillation path are used
to select either the long or short path and to stop the oscillation.
These AND gates are not needed in the in the actual carry chain.
We have measured the carry chain delays using a Tektronix sampling
scope. Table 4.1 shows a comparison of typically measured and
the simulated carry chain delays. The data is from wafer 6 of
the first HSCD wafer lot. The measured delays are closer to the
delays obtained by SPICE simulations using a device model with
an fT of 33 GHz extracted from an earlier run than
to the expected delays based on the 50 GHz HBT device model from
the design manual. Ring oscillator measurements on the same wafer
also indicate that the devices on HSCD wafer 6 are slower than
|short chain Tshort|
|long chain Tlong|
1 HBT model from Rockwell's design manual (based on wafer 2 S-parameter measurements).
2 HBT model extracted from S-parameters from a wafer run in 1993.
3 intrinsic delay without interconnect capacitances.
4 modified carry chain circuit using higher power levels and improved buffers & receivers.
4 power increased
from 306 mW to 403 mW.
The circuit was backannotated using QuickCap, a full 3D capacitance extractor since the carry chain is on one of the most critical delay paths The GaAs substrate is semi-insulating and therefore the ground plane (backside metallization) is at least a substrate thickness (625 µm, 75-175 µm for lapped wafers) away from the interconnect layers. The interconnect capacitances are therefore dominated by coupling to nearby conductors, and not by the capacitance to ground as in Si circuits. Hence, a 3D capacitance extraction is necessary to get accurate delay estimates for high speed GaAs circuits. The large distance between the ground plane and the interconnect layers causes problems with finite-element methods because the whole GaAs substrate needs to be meshed. Random-walk methods, such as used in QuickCap, can handle this situation much more readily. The run time increases slightly because a fraction of the walks generated take more random hops before terminating on a conductor or ground plane, if the ground plane is far away. Analyzing GaAs interconnects is also more difficult than analyzing Si interconnects because several dielectric layers must be included in the analysis (GaAs, Si02, SiN, Polyimide, air).
The 3D geometry is generated from a 2D mask-level description of the circuit and a technology file that describes how each mask is used to grow or etch material during processing. In order to simplify the problem for analysis on a single workstation, a planar assumption is made; all metal 1 is at the same distance from the substrate etc. This reduces the geometric interactions between the layers and reduces the number 3D structures needed to represent the IC geometry. Comparison of planar and non-planar models of standard cells has shown that the planar assumption changes the capacitance values only by a few percent. Currently, 3D extraction of large circuits is very time consuming since a hierarchical SPICE netlist must be used for running and optimizing large circuits. However, the 3D capacitance extractor tools at our disposition do not support hierarchical extractions and are not integrated in our main CAD tool suite. Thus several time consuming manual steps are required for preparing layouts for extraction and for feeding the extraction results back into the simulation. We have made the developers aware of this problem and urged them to support hierarchical capacitance extractions. Simulations result with and without interconnect capacitance show that the interconnect delays are quite significant. Almost 50 % of the delays on the short path are due to interconnect delays.
The measured carry chain delays on wafer 6 are matched quite well
by SPICE simulations with 33 GHz device model. SPICE predicts
delays that are 10% and 2% too low. This can be expected since
large parallel plate measurements indicated that the interconnect
capacitances are 1.45 times higher than expected due to problem
with the thickness of dielectric layers. The 50 GHz model in the
design manual was derived from measurements taken from wafer 2
of the same lot. Thus we have two device models that represent
the devices on two wafers from the same lot. We can, therefore,
see the impact of device speed variations on the critical ALU
path which could determine the cycle time of F-RISC. The following
gate delays have been estimated from SPICE simulations:
1 AND2 gate on short oscillation path
2 AND2 gate on long oscillation path
The 32 bit F-RISC datapath is partitioned into four 8 bit slices. The worst case ALU delay includes, therefore, three chip crossings. The worst case operation is a 32-bit addition with operand B equal to $FFFF'FFFF and operand A equal to $0000'0001. This operation will cause carry propagation from the least significant bit to the most significant bit. In the worst case the ALU operand is followed by a feed forward of the ALU result to operand B for the next ALU operation. The worst case delays on the ALU path with the 8-bit carry select adder are shown in Table 4.3.
Clearly the 1 ns target goal can not be achieved with 33 GHz devices. Even with 50 GHz devices the delays on the ALU are too long to support a 1 ns cycle time. The delays had to be reduced to allow for 25 ps of skew and provide some slack time. An analysis of the interconnect delays shows that most of the interconnect delay penalties are due to the long connections from the standard cell core to the periphery with the I/O pads. The interconnect capacitance on the carry-in and carry-out signals between the driver/receiver circuits are quite high (450 fF, 250 fF). The short path includes two of these heavily loaded connections from the I/O pad ring to the standard cell core and SPICE simulations indicate that 49% of the short path delay is due to interconnect delays.
Special cells with a very high drive capability are required to
reduce the delays on these highly loaded signals since the gates
on the carry chain are already high power gates. Simply increasing
the current levels everywhere to the maximum supported by the
smallest HBT device (2 mA) is not a viable option because of power
dissipation problems in GaAs. The drive capability can be improved
by using emitter followers or by using a push-pull super-buffer.
Operand B -> Carry_Chain -> MUX2 -> Driver
Delay = Tlong-TAND+TDR
|Chip to Chip MCM Delay (10 mm, er = 2.72)|
Carry_in Receiver -> MUX2 -> Driver
Delay = Tshort-TAND
|Chip to Chip MCM Delay (10 mm, er = 2.72)|
Carry_in Receiver -> MUX2 -> Driver
Delay = Tshort-TAND
|Chip to Chip MCM Delay (10 mm, er = 2.72)|
rec ->SBUF2 ->MUX2 ->XOR2 -> MSLATCH
Delay = Tshort - TAND - TDR + TSBUF + TXOR
MSLATCH -> MUX4 -> DLMUX2
Delay = TMUX + TDLM
Since we use three levels of current switches the outputs of a
gate can have three different offset levels (0,-VBE0,-2*VBE0 VBE0
= 1.4V). The different levels are necessary to prevent saturation
of the current switches which are stacked up to three levels deep
in our current tree logic. Figure 4.2 shows the different buffer
The drive capability of the different buffer circuits is shown
in Figure 4.3. The super-buffer clearly has the best drive capability
while dissipating less power than a level 2 differential ECL buffer
with an emitter follower current of 2 mA. However, the input to
super-buffer must be at the same level as the output or at a lower
level. Emitter followers can be added to any current tree/gate
and introduce only a small intrinsic delay. The input of a CML
or ECL buffer can be at any level, only the long tail resistor
that determines the tree current needs to be adjusted. The CML
gate has the lowest power dissipation and the lowest device count,
the main reasons for preferentially using CML in our designs.
However, the CML buffer and the ECL buffer with a level 3 output
have a higher interconnect loading sensitivity than the super-buffer
or the ECL buffer. Hence, signals with high interconnect loads
must be at offset level 2 driven either by ECL gates or by a super-buffer.
However, these circuits can only be used on critical paths because
of the increase in power and devices.
The F-RISC/G system is illustrated in Figure 5.1. Instructions supplied by the instruction cache are decoded by the instruction decoder, which sends the operand and destination addresses and control information to the datapath. The data cache is used only by Load and Store instructions.
The level 1 (L1) cache comprises the primary instruction and data
caches. Each cache consists of a single cache controller chip
and eight 2-kbit RAM chips. The controllers for the two caches
are identical, differing only in the settings of some control
signals. This approach was necessary in order to minimize fabrication
costs, and minimizing the penalty for this restriction represented
a significant percentage of the design effort. The cache controller
handles all handshaking with the secondary cache and the CPU,
and sets the control lines of the RAMs as appropriate. The cache
controller and cache RAM designs have been completed.
Each RAM chip is configured to store 32 rows of 64 bit and is single-ported. One unique feature of these chips is that they have two distinct "personalities." Each RAM may read or write data four bits at a time using the DIN and DOUT buses. Each 64-bit row of memory may be filled one nibble at a time. A separate 64-bit bi-directional bus (L2BUS) allows reading or writing of an entire row at once. This feature helps to reduce cache miss penalties.
The key functional components of the cache controller chip are
the tag RAM, a three stage pipeline with integrated counter, and
a comparator. The organization and interconnection of these functional
structures is illustrated in Figure 5.2. The chip additionally
includes circuitry to supply appropriate control signals to the
major functional units and circuitry which provides at-speed testing
capability of wafers or dies. The cache controller was designed
for dual use in both the instruction and data caches. For this
reason, the first pipeline latch serves also as the Remote Program
Counter (RPC) in the ICC configuration.
Two data paths shown on the block diagram are critical and thus were carefully optimized. The first is the 9-bit path from the ABUS, through the master of pipeline latch 1, and out to the cache RAMs. The second critical path is the MISS generating circuitry. This path requires reading an address from the ABUS, addressing and reading the tag RAM, and sending the result into the comparator
As the F-RISC/G prototype is partitioned, interchip communications
becomes an important issue. Large fractions of the cycle time
are consumed by communication between chips. Each off-chip communication
entails a driver and receiver delay (I/O delay) as well as an
MCM time of flight delay. Rise time delays and skew must be considered
Figure 5.3 illustrates the communications that occurs with the
primary data cache. The primary cache communicates with the secondary
cache, the datapath, and the instruction decoder. Figure 5.4 shows
that the primary instruction cache also communicates with all
of the core CPU chips as well as the secondary instruction cache.
Table 4.1 lists the line lengths and associated delays for communications between the CPU and the primary cache. The line length figures are based on work performed by Atul Garg as part of his doctoral research. In order to determine these line length figures, the entire MCM was routed by hand.
The delay figures are based on a dielectric with er=2.67,
which translates to a time of flight on the MCM of 5.44 ps/mm.
An additional 50 ps per line was allowed for rise time degradation
|Signal||MCM Length (mm)||Delay (ps)|
Table 5.5 enumerates the signals used for communication between
the primary and secondary caches.
|L2ADDR||28||CC||L2||28-bit line address.|
|L2DONE||1||L2||CC||Indicates that the L2 has completed a transaction. Any data L2 places on the bus must be valid when this is asserted.|
|L2DIRTY||1||CC||L2||Indicates that the L2 will be receiving an address to be written into.|
|L2MISS||1||CC||L2||Indicates that the address on L2ADDR is needed by the CPU.|
|L2CNTRL||2||CC||L2||Cache flush and init information.|
|L2SYNCH||1||CC||L2||A 1 GHz clock used for synchronizing with L2.|
|L2VDA||1||CC||L2||The address currently on L2ADDR is valid.|
Since little is known about the eventual design of the secondary caches (even the device technology is not certain), as much freedom as possible was given to the designer of the secondary cache while still assuring that the "usual case," the Load hit, is optimized. As a result of this uncertainty and the fact that the secondary caches do not share the synchronized clock used by the primary caches and core CPU, the timing requirements of the secondary caches are very specific. The specification of the handshaking between the L1 and L2 caches is now complete.
As part of the design of the cache controller chips, complete system timing diagrams, incorporating MCM delays and operations within each chip, were created. From these diagrams it was confirmed that at-speed system operation is possible using MCM's with dielectrics in the er=2.67 range. Some representative timing diagrams follow.
By replacing the carry-in receiver with a super-buffer circuit
and adding emitter followers to the carry output multiplexer gate
the ALU delays are reduced from 1063 ps to 853 ps while only increasing
the power dissipation of the carry chain from 306 mW to 403 mW.
The purpose of the cache RAM chip is to provide high speed memory for the primary caches of F-RISC. This memory part is capable of storing 2 kbits of data. The data can be accessed either 4 bits at a time or 64 bits at a time. Each type of access is accomplished using different data input and output pads, allowing concurrent data accessing without the use of a single shared data bus. The 4 bit data path is known as the fast path because of the high speed of the memory accesses along this path while the 64 bit data path is known as the wide path. The large size of the bus, 512 bits, connecting the L1 and L2 caches helps reducing cache miss penalties. The cache RAM must be capable of either storing or retrieving data on the fast path once every processor cycle. Since the F-RISC processor has a 1 nanosecond cycle time, system timing constraints require that he pipelined cache RAM has a read access time of 750 ps in order to provide instructions and data to the processor in a timely fashion. In addition, the write access time after the application of the write signal on the fast path can be no longer than 750 ps in order to allow data to be stored before it is removed from the data lines by the processor.
There aren't any off the shelf memory parts currently available which can access data at the required rates. For this reason, the cache RAM chip was designed using the same Rockwell process utilized in the F-RISC processor chips. This process provides devices with fast enough switching speeds and high enough current handling capabilities to design a memory part with the required design specifications. The artwork for the cache RAM chip is shown in Figure 5.10.
Capacitance values of the nets on this chip were extracted using
two-dimensional and three-dimensional capacitance extraction software.
Digital simulations were performed using these capacitance values
to insure the functionality of the control and testing logic as
well as to estimate the timing parameters associated with the
cache RAM design. SPICE simulations were also performed using
the extracted capacitance values on both the digital and analog
circuits along the critical paths of this chip to determine critical
cache RAM timing parameters. The results of these simulations
show that the cache RAM chip will function properly and meet the
required design specifications.
Four RPI test chips were submitted to Rockwell for fabrication
in last July. The layout of the reticle is shown in Figure 11.
This reticle contains four RPI chips - passive test chip, standard
cell test chip, 20 GHz voltage controlled oscillator (VCO) test
chip, and register file / carry chain test chip. The first wafer
(wafer 6 of the HSCD lot) was shipped to RPI in December.
The mask contains a variety of circuits to determine the basic device performance as a function of power supply voltage, current level, temperature and processing variations Specifically, the passive test chip contains test structures to measure wiring parasitics on a HBT chip. It also carries ring oscillators and gate delay chains to provide basic delay information as a function of capacitive load and fanout. Other chips contain a number of key circuits used in the main architecture chips. The 20 GHz VCO chip has a high-speed voltage controlled oscillator on the chip with several other circuits to test the performance of the process. The register file test chip is an optimized version of the previous test chip fabricated at Rockwell. It also includes the high-speed carry chain macro and associated support circuits. The standard cell test chip contains a number of representative standard cells used in the F-RISC/G chips and tests the implementation of the boundary scan test scheme applied to test the instruction decoder and the datapath chips.
Divider circuits are used to determine flip-flop performance.
Several functional circuits are also used including a 2:1 mux,
1:2 demux, 4x4 parallel multiplier and a 7-bit LFSR. These circuits
are used to evaluate yield and cell performance in a variety of
conditions. Additional test structures were included to measure
individual cell and device characteristics.
The layout of the chip is shown in Figure 12. This chip contains
both passive test structures and active test structures.
The passive structures are meant for measuring wiring parasitics on a AlGaAs/GaAs HBT chip and comparing the measured results with results obtained from CAD tools. The structures are divided into five categories - capacitors, inductors, probe calibration, transmission lines, and resistors.
The active structures are divided into three categories -- coupling, ring oscillators, and device characterization structures. The coupling structures allow measuring the coupling between differentially coupled wires and single-ended wires.
A number of 8-stage ring oscillators are placed on the chip to
measure the device performance with respect to the load and the
local temperature. These oscillators are made up of standard Q1
and the new round Q1 transistors. The oscillation frequencies
of these structures lie in the range of 0.5 GHz - 3.0 GHz. A number
of device-characterization structures are also provided close
to the ring oscillators to correlate the measurements with the
MIM capacitors are made between M1 and M2 layer sandwiching only the nitride layer.
The following tables compare the expected and measured parallel plate capacitance figures. Mayo has measured similar parallel plate capacitor structures on one of their wafers. Even though the measurements are from two different wafer lots, both measurements indicate that the capacitance figures are off by 40-50 %. Thus either the dielectric constant or the dielectric thickness are off. After reporting the problem to Rockwell, Rockwell has measured the Polyimide thickness on a crossection of an HSCD wafer and informed us that the measured dielectric layer thickness is in the 0.9 - 0.95 µm range instead of the 1.6 µm shown in the design manual.
Parallel Plate Measurement Results from Mayo (different wafer lot)
These structures will measure the influence of neighboring conductors, power and ground rails on other layers, and crossovers on the capacitance of a line. The structures have been analyzed using the 2-D/3-D capacitance extraction tools from Random Logic Corporation (QuickCap) and OEA International (Metal).
Since HBT logic is almost always designed with differential logic it was felt that several of these differential line configurations were also required. These structures include wires with varying nearby grounded conductors, wires with adjacent differential lines, wires with metal planes on other layers, signal line overcrossings etc. To address perceived difficulties in measuring some of the parasitics some of these structures were incorporated into wire length oscillator circuits which could be simulated with SPICE using the calculated values of capacitance provided by tools such as METAL by OEA and Quickcap by RLC, and then comparing the frequency of oscillation between the calculated waveforms and measured waveforms.
Since the structures described above involve some active transistor devices, a means for measuring these device characteristics in the area on the wafer and die are provided with special probe de-embedding sites to characterize the HBT devices at microwave frequencies. There are deembedded transistors and deembedded Schottky diodes on the chip.
Figure 13 compares measured and expected ring oscillator delays.
The measured delays are from wafer 6 of the first HSCD lot. The
simulated delays are based on SPICE simulations with different
device models and different interconnect crossections. The measured
results are matched quite well if the capacitance figures are
multiplied by a factor 1.4 to account for the thinner than expected
dielectric layers and by using a 33 GHz device model instead of
the 50 GHz device model from the design manual. We have meanwhile
made measurements on two additional wafers (wafer 3 and wafer
8). However, the ring oscillator delays on these wafers are very
close to the ones measured on wafer 6. Thus, the device performance
on at least three wafers is matched best with a 33 GHz device
We could verify that the VCO and the modified test circuits are
working correctly. We have measured on the 2.6 W VCO 'challenge'
chip a oscillation frequency of 13.6 GHz, the fastest HBT circuit
we have seen so far. However, our measurements on the four RPI
test circuits on the HSCD reticle indicate that at least three
wafers have the following two problems.
o The S-parameter measurements of large parallel plate capacitors
which should have a uniform dielectric thickness show that the
interconnect layers are too thin, resulting in a 1.46 times higher
capacitance. Measurements at Mayo of similar structures fabricated
on another wafer lot indicate that this might not just be a problem
on the HSCD run. Rockwell has examined a crossection of an HSCD
wafer and reported that the M1-M2 Polyimide thickness is only
0.9 - 0.95 µm instead of 1.6 µm as shown in the design
manual. Since our group is designing large digital HBT circuits
interconnect capacitances have a significant impact on performance.
Hence, this problem should be resolved before the fabrication
of our architecture chip, otherwise the processor will be slowed
down by 20%.
o The ring oscillator measurements on the RPI 'passive' HSCD indicate that the devices on the first wafer run are slower than expected. We have seen only very small performance variations (<5%) on the three wafers we obtained from Rockwell. Rockwell has suggested that the devices might be slower than expected because of:
> cooling problems
> resistive or capacitive coupling between adjacent devices
However, preliminary 3D thermal modeling, measurement on lapped
wafer, and measurements on ring oscillator in which the devices
are spaced 6 µm further apart do not support these suspicions.
One of the wafers was lapped to 7 mils and some preliminary worst
case temperature calculations with QuickCap indicate that the
device temperature should be below 50 C with our water cooled
chuck. Further, if the device temperature is causing a problem
we should have seen higher ring oscillator frequencies on the
lapped dies (7 mil thick) compared to the 25 mil thick wafers.
In addition, we consistently can match the measured result within
5-10% using a 33 GHz device model and by increasing the interconnect
capacitance by the measured factor of 1.4. We are currently collaborating
with Mayo and West Point to get device S-parameter measurements
of the wafers and lapped dies. While it is possible that the isolated
devices used for S-parameter measurement are performing according
to specification we consider this unlikely. There is sufficient
evidence of a second process problem, namely the devices on at
least three wafers have an fT in the 33-35 GHz range
instead of 50 GHz. This problem should be addressed before the
architecture reticle run, otherwise the chips will be 30% slower.
This is of great concern to us since the HBT fabrication runs
are very expensive and we only have sufficient funds for two foundry
runs. The combination of the two problems results in a slow down
of heavily loaded circuits by as much as 50%.