F-RISC/G - A 1.0 GigaOPS Fast RISC Processor for Superworkstation and TeraOPS Parallel Processing Applications

ARPA Contract Number DAAL03-90G-0187

[AASERT Award DAAL03-92G-0307 for Cache Memory]

Semi-Annual Technical Report

October 1992 - March 1993



Prof. John F. McDonald

Center for Integrated Electronics

Rensselaer Polytechnic Institute

Troy, New York 12180



(518)-276-2919
FAX (518)-276-8761
MACinFAX (518)-276-4882
email: mcdonald@unix.cie.rpi.edu

Abstract

The F-RISC/G (Fast Reduced Instruction Set Computer -- version G) project has as its goal the development of a one nanosecond cycle time computer using GaAs/AlGaAs Heterojunction Bipolar Transistors (HBT). In the past contract period the primary activities consisted of testing the RPI HBT test chip, finalizing the boundary scan approach to be used for testing the chips in the FRISC/G system and incorporation of this scheme into the layouts for the final 5 HBT chip types to be used in the architecture. These integrated circuits are being prepared for submission to Rockwell in two triple wide reticle runs beginning in mid May. The results of the preliminary test chip run were obtained from a set of 58 dies supplied to us by Rockwell in December of 1992. Of these 58 dies, approximately half were found to have sufficient yield to verify that the CAD tools developed earlier in the program work properly, and that the circuit style selected for implementation of the cell library [full differential Current Mode Logic with long tail resistor current sourcing] performed correctly. These were the two challenges for which the design team was primarily responsible. Partial confirmation of working pieces of the large register file were obtained. Unfortunately the speeds of the circuits were only about 75% of what was predicted. The cause for this was successfully traced at Rensselaer by parameter fitting (using cooperatively measured data on test structures at Rockwell) to poor epi quality of the wafers used in our lots. The experience gained in tracing this problem demonstrated the ability of our group working in conjunction with Rockwell to track yield problems. In spite of the 75% of expected performance problems, the positive results of this contract period include the fact that working circuits were obtained which ran at 2.66 GHz, suggesting that with improvements in the process expected in the forthcoming 4 inch Rockwell line, sufficient numbers of dies working at proper speed would be available to make the demonstration FRISC/G MCM system.

Project Goals

Exploration of the Fundamental Limits of High-Speed Architectures.

Study of GaAs HBT for Fast Reduced Instruction Set Computer Design including Adequacy of Yield, and Device Performance for High Performance Computing Applications.

Research in the Architectural Impact of Advanced MultiChip Module (and 3D) Packaging in the GHz Range.

Examination of Pipelining across Pad Driver Boundaries as a Means for Reducing Adverse Effect of Partitioning due to Yield Limitations in Advanced Technologies.

Investigation of Power Management in High Power Technologies.

Study of Appropriate Memory Organization and Management for Extremely Fast RISC Engines using Yield Limited Technologies.

Use of Adaptive Clock Distribution Circuits for Skew Compensation at High Frequencies.

Exploration of Superscalar and VLIW organizations in the sub-nanosecond cycle regime.

Exploration of a combination of HBT and MESFET technologies for lower power, higher yield, but fast cache memory [AASERT Program].

Introduction

The F-RISC/G (Fast Reduced Instruction Set Computer - version G) project has as its goal the development of a one nanosecond cycle time computer using GaAs/AlGaAs Heterojunction Bipolar Transistor Technology. More generally the project seeks to explore the generic question of how one can achieve with bipolar circuits higher clock rates than expected from Silicon based CMOS. Traditionally CMOS has achieved its increasing clock rates from lithography improvements, shortening the lengths of devices and interconnections to achieve higher speed. Bipolar devices, on the other hand, have achieved their speed by reducing the thickness of various device layers, with more recent improvements coming from band gap engineering in heterostructures, and schemes to include built-in acceleration fields in the base transit region (graded base techniques). Since thicknesses of various layers in semiconductor processing can be minimized with proper yield engineering, short device transit times (or high transit time frequency) in principle favor the bipolar device. This is often quantified at least for analog applications by the closely related unity current gain frequency, .

(1)

The final device performance, however, depends also on other device parasitics such the base resistance, , and the emitter base capacitance, , as quantified in the so called maximum oscillation frequency or unity power gain frequency:

(2)

These two canonic frequencies are often quoted in the literature for analog circuit performance, and they give some feeling for how fast the basic device will respond. One must use caution, however, as these parameters often do not give a clear picture about the speeds of logic circuits which utilize these devices. In addition to these "intrinsic" HBT parameters logic circuits also depend on the burden of wiring capacitance and circuit input loading feasible with the technology. Nevertheless, the parameter is considered a better indicator of logic circuit performance than .

It is possible to achieve clock frequencies in digital circuits close to 25% of in small circuits such as frequency counters and serial to parallel converters. Hence, some 12 GHz serial to parallel converters have been fabricated in 50 GHz technology using full differential logic circuits. Unloaded gate delays often approach . Hence the unloaded fast gate delay in the same 50 GHz technology is around 18 ps. Of course, the circuits achieving these upper limits in speed also have high power dissipation.

Interestingly, can be larger or smaller than . It can be larger if the is small enough. is an important parameter in circuit applications, and lowering this parameter is the subject of several recent articles.

Traditionally CMOS has held an advantage in power dissipation. However, CMOS utilizes a rail- to-rail logic voltage swing and at higher frequencies the dynamic power dissipation of CMOS can become a limiting factor in its exploitation for the high end of computing. Bipolar circuits, on the other hand, have a high static power dissipation, but can be arranged to have very low dynamic power dissipation in actual logic circuits by using much lower voltage swings. Traditionally the high power dissipation of bipolar has been required in order to keep current densities up in the emitter for high gain. However, bipolar technology can also be scaled using more aggressive lithography than the current 1 µm minimum feature size, this current density can be kept high while lowering the total current and therefore the static power. A comparable scaling of interconnections is also assumed in making this statement. At some point the dissipation per gate of these two technologies will intercept, and the question will then be which one delivers the highest computational rate for the lowest total power. The answer to this question may well prove surprising.

Noise is another key consideration in predicting the future for computer technology. CMOS is not a balanced (continuous) current logic technology. Devices conduct current in CMOS only while loads are charging. Once these capacitive wiring loads are charged or discharged the current flow ceases. Hence, currents in CMOS circuits are constantly switching on and off. This causes transient current surges in the power distribution system leading to switching noise due to parasitic inductance's in the power supply. Bipolar circuits arranged to redirect a constant current through differing paths in current trees have the ability to dramatically reduce this noise. This constant current is also the cause for bipolar circuits having a high steady state power loss. Hence, the bipolar designer must attempt to use his logic units in every cycle because unlike in CMOS each unit dissipates static power even when it is not used. The main path for reduction of this power loss is through bipolar device scaling and architectures that efficiently use a small set of logic units.

Industrial assessments appear to predict that the Si bipolar device alone cannot offer performance advantages relative to Si CMOS. It is the HBT which appears to offer enough additional avenues for speed improvement that it has become the primary focus of our effort. However, Heterostructure MESFET's, or High Electron Mobility Transistors (HEMT's) could also play an important role in high performance computing. The transconductance of the FET is much lower than that of the HBT. For example, the transconductance of a MESFET's can be as high as 600 mS/mm. By comparison the bipolar device offers as high as 20,000 mS/mm for transconductance. As a result, loading effects in HBT circuits, while not negligible, are considerably lower than in CMOS or FET circuits. However, lightly loaded MESFET circuits can perform well just like lightly loaded, deep submicron CMOS. However, in even moderately loaded circuits the degradation of performance can dominate the performance. Nevertheless, the lightly loaded situation can be of significance especially in regular structures such as memories. Hence, our effort should eventually encompass a combination of HBT and MESFET or HEMT device technology. This is expected to impact primarily the cache memory where the core of the memory could consist of MESFET cells, while the decoder and sense amplifiers could be implemented with fast HBT devices which preserve the low switching noise and low voltage swings desired.

Of course, none of these arguments is convincing unless the fabrication yields of HBT circuits are compatible with the design of processors. Based on yield projections made by Rockwell, our effort is currently focused on building block components for a partitioned RISC design with roughly 5000-7000 HBT's per chip. Such chips should be expected to yield in the range of 10-20%. Cache memory is more challenging, and will demand higher HBT counts near 12,000. The inherent device yields should be high using OMCVD processing as the oval defect density [the technology specific yield detractor] is exceedingly low at 5 ovals per square centimeter. Theoretically, this should make it possible to fabricate 100,000 HBT devices, even today. However, doping and thickness uniformity problems, and interconnect defects appear to mask this improved state of affairs in the basic materials for GaAs/AlGaAs technology.

It was considered prudent by both Rockwell and Rensselaer to create first a small test chip to assess whether the interface to the foundry, the cell library, and device yields are ready for launching a full scale fabrication effort. This also provided an opportunity to verify CAD models, and tools, as well as the underlying logic circuit class, namely full differential CML. The next section describes the results of this investigation, which is still in progress since an additional test chip run is planned. Information gathered with the test chip is considered extremely important for the success of the more ambitious FRISC chip set.

1 Test Results from the RPI HBT Test Chip

1.1 Test Chip Description

The RPI HBT test chip contains several key components of the FRISC-G architecture with embedded test circuitry for high speed performance evaluation with modest test equipment. Figure 1.1 shows the floorplan of the test chip. The main components are the 8x32 bit registerfile, the 8 bit carry chain, the voltage controller oscillator (VCO), and single ended ECL and differential open collector output drivers/receivers. The register file and the carry chain, the critical component of the ALU, are the most time critical components of FRISC. The VCO generates the on chip clock signal. Its frequency is controlled by an analog input voltage. The VCO is built out of voltage controlled delay elements which are critical for the clock skew compensation chip for the processor and cache subsystems.

The embedded test circuitry consists of a write circuit, address and data linear feedback shift registers (LFSRs), comparators, and multiplexers, besides the control logic for selection of a test and for observing different internal nodes, or selecting an external clock source. The write circuit generates the register file write pulse and can also be put into oscillation. The linear feedback shift registers generate a 31 cycle long pseudo random sequence for testing the register file. The data and address LFSRs are clocked by the same clock signal. Hence, the two LFSR patterns are synchronized and their cycle offset is set by the initial values loaded into the LFSRs. The memory can be tested with different bit patterns by loading different initial values into the LFSRs. In the register file write test phase the pattern from the data LFSR is written into the register file. In the read test phase the data is read from the register file and is compared to the expected data pattern generated by the data LFSR. The match output is a signature signal that goes high if a bit error is detected. The secondary test chip output allows to observe the output of two register file columns directly. The address and data LFSRs can be loaded with different initial values to test different memory patterns. The length of the read/write cycle can be adjusted with the analog VCO control voltage. Figure 1.2 shows a block diagram of the test chip.

Figure 1.1: Floorplan of Test Chip

The test chip was designed with the CAD tool set from VLSITOOLS, augmented by our own tools for differential routing, simulation of differential current tree logic, and for analysis of voltage drops on power rails. The registerfile, the carry chain, and the VCO are custom macros. The embedded test circuitry is implemented with differential current mode logic standard cells with three levels of current steering.

1.2 Test Chip Fabrication

The RPI test chip was fabricated by Rockwell on their HBT pilot line in Newborn Park, Ca. The test chip was part of a multiproject wafer and occupies a quarter of a reticle. Since some of the chips on the multiproject wafer were analog we did not get the desired digital interconnect process option (no SiNx but thicker Polyimide layer), and the process was adjusted to improve breakdown voltages.

Figure 1.2: Block Diagram of Test Chip

However, Rockwell supplied the wafer lapped to 3 mil and 7 mil thickness at no extra cost. We received 58 "likely to be functional" dies. An additional 17 reject dies were supplied for setting up our test equipment. Table 1.1 summarizes the most important DC parametric measurements for the multiproject wafer run.

Table 1.1: DC Parametric Test Results for Lot 37885 from Rockwell

Wafer #Vbe (mean, std) Beta (mean, std) BVCB0 (mean)
#51.34 V, 0.6%38.7, 29.9% 15.0 V
#111.21 V, 32.7%36.1, 64.5% 13.7 V
#121.33 V, 1.1%34.2, 36.1% 15.0 V
#161.35 V, 0.8%34.0, 35.0% 15.0 V

The Beta parameters are only about one third of what was expected. Rockwell reported that the Betas were field dependent due to alignment problems with the CENSOR 3" stepper. The pads on wafer #11 are not open. Most likely the pads have still a residual film of Polyimide. Even moving the probe back and forth many times could not penetrate the film on the pads. Hence, none of the chips from wafer #11 could be tested.

1.3 Test Equipment

The RPI test chip was designed for testing with a multichannel ceramic probe from Cascade. The six channel probe with a bandwidth of 5 GHz provides power through two ground and two power pins with two 0.5 nF bypass capacitors that are mounted within 6 ps from the probe pins. A separate Ground-Power-Ground probe with a 0.5 nF bypass capacitors is used for powering up the registerfile. Two additional high frequency (40 GHz) Ground-Signal-Ground probes can be used to supply an external clock and to contact the high speed secondary output port. The probes are mounted on a probe station donated by Tektronix. The probe station has a water cooled chuck with a Z-axis controller. The lapped GaAs chips are picked up with a vacuum wand and placed on the chuck adapter with a 10 mil vacuum hole. The thermal interface to the chuck adapter is improved by putting a drop of deionized water on the backside of the chip.

Figure 1.3 shows the test instrumentation used in our high frequency testing laboratory. The Tektronix 7104 mainframe with a S4 sampling head has a bandwidth of 18 GHz. The S51 trigger countdown head allows triggering the scope on a subharmonic of the input signal. A power divider is used to split the high speed input signal into two signals with half the amplitude. One of these signals is fed through an attenuator to the trigger countdown head. The attenuator is necessary to damp out the reflection from the unmatched input of the trigger countdown head.

1.4 Initial Confirmation of the LFSR Circuits

One of the initial successes of the test chip program was the identification of a number of dies with functional Linear Feedback Shift Register (LFSR) circuits which were included in the design as pseudo random address and data generators for testing the register file at speed (200 ps). While not terribly complex circuits, their correct operation immediately verified a significant portion of the Computer Aided Design (CAD) tool set for the HBT design effort including macrocell libraries, and the differential router. Additionally, it provided a good check on the logic thresholds which could be expected, and validated the full differential current mode logic with resistor sourcing which had been one of the largest uncertainties in the design effort.

The waveform in Figure 1.4 shows an Oscilloscope Photo of an LFSR working at 2.0 GHz and Figure 1.5 shows the ideal output of the LFSR circuit.

Figure 1.3: Test Equipment Setup

Figure 1.4: Oscilloscope Photo of working LFSR Circuit at 2.0 GHz.

Figure 1.5: LFSR Pseudo Random Output

We can see from the oscilloscope photo that some switching noise is present. This is most likely due to the large variation of the current gain Beta on this particular run. The average Beta is only 34 (36%) and Re is 35 Ohm instead of 15 Ohm. Thus, the logic level regeneration capability (voltage gain) of the logic elements is reduced. Much of the observed noise may well be due to mismatches in the gain of the current switches along the different paths through the current trees of the master slave latches. The signal was measured through a probe with a bandwidth of 5 GHz which limits the rise time to 75 ps.

1.5 Test Results for Write Circuit

Table 1.2 provides a summary of the measured results and simulation results for the write circuit on the RPI HBT test chip. The write oscillator circuit contains a string of high, medium, and low power buffers in series with a 2 input and a 4 input multiplexer. The buffers and multiplexers are standard cells that were placed and routed with VLSITOOLs using our differential routing modifications. Figure 1.6 shows the layout of the write circuit. The purpose of the write circuit is to generate a delayed write pulse for the register file, but our test chip control logic allows closing the loop through the buffers and multiplexers to set the circuit into oscillation for propagation delay measurements.

Mainly results measured on wafer 12 will be reported since wafer 12 is the best wafer received thusfar, as far as yield and performance are concerned. Further, Rockwell supplied only measured S-Parameters for the small devices (Q1) on wafer 12.

The yields on wafer 12 are low. In order for the VCO and the write oscillator to be working and to be observable a yield of only 106 devices is required. Only 7 of the 15 chips from wafer 12 had working write circuits.

The power dissipation of the test chips is close to the expected value of (5.2V * 245mA). The power supply voltage on the chip is 200mV lower than the figures recorded. Measurements of the power supply voltages on the chip with a needle probe showed a 200mV voltage drop in the cables and probes. A multichannel Cascade probe with a bandwidth of 5 GHz is used to supply power and probe six signal pads on the test chip. The circuit performance is typically measured at a voltage of -6V instead of the nominal -5.2V to compensate for the voltage drop on the probes and cables and to compensate for the higher Vbe voltages on this run.

Figure 1.6: Layout of Write Oscillator

Table 1.2: Summary of Write Circuit Oscillator Test Results

Die Measurement Results / Comments
13whole chip did not function properly
14oscillator did not function properly
18whole chip did not function properly
20whole chip did not function properly
22T = 880ps; A = 320mV; Pa = -6.0V; I = ? mA; Temp = cooled
T = 880ps; A = 340mV; Pa = -6.0V; I = 267mA; Temp = RT
23T = 880ps; A = 250mV; Pa = -5.9V; I = 254mA; Temp = cooled
T = 880ps; A = 220mV; Pa = -5.7V; I = 236mA; Temp = RT
24not tested, slides on chuck
26whole chip did not function properly
28oscillator did not function properly
29whole chip ceased to function during testing
30T = 900ps; A = ? ; Pa = -5.5V; I = ? mA; Temp = RT
31T = 900ps; A = 340mV; Pa = -6.0V; I = 283mA; Temp = RT
33oscillator did not function properly
34whole chip did not function properly
37whole chip did not function properly

Key

T=> oscillator period
A=> oscillator amplitude (listed as twice that observed through 2x attenuation)
Pa =>power supply voltage
I => power supply current
Temp=> temperature

The measured speeds were below what we expected. Figure 1.7 shows the SPICE results for the write circuit. To check our CAD tools we have updated our back annotation process to reflect the interconnect structure and substrate thickness of wafer 12. Originally, our capacitance extraction assumed a 10 mil thick substrate and a digital interconnect process option that would have replaced the high dielectric constant (er = 7) SiNx layer with an additional 0.25 µm of Polyimide (er = 3.5) between metal 1 and 2. This would have reduced the capacitive coupling between adjacent runs in metal 1 and metal 1- metal 2 crossover capacitances. However, the test chip was fabricated on a multiproject wafer together with analog designs. Hence, the chips were lapped for free to 3 mils or 7mil thickness but processed without digital process option. Further, Rockwell indicated that the process was tweaked to get better analog performance: higher breakdown voltages, but lower Beta. Since Rockwell could not provide measured interconnect figures, the interconnect capacitance's were modeled with MagicCad and an in house random walk capacitance modeling tool. After adjusting the capacitance figures, the coupling capacitance and crossover capacitance increased by up to 50%. Table 1.3 lists the worst case capacitance figures used for our extraction:

Table 1.3: Estimated Worst Case Interconnect Capacitance for Rockwell HBT Process

3mil substrate7mil substrate
M1-M1 (2um lines with 3um spacing)C10* = 0.18 fF/um C10* = 0.17 fF/um
M2-M2 (3um lines with 3um spacing)C10* = 0.14 fF/um C10* = 0.13 fF/um
M1-M1 (under M2 grounded)C10* = 0.23 fF/um C10* = 0.23 fF/um
M2-M2 (over M1 grounded)C10* = 0.20 fF/um C10* = 0.20 fF/um

C10* = equivalent worst case capacitance to ground for differential signal mode

mx41l_1 = Simulated with Rockwell SPICE Model, freq=1.36 GHz

mx41l-2 = Simulated with Extracted SPICE Model, freq=1.08 GHz

Figure 1.7: SPICE Results for Write Circuit

The capacitance extraction is based on worst case assumptions. It is assumed that each differential pair runs parallel to a 30 µm wide ground plane with a spacing of 3 µm. Under this assumption, the capacitance of a net is a simple function of its length and the overlap with metal 2. Thus, the extraction can be performed with our CMOS based tools. Further, we can use the same extraction process for digital simulations. However, in the long term we need to have measured electromagnetic characteristics of the interconnect structures to improve our interconnect modeling.

Table 1.4: Simulated and Measured Write Oscillator Periods

Measured WOSC period on wafer 12880-900 ps
Simulated WOSC period with Rockwell Spice Model and no back annotation 254 ps
Simulated WOSC period with Rockwell Spice Model and back annotation 736 ps
Simulated WOSC period with extracted Spice Model and back annotation 920 ps

The Rockwell SPICE model was updated by fitting SPICE model parameters to measured S-Parameters from wafer 12 at VCE=5V and IC=1mA. Figure 1.8 compares the expected S21 parameters with the measured S21 parameters, and the S21 parameters from the SPICE model with the extracted model parameters. The S-Parameter test site pad capacitance was determined by a 3-D capacitance extraction program to be 32 fF on the base and collector node. The pad capacitance would normally be measured on a empty test site with the same pad arrangement such that the influence of the pads could be taken out of the measurement. However, the empty test site is not available on the Rockwell test mask. Thus the pad capacitance had to be estimated with a 3-D capacitance extraction program and taken in account in the parameter fitting procedure. The Ft of the Rockwell SPICE model for the given bias condition is 45 GHz, but the extracted Ft is only 30 GHz. The back annotated simulation with worst case interconnect capacitances and extracted Spice model parameters yields results that are close to the measured ones. (See Table 1.4.)

ms21 = measured S21 Parameters of Q1 device on Wafer 12

rs21 = simulated S21 Parameters with Rockwell SPICE Model, Ft=46.7 GHz

es21 = simulated S21 Parameters with Extracted SPICE Model, Ft=29.5 GHz

Figure 1.8: Measured and Simulated S21 Parameters for Wafer 12

1.6 Test Results for Voltage-Controlled Oscillator

The voltage controlled oscillator generates the on chip clock signal. The frequency of the clock can be adjusted with an external control voltage. The VCO was designed to adjust the registerfile read/write cycle in the range of 120 ps to 350 ps since simulations predict a read/write cycle time of 200 ps. The VCO range had to be limited to avoid a large Frequency/V gain that could cause noise/jitter problems. The analog input is filtered by an R-C element and routed on the chip as a differential wire pair with one signal grounded to avoid large jitter due to noise on the VCO control input. Further, all the DC signal input connectors on the probes are bypassed and terminated on the chip side. The VCO is implemented as a custom macro with the same voltage controlled delay elements that will be used for the deskew chip. Figure 1.9 shows the layout of the VCO circuit. A yield of 104 devices is required for the VCO signal to be observable on the output pad. A yield of 330 devices is necessary for the VCO and both LFSRs to be observable.

Table 1.5 shows the results of testing the test chip VCO for all dies from wafer 12. The simulation results along with the measured test chip data are shown in Figure 1.10.

Table 1.5: VCO/LFSR Test Results for Wafer 12

WAFER 12
7 mil thick
Chip
VCO (GHz)
Comments
13
0
14
1.0 - 2.5
One LFSR works fully
18
0
20
0
22+
1.0 - 2.5
One LFSR works fully
23+
1.0 - 2.4
24
?
Problems with chip moving on chuck (wax?).
26+
0
28
0.91 - 2.4
29+
1.0 - 2.5
30
0.9 - 2.7
Both LFSRs work fully
31
1.0 - 2.7
Both LFSRs work fully
33
1.0 - 2.4
One LFSR works fully
34
0
37+
0

+ = Chips indicated by Rockwell as being " visually good" but which have many defects

Figure 1.9: Layout of VCO

Figure 1.10: Simulation Results for VCO

1.7 Test Results for the Carry Chain Circuit

The 8 bit carry chain circuit tests the two critical paths through the FRISC ALU. Because of yield considerations, the 32 bit datapath needs to be implemented with four 8 bit slices. Each slice has an 8 bit carry select adder. The critical path through the ALU is the operand in to carry out delay on the first slice, and signal transmission delays and carry-in to carry-out delay on the other three slices. Both the critical path on the first slice and the other bit slices can be set into oscillation on the test chip. The layout of the carry chain circuit is shown in Figure 1.11. The carry chain was laid out as a custom macro to minimize interconnect delays. A yield of 153 devices is required for the carry chain output signal to be observable. The measured results are summarized in Table 1.6.

Figure 1.11: Layout of Carry Chain

Table 1.6: Test Results for Carry Chain

WAFER 12
7 mil thick
Chip
Op->Cout
Cin->Cout
Comments
13
950ps, 250mVpp
500ps,125mVpp
Vee=-5.5V Ip=230mA
14
925ps,200mVpp
500ps,150mVpp
Vee=-6.0V Ip=282mA
18
X
X
20
X
X
22+
?
?
23+
X
X
25
?
?
chip moving on chuck.
26+
X
X
28
?
?
29+
?
?
30
?
?
both LFSRs work, reserved for rf test
31
?
?
both LFSRs work, reserved for rf test
33
950ps, 280mVpp
525ps, 220mVpp
Vee = -6.2 V Ip = 286 mA
34
X
X
35+
X
X
WAFER 5
3mil thick
Chip
Op->Cout
Cin->Cout
Comments
27
875ps, 350mVpp
475ps, 300mVpp
Vee=-6.0V

Figure 1.12 shows SPICE plots for the backannotated carry chain circuit with the standard Rockwell SPICE model and a SPICE model extracted from measured S-parameters. The simulations with both models predict a much shorter oscillation period than measured on the test chip, 642 ps versus 875 ps on the fastest chip. We are currently performing additional tests with needle probes to investigate this disagreement.

t_0,b_0 = no backannotation and Rockwell SPICE Model, Period = 268 ps

t_1,b_1 = backannotation and Rockwell SPICE Model, Period = 495 ps

t_2,b_2 = backannotation and Extracted SPICE Model, Period = 642 ps

Figure 1.12: Simulated Carry Chain Waveforms (Operand -> Carry_out loop)

1.8 Summary of Test Chip Conclusions

The RPI test chip is the first large HBT chip fabricated by Rockwell that was designed and laid out by workers not affiliated directly with Rockwell. So in many ways this exercise tested the interface to the foundry. Despite being the first outside design, the testing has shown that the circuits are functional. Further, we could prove that our embedded test strategy allows us to test these circuits at 2.7 GHz with a very modest amount of test equipment. However, this particular run had a yield and a performance problem. The power dissipation of the chip was close to the nominal power dissipation, 1.3 W without the register file powered up, and 3.1W with the register file powered up at Vee=-5.2V. However, the chips were tested at supply voltages of up to 6.8V to compensate for the high Vbe voltages on this run and to lower jitter. The countdown trigger head signal (after the power splitter and attenuator) needs an amplitude of about 50 mVpp to avoid large jitter. The power dissipation is typically 1.6 W without the register file and 3.5 W with the register file at Vee=-5.8 V.

The yields on our best wafer are only reasonable for circuits with less than 300 devices. Thus it is unlikely that we will find a fully working 8x32 bit register file with 1800 devices but we are continuing our search for a working column on one of the test chips. The speed of the tested circuit was below what we had expected based on simulations. For example, the maximum VCO frequency was expected to be 3.5 GHz instead of the best measured case of 2.7 GHz. This results tracks the measured performance slowdown of the Rockwell dividers on wafer 12. Despite the lower measured performance, the register file can still be tested with the VCO at the expected 200 ps access time since only half a cycle is allocated for register file read/write access.

Our investigation showed two reasons for the slowdown:

First, the interconnect capacitance on the test chip is higher than assumed. Since the test chip was fabricated on a multiproject run together with analog chips, the interconnect structure included a SiNx layer for MICS and the wafer was lapped to 3 mils. This resulted in higher coupling, crossover, and ground capacitances.

Second, the measured S-parameter data from Q1 devices on wafer 12 indicate that the at 1 mA bias current is 30 GHz and not 45 GHz as predicted by the SPICE model. Simulations with the new interconnect capacitance figures and extracted SPICE model parameters match the measured performance quite well with the exception of the carry chain circuit, despite the worst case capacitance extraction methodology and the limitations of the SPICE model.

Many of the LFSR circuits failed at high frequency. Instead of the full length sequence they reverted to a small sub-sequence. This indicates that the latches are very sensitive to the clock rise time and amplitude, epecially if the logic gain is low. Thus we need to fully backannotate the boundary scan circuitry and simulate it in SPICE to make sure the boundary scan logic will work even if the gain of the devices is low. The boundary scan circuitry has very high fanout loads and long interconnections that degrades voltage swings and rise times. Without the boundary scan logic working the chips are essentially untestable.

2 Testing for Yield Limited High Speed Technology

Testability is a prime concern not only for the test chip discussed in the previous section, but more importantly for the 5 chips needed for F-RISC, and in general for any complex, high-speed IC design. It is becoming increasingly difficult for design engineers to test high-performance chips by stimulating all inputs and sampling all outputs. The probes, which often require transmission line SMA connectors, are physically too big to permit large numbers of controlled impedance transmission line channels. Also, no test equipment is currently available that would allow at-speed testing of the circuits used in F-RISC/G. It was, therefore, essential to develop a testing scheme that reduce the need to probe the chip directly when operating at high speeds. Boundary scan with embedded test circuitry is the key element of our test strategy. The goals of our testing scheme were:

Minimal impact on processor cycle time

Low transistor count for minimal impact on yield

Functional testing of the chip, both as a bare die and as mounted on the MCM package

Low impact on power and heat removal requirements of the chips

Determination of circuit functionality and performance

Circuit changes and wiring restricted to the pad ring

Test the wiring on the MCM substrate and measure the signal delays on the MCM

Testing can be performed with two six channel ceramic probes and two high bandwidth G-S-G ceramic probes

The goals are listed in order of priority. While the goals are contradictory and therefore require a trade-off, the aim of the test plan is to avoid compromising the most important goals. Most important is the desire to minimize the impact on the performance of the chip. Testing time and test pattern complexity are sacrificed to meet the demands of our design.

Built-in self-test

For our test chip, we investigated built-in self-test (BIST) techniques. In our test chip, the BIST circuitry accounted for a large percentage of the transistor count. For the final FRISC chips, we cannot afford to spend these resources on BIST. Also, the BIST logic slows down the chip operation by adding test circuitry to the critical paths. This loading is acceptable in a test vehicle that is not meant to be part of a real system. However, for chips that are to be part of an actual high-speed system, the expansion of the critical paths is unacceptable.

Boundary scan testing

Boundary scan testing is becoming increasingly popular since the adoption of the ANSI/IEEE standard 1149.1-1990 [Maun92; Webe92]. The same logic can be used to test the interconnections among chips on the MCM [Karp91]. In a boundary scan system, a shift register connects all the input receivers. During testing, the outputs of the shift register replace the normal circuit inputs. The output drivers are also connected by a shift register that allows the output signals to be sampled and read out.

To test the operation of the multi-chip module (MCM), these shift registers switch roles. The output shift register is loaded with a test pattern, which is then applied to the output pads. After crossing the MCM, these signals arrive at the input pads of another chip where they are sampled and stored into the input shift register. In combination, these shift registers allow the chip to be tested without a logic analyzer and with only a few signal connections to the chip. Additional elements are added to this combined shift register to control various portions of the boundary scan. We chose boundary scan testing as the best scheme for F-RISC/G. It combines low transistor counts with the potential for testing circuits at speed[Phil93].

2.1 Re-inventing Boundary Scan

There were two challenges inherent in adapting the ANSI/IEEE boundary scan scheme to the F-RISC/G project:

The ANSI/IEEE standard introduces a layer of logic between the internal chip circuits and the I/O circuits. The additional delay would be unacceptable for our project.

The standard does not provide any means for testing of chips at speed. We must include this feature in our testing scheme.

2.1.1 Novel input and output circuits

We have developed innovative CML input and output circuits which provide most of the functions of ANSI/IEEE boundary scan without compromising chip performance.

Receiver circuit

Figure 2.1 shows the proposed boundary scan receiver (input) circuit. The pad receiver is part of a combined multiplexer/latch, the circuit for which is shown in Figure 2.2. This circuit can be thought of as a two-input multiplexer with a latch attached to one of the inputs. One feature of this circuit is that the pad input can have a higher voltage swing than the on-chip inputs. To provide an increased noise margin, the off-chip differential lines have a higher voltage swing than on-chip signals. Single-ended signals use ECL levels, which have a higher swing as well. During normal chip operation, the INP_SEL line is low, hence the signal from the pads is transmitted directly to the circuit through the input multiplexer. Because the receiver input signal is connected to a top-level current switch in the multiplexer, the delay penalty relative to a standard receiver is minimal.

Driver Circuit

The boundary scan driver (output) circuit is shown in Figure 2.3. The output signal from the internal chip logic goes directly to a standard driver circuit and then to the pad. Thus, the delay penalty during normal chip operation is minimal. Only one multiplexer input load is added to the fanout of the circuit generating the output signal. An output signal multiplexer cannot be added to the driver circuit without increasing the delay. Thus, the scan latch cannot directly stimulate the output pads. For MCM testing, the chip outputs must be controlled by setting the chip inputs and using the chip functionality to form the desired output pattern.

Figure 2.1: Boundary scan receiver

Figure 2.2: Boundary scan receiver circuit

Figure 2.3: Boundary scan driver

Figure 2.4: Timing diagram of at-speed testing

2.1.2 At-speed boundary scan testing

We have included sequencing logic which enables the boundary scan to test the speed performance of the on-chip circuitry. Figure 2.4 shows the basic operation of this sequencer. The time between the presentation of the input pattern and the sampling of the output result can be controlled. After the results are sampled, they are shifted out along the scan path for observation. If the correct results were obtained, then we can be sure that the circuit operated properly in the given amount of time (D). By shortening the time between the input presentation and the output sampling, the delay of the circuit can be measured.

The chips for F-RISC/G are controlled by a four-phase clock, as shown at the bottom of Figure 2.4. Through the use of elements added to the scan path, any one of these phases can be chosen for use as basis for the pattern presentation strobe (INP_HOLD) and another for the sampling strobe (LATCH_INPUT). Before being used, these selected phases are sent through a digital delay line with eight taps. This delay line provides a resolution of approximately 45 ps. Thus, the strobe can be delayed 25% into the next phase.

2.1.3 Scan clock distribution

However, as indicated in the figure, the pattern is not presented exactly at the beginning of the strobe and the results are not sampled exactly at the beginning of the sampling strobe, either. The skew of these strobe lines is a significant factor. Even, if only a 6 ps/mm time of flight delay through the Polyimide dielectric is considered, the delay/skew between the first latch and the last latch on the boundary scan line is almost 100 ps since the boundary scan register wraps all around the chip. Special drivers are required to drive the scan clock because of the high fanout and the long interconnect. Further, the scan register must be split into two parts to reduce the fanout and the R-C delays.

We have done extensive SPICE simulation of the circuit to determine the estimated skew at each point along the scan path. The strobes had to be driven by special high-powered buffers with a low sensitivity to both capacitive and active loads. Second, at various places along the scan path, we included extra sampling latches. These latches sample the phases of the system clock. By adjusting the sampling time and observing which phase is being sampled at which location, we can measure the skew at strategic points along the scan path. In the example shown in Figure 2.4, we can see that the INP_HOLD strobe is a delayed copy of the third clock phase (3). However, at the point along the scan path where the pattern transition is shown, the pattern is revealed during the fourth clock phase. Thus, the 4 latch at this point will show data.

2.2 Impact on F-RISC/G design

Implementation of this scheme would have an impact on the design of F-RISC/G in two main areas: cycle time and resource usage. Both are important considerations.

2.2.1 Cycle Time

The design has a minimal impact on the cycle time. While exact numbers will require extensive SPICE simulation of backannotated circuits.

For the driver circuit, the only change is an extra load on the input line. For the most critical cases, the driver will be driven by a high-powered gate. Also, the input capacitance, CB, from a medium-powered gate is assumed and 20 fF of additional capacitive load to simulate the extra wiring in the pad driver layout. The additional delay is given by:

ÆTD = Shigh_power * (CB+CL) = 112 ps/pF * (29 fF + 20 fF) = 5.5 ps

For the receiver, the calculation is more difficult. The receiver had to redesigned, and this new design optimizes the path from the pad to the output. The impact comes from two sources. First, the delay through the receiver will increase because of the additional loading within the multiplexer. Also, the output of the receiver has an additional load. An estimate of the additional delay is given by:

ÆTR = Sreceiver * (CC+CB+CL) = 70 ps/pF * (4 fF + 29 fF + 20 fF) = 3.7 ps

2.2.2 Transistor and Power Resources

Both the driver and the receiver circuitry have additional circuits added to support the boundary scan. The drivers require 16 extra transistors; the receivers 24. Both consume approximately 6.24 mW of additional power. The power figure assumes a master stage equivalent to a medium powered gate (1.0 mA) and a slave stage of 1/3 the power of the standard low powered gate (0.2 mA).

The datapath has 90 signal inputs and outputs. This scheme adds 1.9K transistors and consumes 0.56 W of power. The datapath without boundary scan has 6.8K transistors and consumes 6.7 W. The instruction decoder device count has increased from 4.3K transistors to 6K. These numbers do not include the extra power and transistors for the pads added by the boundary scan scheme, for the clock manipulation circuitry, or for the changes to the phase generator module that add a few hundred transistors to each chip.

2.3 Testing Plan

To control the circuits described in the previous sections, we will need nine pads. This arrangement requires two of the largest 5 GHz Cascade Microwave probes, which can control six lines each. For the boundary scan testing, our experience with the test chip will be valuable.

To reduce pinout requirements when the chips are mounted on the MCM, all the chips can be connected into one scan path sharing a common SCAN CLOCK signal. Also, other control lines can be shared among multiple chips. Certain signals must be separate for each chip for testing the MCM lines by having some chips reading their scan registers for inputs and others reading the incoming MCM lines. Of course, on the MCM the clock inputs will be provided by the clock deskew unit. In total, the number of MCM signal pins required to support the boundary scan scheme will be six plus one per chip on the module.

2.4 Core Processor Chips

The instruction decoder has passed its full-chip simulation and is prepared for tape out in May. Test cases and expected results were defined for seventy-five instructions which fully test the chip. Various situations involving cache misses, page faults, and multiple simultaneous interrupts were simulated. The floorplan is shown in Figure 2.5. The layout is shown in Figure 2.6.

Figure 2.5: Instruction decoder floor plan

Figure 2.6: Instruction decoder layout

The datapath chip has also passed its full-chip simulation. Its floorplan is shown in Figure 2.7. The register file and carry chain used in the test chip have been improved for the datapath chip so as to increase the efficiency of routing and to further reduce the loads on critical paths. The layout is shown in Figure 2.8.

Figure 2.7: Datapath floor plan

Figure 2.8: Datapath Layout

3 Clock Deskew Scheme

The overall deskew scheme is shown in Figure 3.1. Each of the FRISC/G chips receives a differential 2 GHz clock from the clock distribution / skew compensation chip. Each chip has a phase generator that generates the four phases for the internal chip timing. This circuit is explained in more detail in the following "Clock Phase Generator Circuit" section. The clock signals that are sent to each of the chips are returned to the skew compensation chip on matched transmission lines, routed in parallel to the distribution lines to each of the chips. The clock distribution chip has a PLL for each clock output channel. The PLLs automatically adjust the variable delay lines on each channel such that all returned clocks arrive after a one cycle round trip delay. If the distribution and return path delays are matched, the clock signals arrive at each chip simultaneously. Of course, this scheme only works if the forward and return delay paths on a given channel are matched, and there are some dissimilarities due to temperature and process variations on these paths. However, the delay paths only need to be matched on the forward and return path on a given channel. The variable delay lines can adapt continuously for variations in the transmission line delays to the different chips in the system due to thermal drift or aging.

The PLL circuits on the clock skew compensation chip track slowly varying propagation delay variations. The PLL employs the same Voltage Controlled Delay Line (VCDL) used for the VCO on the test chip. Since the input to the PLL is a stable 2 GHz input, the PLL loop bandwidth can be chosen to be as small as feasible with the available area for the loop filter capacitors. The total number of the PLL circuits required in the scheme is equal to the number of clock channels provided (to the chips) minus one. The minus one comes from the fact that one of the "loops" (a clock signal sent and returned) can be used as a reference channel. When all the PLL circuits are in locked condition, the signals at the inputs (t0, t1, t2, etc.) are all in phase with each other. When t0, t1, t2, etc. are in phase with each other, it can also be seen that t0*, t1*, t2*, etc. are in phase with each other (Figure 3.2). Since periodic signals are indistinguishable from each other if the phase differences between them are integer multiples of their period, the clock signals provided at each of the chips can be considered to arrive simultaneously, which, at first glance, seems to have the desired result: the deskewing of the clock signals at the inputs to each of the chips. But, upon closer inspection, it becomes evident that an additional restriction must be imposed. The clocks t0 and t1 could be in phase, but with loop 0 having a round trip delay of n periods and loop 1 a round trip delay of n+1 periods. Under this condition, the phase difference between t0* and t1* is half a period. Since the phase detector in the PLL cannot detect whether this situation has occurred, the maximum round trip delay difference between all loops must be restricted to less than one cycle for any setting of the variable delay lines. With this restriction, the loops will always have the same round trip delay and the chips receive their clock signals without chip-to-chip skew.

An additional design consideration comes from the need to distribute a four phase clock reset/start-up signal. There is no viable way to generate a reset signal and distribute it to all the chips in less than one clock period (500 ps) such that the four phase clock generators on all the chips can be synchronized to come up in the same phase, unless one is willing to deskew the reset lines in addition to the clock lines. If the option of deskewing the reset lines is not taken because it would double the size of the clock distribution problem, the chips must be synchronized by the following procedure:

Startup Procedure

assert the four phase clock generator reset signal, all chips are in phase 1

wait until all PLL circuits are in lock

stop the master clock (2 GHz)

deassert the reset signal

wait m cycles (until all chips have seen the falling edge of the reset signal)

restart the master clock

If multiple cycles are available for the distribution of the reset signal the routing of the reset signal becomes uncritical and can be routed as a daisy chain. However, the clock can only be stopped for a very short period of time compared to the rise time of the loop filters in the PLLs, otherwise the PLLs will be out of lock when the master clock is restarted.

Figure 3.1: Clock Skew Compensation Scheme

Figure 3.2: Clock Deskew Scheme Concept Timing Diagram

3.1 Clock Phase Generator Circuit

The clock phase generator on each chip generates the four clock phases for the timing of each chip from its deskewed copy of the master clock. Each phase is 250 ps long in FRISC/G. The phase generator circuit is based on a Tektronix patent (4,833,695) with some modifications for differential operation. This circuit generates a pulse at each of the rising and falling edges of the master clock signal with the pulse duration equal to the half of the master clock period. This is accomplished by using a mix of single ended and differential logic circuitry. The circuit has the advantage that the master clock signal frequency must be only twice the desired system clock frequency (1 ns) for a four phase clock. Thus the master clock frequency is only 2 GHz instead of 4 GHz as with a shift register solution. In addition to the architectural requirements of generating four phases, the four phase generator circuit for the F-RISC/G project must adhere to the requirements of the boundary scan testing scheme for the fabricated chips.

A reset signal is needed to force the phase generator circuit to a known state so that the clock phases at each of the chips on a MCM are synchronized. This reset signal is to be provided externally. A simple master-slave latch circuit is sufficient to synchronize the reset signal with the clock. The master-slave latch constantly samples the reset signal at each positive edge of the master clock. The output of the latch then is a clock synchronized reset signal that is available to the rest of the phase generator circuit as the "reset" signal. The SPICE simulation in Figure 3.3 includes the effect of the reset signal on the clock phases.

Figure 3.3: Waveform of Four Phase Generator

Figure 3.4: Four Phase Clock Generator

The boundary scan test strategy also requires the ability to stop the four phase clock generator N clock phases after the input vector has been applied. The freeze signal freezes the state of the clock generator. When the freeze clock control signal is deasserted, the phase generator must restart from the frozen state. Simply resetting the clock is not an acceptable solution. The circuit shown in Figure 3.4 was developed in this project and its SPICE simulation results are shown in Figure 3.3.

The additional circuit senses both the master clock and the state of the D-latch before generating the multiplexer control signal to determine the proper switching moment. The inverted XOR gate outputs a positive signal whenever the D-latch state and the master clock signal are matched, which marks a safe switching window for the multiplexer. Since the master clock signal is a periodic square wave the output of the XOR gate also exhibits a periodic square wave. This signal is fed to the master-slave latch with clear. Since the clear input is high while the freeze control signal is in effect, the master-slave latch output is "0". The master-slave latch will go high in the first safe switching window after the freeze signal is deasserted. Once a "0" is latched onto the master-slave latch, the AND gate is disabled, effectively blocking further signal pulses from the XOR gate. As shown in Figure 3.4, latching a "1" into the master-slave latch selects the multiplexer input connected to the master clock signal, effectively restarting the clock phase generator process from where it was frozen.

4 Primary Cache Memory

4.1 Primary Cache Chips

There will be two basic chips in the F-RISC/G primary cache memory system. The first is a controller chip containing memory for storing the tags as well as logic to handle events such as interrupts and pipeline stalls due to cache misses. The second chip will be the actual data memory chip, which has a capacity of 2 kbits. Both of these chips will be approximately 8 mm by 8 mm. Several copies of the data memory chip will be used in each cache to increase its size.

The data memory chip has 155 signal pads and requires at least 42 power pads, some of which may be incorporated into the large number of single-ended pads around the double padring caused by the wide buses to the L2 cache. The final tape out for fabrication is scheduled for May 1993. For the present revision the size of the chip is 8 mm by 7.5 mm. This size may shrink slightly since the power and ground pads can be incorporated into the second pad ring of the single-ended pads. The provision for two pad rings has been influenced by the availability of the most likely sources for the MCM package, namely IBM, TI & GE/HDI, IST, or ISA.

4.2 Design of the Primary Cache

The most important issue in the design of the primary cache has been the lack of suitable memories. Commercial parts that support a 1ns cycle time are just not available, forcing the design of custom memory chips in the same technology as the processor. Originally, a single cache chip was pursued, and it was to be configurable for use as either a controller chip or a simple memory chip (with the controller logic disabled). However, cache simulations showed that we could not meet our CPI design goal of less than 2.0 using this scheme. An insufficient number of memory chips could be placed close enough to the controllers to fit the timing constraints in a 1 ns system. The decision was made to have two cache chips instead, allowing the memory chip to have over twice as much RAM as the controller and still contain roughly the same number of devices.

The processor uses a Harvard architecture with separate instruction and data caches. This allows both an instruction and a data word to be fetched in parallel in order to improve the throughput of the processor. It does, however, preclude the use of programs which modify their instruction streams, as is sometimes useful in certain areas (e.g., artificial intelligence).

The four different configurations needed for the primary caches are: Instruction Cache Controller (ICC), Data Cache Controller (DCC), Instruction Cache Memory (I-RAM), and Data Cache Memory (D-RAM). The need for separate configurations for the ICC and DCC or the I-RAM and D-RAM stem from the need to write back data only, and not instructions. The controller chips and memory chips will be the same; the configuration will be done based on pin configuration on the MCM.

The data and instruction caches have the same size (2 Kbytes, consisting of eight copies of the memory chip),are direct-mapped, and have a line (block) size of 512 bits. A 512 bit wide bus is used between the two different levels of cache. This wide bus saves time since an entire line can be transferred at once and compensates for the small cache size. There are eight RAMs (32 x 64 bits) in each cache. They each store their respective four bits of each of the words (e.g., the first RAM stores the first four bits of each of the sixteen words). These bits are then multiplexed to the output of the chips based on the lowest four bits of the address. The number of data output pads is then reduced from 32 (which would be the case if an entire word was stored in each chip) down to 4, which helps reduce the overall size of the chip since gate count estimates and preliminary layouts showed it to be pad-limited. For the data cache, there is also a byte operations chip between the data memories and the processor (however, it will not be implemented at this time since contract funds are as yet unavailable for its fabrication). The address bit mapping is shown in Figure 4.1.

Figure 4.1: Address Bit Mapping

Several different designs were evaluated for the primary cache. The current design was chosen because the larger reticle size that is available now, permits the design of a second cache chip specifically dedicated to storage. The previous design used a single configurable chip for both the controller and the storage. A table of some relevant statistics for the different designs is shown in Table 4.1. The key design criterion has been CPI, with power dissipation the next highest priority, and yield being the third consideration.

Table 4.1: Cache Designs Trade-offs

SizeDeviceI/O CPIPowerDensity
4kbit11,000132 2.5420W10.7W/cm2
1kbyte13,500155 2.1538W11.3W/cm2
2kbyte13,500155 1.8968W12.0W/cm2

The F-RISC chip set has been designed with a capability for a 4x increase in memory density. Memory density increases would require new technology since the present chips are already quite large and have a large power dissipation. Rockwell has begun development of a combined HBT MESFET technology in which 0.5 µm channel length MESFETs can be fabricated in unused HBT emitter material regions on the chip. This promises the capability to integrate both devices in the same technology with neither having reduced performance due to the presence of the other. HBT devices could be employed in the decoders and sense amplifiers of the cache chips while MESFET devices could be used in the core. This would reduce the size of the chip, increase its yield, and reduce its power, while not significantly decreasing speed. Small, but very fast, memories have already been demonstrated by Rockwell.

4.3 Controller Chips

There are two different configurations of the controller chip. The instruction cache controller has 68 input pads and 45 output pads. The data cache controller has 68 inputs and 46 outputs. The additional output is for the dirty bit signal to the second level cache, so that a modified cache line can be written back.

The DCC uses 24-bit tags, each consisting of 23 address bits and one dirty bit. A comparator checks the address that is latched from the CPU against the corresponding tag. If they do not match, it sends a miss signal to the processor and waits for the processor to acknowledge the miss. The processor sends an acknowledge signal and a stall signal if it needs the data. If the data is not really needed by the processor because the instruction was flushed, the processor sends the acknowledge but no stall, so that the cache knows not to try and get the data from higher level memory. If the stall signal is received, the DCC sends a miss signal to the 2nd-level data cache and initiates the process of getting the proper data for the CPU.

The ICC functions similarly to the DCC, but with two differences. It only uses 23-bit tags, since it doesn't have to worry about writing modified data back out to the 2nd level, and it contains a remote program counter.

The cache controller chip layout is shown in Figure 4.2. The chip contains 11,500 transistors.

Figure 4.2: Cache Controller Chip Artwork.

4.4 RAM Chips

Similar to the controller chip, there are different configurations of the RAM chips for the I and D caches. Access time of the RAM is currently targeted at 600ps. The main difference between the I-RAM and D-RAM is that the former needs no output pads to the second level. The I-RAM configuration has 75 input pads and 4 output pads. The D-RAM has 83 inputs and 72 outputs. The lowest 9 address lines from the CPU are used to select the appropriate bits from each chip. The lowest four are used to control the output multiplexers, and the other 5 are used as the address of the 32 lines (blocks) in the RAM. Artwork for the memory chip is shown in Figure 4.3. The chip contains 13,500 transistors.

Figure 4.3: Memory Chip Artwork.

4.5 Byte Operations Chip

The last cache chip handles byte operations. It receives three control lines from F-RISC/G which tell it how to align the bytes in a data word being sent from the cache. This chip consists primarily of large shifters and uses 67 input pads and 64 output pads. It will not be implemented at this time due to budget limitations in the current contract. However the designs include all the electronic "hooks" to incorporate these chips when fabricated.

4.6 Performance Evaluation

There is currently no cache simulator specific to F-RISC/G. An estimate of the CPI for the processor can be seen in Table 4.2. This estimate is based on use of a cache simulator for a RISC machine that is similar to F-RISC, as well as estimates of code latencies inherent in the architecture of the processor. It assumes a very large second level cache and the use of an optimizing compiler for code generation. The second level cache effective turnaround time is assumed to be 5 ns.

Table 4.2: CPI Calculations

Cause of Increase in CPI
Amount
Instruction Miss
0.32
LOAD Miss
0.13
STORE Miss
0.04
Copyback
0.04
Code with Latencies
1.45
Total
1.89


The design space for the cache is in Figure 4.4. It shows the differences in CPI by changing bus width vs. memory decoder depth when increasing the size of the cache. The lowest level shown in this chart is 1.45, the lowest inherent CPI of the processor, due to branch and load latencies.

4.7 Interface with Second Level Cache (L2)

The second-level caches will also be located on the F-RISC/G MCM package. The delay penalty for crossing MCM boundaries is just too large. They can use 3 ns SRAMs and should provide at least 32 Kbytes for both data and instructions caches. A recent revision to the primary cache (adding an extra signal and passing the address to the second level at the same time the miss is sent to the processor for acknowledgment) has been made in an effort to speed up the LOAD miss sequence.

Figure 4.4: Cache design space

4.8 Other Considerations for Primary Cache

One other consideration for the primary cache is the write allocation policy. In F-RISC/G, the primary cache cannot write directly back to the main memory. All communications and data transfers must be done through the second-level cache. There would simply be too much overhead if the processor were allowed to write to main memory directly for the current yield.

Another issue is the cache coherency problem. This is particularly critical in configurations of F-RISC containing multiple nodes which can operate in an SPMD (same program, multiple data) configuration. There is not enough real estate to force the caches of various nodes to be coherent at the primary level. Any such schemes will be left to the L2 caches, or perhaps some higher level of memory.

5 Secondary Cache Memory

5.1 Design Considerations

The design of the first level of cache memory requires that the effective transaction time for the second level (L2) cache be 5 ns in order to meet overall system throughput goals.

A survey of available SRAM technology reveals that it is possible to obtain 1k x 32 bit SRAM's with an access time of 3 ns [Brow92]. If GaAs MESFET technology is used, it can be expected that these chips will each dissipate 5 watts. Large gate arrays with gate delay times of around 177 ps, such as the Vitesse VSC20K8R [Vite91], are available, and can be used to implement cache logic and tag RAM on a single chip. Given these access times, a level 2 miss or dirty will exceed this time limitation. The solution is to provide a level 2 cache that is significantly larger than the level 1 cache and to implement a write-back scheme for the L1 cache.

The estimated time required for an L2 hit can be determined by calculating the time required for the L1 Miss signal to reach and propagate through the L2 control logic plus the time needed to access the L2 SRAMs and propagate the stored tag through the comparator. Time of flight on the MCM is 6 ps/mm. Assuming a worst case distance of 5 cm between the L1 and L2 caches, around 600 ps is taken up by round trip MCM delays. Logic delays are expected to consume less than 1 ns.

5.2 Size and Partitioning

The L1 cache uses a 512 bit (16 word) block size, and contains 32 rows. In determining the L2 cache block size it was decided to avoid the use of smaller blocks than those used in the L1 cache in order to avoid the use of multiple bus accesses. A bigger block size also has the disadvantage of requiring multiplexing logic to select which portion of the L2 block to send to the L1 cache and requires more L2 SRAM chips. As a result, a 512 bit block size will be implemented.

Given 1024x32 SRAMs, a 16 word by 1024 line cache can be implemented with 16 chips. This is 32 times bigger than the L1 cache.

The L2 cache will use a Harvard architecture, with each cache (instruction and data) being 64k-bytes in size, and requiring 16 SRAM chips and a single controller chip. The two cache controllers can be identical if the dirty circuitry in the instruction cache is not used.

Each cache contains one controller / tag RAM chip which interfaces to 16 SRAM chips, the third level of memory, and the L1 cache; see Figure 5.1.

Figure 5.1: L2 Cache Partitioning

5.3 Level 1 / Level 2 Cache Communications

The L1 cache presents the L2 control logic with three control signals, L1 Miss, L1 Dirty, and L1 Valid; see Figure 5.2. L1 Valid is used to decrease the time required for an L2 read hit. Since the L1 cache passes address information from the CPU to the L2 cache, the L1 Valid signal can be used by the L2 cache to determine if the address on the bus is valid. If so, then the L2 cache can start to access its SRAMs as soon as the L1 cache gets an address, instead of having to wait for the first level of cache to determine that an L2 transaction is needed. This doesn't have any positive effect in the event of an L1 dirty miss since the L1 cache will have to send an address stored in its SRAMs rather than the address presented to it by the CPU.

Figure 5.2: Intra-Cache Communication

5.4 Cache Organization

Due to timing constraints, the level 2 cache will use a direct-mapped architecture. If the effective access time of the third level of memory is 25 ns, if it is assumed that any L2 transaction invokes a 1 ns penalty in logic delays and SRAM to controller delays, and if time of flight delays and the "early start" mechanism are factored in, then it is possible to estimate the time required for any L2 cache transaction. See Table 5.1. The approximate L2 cache cycle timing is shown in Figure 5.3.

Figure 5.3: Approximate L2 Cache Cycle Timing

Table 5.1

L2 Read Hit:4 ns
L2 Read Miss, Clean:28 ns
L2 Read Miss, Dirty:58 ns
L1 Dirty, L2 Clean:7 ns
L1 Dirty, L2 Dirty:32 ns

5.5 Write Policy / Level 3 Memory Interface

In order to minimize bus traffic between the L2 cache and the third level of memory, a write-through cache scheme will be implemented. This has the advantage of reducing time-consuming write requests to the third level of memory.

The assumed level three memory effective access time is 25 ns. A third level of cache may be desirable in order to achieve this access time and for interfacing between the level two output signals, L2 Valid, L2 Done, and L2 Miss, and conventional RAM.

5.6 Conclusions Concerning L2 Cache

Due to the timing requirements imposed by the L1 cache design there are relatively few design decisions to be made regarding the L2 cache. Cache size requirements dictate that commercially available parts be used and timing dictates that the cache is direct-mapped. In order to improve overall throughput, a write-through cache will be used, and early start capabilities implemented. The 5 ns goal for average access time can be realized, but only if the cache can be made large enough to ensure that most read transactions are hits. A third level of cache could improve matters by reducing the L2 miss penalty but the technology for this section of the design are fairly conventional except for the width of bus transactions.

6 Package Design

Investigation of the design of a high speed package for the F-RISC/G system is continuing at Rensselaer. A multi-chip module (MCM) will be used to interconnect the F-RISC/G chipset. This package must provide high-bandwidth interconnects between the chips and must handle signals with rise times on the order of 75 ps. The package needs to provide integrated bypass capacitors, terminating resistors and cooling for the system. The next section will describe the issues under consideration for the design of the MCM.

6.1 Packaging Issues

The placement of the F-RISC/G core chips on an MCM is shown in Figure 6.1. Each chip has a size of 7.5 mm x 8 mm except the clock deskew chip which has a size of 4.5 mm x 8 mm. The routing channels between chips are 3 mm wide for routability. Table 6.1 shows the maximum routing distance (transmission delays) allowed.

Figure 6.1: Placement of F-RISC/G Core

Table 6.1: Maximum Available Routing Distances for F-RISC/G MCM

Maximum Available Routing Distances
Type of TransferTime allotted (ps) Routing Distance (mm)
Address to I-Cache
750
135
Data from I-Cache
500
90
Address to D-Cache
750
135
Data to D-Cache
1000
180
Data from D-Cache
500
90

Figure 6.1 shows the floorplan of the F-RISC/G MCM. The core chipset contains eight different types of chips with a total of twenty five chips. This core is surrounded by a second set of chips which implements the second level caches for F-RISC/G. These second level caches contain two cache controller chips and thirty-two memory chips. The revised floorplan of the MCM is shown in Figure 6.2. The chips for the second level caches are shaded.

Figure 6.2: Placement of F-RISC/G chips with Second Level Cache

The transmission line behavior of the MCM interconnect forces termination of signals traveling long distances. A patterned NiCr layer can be used for the terminating resistors. Bypass capacitors must be provided to suppress the noise in the power distribution system which is mainly caused by switching I/O drivers.

The high speed gates necessary for meeting our speed criteria consume up to 10 mW. The F-RISC/G core is estimated to use about 180 W of power. This number increases with the placement of the secondary level cache chips on the same MCM. Removal of this heat is critical. The semi-insulating GaAs substrate used in the Rockwell process has high thermal resistance. Lapping down the GaAs substrate will reduce this thermal resistance. Rockwell can lap down the chips to 3 mils.

Figure 6.3 shows an early approach to this problem. This package consisted of copper wiring layers separated by a Parylene dielectric. The relative permitivity (er) of Parylene is 2.7, making the time of flight delay 0.18 mm/ps (or around 6 ps/mm). Ground planes are provided for all signal layers to yield controlled impedance, high bandwidth interconnect. The proposed wiring pitch is about 25 µm and the metal layers are 5 µm thick. The chips were to be mounted in the flip-chip style and bypass capacitors and termination resistors were to be integrated in the final version.

Figure 6.3: Cross-section of a Thin-film Multi-chip Module

Further work is being done on several fronts to accomplish the design of a package for F-RISC/G chipset. The MCM interconnects is being simulated to accurately predict the behavior of the system at high speeds. This work also includes the determination of placement of the terminators and bypass capacitors and the heat dissipation in the terminators.

References

[Brow92] R. B. Brown, T. N. Mudge, GaAs MESFET Characteristics for High-Performance Computing, Proceedings High Performance Computing PI Meeting 1992, Vol. I, page 73, Fig.17

[Karp91] Karpenske, D. and C. Talbot. Testing and diagnosis of multichip modules. Solid State Technology, Vol. 34, No. 6, pp. 2426, June 1991.

[Maun92] Maunder, C. M. and R. E. Tulloss. Testability on TAP. IEEE Spectrum, pp. 3437, February 1992.

[Phil93] Philhower, R. A. Spartan RISC architecture for yield-limited technology. Ph.D. dissertation, Rensselaer Polytechnic Institute, Troy, New York, expected 1993.

[Vite91] Vitesse, 1991 Product Data Book, p. 1-10

[Webe92] Weber, S. JTAG finally becomes an off-the-shelf-solution. Electronics, Vol. 65, No. 9, p. 13, 10 August 1992.