F-RISC/G and Beyond -- Subnanosecond
Fast RISC for
TeraOPS Parallel Processing Applications



ARPA Contract Numbers DAAL03-90G-0187,

DAAH04-93G-0477,

[AASERT Award DAAL03-92G-0307 for Cache Memory]

Semi-Annual Technical Report

April 1994 - October 1994



Prof. John F. McDonald

Center for Integrated Electronics

Rensselaer Polytechnic Institute

Troy, New York 12180


(518)-276-2919

FAX (518)-276-8761

MACinFAX (518)-276-4882

e-mail: mcdonald@unix.cie.rpi.edu

Abstract

Previous work on F-RISC/G project had completed the design of at least one version of each of the key architecture chips for the 1 ns F-RISC/G computer. However, preliminary results from the first RPI test chip suggested it would be prudent to delay fabrication of these chips. Fortunately the award of a companion HSCD subcontract from Rockwell provided a mechanism whereby a series of additional test chip fabrication probes could be tried to establish the status of the fabrication process. These collateral fabrication runs have made it possible for us to avoid premature expenditure of the foundry funds, and to defer commitment of the architecture chips to a point in time where yield and performace could be verified. An additional benefit of this approach has been the provision of extra time to check the F-RISC chip designs more thoroughly. It was during this process that the importance of utilizing 3D capacitance analysis tools in network extraction of wiring delays became evident. Through use of a new tool developed by Y. LeCoz at RPI the F-RISC research group uncovered a large number of instances where our designs had underspecified the driving power of various logic circuits. In certain cases this explained residual differences between measured and computed waveforms taken from the first RPI test chip. Extensive rework of most of the architecture chips was found necessary, especially for the register file, and in the Cache Memory chips. Another benefit from the HSCD program was that for the first time validated HBT thermal models became available. This provided the group a chance to incorporate more realistic "slack" in the timing to accomodate thermal slowdown of the HBT at the elevated temperatures likely to be encountered in the MCM. A rather detailed tradeoff between cooling capability in the MCM and the ultimate speed was obtained. Four test chips were submitted under HSCD funding, including a test circuit for a 20 GHz VCO. Additional checking of the architecture has become possible through the use of APTIX Field Programmable Circuit Board technology along with FPGA emulators for each of the GaAs chips. This very thorough extra check revealed at least two additional functional problems with the previous designs that might have jeopardized the success of the project, one of which was only excited only on simulated power up. The emulator board also permits continued software development to proceed while the actual processor chips move toward fabrication. Finally, during this contract period continued the examination of the Rockwell models for the new 100 GHz process and a new 80 GHz round emitter HBT which could be used in a subnanosecond F-RISC/H superscalar effort.

List of Participants

> John F. McDonald (Professor and Principal Investigator)

> Hans J. Greub (Assistant Professor)

> Arik Airapetian (Visiting Scientist in the test area)

> John S. VanEtten (Graduate Student)

> Atul Garg (Graduate Student)

> Peter Campbell (Graduate Student)

> Cliff Maier (Graduate Student)

> Sam Steidl (Graduate Student)

> Steve Carlough (Graduate Student)

Project Goals

> Exploration of the Fundamental Limits of High-Speed Architectures through use of Advanced Device Technology.

> Study of GaAs HBTs for Fast Reduced Instruction Set Computer Design including Adequacy of Yield, and Device Performance for High Performance Computing Applications.

> Research the Impact of Advanced MultiChip Module and 3D Packaging on Processor Architectures for Sub-nanosecond Computing.

> Examination of Pipelining across Pad Driver Boundaries as a Means for Reducing Adverse Effect of Partitioning due to Yield Limitations in Advanced Technologies.

> Investigation of Power Management in High Power/ High Performance Technologies.

> Study of Appropriate Memory Organization and Management for Extremely Fast RISC Engines using Yield Limited Technologies.

> Use of Adaptive Clock Distribution Circuits for Skew Compensation at High Frequencies.

> Exploration of Superscalar and VLIW organizations in the sub-nanosecond cycle regime.

> Exploration of a combination of HBT and MESFET technologies for lower power, higher yield, but fast cache memory [AASERT Program].

> Exploration of novel new HBT technology such as HBT's in the SiGe and InP materials systems.

Introduction

The present focus of the F-RISC group is the Heterojunction Bipolar Transistor and its impact on computer design. While the industrial emphasis has been primarily on CMOS technology it is acknowledged by many experts that progress with this technology is approaching some significant barriers. GaAs MESFET, H-MESFET, and HEMT devices may offer some prolongation of the usefulness of the FET. However there are many reasons to reexamine the bipolar device and its role in the future. Many of these have been addressed in earlier semiannual reports, and they will not be repeated here. However, as is mentioned in a recent text on High Speed Devices by Chang and Kai, it can be observed that the speed of a device is determined by the transit time of a carrier through a control region. In a FET the transit time is limited by the length of the channel which is dictated by the patterning capability of lithography among other things. In the bipolar device this time is determined by the thickness of the base which in turn is dictated by epitaxial technology. Epitaxy is capable of making much shorter control regions than lithography.

More significantly the current handling capability and current required by the FET in switching circuit applications is forced to travel horizontally through a channel whose minimal width is dictated by lithography and whose thickness is by necessity much smaller than a design rule if the device is to have good turn off characteristics and low leakage. For the bipolar device the current must pass vertically through the base whose length and width are minimally both set by lithographic design rules. Hence, it is evident that if a certain amount of current must be switched to charge load capacitances over a given voltage swing, the current density in the device can be inherently lower with the bipolar device, or conversely a greater amount of current is available for switching capacitive loads. The transconductance of the bipolar device is much larger per unit width of the device than for the FET.

Finally as we have also observed in previous reports the threshold control on bipolar devices is superior to that of the FET. This enables the use of much lower voltage swings, which in turn demand even less current to accomplish a logic transition. Consistent with this goal is the fact that with bipolar circuits it is possible to use current steering logic and full differential mode to reduce switching noise to a minimum and increase noise margin simultaneously.

HBT devices denies the it the advantage of the shorter interconnections found with most FET processing. So some of the HBO's wire charging advantages The present disadvantage of the Heterojunction bipolar device is that its rate of progress towards lithographic shrinkage has progressed more slowly than for CMOS or MESFET technology. This technology lag has two effects. The first is that the large scale of the present are compromised. The second is that the total current in the HBT must be large to achieve the biasing current densities demanded to reach the peak of the transit time frequency curve versus the collector current. A smaller bipolar device could reach this density for optimum peak transit time with a lower total amount of current. This in turn could result in a faster circuit which actually burns less power for the same small voltage swing at the higher frequency. The reason for this seemingly contradictory notion is that the dynamic power dissipation of the bipolar circuits can be much smaller than in FET processing (by several orders of magnitude), but the biasing currents are set by the size of the device and this peak in the current density curve for the transit time frequency. A smaller device with concurrent shrinkage of the wiring capacitance tends to skew this comparison towards the bipolar device.

Part of our research program is to identify these generic trends and to use the F-RISC architecture as a test case for verifying the hypothesis that ultimately the bipolar device might be faster and use less power than CMOS in high power systems. A key factor in this study (which is still not resolved) is the impact of higher current densities on these devices, primarily through dopant redistribution. One lithographic shrink by a factor of two increases the current density in both FET and bipolar technology by a factor of two barring other changes for situations where the device drives wire capacitance, because a lithographic shrink for any of the submicron technologies brings only a linear reduction in wire capacitance due to edge effects. The problem is more severe in GaAs technology because of the semi-insulating substrate and we see this effect at larger wire geometries, but submicron CMOS also is encountering this scaling problem. Of course, as has also been mentioned, the HBT seems uniquely better able to handle these current densities. In this report we continue the study of the scaling process for HBT technology that was begun in the previous semiannual report.

Although the 100 GHz Rockwell HBT process is still under development, a preliminary sampling of performance improvements can be had by examining a new 80 GHz round emitter variation on the 50 GHz baseline HBO. Although there would be no wire capacitance scaling associated with using these round emitter devices, many wires are quite short in F-RISC/G and it is possible that incorporation of this transistor in some of the critical path calculations might reveal that some speed up of the architecture would derive from just using these transistors in place of the conventional rectangular emitter HBO's.

Access to the 100 GHz process is enhanced by the participation of the RPI design group in some of the HSCD objectives. Although most of the HSCD work is currently using the 50 GHz baseline our participation in that program has already provided an early look at the 100 GHz and 80 GHz round emitter models. Additionally there is a large fabrication effort in the HSCD program, which has provided more access to the yield and performance modeling for Rockwell's process than we could otherwise afford. This has permitted submission of several test chips, including a revised version of the first RPI test chip for fabrication. One of these new test chips focuses on a set of wiring oscillators, whose performance will give a better confirmation of the importance we have ascribed to 3D wiring effects in our layouts.

Companion HSCD project

A reticle containing test chips was submitted to Rockwell for fabrication in July. The layout of the reticle is shown in Figure 1. This reticle contains four chips - passive test chip, standard cell test chip, 20 GHz voltage controlled oscillator (VCO) test chip, and register file test chip.

FIGURE 1: LAYOUT OF THE RPI-ROCKWELL RETICLE

The passive test chip contains test structures to measure wiring parasitics on a HBT chip. It is described in the next section. Other chips contain a number of key circuits used in the main architecture chips. The 20 GHz VCO chip is described in section 0. The register file test chip is an optimized version of the previous test chip fabricated at Rockwell. It is described in section 0. The standard cell test chip contains a number of representative standard cells used in the F-RISC/G chips and is described in section 2.4.

Passive test chip

A passive test chip was designed and sent along with other test chips to Rockwell in July. The layout of the chip is shown in Figure 2. This chip contains both the passive test structures and the active test structures.

FIGURE 2: LAYOUT OF THE PASSIVE TEST CHIP

The passive structures are meant for measuring wiring parasitics on a AlGaAs/GaAs HBT chip and comparing the measured results with results obtained from CAD tools. The structures are divided into five categories - capacitors, inductors, probe calibration, transmission lines, and resistors.

The active structures are divided into three categories - coupling, device characterization, and ring oscillators. The coupling structures allow measuring the coupling between differentially coupled wires and single-ended wires. A number of device-characterization structures are provided to fully characterize the devices used in other active structures. The ring-oscillators are loaded with different interconnect capacitances to show the effect of capacitive loading on the wires. These oscillators are made up of standard Q1 and the new round Q1 transistors. The oscillation frequencies of these structures lie in the range of 1.5 GHz - 2.0 GHz.

Capacitance Extraction

FIGURE 3: FIELD LINES FOR SI INTERCONNECT STRUCTURE (1-2--3)

FIGURE 4: FIELD LINES FOR GAAS INTERCONNECT STRUCTURE (1-2--3)

The semi-insulating GaAs substrate reduces interconnect capacitances, but the lack of a ground plane in close proximity to the interconnect layers results in increased coupling to nearby nodes. Even if the wafer is lapped, the ground plane is at least 75 m away from the interconnect layers. In Si circuits the lightly doped silicon substrate acts as a ground plane and interconnect capacitances are dominated by the capacitance to the substrate, at least at low frequencies. Thus, the capacitance of interconnections on Si is a strong function of the interconnect length whereas the GaAs interconnect capacitance is a strong function of the shape and proximity of nearby conductors. The methods used in many VLSI design tools for interconnect extraction are targeted for CMOS designs and use, for example, the Sakurai fitting equations to extract parasitic capacitances. However, since GaAs interconnect capacitances are not dominated by the capacitance to the substrate these methods are less accurate.

Figure 4 and Figure 3 illustrate the difference between GaAs and Si interconnect capacitances. Figure 4 shows the field lines for three 2 m wide wires (1-2 --3) with a spacing of 2 m and 4 m on a 75 m thick GaAs substrate. Figure 3 shows the field lines for an equivalent interconnect structure on Si with a field oxide thickness of 1 m. The interconnect capacitances per unit length for the center conductor are shown in Table 1.

TABLE 1 - GAAS AND SI INTERCONNECT CAPACITANCE

Capacitance
GaAs

[fF/µm]
Si

[fF/µm]
C20
0.022
0.141
C21
0.066
0.028
C23
0.045
0.012
C22
0.133
0.181

FIGURE 5: TYPICAL INTERCONNECT GEOMETRY

The capacitance to the substrate, C20, is only 17% of the total capacitance for GaAs, but 78% for Si. The coupling to the nearby wires is at least twice as strong. In addition, the coupling to nearby wires does not decrease as fast with distance in GaAs as in Si. The coupling to conductor 3 is still 68% of C21 in GaAs even though the spacing to conductor 3 is 4 m and the spacing to conductor 1 is 2 m. The coupling to conductor 3 is only 42% of C21 in Si. The typical interconnect structures in a circuit are much more complex than the 2-D interconnect case used to illustrate the difference between GaAs and Si coupling capacitances. Especially in standard cells the conductor geometries and spacing vary widely and all three metal layers are used. The strong coupling in GaAs interconnects and the complex geometries require therefore a 3-D capacitance extraction to obtain accurate interconnect capacitance figures.

Accurate device models and interconnect parasitics are essential for high speed circuit designs. If the delays can be estimated accurately the designer can allocate power optimally and control on-chip skew effectively resulting in faster circuits with lower power dissipation. The importance of 3-D capacitance extraction tools increases as the devices and interconnects are scaled. If the Rockwell HBT process is scaled by factor S (S<1), the interconnect length shrinks by S and interconnect capacitances also shrink by S. The capacitance per unit length stays about constant since the width and the spacing decrease. However, the maximum device current shrinks with S2. The current per cm2 in the emitter is about constant since it is already close to the dopant redistribution and electromigration limits. Switching to carbon doping for the HBT base layer has increased the dopant redistribution limit by a factor of two. If interconnect capacitances shrink with S and the device current levels shrink with S2 interconnect capacitance induced delays will make up a larger fraction of the critical path delays. Thus accurate 3-D capacitance extraction tools are essential for subnanosecond computing.

We currently have the OEA tool Metal and the RLC tool QuickCap available for 3-D capacitance extraction. Metal uses 3-D volumetric meshing and a very fast mesh equation solver. Unfortunately the mesh size gets very quickly too large, even for small GaAs circuits. OEA is working on a new tool called NETAN which will only analyze the capacitance of a net to its nearest neighbors and use localized meshing. NETAN will be able to extract capacitances for larger nets.

RPI is a beta site for RLC's QuickCap extractor which is still under development. For all our new designs and the analysis of previous designs we have been using QuickCap. QuickCap uses a floating point random walk algorithm for capacitance extraction that does not require meshing. We have been able to analyze circuits with up to 2000 nodes ( an area of 1 mm by 2 mm) and extract the capacitances of 10 nodes in 24 hours. It is not feasible to extract all node capacitances in large circuits because of memory constraints and because the extraction time increases linearly with the number of capacitances extracted. We found performance problems with circuits that contain empty areas because circuits were removed to reduce memory requirements and avoid paging. The number of walks per CPU second fell significantly because many of the random walks would rattle in the thin oxide layer of the HBT process for a long time before terminating. Random walks terminate only once they hit an electrode or the ground plane. The current release, QuickCap.07, contains a fix that improves performance on such circuits by up to a factor of 16.

RLC is currently working on a parallel version of QuickCap. The new version will divide the chip area into overlapping tiles and distribute the analysis of the tiles onto multiple SUN workstations. Parallelization of the random walk algorithm should yield a linear speedup with the number of processors since each random walk is independent from any other. Thus very little communication between the subtasks running on the different processors is required. Tiling will reduce the memory requirements for each subtask and avoid paging.

We need improved/faster extraction tools to be able to get extraction results over night. Currently we can fully extract a standard cell or a few critical nodes in a larger block over night, however, the extractions of a large number of nodes in a large circuit such as the registerfile are very time consuming (>24 CPU hours). The parallel version will also help speeding up the extraction of small layout cells. The experience with the registerfile has shown that we need to use the extractor to trade-off different layout alternatives and to minimize capacitances on critical nodes such as the bitlines.

20 GHz "Challenge" Chip Update

The 20 Ghz "Challenge" chip, a voltage-controlled oscillator (VCO), was sent to Rockwell in July 1994. At present, a test plan is being prepared for the chip. During the design of the VCO, the 2D VTI Tools extractor was the sole capacitance estimation tool available. While we know this tool may be inaccurate for certain layout topologies (in particular dense circuits with minimally spaced wires) we believe that the impact of any inaccuracies in modelling capacitance are offset by the design methodology used within the VCO. For example, the 2D extractor is optimized for dense, minimally-spaced standard cell routing channels. While the VCO does have some extremely dense sections, all long wires have been spaced approximately 3 to 4 times the minimal spacing requirement, thus the 2D extractor (based upon the assumption of minimally spaced wires) will overestimate capacitance on these nodes. The RCL QuickCap extractor will eventually be used to generate accurate capacitance numbers which may then be used in a PSPICE simulation. Results from simulation will then be compared to experimental values measured from the fabricated chip. The design of the VCO had benefit for other aspects of FRISC, notably the Deskew chip. The high-speed, symmetric XOR cell used in the VCO as a frequency multiplier is also a highly-sensitive phase detector and is incorporated into the Deskew phase-locked loop (PLL) circuit.

Optimization of the Register File used in the RPI Testchip and Datapath Chip

After the modifications to the memory cells and the address decoders were completed (as described in the last semiannual report), simulations with PSPICE (which included the wiring capacitances extracted with our new 3-D capacitance extractor) revealed that the register file was still too slow. In order to improve the access time, other cells were examined using the QuickCap capacitance extraction tool. As a result, the threshold voltage generator, address-line drivers, read-write logic and sense amplifiers were modified. In addition, the availability of a third level of metal opened up new design possibilities which were explored and integrated into the optimized register file. Figure 6 depicts the location of the changes within the register file. These changes are described below.

Most of the changes were made possible by the recent process upgrade to a third level of metal which could be routed over devices. This allowed the designer to produce layouts with less capacitance and more symmetry, thereby improving the circuit speed while reducing skew within a differential signal pair. Because the register file is an analog circuit which is highly sensitive to capacitance, symmetry in layout is critical. Based upon experience with the 20 GHz "Challenge" Chip, the designer of the VCO was selected to redesign the register file. Because the register file was already incorporated into two other layouts, it was also extremely important to maintain the original signal input/output locations. Although this constraint was always met, it did reduce the symmetry of the layout and thus the overall optimality.

FIGURE 6: REGISTER FILE MODIFICATIONS

Threshold Voltage Generator

There were a number of reasons for optimizing this circuit. Most of all, parts of this circuit must match exactly with the layout and orientation of both the memory cell and the wordline pullup resistors, hence the optimization of the memory cells dictated the redesign of the Threshold Voltage Generator. Other justification came from the use of a two-level metal process for the original design. As a result, the layout was unnecessarily complex for use with a three-level metal process, therefore it was decided that the circuit would be redesigned from scratch in order to fully utilize the new process. This new layout also allowed the use of monolithic microwave integrated circuit (MMIC) capacitors, and as a result, the overall size of the layout was reduced considerably.

Address Line Drivers

As with the Threshold Voltage Generator, the original Address Line Driver was designed for a two-level metal process, resulting in a dense, asymmetrical layout with high parasitic capacitance. In order to efficiently utilize the new process, this circuit was also redesigned from scratch. Drawing upon experience with the high-speed VCO, the design methodology focused explicitly upon creating balanced, symmetric signal paths to ensure matched delay. As a result, the new optimized layout was significantly smaller than the original design. The savings in area were transferred to reducing capacitance on adjacent address lines by increasing the spacing between lines and between the driver and the lines. The Address Line Driver optimization was constrained by the original position of the register file input connections.

Power Rail Metallization Changes

In optimizing the Address Line Drivers, it became possible to optimize the power rails within the register file. The original design required several alternating power and ground connections to the address driver side of the chip simply because a power connection placed between two address line drivers could not be extended beyond those two cells. By placing the power and ground rails in the third level of metal, the rails may be routed over the cells and thus all drivers may share the same supply rails. This helps reduce voltage droop along the rails and allows more flexibility in providing power to the register file macro.

Address Line Metallization Changes

The Address Line Drivers are used as a buffer between the register file address line inputs and the internal address lines. The internal lines run the height of the macro and are connected to the 32 address line decoders. Crossover capacitance on the internal address lines can be significant and should be minimized, hence the metallization scheme was modified to take advantage of the third level of metal . By changing the address lines from metal2 to metal3, the crossover capacitance between the decoder inputs and the address lines was significantly reduced.

Sense Amplifier Changes

The Sense Amplifiers were modified in order to reduce crossover capacitance and increase drive current capabilities. The internal supply rails were rerouted over devices using metal3 and the VSS rail was split into two rails in order to reduce capacitance. The drive current was boosted by replacing a normal Q1 transistor with a high-current Q3 device. The Sense Amplifier optimization was constrained by the original position of the register file output connections.

Addition of Read/Write Buffer

A buffer was added to the Read/Write input signal to drive the eight Read/Write Logic cells. This buffer reduced the loading on the input signal and thus improved the access time of the register file. The addition of the buffer was made possible by the reduced area of the redesigned threshold voltage generator cell. The Read/Write Buffer placement and routing was constrained by the original position of the register file input connectionss.

Read/Write Logic Changes

The Read/Write Logic was also optimized to take advantage of the third level of metal. Power rails were repositioned within the cell in order to reduce capacitance. In addition, the circuit was redesigned to remove a device and improve symmetry between the signal paths. The Read/Write Logic optimization was constrained by the original position of the register file input connections.

Clock Distribution

The clock distribution of subnanosecond clock signals on an MCM is difficult since even relatively small amounts of skew can make up a significant fraction of the short clock cycle. For example, if data is transferred synchronously between two chips on the MCM within a 500 ps cycle and the clock skew is 50 ps only 400 ps are available for the transfer in the worst case. In addition, there will be skew in the on-chip clock distribution tree that provides the clock for the input and output latches on the two chips which can further reduce the available data transfer time. Thus a low skew clock distribution scheme on the MCM and on the chips is essential for subnanosecond computers.

We have developed a clock distribution scheme with active skew compensation based on digital delay lines and Phase Locked Loops (PLL). The skew compensation scheme can compensate for slowly varying delays due to temperature effects or water take-up, a problem with Polyimides. A test chip has been designed, laid out, and verified for evaluation of the clock distribution scheme at 2 Ghz. The test chip contains several additional features to measure clock jitter and to increase testability and observability of key control signals.

Figure 7 shows the clock distribution scheme. A clock distribution chip provides a clock distribution channel for each clocked chip on the MCM. Each channel is essentially a PLL clock loop. The master clock is sent through a digital delay line on the forward path through a clock driver over a MCM transmission line to a clocked chip. The clocked chip receives the clock signal and feeds it to its four phase clock generator and returns the clock signal back to the clock distribution chip on a matched transmission line. The clock distribution chip receives the clock return signal and sends it through a matched digital delay line to the phase detector of a PLL controller. The controller will adjust the control voltage of the digital delay lines such that the phase difference or phase error between the master clock and the clock return signal is zero. In the ideal case all delays on the forward and return path are exactly matched and the clock arrives at the four phase generator on the receiving chip at 0.5* n * Tclk if the clock loop round trip delay is n*Tclk and the PLL is in lock. Once all N clock channels are in lock, each receiving chip receives the master clock with a delay of 0.5*n*Tclk if we constrain the delays on each clock channel such that the clock delay multiplier n is the same for all clock channels.

The clock distribution chip contains further a system startup controller that generates the Sync signal that synchronizes the four phase generators on the receiving chips. The four phase generator switches to the next phase at every clock signal transition, thus a clock phase is only 250 ps long. Without synchronization the clocked chips might receive the clock without skew, but be in a different phase. The master clock must be stopped for a clock period in order to distribute the Sync signal to all receiving chips since the 250 ps delay between clock transitions is not sufficient to distribute the Sync signal to all chips on the MCM.

In order to prevent the clock loops from locking with different clock delay multipliers the following conditions must be met:

max(Delay_of_Delay_Line) + max(Transmission_Line_Delay_Missmatch) < Tclk

min(Delay_of_Delay_Line) - max(Transmission_Line_Delay_Missmatch) > -Tclk

FIGURE 7: CLOCK SKEW COMPENSATION

The maximum delay of the digital delay lines with respect to the initial delay, the Init signal forces the delay control signal to zero, is 125 ps and the minimum delay is -125 ps, thus the maximum tolerable delay mismatch between the clock distribution channels must be below 125 ps for a 2 Ghz clock signal.

Phase Locked Loop Controller

FIGURE 8. THREE STATE PHASE DETECTOR

The phase locked loop controller adjusts the control voltage of the digital delay lines such that the phase difference between the master clock and the return clock is zero and the PLL stays in lock even if the interconnect or driver/receiver delays vary slowly. The controller is more complicated than in a PLL for frequency control since no VCO is present and some of the non-ideal behavior of phase detectors becomes important. The phase difference or phase error is measured with the three state phase detector shown in Figure 8. The phase detector has actually a fourth state (11) with both output signals UP and DOWN high simultaneously. If the phase detector is in state (11) it gets cleared by the AND gate after the propagation delay through the AND and the Reset delay of the master slave latch. If one of the input signals (V, R) goes through a positive transition while the phase detector is in state (11) or the clear signal is still active the transition gets lost and the phase detector switches characteristics. The two characteristics of an ideal three state phase dectector are shown in. The switch will happen as soon as the phase difference is outside of the permissible phase range of the phase detector. The characteristics are offset by one clock cycle.

Figure 9 shows the HBT phase detector characteristic for a 2 Ghz clock signal. The trace shows the averaged phase error signal. The actual phase error signal generated from the Up, Down signals of the phase detector is a positive or negative pulse train. The actual phase range is only - to instead of the -2* to 2* range of the ideal phase detector even though the latches have been optimized for a fast reset.

It is important to note that the sign of the phase error signal changes if the phase detector switches characteristics. Which characteristic the phase detector is on when the PLL starts up depends on initial conditions. Since the phase detector can be on characteristic 1 or 2 when the PLL starts up the error signal generated from the UP, DOWN signal for the PLL can have either sign!

FIGURE 9. HBT PHASE DETECTOR

If the phase detector comes up in the wrong state or characteristic the PLL will have positive feed back and drive the PLL output voltage to is upper or lower limit, the PLL latches up! The controller must detect this situation and force the phase detector to change to the other characteristic. Unfortunately the phase detector is close to a zero of the current characteristic and the phase difference will be out of the range for the characteristic that we would like to switch to. Thus the phase detector will switch right back to the characteristic that lead to the latch up. An indirect approach must be taken to force a switch to the characteristic that provides negative feedback.

FIGURE 10. CONTROLLER FOR PLL CLOCK LOOP

Figure 10 shows the PLL controller needed for each clock distribution channel. If the phase detector is on the wrong characteristic when the PLL starts up (situation 1 in Figure 8) the controller detects a PLL latch up with the two comparators that check whether the loop filter output voltage has reached the upper or lower voltage limit (situation 2). The loop filter has been replaced with an integrator to increase loop gain and reduce the steady state error of the PLL. If either limit is reached the corresponding comparator sets a latch that will force the Up, Down signal converter to output either high or low voltage. This will drive the phase difference outside of the range of the current phase detector characteristic and thus force a change over to the characteristic that provides negative feedback. The change in sign is detected by a novel differential Schmitt Trigger circuit which will reset the latch (situation 3).

FIGURE 11. PLL CONTROLLER WAVEFORMS

Once the phase detector has changed characteristics the negative feedback loop will drive the PLL into lock (situation 4). Figure 11 shows the PLL controller waveforms and phase error of the PLL for the case were the loop initially latches up. The final phase error is below 5 ps. These PLL waveforms are generated with SPICE. PLLs are difficult to design since PLLs take a very long time to simulate. The transient analysis has to go through hundreds of clock cycles until the steady state is reached. It took 36 hours of CPU time on a Sun10 to generate the traces shown.

Testability

FIGURE 12. DESKEW TEST CHIP ( 2.6 MM X 3.0 M

Since the deskew chip will be inserted on an MCM the chip must be fully testable on the wafer for Known Good Die identification. Two additional delays lines have been included in each clock distribution channel to close the clock loop on the chip and simulate slowly varying interconnect delays. This is achieved by applying a slowing varying sawtooth waveform on the TestV input and applying the Test signal. Each channel has a Test_Point signal output to measure skew in test mode. For a more coarse evaluation of a clock channel the phase detector lock signal can also be observed. The lock detector has a window of -15 ps to 15 ps. On the deskew test chip the Test_Point signals of the two clock channels implemented are connected to four phase generators and the 1 signals are connected to an XOR phase detector. The XOR output signal is connected to an output driver for direct measurements of skew. Figure 12 shows the layout of the deskew test chip with two clock distribution channels, a system startup controller, and the additional features to increase testability and observability. The deskew test chip contains 1030 HBT devices in an area of 2.6 mm x 3.0 mm and dissipates 2 W.

L1 Cache Chips

In early May, the design of the level 1 (L1) cache controller was initiated. The design of this chip was delayed while the L1 cache chip was in the design process. Due to the Harvard architecture used in FRISC-H, the L1 cache controller must be able to operate as either a data cache controller (DCC) or instruction cache controller (ICC). The L1 controller was designed to keep the time penalty for a miss at an absolute minimum yet use a simple state machine as a controller. The state machine has seven states and uses three master-slave latches.

The L1 cache controller uses a three-stage pipeline in order to properly handle cache misses. Included within the L1 controller is a remote program counter (RPC) which generates the next address for the cache (unless the CPU explicitly sends an address). This counter is integrated with the first pipeline latch in order to reduce the latency of the controller.

When an address misses within the L1 cache, it takes 3 cycles for the miss signal to be acknowledged by the CPU, hence a three-stage pipeline was required. The interface between the L1 and L2 caches is asynchronous, allowing the design of the L2 cache controller maximum design flexibility. However, the addresses generated within or sent to the L1 controller are immediately sent out to the L2 controller while the transmission of the address of a dirty line in the cache is delayed until the miss is handled by the L1 controller. This requires that the L2 cache controller must also be pipelined.

The cache architecture for FRISC/G requires a total of 32K bit of memory to accommodate the level one instruction and data caches (16K bit per cache). The access times of the cache must be fast enough to provide new instructions and data to the core processor every computation cycle (1 ns length). Therefore, it was necessary to design the memory parts which comprise this 32K bit memory utilizing the same HBT technology used in the core processor (Rockwell 50 GHz AlGaAs-GaAs HBT baseline process). This process restricts the number of devices per chip to the extent that it only possible to place 2K bit of memory on a single chip. This means that a total of 16 cache RAM chips are required to implement the level one cache memory. The cache RAM chip was originally designed using 10 instances of the register file designed for the data path chip. Due to the redundancy scheme incorporated into the design, the two additional register files could be swapped in if any of the original eight register files proved defective. The power requirement to implement this scheme, however, were too large, especially considering the fact that the cache chips comprise the majority of GaAs chips on the MCM. Therefore, the cache RAM chip was redesigned in order to reduce its power consumption. There were three main areas in which power savings were achieved. A new cache memory macro block was designed, a specialized boundary scan at speed testing scheme was designed, and finally, the redundancy in the cache RAM chip was eliminated.

The cache memory block design has been discussed in a previous Semi-Annual Technical Report, so the results will only be reviewed briefly. The original register file used in the data path chip is a 32x8 bit macro block. The cache macro block design for the new cache RAM chip was based on this original design, but widened to 32x16 bits. The access time for this memory macro can be significantly higher since, at most one read access is required per macro cell on any given processor cycle. Therefore, power could be reduced on the word line and bit line drivers at the expense of the memory access time. In addition, only one cache macro block is required for every two register file blocks used previously, eliminating a set of address drivers and decoders for each cache macro block used. Although adjustments must be made to compensate for the longer word lines in the new cache macro block, overall power savings are achieved through the widening of the macro block and by increasing the cache macro block's access time from that of the original register file block.

The original boundary scan test scheme designed for the core processor chips was intended to test a synchronously clocked chip with a large number of critical input and output signals at speed. Therefore, a fair amount of logic was used to implement an on-chip tester which can store input test patterns in a boundary scan shift register, apply these patterns at the input pads of the chip in a precisely timed manner, run through a complete processor cycle of the chip, sample the results appearing at the output pads at the end of the cycle in a precisely timed manner, and finally shift the sampled results out of the boundary scan chain. The nature of the cache RAM chip allows the design of a much simpler on-chip tester. This is because the cache RAM chip is asynchronous. Therefore, there are no clocks which need to be controlled by the on-chip circuitry. Also, we are mainly concerned with the RAM access times for cache transactions involving the processor. Transactions with the level 2 cache occur much less frequently and are limited to a greater extent by the speed of the level 2 cache chips. This means that, since each cache RAM chip only has a four bit high speed data bus and a nine bit address bus for processor communication, there are relatively few signals which need special scan latches for test pattern storage that need to be precisely timed for at speed testing of the chip. Functional testing of the cache macro blocks and control circuitry is provided in a separate testing mode which utilizes a counter to provide block addresses and a shifter to provide data patterns which are written into the selected memory block. These patterns are then read out onto a special test pad where they can be viewed with an oscilloscope. Special tests are also provided to test the pads which communicate with the level two cache and to test the MCM wires. These tests take advantage of the cache macro blocks for pattern storage and thus require little additional hardware. The simplicity of this test scheme reduces the power consumed by the tester by an order of magnitude.

The redundancy inherent in the cache RAM chip, as originally designed, would require an additional cache macro block as well as multiplexer logic to allow selection of the cache macro blocks being used and control logic to select the working blocks. Because of the cache block layout, this design would make the cache RAM chip size prohibitively large. In addition the power consumption of the additional block is significant. Use of two register file macros for the redundant logic block would make the cache RAM chip size less prohibitively large due to the shape of the register file layout , but would require additional devices and increase power requirements even further. An additional problem with the redundancy scheme is that multiplexers and control logic required to effectively use the redundant memory must operate correctly for the chip to work at all, offsetting the yield enhancement produced by the addition of redundant memory. Also, the multiplexers create additional delay along the read access critical path of the chip. For these reasons, we decided to redesign the cache RAM chip with no redundantcy.

The slower access time of the cache macro block with respect to the register file access time made the chip level design challenging. This is because the time of flight of signals on the MCM constrains the cache RAM chip read access time to 750 ps. Since the cache macro block design requires an access time of 450 ps, the propagation of the address bits from the address pad inputs to the cache macro blocks, the selection of the proper data output from the addressed cache macro block, and the propagation of the selected output data to the processor data out pads must be completed in 300 ps. Figure 13 illustrates the critical path for a read access of the cache RAM chip. The address input pads are composed of latch and multiplexer pairs implemented in a single current tree. The purpose of the multiplexer in each pad cell is to select either the pad input during normal operation or an address bit on the boundary scan chain in testing mode. During normal operation, the latch is left open to allow the pad input to propagate to the cache macro blocks unhindered. In testing mode, however, the latch blocks the corresponding scan chain bit from propagating to the cache macro blocks until it is time to begin a test, at which time all the pad latches are opened simultaneously to present the test pattern. The buffers between the address receiver pads and the cache macro block are used to buffer the address lines and shift the common mode voltage level of the address signal pairs from level 2 to level 1. Shifting is necessary because the cache macro block address inputs require a level 1 input. The address pad receiver outputs cannot be at level 1 since they drive high capacitance wires. Level 1 gates do not drive high capacitance wires as well as level 2 or level 3 ECL style gates. Thus, we would like to drive the address signals on level 2 for as long as possible on chip to reduce capacitive delays. Two sets of address buffers are used to reduce fanout and capacitive loading. There are two sets of 4-input multiplexers for the selection of the cache output data. The first set of data output multiplexers selects one of four nibbles based on the two low order address bits. The second set of data output multiplexers then selects the data from one of the four cache macro blocks based on the value of the second and third address bits. 4-input multiplexers can only be implemented efficiently by constraining the multiplexer data inputs to level 1 inputs. This means that we are required to drive the outputs of the first set of data select multiplexers on level 1. Because the driving capability of gates with level 1 outputs for highly capacitive lines is not as good as the driving capability of level 2 or level 3 output gates, we would like to minimize the average lengths of these data output lines. Space considerations, however, require us to place the first set of multiplexers in the cache macro blocks near the sense amplifiers. The second set of data output multiplexers, therefore, is placed in the center of the chip to minimize the distance for the interconnect between the two sets of multiplexers. A level 2 output was selected for the second set of data output multiplexers to allow them to effectively drive the interconnect to the processor data output pads. The data output pads consists of a set of latches to store the data nibble and set of drivers to place the data onto the pads. The intrinsic delays of the components along the read access critical path as well as the capacitive induced wire delays associated with the chip interconnect along this path derived from Spice simulations are shown in Figure 13. The capacitive delays induced by the interconnect account for a large percentage of the overall on-chip delay, making placement and routing of components along this path critical to achieve a fast enough read access. Capacitance extraction of the interconnect obtained from the route of the cache RAM chip was performed to obtain the capacitance values used in the Spice simulations. These simulations show that the total delay meets the design specification of 750 ps.

FIGURE 13: READ ACCESS CRITICAL PATH FOR CACHE RAM CHIP

L2 Cache Chips

The Level 2 (L2) cache for the F-RISC/G processor must meet a cycle time of 5 ns, including the time-of-flight across the MCM in getting to and from the Level 1 (L1) cache. It must be at least 32 Kbytes in size, in order to approximate an infinite cache as compared to L1, so that the statistical performance analysis is valid. Due to the wide 512-bit bus between L1 and L2, used to provide single-cycle transfers of a cache block, the L2 memories will be divided up into 4 chips for the data cache and 4 chips for the instruction cache. As with L1, it is not possible to simultaneously switch that many drivers on a fewer number of chips and still maintain a good margin for voltage drop on the power supply.

The L2 cache mirrors the L1 cache, in that it is direct-mapped and contains separate instruction and data caches. There are controller chips which are separate from the memory chips, in each. A wide bus is used between L1 and L2, but this technique will not be used between L2 and the next higher level of memory. That interface will be determined by whatever off-the-shelf parts are used for L3, which will probably be standard DRAMs. The transfers between L2 and L3 will be multiple cycle transfers, due to the narrower bus.

The L2 and L1 caches operate asynchronously. When data must be written from L2 to the next higher level of memory, the sending of the new data requested by L1 is given priority over writing the "dirty" data back to L3. The processor is then allowed to continue in parallel with transferring the "dirty" data out to L3. Should the processor then require yet another block to be loaded into the same location before the previous writeback has completed, an exception occurs and the transfer of the new data to L1 must wait for the previous event to finish.

Two different foundries are currently being investigated for fabrication of the L2 cache memory chips and controller chips. The first is the soon-to-be-on-line 0.5 micron Bi-CMOS process from MOSIS which should be available early in 1995. Analysis is being done to see if this line can meet the L2 requirements for speed and cycle time, since it provides a relatively low-cost solution, which is readily available. The second choice being evaluated is a CMOS/SiGe HBT hybrid process from IBM. It provides very fast HBTs for use in critical circuits such as sense amplifiers, along with low power CMOS devices for the core memory cells. While it will easily meet the L2 chip requirements, it is a more costly process and may not really be necessary for the application. Yield is not an issue with either foundry, as 8 Kbytes of memory can easily be put on a single memory chip with either process. Power is the chief design constraint, as using low power in L2 will help to offset the amount dissipated by the other core chips on the MCM, and simplify the cooling scheme for the overall system.

MCM Layout and Discussion of Temperature Slack Requirements

The F-RISC/G high-speed circuits dissipate approximately the same amount of power during both static and dynamic circuit conditions. The efficient removal of this heat is essential for a number of reasons. The beta of the transistors decreases with increasing temperature at the same collector current (as shown in Figure 14), which will result in slowing down the circuits. The increase in junction temperature also decreases the base-emitter forward bias voltage by 2mV/C. This will increase the voltage swings and the total power. Any synchronous circuit is designed with a timing margin to keep it safe from variations in fabrication process parameters and delays incurred due to both on-chip and off-chip clock skew, and signal rise time degradation. The circuit delays induced due to the temperature increase will cut into this safety margin. This problem is most crucial in the critical paths of the circuit.

FIGURE 14: BETA VS IC CURVE FOR THE STANDARD Q1 TRANSISTOR

It was therefore imperative to model the temperature induced delay to predict the circuit performance at higher temperatures and accordingly insert slack into the critical path for the reliable operation of the chips at higher operating temperatures. A preliminary thermal analysis and simulation of the multichip module showed that the chip temperatures can go as high as 75C.

Therefore, for a first order approximation, circuits in the critical path were modeled and simulated at higher temperatures using new models for different temperatures provided by Rockwell. There circuits are the ALU register file, L1 cache memory and carry chain adder. The speed of the circuit operation is shown below. As can be seen the maximum variation in the speed is about 10% from its nominal value. Therefore, it was decided to require a 10% temperature slack for all critical path delays. Ihe important thing to note here is that increased temperature only slows down the devices while the interconnect is not affected.

FIGURE 15: DELAY VS TEMPERATURE FOR THE CRITICAL CIRCUITS

The modeling of temperature slack imposed new contsraints on the timing of the critical paths in the F-RISC/G system, which in turn required a fresh inquiry into the placement and routing of all the chips on the multiple chip module. There are six critical paths on the MCM as follows:

  1. Data Path Cache Cycle (2250 ps)
  2. Instruction RAM Cache Cycle (2000 ps)
  3. I3...I7 Cache Cycle (1750 ps)
  4. ID - DP instruction broadcast (250 ps)
  5. Deskew to ID, DP, L1-ICC, L1-DCC clock routing
  6. Broadcast of BRANCH from DP3 to other DP chips, ID and ICC (250 ps)

It turns out that satisfying the first two constraints will satisfy all the other constraints as well. The cache critical path is illustrated in Figure 16. This critical path corresponds to a read or write hit operation, and therefore does not involve the second or higher levels of cache.

FIGURE 16: MEMORY CRITICAL PATH

While the CPU critical path is a single-point broadcast from the instruction decoder to the four data path chips, the memory critical path is much more complicated. Components of the critical path are labeled A-H in the Figure 16 and are described in Table 2. Aside from the number of chips involved in the critical path, the combination of clocked and flow-through logic makes the analysis more complicated.

TABLE 2: CRITICAL PATH COMPONENTS

Delay
Components of Delay
A
Driver Delay + On-Chip Skew
B
MCM Time of Flight + Skew
C
Receiver Delay + 2 Multiplexor Delays + D-Latch Delay + On-Chip Skew
D
Driver Delay + On-Chip Skew
E
MCM Time of Flight + Skew
F
RAM Read Access Time
G
MCM Time of Flight + Skew
H
Receiver + D-Latch Delay + On-Chip Skew

In Figure 16, the components symbolized with rectangles are clocked, while those represented with circles are flow-through. The critical path represents a complete level 1 cache transaction from issuing of an address to the reception of the requested data.

The four datapath chips generate the address to be requested on phase 1 of the system four-phase clock. On phase 3 the address is assumed to be on the inputs of a D-latch within each of the two cache controller chips (instruction cache controller and data cache controller). From this constraint the first of two limiting equations for the memory critical path is:

( 0.1 )

This equation is based on the fact that the input latches of the cache controller chips and output latch of the data path chips are clocked two phases apart. In the F-RISC/G prototype, each phase is 250 ps long, thus A + B + C < 500 ps.Once 500 ps is subtracted from the total allowed critical path time, the following constraints remain.

Instruction cache: 1500 " D+E+F+G+H ( 0.2 )

Data cache: 1750 " D+E+F+G+H ( 0.3 )

There is an additional constraint involving the instruction cache path which produces the SRCB instruction field. This path must be one phase faster than the other paths in the instruction cache.

MCM Floorplan

A thin-film Multi-Chip Module (MCM) was chosen as the package for the F-RISC/G prototype. [Greu90, Phil93]. The initial MCM floorplan made several assumptions which later had to be modified or abandoned. Among them, that there would be a Byte Operations chip on the initial prototype, and that there would be four RAM chips per cache. The number of RAM chips per cache had to be increased both due to lower than expected yields in the RPI Test Chip run, and simulations which suggested that very wide cache (512 bit blocks) would be necessary to achieve a CPI below 2. The initial floorplan is shown in Figure 17.

FIGURE 17: INITIAL MCM FLOORPLAN

In [Phil93] several assumptions were made in order to show that this floorplan would work based on the chip partitioning assumed. Chip dimensions were assumed to be 5 mm wide by 8 mm high, less than the currently assumed 1 cm x 1 cm. In addition, a dielectric with a dielectric constant of 2.7 was assumed.

Assuming a dielectric constant for Paralyne of 2.65 [Maji89] the time of flight on the MCM would be 5.43 ps/mm. Using General Electric's "High-Density Interconnect" (GE-HDI) [Hall93] package it should be possible to mount the chips with an interchip separation of 1 mm.

Based on the floorplan shown in Figure 17 and a chip size of 1 cm x 1 cm, the MCM time of flight delays shown in Table 3 can be estimated.

TABLE 3: INITIAL TIME OF FLIGHT ESTIMATES

Estimated Distance (mm) Estimated Time of flight (ps)
Datapath to both cache controllers
45
244.35
Cache controller to 4 RAMs
39
211.77
4 RAMs to CPU
12
65.16

Assuming a receiver delay of 70 ps, a driver delay of 35 ps, a latch delay of 25 ps, a RAM access time of 750 ps and clock skew of 20 ps, the critical path would then be 1736.93 ps for each cache, representing a 13% slack for the instruction cache, and a 23% slack for the data cache.

Unfortunately, yield figures from the fabrication of the RPI Test Chip [Phil93] were such that it was deemed necessary to use eight 2Kb cache RAMs in each cache. A simple estimate of the effects of doubling the number of RAM chips can be achieved by assuming that interconnect doubles in length. This would result in a critical path delay of 2013.86 ps.

FIGURE 18: BROADCAST FROM DATA PATH TO CACHE CONTROLLERS

Clearly, a new floorplan is necessary in order to achieve the desired cycle time. The floorplan which was arrived at is shown in Figure 18.

As can be seen from this figure, both cache controllers are now located directly beneath Datapath chips, reducing the distance of the address broadcast from 45 mm to approximately 31 mm.

FIGURE 19: CACHE CONTROLLER ADDRESS BROADCAST

By placing two sets of output address pads on each cache controller, it was possible to allow each controller to broadcast to two sets of four chips, as illustrated in Figure 19. This greatly reduces the time of flight delay as compared to broadcasting to all eight chips at once. The approximate distance for one of these four-chip broadcasts is 63 mm.

Finally, each Datapath chip need only communicate with 2 Data RAM chips, and each Instruction RAM can send its outputs directly to the Instruction Decoder as shown in Figure 20. From this diagram, the approximate length of the longest Instruction RAM to Instruction Decoder transfer is 3 mm.

FIGURE 20: CRITICAL DATA AND INSTRUCTION PATH TO CPU

Based on this new route and the previously specified assumptions it is possible to calculate the complete critical path as shown in Table 4. The totals shown assume that the datapath and cache controller chips are clocked two phases apart on the A-C path. On chip skew of 25 ps and off chip skew of 10 ps is included between clocked latches. A rise time degradation for off chip connections of 45 ps is assumed.

It is interesting to note the effect of adding eight more RAMs to each cache. Assuming two extra set of address pads were included on each Cache Controller chip, the effect on the address broadcast from the Cache Controllers to the RAMs would be to increase path length by approximately 1 cm if the new RAMs were placed below the Cache Controllers. The distance that the data would have to travel to the Datapath chips and that the instruction would have to travel to the Instruction Decoder would similarly increase. As a result, the critical path would be too long and the target cycle time could not be met.

TABLE 4: CRITICAL PATH USING UPDATE MCM PLACEMENT

Delay
eR = 2.2 eR = 2.38 eR = 2.65 eR = 3.24
A
120 ps 120 ps120 ps120 ps
B
219.3 ps 225.4 ps234.2 ps252 ps
C
110 ps 110 ps110 ps110 ps
D
120 ps 120 ps120 ps120 ps
E
357.5 ps 370 ps387.85 ps424 ps
F
760 ps 760 ps760 ps760 ps
G
60.8 ps 61.4 ps62.3 ps64 ps
H
110 ps 110 ps110 ps110 ps
TOTAL
1908.3 ps 1921.4 ps1940 ps1978 ps


If four chips were added, however, the effect on the cycle time could probably be minimized. Due to the organization of each RAM into four 32 bit x 16 bit cache RAM blocks, however, it is difficult to partition these RAMs usefully among the four Datapath chips. Each Datapath chip sends and receives 8 bits of data, which would now have to somehow be divided among three RAMs.

Impact of 3D packaging

A major source of delays on F-RISC/G MCM are the MCM delays (30%). These delays can be minnimized using 3D chip stacking. There are two types of configurations possible. One is short loaf in which the chips can be stacked on top of each other and the other is long loaf configuration where the chips can be put sideways and connected together. Reducing the delay by as much as 1000ps will get rid of one pipeline stage and can increase the clock cycle by as much as 25% with the same architecture.

Packaging

A future packaging possibility is the use of three dimensional chip stacking. By placing chips in stacks it is possible to reduce interconnect delays. One possible technique would be to dig a trench in the MCM into which the chips would be placed, turned on their sides Figure 21. Interposing routing and thermal conduction layers would be placed between the chips as needed. In this configuration, routing is only possible on three sides of the stack.

FIGURE 21: THREE DIMENSIONAL CHIP STACK ON MCM

Since the Cache RAM chip is not pad-limited, it should be possible to restrict pads on only three sizes without having a large effect on the overall chip area.

As the main component of communication delay on the MCM is the broadcast of the address from the Cache Controller to the Cache RAMs, it makes the most sense to try and use three dimensional stacking there first.

FIGURE 22: MCM LAYOUT WITH 3-D STACKING

Figure 22 shows one possible MCM floorplan utilizing 3-D stacking. In this configuration, the cache controller and associated RAM chips would share a stack.

FIGURE 23: POSSIBLE STACK ROUTINGTABLE 5: CRITICAL PATH USING 3D STACKING

Delay
eR = 2.65
A70 + 20 = 90 ps
B178.1 + 20 = 198.1 ps
C35 + 2 x 25 + 25 + 20 = 130 ps
D70 + 20 = 90 ps
E55 + 20 = 75 ps
F750 ps
GInstruction Cache: 65 +20 = 85 ps

Data Cache: 92 + 20 = 112 ps

H35 + 25 + 20 = 80 ps
TOTALInstruction Cache: 1498.1 ps

Data Cache: 1525.1 ps

Figure 23 illustrates a possible way in which the stack may be routed. It should be possible to route address lines from the Cache Controller to the Cache RAMs and the 32 Data Buses using only one or two sides of the stack. This would leave the remaining one or two sides for the 512 bit L2 Cache Bus.

Using this technique, a rough approximation of the critical path may be made, as shown in Table 5. This represents a decrease in critical path length of approximately 15%. If the datapath chips were also stacked on top of each other, in a separate stack from the caches, that would decrease the data cache critical path by approximately 30 ps more (Figure 24). In addition, it may then be possible to clock the Datapath and Cache Controller chips only one clock phase apart, resulting in a savings of another 250 ps. This would result in an overall critical path length of approximately 1220 ps, a savings of 30%. Further savings may be possible by stacking the Instruction Decoder with the Datapath chips.

FIGURE 24: ALTERNATIVE 3-D STACKING LAYOUT

100 GHz Models Power Curves and the Round Emitter Transistors

Rockwell International is in the process of developing an AlGaAs-GaAs HBT process with smaller dimensions and significantly improved performance with respect to the current baseline AlGaAs-GaAs HBT (Q1). A preliminary Spice model was released by Rockwell for an HBT with emitter stripe dimensions of 1.2x1.2 m2 (referred to subsequently as Q1.2x1.2) in this process. Spice simulations reported in a previous semiannual report show that the gate delay of a buffer with no load capacitance from interconnect can be reduced by a factor of two using the Q1.2x1.2 device and at the same time use only half the power. However, because of the high switching speed of the HBT devices we use, capacitive effects associated with the interconnect cannot be neglected in determining a gate's switching speed. Changing the voltage on a gate's output involves adding or removing charge. This implies that when driving highly capacitive interconnect, more current, and hence more power, must be used to change the voltage on the line in the same amount of time. Thus, in order to understand the implications of implementing FRISC/H in this new Rockwell process, we would like to know the amount of relative power required to double the switching speed of a CML gate using this process with respect to the power consumed by the baseline process for a given interconnect load capacitance.

FIGURE 25: CML LEVEL 1 BUFFER SCHEMATIC

In order to simulate the switching speed of a CML gate, a Spice netlist for a chain of buffers was created. Each buffer consists of a differential pair driven by a resistive load current source (see Figure 25). Two pull-up resistors are provided to create two output signals with a differential voltage of about 240 mV. A differential voltage of this value is sufficient to force the majority of current through one device exclusively in the differential pair. This allows a digital value of one or zero, represented as either 240 mV or -240 mV differential voltage applied to the input of the buffer, to be propagated to the output. We are interested in the propagation delay of the second to last buffer in the chain with a specified interconnect load capacitance on the output lines of this gate. The last buffer in the chain serves as a typical load for the gate of interest. The other gates in the chain serve to present a realistic input to the gate of interest. The simulation consists of simply providing a differential pulse at the input of the chain and allowing it to propagate along the chain.

The first set of simulations performed used the Q1 device in a chain of low power buffers. The buffer design is one used in present digital circuits. The low power buffer was chosen for comparison because it draws a steady state current of about 570 A. The Q1.2x.12 maximum current is limited to 690 A. Higher currents through this device cause dopant redistribution, which destroy the device. Therefore, the Q1.2x1.2 can only be used in relatively low power CML circuits. In addition, FT simulations of the Q1.2x1.2 device show that FT for this device peaks at an operating current between 500 A and 600 A with a value of about 77 GHz (see FIGURE 26). This means we should see a peak performance from the Q1.2x1.2 at current levels similar to that used in the Q1 low power buffer design. Figure 27 shows a plot of the FT of the Q1 device as a function of operating current. The FT value at the current level used by the low power buffer is about 39 GHz. This indicates that the unloaded gate delay of the Q1.2x1.2 device low power buffer operating at the same current as a Q1 device low power buffer should have a propagation delay which is about a factor of two smaller.

FIGURE 26: FT FOR Q1.2X1.2 AS A FUNCTION OF COLLECTOR CURRENT

FIGURE 27: FT FOR Q1 AS A FUNCTION OF COLLECTOR CURRENT

The simulations of the Q1 low power buffer chain were performed using a number of load capacitance values. The results, shown in Figure 28, illustrate that there is an intrinsic gate delay, indicated by the zero interconnect laoad apacitance result, as well as an additional delay which is linearly dependent on the load capacitance. This is expected since, during a gate transition, the input differential voltage will approach zero and then change sign, at which point, the transistor previously conducting current will become cut off while the opposite transistor will begin conducting current. The level of this current approach the level of the current source and remain relatively constant while bringing the voltage down to the level 1 low voltage value. Meanwhile, after the transistor which previosly conducted current cuts off, current will flow through the pull-up resistor to charge the output line to a voltage level near VCC. Since the average transient current to displace charge on the buffer output lines remains constant with changing load capacitances, the delay due to interconnect capacitance scales linearly as a function of the load capacitance value (t = CV/I). This explains the linear nature of the Spice plot for the gate delay as a function of the load capacitance.

FIGURE 28: Q1 LOW POWER BUFFER PROPAGATION DELAY AS A FUNCTION OF INTERCONNECT LOAD CAPACITANCE

Next, simulations were performed using a buffer chain utilizing Q1.2x1.2 devices. For a given interconnect load capacitance, the resistors of the Q1.2x1.2 buffers were adjusted to provide a particular steady state current and an output differential voltage equal to that of the Q1 low power buffer. For each simulation the buffer propagation delay was compared with the propagation delay of a Q1 low power buffer with twice the load capacitance. After each simulation, the resistors were than readjusted and the simulation repeated until a propagation time of one half that of the Q1 low power buffer was found. This process was repeated for a number of load capacitances. The purpose of these simulations, as mentioned above, is to determine how much power is required to reduce the gate delay of a Q1.2x1.2 buffer by a factor of two from that of a Q1 buffer if the Q1.2x1.2 buffer sees half the interconnect load capacitance. The reason why we assume a decrease in load capacitance by a factor of two for the new process is because the shrink in process design rules combined with hand optimized layout should allow us to decrease the average load capacitance for a gate by this factor. The results for this group of simulations is summarized in Figure 29.

FIGURE 29: RELATIVE BUFFER POWER REQUIRED TO DECREASE PROPAGATION DELAY OF Q1.2X1.2 BUFFER BY A FACTOR OF TWO WITH RESPECT TO A Q1 LOW POWER BUFFER

From the plot in Figure 29, one finds that the power of the Q1.2x1.2 buffer required to obtain the desired gate delay reduction is greatly diminished for small interconnect load capacitances with respect to the power required by the Q1 buffer. This is because with low interconnect capacitance, the load capacitance is dominated by the device capacitances for the load HBT transistors. Because of the process shrink, the device capacitances for the Q1.2x1.2 are greatly reduced from that of the Q1 device. Thus, less charge needs to be displaced by a Q1.2x1.2 buffer to vary the voltage on the output lines when the device capacitance dominates, and hence, the power consumed by the buffer can be decreased without increasing the switching delay. As the load capacitance due to the interconnect increases, the device load capacitance becomes insignificant. At this point, the charge that must be displaced by the Q1.2x1.2 is about half of that which must be displaced by the Q1 buffer (Due to the factor of two difference in interconnect load capacitance). Therefore, if the Q1.2x1.2 buffer is required to change the voltage on the output line by the same amount as the Q1 buffer in half the time, the ratio of charge to time is nearly constant (Q/t = CV/t = I). This implies that both device buffers must use have the same current drive to achieve the required performance improvement, which means that both buffers must consume the same amount of power when the load capacitance is dominated by the interconnect capacitance if the factor of two decrease in switching delay is to be achieved.

FIGURE 30: Q1 MEDIUM POWER BUFFER PROPAGATION DELAY AS A FUNCTION OF INTERCONNECT LOAD CAPACITANCE

Pspice simulations were also performed on a buffer chain utilizing Q1 medium power buffers with a number of interconnect load capacitance values. A Q1 medium power buffer draws a steady state current of about 940 A. The results of the simulations, shown in Figure 30, are similar to those obtained using the Q1 low power buffer. However, the slope of the line in the plot shown in Figure 30 is less steep than that observed in Figure 28, indicating that the Q1 medium power buffer has a smaller propagation delay than the Q1 low power buffer for a given interconnect load capacitance. This is expected since the Q1 medium power buffer has a higher current drive than the Q1 low power buffer, and hence can displace a larger amount of charge from the buffer output lines in a given period of time. Pspice simulations were then performed using a chain of Q1.2x1.2 buffers to determine the power requirements to achieve a factor of two reduction in propagation delay with respect to the Q1 medium power buffer given the interconnect load capacitance has decreased by a factor of two for the Q1.2x1.2 buffer. The results of these simulations, summarized in Figure 31, are similar to those found in Figure 29. The relative power consumption with respect to the Q1 medium power buffer is slightly higher than with respect to the Q1 low power buffer, however. This is in part due to the fatc that the FT of the Q1 device is about 43 Ghz for the current level of the medium power buffer (see Figure 27), while the Q1.2x1.2 device FT peaks at only 77 Ghz. Hence, a larger driving current is required to alter the base charge more quickly to compensate for the inability of the device to react to the change in input voltage twice as quickly as the Q1 device. Also, the I-V characteristics of the Q1.2x1.2 device are less sharp than those of the Q1 device. Hence, the amount of charge increase in the base per unit area to increase the collector current by a specified amount is larger for the Q1.2x1.2 device than the Q1 device. Thus, when the load capacitance is predominantly device capacitance, the relative power with respect to the Q1 device buffer will increase as a function of the power of the Q1 device buffer due to the excess charge which must be displaced in the Q1.2x1.2 load devices. At high interconnect capacitance levels, the power consumption of the Q1.2x1.2 buffer approaches that of the Q1 medium power buffer for the same reasons described above in the low power buffer comparisons. Note, however, that above a 6 fF interconnect load capacitance, the current required to achieve the factor of two reduction in gate delay is too large for the Q1.2x1.2 device to handle safely. This means that an HBT similar to the Q1.2x1.2 device is needed with a longer emitter stripe to allow the device to handle greater current. Although the larger size will increase the device capacitances, the greater current handling capability will make this device of greater use in CML logic which must drive interconnect with high capacitance.

FIGURE 31: RELATIVE BUFFER POWER REQUIRED TO DECREASE PROPAGATION DELAY OF Q1.2X1.2 BUFFER BY A FACTOR OF TWO WITH RESPECT TO A Q1 MEDIUM POWER BUFFER

The results of the power comparisons for implementation of our CML gate library in the new Rockwell HBT process with respect to the implementation in the baseline HBT process show that the power used by gates in the new process decreases for small interconnect load capacitances. These results are very encouraging when one considers that, in CMOS processes, the power consumed by a gate increases as a function of the switching rate of the gate. Even with high capacitance loads, a CML gate implemented in with the Q1.2x1.2 devices will, at worst, consume the same amount of power as a similar gate implemented with the Q1 device, and yet retain the factor of two decrease in gate delay. It is important, however, to have good design tools in order to produce optimized layouts which insure that the average interconnect load capacitance on critical paths in a design decreases by a factor of two with respect to the interconnect load capcitances observed in baseline process designs.

Rockwell has recently introduced round emitter transistors with a slightly larger footprint than the standard Q1 transistors. These transistors show a peak ft in the range of 70-80 GHz as compared to 50 GHz for the standard transistors. Therefore, it is possible to drop-in these transistors in the place for standard Q1 transistors and get a speed improvement in the critical path with the same technology. The passive test chip contains ring oscillators made with these new transistors. Figure 32 shows the layout of an inverter using the round emitter transistors and the standard transistors. The round emitter transistors are only slightly bigger than the standard transistors.

FIGURE 32: LAYOUT OF AN INVERTER WITH (A) ROUND EMITTER TRANSISTORS AND (B) STANDARD TRANSISTORS

FRISC System Emulation

Although the GaAs HBT main architecture chips have individually been rigorously designed and verified, the complete check of the architecture including all the interchip interactions has not been done. To insure that the whole architecture functions correctly, a Field Programmable Gate Array (FPGA) based emulator is under development. Here each GaAs chip of the Architecture is mapped by a translator from its netlist to a Xilinx FPGA compatible form for individual placement and routing within the FPGA chip. The interchip routing was implemented using a new technology from APTIX know as the Field Programmable Circuit Board (FPCB).

Emulator Hardware

The emulator is realized through the use of an Aptix AP4-FPCB, shown in Figure 33. This Field Programmable Circuit Board can contain up to 4 Field Programmable Interconnect Components (FIPIC), and up to 16 core Field Programmable Gate Arrays (FPGA's). Each chip in the actual architecture is mapped to one of the core FPGA's on the Aptix board. The FPGA's are interconnected through the four Field Programmable Interconnect Components (FIPIC's). With this system, any I/O pin on any FPGA can be routed to any other I/O pin on any of the other FPGA's. Thus, the complete architecture can be put into this programmable hardware which allows for easy modifications and verification.

Development Process

The first step in developing the emulator was the netlist translator. The translator (developed here by Sam Steidl) is used to move the GaAs schematics generated in the Compass tools over to the VIEWlogic CAD tools, which is used to interface with the Xilinx router used for the FPGA's. The translator swaps each occurrence of a Compass primitive with an equivalent VIEWlogic primitive, and preserves the interconnect between the primitives. Thus, we can be assured that the translator output (VIEWlogic schematics) is logically equivalent to the input (Compass schematics). The translated schematics can then be exported from the VIEWlogic tools, and the PPR router from XILINX can generate a bit stream file which is used to program the FPGA's on the APTIX board.

The next step was to generate the interconnect between the chips. This interconnect matches that of the actual multi-chip module (MCM) routing between the chips in the original architecture. This interconnect is entered into the VIEWlogic CAD tool, and it references the previously mentioned FPGA schematic files. This interconnect is then exported in the same manner as the previous chip files, only now the APTIX router is used, and a bit stream is created which is used to program the FIPIC's. With these netlists exported, the entire processor can be simulated using Viewsim (The VIEWlogic digital simulator).

The output from the routers are then brought over to the PC where the Aptix development system resides. The FIPIC's are programmed on the Aptix board from the PC via serial communication with the Aptix Host Interface Module (which in turn communicates with the AP4 Aptix board). FPGA's are then programmed using the Xchecker cable and download program provided by Xilinx. Once the clock signal is applied to the Aptix board, an automatic startup sequence is activated which initializes the processor. This startup sequence was designed and routed into another FPGA. Once the processor is initialized, programs can then begin execution on the emulator. The Aptix Development System automatically routes up to 64 lines on each FIPIC to an interface pod which is connected to the HP logic analyzer. The Logic Analyzer is then automatically configured, and signals are automatically labeled via serial communication with the PC. This saves several hours that would otherwise be required to hookup logic analyzer probes, and configure the logic analyzer. Thus the system can then be tested in a relatively short period of time.

Emulator Results

The translator has proved to provide reliable netlists, and has shown itself to be a very valuable tool for interfacing our two CAD systems. Although the testing procedure is not yet complete, the emulator has already indicated a minor problem with the instruction decoder of the FRISC processor. The swift identification of this problem made it's correction and painless task. A series of test routines has been developed using the software simulator for the FRISC processor. These series of routines exhaustively test every feature of the architecture. Once it has been verified that every test routine runs correctly, we can be assured that the final GaAs architecture is correct.

Considerations

While the emulator and the HBT FRISC processor are logically equivalent, there may be timing errors which can occur since the FPGA's and the FIPIC's in no way preserve the relative delay characteristics of the GaAs chips, or the MCM. Thus, if an error occurs on the emulator, it must first be determined whether or not this is because the original chips depended upon different delay characteristics. If this is the case, proper delays must be introduced to the emulator to for it to operate correctly. This is an unlikely case however, since the processor is a synchronous system, if the emulator system clock is slow enough (less then 6 MHz), then there should be no timing problems on the emulator.

FIGURE 33: APTIX AP4-FPCB

Design Verification Programs of F-RISC/G

The purpose of verification programs is to verify the design of F-RISC/G. Each of the verification programs is provided with the correct output results. The verification program was executed correctly if its results (obtained in the cache memory) coincide with the given correct results.

The verification programs are written in F-RISC/G Assembler and they can be used for design verification of F-RISC/G at any of the following stages:

a) design verification of a mathematical model of F-RISC/G, i.e. of a software description of F-RISC/G;

b) design verification of a physical model of F-RISC/G, i.e. of a hardware model of F-RISC/G realized on Aptix board;

c) design verification of the real F-RISC/G chips.

A set of 198 verification programs is written for design verification of F-RISC/G. The total volume of these verification programs is about 15,600 instructions.

The design verification programs are written according to the following three main principles:

1. A set of verification programs must be complete, i.e. it must verify the execution of any possible

F-RISC/G instruction ( the quantity of F-RISC/G instructions variations shown in Table 6 is equal to 5,710 ).

2. A set of verification programs must be a strict sequence of verification programs. A few verification programs at the beginning must verify the diagnostic core of the object which is "Register File - Cache Memory - Register File" and "Register - Register" communications in F-RISC/G. Each next verification program in the sequence uses the F-RISC/G instructions that already have been verified by the previous verification programs plus a few new instruction to be verified. This approach makes easier the diagnosis of design errors.

3. The result of each simple step's execution of the verification program must be stored in the cache memory. This approach makes easier the diagnosis of design errors because a wrong data obtained in a specific address of the cache memory exactly determines that one or a few instructions which were executed incorrectly.

TABLE 6: QUANTITY OF VARIATIONS OF THE F-RISC/G INSTRUCTIONS

QuantityInstruction Quantity
ADD24JUMP 4224
SUB24LOAD 44
AND32PSWAND 16
OR32PSWOR 16
XOR 32PSWXOR 16
ADDI24NOOP 1
SUBI24SHIFTL 12
ANDI32SHIFTA 12
ORI32ROTATE 12
XORI32STORE 44
BRANCH768TRAP 192
INTPC1TRAPI 64

Most of the verification programs verify the execution of all possible F-RISC/G instructions. Three additional verification programs perform more complicated calculations namely:

1) addition of array's elements (stored in the cache memory),

2) re-ordering of array's elements (stored in the cache memory) in the increasing order,

3) multiplication of two operands.

Some of the verification programs were already used for the verification of the mathematical model of F-RISC/G and for the verification of the physical model of F-RISC/G.

Conclusions

The current contract period located several key problems with the design of the F-RISC/G architecture chips. This included timing deficiencies due to underestimates of the impact of 3D capacitance effects in the wiring. This stimulated the utilization of new 3D capacitance codes developed by Prof. Yannick Le Coz and his colleague Ralph Iverson. Considerable effort to revise the cache memory critical paths resulted from these studies. Additionally the datapath register file was revisited. A second collection of errors was located through the use of an APTIX Field Programmable Circuit Board and FPGA emulator technology. Several refinements of the back translator which produces this emulation directly from the layouts of the GaAs chips were also required to insure the correctness of the emulation. A series of 4 new or revised test chips were submitted under HSCD funding to obtain further yield and performance monitoring of the Rockwell process to judge whether the time is propitious for submission of the main architecture chips for fabrication.

Finally, the exploration of the 100 GHz models and new round emitter 80 GHz variants of the 50 GHz process were explored to determine whether they can be used in faster clocked versions of the F-RISC architecture. The 100 GHz models suggested that that HBT is not yet fully optimized since only the maximum frequency and not the transit time frequency appear to be at 100 GHz.

References

[Greu90] Greub, H. J., J. F. McDonald, and T. Creedon. "Key components of the fast reduced instruction set computer (FRISC) employing advanced bipolar differential logic and wafer scale multichip packaging." IEEE Bipolar Circuits and Technology Meeting, Minneapolis, Minnesota, pp. 19-22, 1988.

[Phil93] Philhower, R. "Spartan RISC Architecture for Yield-Limited Technology." Ph.D. dissertation, Rensselaer Polytechnic Institute, Troy, New York, December 1993.

[Sze81] Sze, S. M. "Physics of Semiconductor Devices, 2nd Edition" John Wiley & Sons, New York, New York, 1981.

[Maji89] Majid, N., Dabral, S., and McDonald, J. F. "The Parylene-Aluminum Multilayer Interconnection System for Wafer Scale Integration and Wafer Scale Hybrid Packaging." Journal of Electronic Materials, Vol. 18, No. 2, 1989.

[Hall93] Haller, T. R., et. al. "High frequency performance of GE high density interconnect modules." IEEE Transactions on Components, Hybrids, and Manufacturing Technology, Vol. 16, No. 1, pp. 21-27, February 1993.