F-RISC/G and Beyond -- Subnanosecond

Fast RISC for

TeraOPS Parallel Processing Applications



ARPA Contract Numbers DAAL03-90G-0187,

DAAH04-93G-0477,

[AASERT Award DAAL03-92G-0307 for Cache Memory]

Semi-Annual Technical Report

April 1993 - October 1993



Prof. John F. McDonald

Center for Integrated Electronics

Rensselaer Polytechnic Institute

Troy, New York 12180

(518)-276-2919
FAX (518)-276-8761
MACinFAX (518)-276-4882
e-mail: mcdonald@unix.cie.rpi.edu


Abstract

The goal of the F-RISC/G (Fast Reduced Instruction Set Computer - version G) project is the development of a one nanosecond cycle time computer using GaAs/AlGaAs Heterojunction Bipolar Transistors (HBT). In the past contract period the primary activities consisted of final completion and tape out of three architecture chips for the FRISC/G one nanosecond machine effort, and development of several additional small test "chiplets" containing subcircuits from these and other key integrated circuits. The artwork for these chips has been assembled in one of the larger (quad sized, 20 mm x 20 mm) reticles now available at Rockwell to make a multichip fabrication run. The three architecture chips include the datapath byte chip (with an internal 32 byte register file), the FRISC/G instruction decoder, and an 8 bit slice of the Level 1 (L1) cache chip. The test chiplets include a separate circuit to test the "at-speed" boundary scan circuitry to verify its ability to test large chips, and special test circuitry to verify that the adaptive deskew circuit functions correctly, with 2 GHz clock pulses, and with linkages to the boundary scan circuit so that a single cycle of clocked execution can be initiated at speed between scan-in and scan-out operations with proper deskew for test purposes. Additional test chiplets contain a single block of the cache memory chip, a circuit designed to probe the upper limits of the speed possible with the 50 GHz Rockwell baseline process (a high speed VCO with a frequency multiplier to reach 20 GHz), and a revised, improved version of the original test chip submitted in 1992. Inclusion of RPI's Test Chip' 92 for a second fab has been motivated by the need to check the run with the previously diagnosed circuits, which happen to have under-performed in the previous fabrication run. In that early run the fT was confirmed by Rockwell to be operating at about only 70% of the expected speed. In addition the yield was not up to expectations. The new fab run involves several yield enhancing changes. The new reticle is not only larger, but the Cannon stepper is expected to be more accurately aligned, which should lead to higher yields. By including the same test chip on this second run we obtain a direct monitor of changes in yield resulting from this change and the other changes accompanying it in the Rockwell line. These changes include the switch to four inch wafers, and improvements in silicon nitride processing. Additional work includes development of a translator for creating an FPGA based emulator for each of the FRISC/G architecture chips, thus making software development easier as well as verifying the logical correctness of the FRISC/G and successor designs. Additional work on the FRISC/G MCM package thermomechanical analysis has also been pursued. Finally a MESFET emulator of FRISC/G has been fabricated.

Project Goals

> Exploration of the Fundamental Limits of High-Speed Architectures.

> Study of GaAs HBT for Fast Reduced Instruction Set Computer (F-RISC) Design including Adequacy of Yield, and Device Performance for High Performance Computing Applications.

> Research in the Architectural Impact of Advanced Multichip Module (and 3D) Packaging in the GHz Range.

> Examination of Pipelining across Pad Driver Boundaries as a Means for Reducing Adverse Effect of Partitioning due to Yield Limitations in Advanced Technologies.

> Investigation of Power Management in High Power Technologies.

> Study of Appropriate Memory Organization and Management for Extremely Fast RISC Engines using Yield Limited Technologies.

> Use of Adaptive Clock Distribution Circuits for Skew Compensation at High Frequencies.

> Exploration of Superscalar and VLIW organizations in the sub-nanosecond cycle regime.

> Exploration of a combination of HBT and MESFET technologies for lower power, higher yield, but fast cache memory [AASERT Program].

> Exploration of novel new HBT technology such as HBT's in the SiGe and InP materials systems.

Milestones for next 12 months

> fabrication of reticle

> documentation of test sets and procedures

> upgrade of test equipment:

. . . two six channel probes

. . . two micro-manipulators with tilt adjustment

. . . construction of test jig for boundary scan testing

> characterization of devices

> testing of architecture chips with boundary scan logic

> testing of test chips with built in self test logic

> identification of known good dies for MCM insertion

> design of circuits for next reticle run

Publications

K. Nah, R. Philhower. H. Greub, and J. F. McDonald, "A 500 ps 32x8 Register File Implemented in GaAs/AlGaAs HBTs", GaAs Symposium 1993, Technical Digest, pp. 71-74.

C. K. Tien, K. Lewis, R. Philhower, H. J. Greub and J. F. McDonald, " F-RISC/I: A 32 Bit RISC Processor Implemented in GaAs HMESFET SBFL", GaAs Symposium 1993, Technical Digest, pp. 145-148.

Introduction

The F-RISC/G (Fast Reduced Instruction Set Computer - version G) project has as its goal the development of a one nanosecond cycle time computer using GaAs/AlGaAs Heterojunction Bipolar Transistor Technology. More generally the project seeks to explore the generic question of how one can achieve with bipolar circuits higher clock rates than expected from Silicon based CMOS. Traditionally CMOS has achieved its increasing clock rates from lithography improvements, shortening the lengths of devices and interconnections to achieve higher speed. Bipolar devices, on the other hand, have achieved their speed by reducing the thickness of various device layers, with more recent improvements coming from band gap engineering in heterostructures, and schemes to include built-in acceleration fields in the base transit region (graded base techniques). Since thicknesses of various layers in semiconductor processing can be minimized with proper yield engineering, short device transit times (or high transit time frequency) in principle favor the bipolar device. This is often quantified, at least for analog applications, by the closely related unity current gain frequency, .

(1)

The final device performance, however, depends also on other device parasitics such the base resistance, , and the emitter base capacitance, , as quantified in the so called maximum oscillation frequency or unity power gain frequency in the emitters up configuration:

(2)

These two canonic frequencies are often quoted in the literature for analog circuit performance, and they give some feeling for how fast the basic device will respond. One must use caution, however, as these parameters often do not give a clear picture about the speeds of logic circuits which utilize these devices. In addition to these "intrinsic" HBT parameters, logic circuit performance also depends on the burden of wiring capacitance and circuit input loading feasible with the technology. Nevertheless, the parameter is considered a better indicator of logic circuit performance than .

It is possible to achieve clock frequencies in digital circuits close to 25% of in small circuits such as frequency counters and serial to parallel converters. Hence, some 12 GHz serial-to-parallel converters have been fabricated in 50 GHz technology using full differential logic circuits. Unloaded gate delays often approach , hence the unloaded fast gate delay in the same 50 GHz technology is around 18 ps. Of course, the circuits achieving these upper limits in speed also have high power dissipation.

Interestingly, can be larger or smaller than . It can be larger if the is small enough. is an important parameter in circuit applications, and lowering this parameter is the subject of several recent articles. The Rockwell HBT has several unique features which focus on lowering this resistance. Realizing that base resistance has two components, one located in the active base region (intrinsic) and one to account for the base extension out to the base contact, the Rockwell HBT uses a thicker layer for the extrinsic base, and efforts to minimize the distance between the emitter and base are in progress.

Traditionally, CMOS has held an advantage in power dissipation. However, conventional or dynamic CMOS utilizes a rail- to-rail logic voltage swing and at higher frequencies the dynamic power dissipation of this style of CMOS circuit can become a limiting factor in particular for high end computing applications. Bipolar circuits, on the other hand, have a high static power dissipation, but can be arranged to have very low dynamic power dissipation in actual logic circuits by using much lower voltage swings. Traditionally the high power dissipation of bipolar has been required in order to keep current densities up in the emitter for high gain. However, bipolar technology can also be scaled using more aggressive lithography than the current 1 µm minimum feature size, thus current density can be kept high while lowering the total current and therefore the static power. A comparable scaling of interconnections is also assumed in making this statement. At some point the dissipation per gate of these two technologies will intersect, and the question will then be which one delivers the highest computational rate for the lowest total power. The answer to this question is surprising.

We note that the present F-RISC/G 32 bit integer engine dissipates about 250 watts up to and including L2 cache chips and produces 1000 MIPS. A DEC Alpha dissipates 30 watts, with another 30 watts in cache at L2, for a total of 60 watts. The CMOS Alpha is a 64 bit architecture which we will ignore in the comparison since it is rare that 64 bit integers are required. Nevertheless, allowing for a factor of 2.5 between the MIPS rates for these two processors, the Alpha at 1000 MIPS would be about 150 watts vs. 250 for the single FRISC/G engine, hence the crossover is very close. At higher clock rates the relative power per MIPS comparison could easily go the other way. We shall return to this issue when we discuss InP and SiGe HBT technologies in shrunk lithographies where this situation will prevail.

Noise is another key consideration in predicting the future for computer technology. CMOS is not a balanced (continuous) current logic technology. Devices conduct current in CMOS only while loads are charging. Once these capacitive wiring loads are charged or discharged the current flow ceases, hence, currents in conventional or dynamic CMOS circuits are constantly switching on and off. This causes transient current surges in the power distribution system leading to switching noise due to parasitic inductances in the power supply. Bipolar circuits arranged to redirect a constant current through differing paths in current trees have the ability to dramatically reduce this noise. This constant current is also the cause for bipolar circuits having a high steady state power loss. Thus the bipolar designer must attempt to use his logic units in every cycle because, unlike in conventional CMOS, each current tree unit dissipates static power even when it is not used. The main path for reduction of this power loss is through bipolar device scaling and architectures that efficiently use a small set of logic units. Current Mode Logic (CML) provides for such a rich family of functional cells using only modest numbers of transistors.

Current Mode Logic is best described as a "current steering" circuit and not a "switching" circuit, because the current in the system is held constant, and is simply steered through different paths in the circuit depending on the logic state. In switching circuits the currents are switched on and off, leading to switching noise. It is possible to implement current steering using CMOS devices using some unconventional circuits called Current Steering Logic (CSL). However, these do not implement full differential logic with the kinds of very low voltage swings feasible in bipolar CML circuits. Even with E/D mode NMOS it is possible to make silicon MOSFET equivalents of GaAs SCFL current tree circuits. But these circuits also do not get the voltage swing down to the levels possible with the bipolar device. In addition, a lot of nice circuit tricks which are possible in conventional or dynamic CMOS don't work on SCFL or CSL styles of circuit design.

Industrial assessments appear to predict that the Si bipolar device alone cannot offer performance advantages relative to Si CMOS. Only the HBT technology appears to offer enough additional avenues for speed improvement. Thus it has become the primary focus of our effort. However, Heterostructure MESFET's, or High Electron Mobility Transistors (HEMT's) could also play an important role in high performance computing. The transconductance of the FET is much lower than that of the HBT. For example, the transconductance of a MESFET's can be as high as 600 mS/mm. By comparison the bipolar device offers as high as 20,000 mS/mm for transconductance. As a result, loading effects in HBT circuits, while not negligible, are considerably lower than in CMOS or FET circuits. Lightly loaded MESFET circuits can perform well just like lightly loaded, deep submicron CMOS. However, even in moderately loaded circuits interconnect loading can dominate the performance. Nevertheless, the lightly loaded situation can be of significance especially in regular structures such as memories. Hence, our effort should eventually encompass a combination of HBT and MESFET or HEMT device technology. This is expected to impact primarily the cache memory, where the core of the memory could consist of MESFET cells, while the decoder and sense amplifiers could be implemented with fast HBT devices which preserve the low switching noise and low voltage swings desired.

Of course, none of these arguments is convincing unless the fabrication yields of HBT circuits are compatible with the design of processors. Based on yield projections made by Rockwell, our effort is currently focused on building block components for a partitioned RISC design with roughly 7000-10,000 HBT's per chip. Such chips should be expected to yield in the range of 10-25%. Cache memory is more challenging, and will demand higher HBT counts near 14,000. The inherent device yields should be high using OMCVD processing as the oval defect density (the technology specific yield detractor) is exceedingly low at 5 ovals per square centimeter. Theoretically, this should make it possible to fabricate 100,000 HBT devices. However, doping and thickness uniformity problems, and interconnect defects appear to mask this improved state of affairs in the basic materials for GaAs/AlGaAs technology.

Evolution to 100 GHz GaAs/AlGaAs HBT Technology

The work conducted on the FRISC/G project has relied on Rockwell's 50 GHz baseline HBT process. However recent evolutionary improvements in the Rockwell process have led to a 100 GHz process. The exact process changes which have led to this improvement have not been disclosed. However, from equations (1) and (2), it should be evident that some of this improvement has come from improving the base transit time, we surmise by making the base thinner. The thinner base width increases fields within the base, which might appear unfavorable because the saturation drift velocity advantages of GaAs actually move closer to silicon at high field strength. However, the base thickness may be so thin at this point that electrons won't have collisions in the base, and so the saturation drift velocity is no longer relevant. It is possible that for sufficiently small transit regions the GaAs/AlGaAs system will still be better than silicon at high field strengths.

According to Rockwell, the 100 GHz devices fit into the same outline in layout as the earlier 50 GHz devices, i.e. there was no lateral device shrinkage, so one could map the entire FRISC/G layout onto the new devices and enjoy improved speed. Early simulations with this device, however, do not reflect a doubling of circuit performance. Note that not only did the device not shrink, but that the interconnection process has not shrunk either. Because there has been no lateral device or interconnection shrinkage we have estimated that the new transistor can only provide at best a 35% increase in speed in lightly loaded circuit situations. Nevertheless, this speed up could be combined with future yield improvements to attain a doubling of clock frequencies for FRISC. Hence, we have begun a low keyed move towards taking advantage of this new device. Yield would have to double in the Rockwell line to achieve a clock rate doubling with the new transistor. Interestingly, if the yield were to quadruple one would not even need the new transistor to double clock rates. Nevertheless, we are eager to take advantage of improvements wherever they materialize.

Since a yield improvement would permit incorporation of a floating point unit into the design, a plan for migrating the simple integer FRISC/G into a superscalar has begun. The new processor, called FRISC/H could have significant commercial potential since it could produce between 4000 and 6000 peak MIPS of performance at perhaps only twice the power level of FRISC/G. Clearly in this range of performance the GaAs/AlGaAs HBT FRISC/H would exactly equal the MIPS per WATT ratio of an equivalent 10 Alpha array, though in 32 bit format.

SiGe and InP HBT Technologies

The GaAs/AlGaAs material system is not the only one where HBT's may be fabricated. In recent years HBT's with exciting characteristics have been fabricated in the SiGe materials system at IBM. For several years TI and AT&T have been working in the InP /InGaAs/AlGaAs alloy system and some of the fastest reported HBT devices have been reported in this system. Interestingly, successful complementary MESFET and HEMT devices have also been reported in the same systems, although not necessarily simultaneously. However, in the case of SiGe this exact combination has been realized and well characterized at IBM at 0.5 micron emitter sizes where lower static power dissipation is possible in the HBT. These materials systems require a good match on lattice spacing for the crystalline semiconductor species. For both of these systems one finds not only a good lattice match, but various alloying ratios where similar lattice matching is possible.

Reticle Overview

Figure 1. RPI Reticle 93

The reticle prepared for the new 4 inch Rockwell GaAs/AlGaAs HBT line contains three large architecture chips and six small test chiplets. Rockwell's new Cannon stepper provides a large 20 mm by 20 mm reticle, and thus allows a much larger payload. Further, the new stepper provides much tighter overlay alignment control than the old one. The upgrade to 4 inch wafer processing and the new stepper has already proven to improve yields on Rockwell's MESFET line. Hence, we can anticipate better yield for the HBT baseline process as well. Figure 1 shows the reticle floor plan and Table I lists the chip, chiplets, and test keys included on the RPI reticle.

The three architecture chips consists of an 8 bit slice of the 32 bit FRISC datapath, the FRISC instruction decoder, and a 2 Kbit primary cache memory chip. The datapath chip contains an 32x8 bit register file, an ALU, a program counter, a status register, several pipeline registers, feed forward logic, and seven program counter history registers. The instruction decoder contains the instruction decoders for the seven pipeline stages, the register tag and feed forward control logic, and the interrupt and CPU state control logic. The instruction decoder is the most complex chip as far as the design is concerned since it has to handle interrupts and exceptions as well as pipeline stalls. The 2 Kbit cache memory chip has the largest device count of the chips on this run. It contains 9 low power register file memory blocks. Eight blocks must be working to obtain a working cache memory chip. A redundant 9th block was included since yield predictions showed that yield would be too low to get a working cache memory chip from a run with only 8 4 in wafers without redundancy.

RPI Reticle 93
8 bit Datapath Chip
Architecture ChipsInstruction Decoder
2 Kbit Cache Memory Chip
Test ChipletsRPI Test Chip 92
Deskew Test Chip
Boundary Scan Test Chip
High Speed Test Chip
Test KeysRPI Test Keys
Rockwell Test Keys

Table I

In addition to the architecture chips six, small chiplets are included on the reticle. A copy of the RPI Test Chip' 92 has been placed on the reticle for reference. This chip was tested last year and all circuits were found to be functional. This chip verified our CAD tools and the differential Current Mode Logic (CML) library. However, the devices from the previous run had an fT of only 30 GHz and the yield was too low to get a fully working register file.

The deskew test chip contains two skew compensated clock distribution channels controlled by a Phase Locked Loop (PLL) with voltage controlled delay lines and additional circuitry to verify the skew compensation scheme with minimal test equipment.

The boundary scan test chip contains several ring oscillators built with standard cells of different power levels and the novel boundary scan logic with additional delay test circuitry. The purpose of this chip is to verify the speed or delay test capability and the accuracy of the novel boundary scan circuitry.

The cache memory block test chip contains a cache memory block with additional test circuitry to accurately measure the access time of the memory block through a frequency measurement and to extensively test the memory block with different bit patterns.

The reticle contains a high speed test chip that contains a Voltage Controlled Oscillator (VCO) with a frequency multiplier (2X, 4X), a static divider chain, and high bandwidth receivers and drivers.

In addition to the Rockwell test keys, we have included on the reticle test keys for S-parameter measurements of all active devices. Besides the test sites with the active device, a test site without the device but all the interconnect is provided for de-embedded S-parameter measurements. The pad and probe parasitics are sufficiently large to require calibration sites to remove the effect of the parasitics from the measurements. These "empty" calibration sites are missing in the Rockwell test key set.

All the architecture chips contain the new boundary scan circuitry with additional test circuitry to measure delays with a resolution of about 50 ps. Thus the chips can be tested without expensive test equipment, in die or wafer form, using multichannel Cascade probes. Known good dies can then be mounted on the MCM and the same boundary scan and speed test circuitry can be used to verify that the dies are still working properly after MCM insertion. The chiplets have been designed with on chip test circuitry such that they can also be tested with a sampling scope and one or two multichannel Cascade probes.

The major chip design work since the last report includes:

> the inclusion of boundary scan features in the architecture chips.

> the inclusion of redundancy in the cache memory chip.

> the resizing of all output drivers to increase the nominal voltage swing from 300 mV to 450 mV. This was necessary as a paper on the measured losses on GE MCM inter-connects showed much higher surface resistance and attenuation than expected.

> the design and layout of five small chiplets and a set of device test keys for the coming reticle run.

Besides the 4 in wafer processing upgrade, Rockwell has also added a third layer of interconnect to the HBT process. The third metal layer is thicker than the metal 1 and metal 2 layers and thus has lower resistance and a thicker Polyimide dielectric layer. However, the minimal line geometries are much coarser than for metal 1 and metal 2. Thus metal 3 is mainly for power rails. We have taken advantage of the lower capacitance of metal 1 to metal 3 in power rail feedthroughs by replacing the metal 2 power rails of the feedthroughs with metal 3. The standard cell library has however not yet been upgraded to take full advantage of the third metal layer since this is at least a 6 month effort. However, two of the custom chiplets have been designed to take full advantage of the new metal layer.


Figure 2. FRISC Architecture

Datapath Chip

The datapath chip implements an 8 bit wide slice of the 32 bit FRISC datapath shown in Figure 2. The datapath chip contains an 32x8 bit register file, a carry select Adder/ALU, a shifter, a program counter, a status register, pipeline registers RES_EX, RES_D1, RES_D2, the Data_IN register and the Data_OUT1, Data_OUT2 registers, the feed forward logic, and seven program counter history registers. Figure 3 shows the floorplan of the datapath chip.

Each datapath chip has an 8 bit wide data input bus, a data output, and an address output bus. All datapath control signals come from the instruction decoder. There are two important control signals that go from the datapath to the instruction decoder. The BRANCH signal tells the instruction decoder the result of a branch condition evaluation. This signal is also fed to the instruction cache controller with the remote program counter. The arithmetic trap signal tells the instruction decoder that a two's complement overflow has occurred in the last operation.

The register file is implemented with one large custom macro that contains the output latches. The rest of the chip is implemented with our differential CML standard cell library using the VTITools router with our own extensions for differential routing and differential CML design.

Placement of the datapath components is very difficult because there are many critical paths and the irregular logic for the processor status word. One of the most critical paths is the loop from the ALU input operand A through the ALU and the RES_EX pipeline register, through the feed forward logic back to operand A. The delay on this loop must be below 1 ns. This path involves in the worst case three chip crossings for the computation of the 32 bit ALU result. All chip operations are controlled by a four phase clock. These four clock phases had to be distributed over the chip with minimal skew. Special high power clock drivers have been developed to reduce interconnect and fanout sensitivities on the clock distribution trees.

The datapath chip includes the new boundary scan logic with the novel speed testing circuitry. The boundary scan feature is essential for testing the chip on the wafer and on the MCM. The boundary scan control circuitry allows us to scan in a test vector, present it to the circuit with a variable delay from the start of one of the four clock phases, and capture the outputs after a variable delay from one start of one of the clock phases. The variable delays can be adjusted from 0 to 280 ps in steps of 40 ps. The start phase for test pattern presentation and output sampling as well as the delays are determined by a shift register at the beginning of the scan path.

Figure 3. Datapath Chip Floorplan

8 bit Datapath Slice Summary
Size9.1 mm x 7 mm
Device Count 9785
Power12.8W
72 Differential Inputs
I/O Pins6 Single Ended Inputs
22 Differential Drivers
18 VCC Power Pads
16 VEE Power Pads

Table II

Instruction Decoder Chip

Figure 4. Instruction Decoder Block Diagram

Figure 4 shows a block diagram and Figure 5 shows the floorplan of the instruction decoder. The instruction decoder consists mainly of random logic. It was the most difficult chip to design since it has to handle exceptions and stalls in the deep FRISC pipeline. Further, the communication delays from the instruction decoder to the datapath and cache memories in the partitioned FRISC architecture are a significant fraction of a clock cycle making the timing of the instruction decoder more critical. A broadcast from the instruction decoder to all four datapath slices can take up to 240 ps and communication to the primary cache memories can take up to two clock phases (400 ps).

The communication delays depend, of course, on the MCM package. The figures above are for TAB die attachment and Polyimide interconnect dielectric. The delays improve if C4 die attach and a Parylene dielectric are used.

The instruction decoder chip receives a 32 bit instruction word from the instruction cache, the instruction and data cache miss signals, the user interrupt signal, the system error exception signal, and the branch and arithmetic trap signals from the FRISC datapath and generates all the control signals for the four datapath slices and the stall and cache miss acknowledge signals for the instruction and data caches. The instruction decoder stalls the instruction pipeline on an instruction or data cache miss.


Figure 5. Instruction Decoder Floorplan

A cache miss is only valid if the decoder issues the acknowledge signal, otherwise the cache miss can be ignored because the instruction fetch or data transfer has been flushed. The instruction decoder flushes one or several of the instructions in the I1, I2, DE pipeline stages after a branch has been taken. Due to timing constraints the instructions are actually flushed one cycle later when these instructions are in the I2, DE, EX stages. This delayed action requires an additional control signal Flush_DP that flushes the EX pipeline stage on the datapath.

The pipelined instruction decoding is implemented with seven decoder stages, one for each pipeline stage. The first two stages I1, I2 contain only a valid flag. The 32 bit instruction word is available on the chip at the end of the I2 stage. Most of the decoding is performed in the DE (Decode) and EX (Execute) stages. Some of the instruction fields like the source register address fields are immediately dispatched to the datapath chips. After the EX stage only a few bits are needed to encode the instruction moving from one pipeline stage to the next.

Each stage has a valid flag that indicates whether the instruction is valid or not. An instruction can be invalidated by an exception or by a flush of one or several setup pipeline stages after a branch has been taken. Further, the result address field and a flag that indicates whether the instruction generates a valid result is kept since the result address is needed for tagging the result registers for the EX, D1, and D2 stages and for finally writing the result into the register file in the DW stage if the instruction generates a valid result.

In addition two bits are needed to encode the type of the instruction LOAD, STORE, ALU, and BRANCH. LOAD/STORE instructions can get an exception in the last pipeline stage (DW) and LOAD instructions must also latch the Data_IN register in the DW stage.

Branch instruction must check the BRA signal from the datapath in the EX phase and flush one or several instructions if the branch was taken. The type of branch (branch with execute and branch with squash) is encoded in the instruction field that is normally used for the destination address. Further, the exception handler needs to know the type of the instruction.

The instruction decoder contains also the CPU state machine (Normal, Protected, Exception, Protected Mode Exception) and the interrupt controller. The interrupt controller generates the interrupt vector for the highest priority interrupt pending. The following interrupts are implemented, Reset, System_Error, Data_Page_Fault, Arithmetic_Trap, Software_Trap, Instruction_Page_Fault, Interrupt, User_Interrupt. FRISC enforces in order instruction completion and features a precise interrupt mechanism.

A large section of the instruction decoder chip is dedicated to the register tag and feed forward control logic. The two source register addresses of an instruction must be compared with the tags of the Result_EX, Result_D1, Result_D2, and Data_IN registers. If there is a match with a valid tag the feed forward control signal are set such that the most recent result or data replaces the stale data fetched from the register file.

Instruction Decoder Chip Summary
Size7 mm x 8 mm
Device Count7358
Power11.5W
45 Differential Inputs
6 Single Ended Inputs
I/O Pins65 Differential Outputs
1 Single Ended Output
25 VCC Power Pads
22 VEE Power Pads

Table III

The instruction decoder chip includes the new boundary scan logic with the novel speed testing circuitry. The boundary scan feature is essential for testing the chip on the wafer and on the MCM. The boundary scan control circuitry allows to scan in a test vector, present it to the circuit with a variable delay from the start of one of the four clock phases, and capture the outputs after a variable delay from one start of one of the clock phases. The variable delays can be adjusted from 0 to 280 ps in steps of 40 ps. The start phase for test pattern presentation and output sampling as well as the delays are determined by a shift register at the beginning of the scan path.

Cache Memory Chip

The data and instruction caches each have 2 Kbytes of memory, consisting of eight copies of the cache memory chip. They are direct-mapped, and have block sizes of 512 bits. A bus that is exactly one block wide is used between the two different levels of cache, in order to speed up the swapping of blocks and increase cache performance. This wide bus saves time since an entire block can be transferred at once, but requires switching different sections of the bus at slightly different times to prevent problems with the power supply.

There are eight memory chips (32 x 64 bits) in each cache. Each memory chip stores four bits out of each cache 32 bit word. These bits are then multiplexed to the output of the chips based on the lowest four bits of the address. The number of output pads is then reduced from 32 (which would be the case if an entire word was stored in each chip) down to 4, which helps reduce the overall size of the chip since it is pad-limited. A block diagram of the cache memory chip is shown in Figure 6.

Figure 6. Block diagram of Cache Memory Chip

The chip contains 16,537 transistors, a large part of which are used for boundary scan testing and one redundant memory block. Only about 13,000 are used during normal chip operation. The chip is 8 mm x 9.4 mm and is estimated to dissipate 10.3 watts of power. The addition of a third level of metallization has improved power distribution on chip and cut down on the power rail voltage droop, which is a concern with the resistive-sourced standard cells used in the logic on the chip. A floor plan for the chip is shown in Figure 7 and the final artwork is shown in Figure 8.


Figure 7. Floorplan of Cache Memory Chip

Performance Enhancements

In order to speed up the LOAD and STORE operations, an extra signal was added to the interface between the primary and secondary caches in order to allow the L2 cache to anticipate fetching the required data. The architecture for F-RISC/G stipulates that the core processor has control over whether or not to take action on a cache miss. There are times when a miss may be ignored as the data is not needed (in certain cases where interrupts have occurred, for example). Previously, the L1 caches had no authority to tell the L2 cache to get the data. Instead, they were merely allowed to inform the processor, which decided whether or not to act, and then broadcast decision back to the L1 cache so that it could then send and address and get the data from the L2. Under the current scheme, the L1 cache sends the address for the data to the L2 controller in parallel with sending the miss to the core processor. The L2 cache is then allowed to do the compare between the address and the contents of its tag memories at the same time the core processor is deciding whether the miss is valid. If the core processor does not act on the miss, the L1 simply tells the L2 to ignore the miss and not send the data (or try to get it from main memory if it is not in L2). If the miss is valid, the L2 then saves the time of checking its tags and can immediately put the data out on the bus to L1 (or begin to get it from main memory). This saves a full cycle on L1 cache misses that are acted on, which are the overwhelming majority, and cuts one nanosecond off the miss penalty. Obviously there is also no penalty for making the L2 check its tags instead of idling if the miss is not valid. Figure 9 is a timing diagram showing the difference between the two schemes.


Figure 8. Cache Memory Chip Artwork

Redundancy

In order to increase yield of the memory chip, redundancy has been added by duplicating one memory block. Replacement of a block involves less overhead than replacing columns individually, and seems to give the best results with respect to the tradeoff between glue logic added and the amount of redundancy.

The replacement strategy is hard-wired, to avoid any additions to the MCM to handle setting configuration registers, as is commonly done with other column replacement strategies during power-up. Three output pads are used to determine which of the eight memory blocks to mask out and replace with the redundant block. These pads will be forced high or low to mask out a block on each of the sixteen memory chips on the MCM.


Figure 9. Enhanced Miss Scheme Timing

Yield Estimates for Cache Memory Chip

An analysis of the tradeoffs for two different redundancy schemes was conducted. The first strategy, and the one used in the cache memory chip, is that of substituting an extra 32x8 block of memory for one of the eight blocks needed per chip. This scheme is the simplest to implement, and uses the least amount of glue logic (i.e., it uses less power and dissipates less heat, two key considerations when designing with bipolar devices). The second strategy consists of replacing individual columns in each of the eight blocks. While its yield numbers are higher, it has the drawback of requiring power-on configuration. The analysis is as follows:

Single Block Replacement

Let c be the area constant of the number of devices under consideration divided by the number of devices for which the yield calculation has been done by the foundry (all foundry yield estimates are for 5000 devices). Then the probability that a certain number of devices will all work becomes:

Pr(N devices working) = yc, where c = N/5000 and y = process yield

In order to calculate yield for the single block replacement strategy, the probability of getting at least eight of the nine blocks working is multiplied by the probability of getting the rest of the devices on the chip in the non-redundant logic to work:

Pr(chip working) = { Pr(block)8 [1-Pr(block)] + Pr(block)9} Pr(logic working)

Since there are 4600 devices in the non-redundant part of the logic and 1400 devices in a block, the area constants become c(block)=0.28 and c(logic working)=0.92 in the above equation. The results of the calculations are shown in Table IV.

Foundry yield on 5000 devices (%)
Estimated chip

yield (%)
5
0.04
10
0.33
15
1.07
20
2.41
25
4.46
30
7.32
35
11.01
40
15.53

Table IV

Single Column per Block Replacement

First, the probability having at least eight of the nine columns in a block working is determined, without the decoder, threshold generator, etc. There are 128 devices in the cells and 12 devices in the R/W circuit, sense amp, etc., giving 140 devices per column. Therefore c=140/5000=0.028 when calculating the probability of one of the columns working.

Pr(eight columns working) = Pr(column)8 [1-Pr(column)] + Pr(column)9

Next, the other 198 devices for the multiplexers and latches used to re-configure the block and the other approximately 400 devices for decoder, threshold generator, buffers, etc. are added back in, so here c=0.12

Pr(entire block working) = Pr(eight columns working) (y0.12)

Having calculated the yield for one block, that number is multiplied by itself for each of the eight blocks.

Pr(eight blocks working) = Pr(entire block working)8

The analysis is finished by considering the 2400 devices for logic (boundary scan, etc.) and an additional 500 for the power-up logic needed for this scheme (the MUX control lines are not hard-wired as in the previous case, which means there would have to be considerable logic added to the MCM to handle sending the configuration latch pattern into each of the eight chips on power-up). The area constant becomes c=2900/5000=0.58.

Pr(chip working) = Pr(eight blocks working) (y0.58)

The final results are shown in Table V. A graph of the estimated chip yield for the two schemes is shown in Figure 10. For comparison a curve showing predicted yield without redundancy is included. Raw yield is not the only criterion used in evaluating the different schemes. While the column replacement gives better yield, it adds 0.9 watts more power to the chip versus block replacement. The additional transistors are estimated to increase the size of the chip from 8 mm x 9.4 mm to 8 mm x 10.3 mm, which cannot fit on the reticle with the other chips. Adding the extra devices also causes concerns about powering up such a large chip to test it with a limited number of probes. The current artwork uses the block replacement scheme for these reasons, although a rework is being considered in an effort to further increase yield of the chip.

Foundry yield on 5000 devices (%)
Estimated chip

yield (%)
5
0.25
10
1.20
15
2.84
20
5.23
25
8.27
30
11.89
35
16.03
40
20.66

Table V

Figure 10. Graph of yield for various redundancy schemes

Cache Memory Chip Summary
Memory Capacity256 bytes
Size8 mm x 9.4 mm
Device Count16537
Power10.3 W
128 Single Ended I/O*
I/O Pins5 Differential Drivers
21 Differential Receivers
15 VCC Power Pads
19 VEE Power Pads

Table VI

* Changing these busses to differential is under consideration.

High Speed Test Chiplet

The high speed test chiplet contains a high speed VCO (1-5 GHz) with a frequency multiplier (2X, 4X), a static divider chain, a high speed ring oscillator, several high bandwidth multiplexers and a high bandwidth driver and receiver. Figure 11 shows the block diagram of the high speed test chiplet. The chip has multiple paths for observing and testing different circuits.


Figure 11. High Speed Test Chiplet

The chip can be tested with a six channel Cascade probe and a high bandwidth Microwave probe to observe the output of the high bandwidth driver. Figure 12 shows the floorplan with the different probe sites for the high speed test chiplet.

The main purpose of the high speed test chip design was to find the limits of the 50 GHz HBT process and gain more experience with digital circuits operating at their frequency limit and their power requirement. The comparison of measured result and of back-annotated SPICE result will give us some feedback on the accuracy of the CAD modeling for microwave design. The biggest concerns are the SPICE models (the Rockwell models include no thermal modeling, and hence no self heating effects) and the parasitic extraction for back-annotation.

Capacitance extraction is more difficult for GaAs circuits since the interconnect capacitance is mainly due to coupling between adjacent wires because of the semi-insulating GaAs substrate. In Si circuits the capacitance is typically dominated by the capacitance to the substrate. An accurate extraction of the interconnect parasitics would require 3-D modeling. Our current extraction tool can only compute interconnect capacitance as a function of length and overlap with other metal layers. The capacitance per unit length is based upon a worst case assumption, a differential wire pair with minimal spacing and width and two adjacent metal plates.

Figure 12. The Floorplan of High Speed Test Chiplet

Back-annotated SPICE simulations indicate operation up to 20 GHz. Figure 13 shows the layout of the high speed chiplet. All of the components have custom layouts and are hand placed to optimize performance. Several design and layout iterations were required to achieve the 20 GHz back-annotated SPICE result.

High Speed Chiplet Summary
Size:1.9 mm x 1.6 mm
Device Count412
Power2.45 W
8 Single Ended Inputs
I/O Pins1 Differential Driver
1 Differential Receiver
10 VCC Power Pads
4 VEE Power Pads

Table VII

Figure 13. The Layout of the High Speed Test Chiplet

The Boundary Scan Test Chip

In order to test the boundary scan scheme employed in the instruction decoder and data path slice chips, a small test chip was developed. The purpose of this test chip is primarily to allow measurements of at-speed testing resolution and to verify our boundary scan design and implementation. In addition, the chip was designed to allow us to investigate the performance of our standard cell library by incorporating chains of buffers at various logic and power levels.

Figure 14. Block Diagram of Boundary Scan Test Chip

Implementation

The boundary scan test chip consists of boundary scan control logic, a four-phase clock generator, and three standard cell loops. The Carry Chain Loop (CCLOOP) contains a copy of the custom carry chain macro used in the Data Path chips in series with a medium powered buffer chain. The carry chain may be put into oscillation to measure its performance, or it may be stimulated with an input pulse which then propagates through the chain. In the pulse mode of operation, the novel delay testing circuitry added to the boundary scan would be used to measure the time of propagation. Multiplexers are provided to select different oscillation paths. Hence, the frequency of oscillation and propagation time of the carry chain loop can also be varied.


Figure 15. Floorplan of Boundary Scan Test Chip

Speed testing is accomplished by varying the sampling point of the boundary scan latches by selecting a clock phase and delay offset. This offset and phase can be adjusted until the output pulse is captured. For calibration each of the four clock phases can be sampled by the boundary scan latches. The output of the four phase clock is also externally visible for timing calibration.

The Level 1 (L1LOOP) and Level 2 (L2LOOP) loops work similarly, but contain selectable chains of low, medium, and high powered buffers. The types of gates included in the chain are externally selectable to allow comparison of performance for each type of gate.

Boundary Scan Test Chip Summary
Size3.5 mm x 6.2 mm
Device count2437
13 Differential Inputs
I/O Pins1 Single Ended Inputs
7 Single Ended Outputs
10 VCC Power Pads
4 VEE Power Pads
Power Dissipation3.45 W

Table VIII

Cache Memory Block Test Chip

This test chip contains a single 32 x 8 bit piece of memory, nine of them are used on the cache memory chip. It contains the same multiplexer and logic delays as the full chip, but allows probing the signals directly without the use of boundary scan. The purpose for this is two-fold. It allows checking the performance hit of the boundary scan, and also allows checking of the memory macro speed in the event of boundary scan failure on the full chip. The logic permits writing of the classic checkerboard and inverted checkerboard patterns, as well as all high or all low signals to the memory. The access time of the memory can be determined with a frequency measurement. All 256 bits can be tested. A block diagram for the chip is shown in Figure 16.

Figure 16. Block diagram of cache memory block test chip

There are 2516 devices in the chip. Power dissipation is estimated to be around 1.4W. The floor plan is shown in Figure 17. Final artwork for the chip is shown in Figure 18.


Figure 17. Floorplan of Cache Memory Block Test Chip


Figure 18: Artwork for memory block test chip.

Cache Memory Block Chiplet Summary
Memory Capacity32 bytes
Size3.4 mm x 4.1 mm
Device Count2516
Power1.4 W
I/O Pins6 single ended I/O
2 VCC Power Pads
2 VEE Power Pads

Table IX

Deskew Test Chiplet

A test vehicle for the deskew scheme has been designed. It is designed to deskew two clock distribution channels, which is sufficient for the testing of all of the circuit components of the deskew circuit. The main objective of the test vehicle is the verification of the designs of the subcircuits comprising the deskew circuit.

Additional delay elements have been added to the deskew test to emulate slowly varying interconnect delays on the MCM. Further, two clock phase generators are included to measure the clock skew that would be seen by two chips in the system. These clock phase generators are driven by the clock sent over the two clock distribution channels with the additional elements (labeled "delay") emulating slowly varying interconnect delays.

The digital clock signals are delayed by voltage controlled delay elements (VCDEs) as shown in Figure 19. The additional delay elements are also implemented with VCDEs. One set of the delays block has a fixed delay whereas the other set has a delay that is controlled by an external control voltage. This control would permit the determination of the maximum tolerable initial skew and the tolerable skew variation. The testing strategy would be, to match the delay in the variable delay set of the VCDEs to the other set, then vary the delay in the VCDEs until the clock is no longer deskewed. Balanced delay Exclusive-ORs (XORs) are used to monitor the skew. Each of the two inputs of the XORs are connected to the matching locations of different clock loops as shown in Figure 19. One of the two XORs is directly monitoring the skew between the clocks at the end of the distribution channels (Clock Test) while the other detects the skew of the first clock phase (Phase Test). Figure 20 shows the key control circuits of the deskew chip with the phase looked loop (PLL).

Table X shows the probe sites of the deskew test chip shown in Figure 21. There are three distinct types to the pins shown in Table X; test output signal pins, diagnosis signal pins, and control signal pins. While some of the pins, such as the Phase Test and Clock Test, are designed to signal the proper deskewed state of the master clock, pins such as the SYNC, DOWN, and UP are designed to diagnose the possible causes of the faults if the test chip fails. The signal pins, such as the VCDE Delay Control Select and VCO Control, comprise the third type of analog pins for testing the test chip to yield information about the performance range of the deskew chip.

Figure 19. Clock Loop Portion of the Test Vehicle

Figure 20. PLL Portion of the Test Vehicle

The chip has been designed for testing with two six channel Cascade probes. A CASCADE signal probe has ten pins total, six signal pins and four power and ground pins. In Figure 22, which contains the floorplan of the deskew chip, there is another probe site, the Probe Site 4. This site is not included in Table X because it is not used during the test-mode. This site contains the differential driver and the receiver ports of the two clock distribution channels. Table XI shows a summary of the test chip.

Figure 21. Layout of the Test Vehicle for the Deskew Chip

Site 1
Site 2
Site 3
Phase Test
Clock Phase 1, A
Clock Freeze
Clock Test
Clock Phase 1, B
SYNC
VCDE Delay Control
Clock Phase 4, A
RESET
Delay Select
Clock Phase 4, B
DOWN
VCO Control
Phase Freeze
UP
INIF
Filter Output

Table X

Figure 22. Floorplan of the Deskew Test Vehicle

Summary of the deskew test chiplet
Clock Distribution Channels
2
Power Dissipation
1.85 W
Transistor Count
1,030
Chip Size
2.6 mm x 3.0 mm
2 Differential Inputs
I/O Pins
2 Differential Outputs
4 Single Ended Inputs
13 Single Ended Outputs
8 VCC Power Pads
8 VEE Power Pads

Table XI

MCM Work

The specification of the F-RISC/G multi-chip module (MCM) is being approached using the methodology shown in Figure 23. The design and analysis space of F-RISC/G MCM is spread across four main design domains - Routing, Electromagnetic, Chip/Pad design, and Package/Thermal - which are to be satisfied simultaneously. These domains are not completely independent of each other and the wire pitch has been found as the common denominator linking all of them.

The routing domain deals with the problem of completely routing the MCM. Chip connectivity statistics along with the basic wiring congestion influence the routability of the system. Normally, as the number of available routing tracks decreases, the chances of completely routing a system decreases too except in highly regular systems. Therefore as a general rule, it can be assumed that as the wire pitch increases (or the number of routing tracks decreases) the routing becomes harder and the routability goes down.

The design of multi-chip module was started with a specification of a netlist describing all the interchip connections. This netlist was used in conjunction with the timing constraints supplied by the chip designers to lay out the F-RISC/G core which is surrounded by level 2 cache chips as shown in Figure 24. The congestion along the datapath chips led to an initial specification of the wiring pitch as 10 microns. The layout was successfully routed with two layers of interconnects using RPI-MCM router as shown in Figure 25. An analysis of routing characteristics indicated a few wires violating timing constraints. This was identified to be due to daisy chained signals going from cache controllers to all the L1 cache chips. This prompted a conversion of daisy chained nets into point-to-point nets by using extra driver pads on the cache controller chip. This results in the maximum net length as about 10 cm. Number of nets are plotted against this maximum length in Figure 26.

A wire pitch of 10 microns completely routed the system and took the design process into electromagnetic domain as shown in Figure 23. The longest pair of wires was simulated using MagiCAD, with different wire pitches to predict the coupling between the wires. The coupling between the wires increases (goes up the curve in the electromagnetic domain) as the wiring pitch decreases and the noise margin goes down. Therefore the wiring pitch which was decided upon in the routing domain gave the maximum coupling present in the system. It takes the process into the pad (or driver) design domain.

One of the design consideration in this MCM is the signal propagation from the center of the MCM to the outlying memory chips. The differential receivers require at least 150 mV at their inputs for a logic swing. The driver to receiver network was modeled in MagiCAD as lossy transmission lines using Parylene as interlayer dielectric (dielectric constant = 2.65) and copper (bulk resistivity = 1.67 µ ohm-cm) as interconnect metal. Simulations were done to obtain geometries for 50 ohm impedance lines. The driver was modeled as a current source in parallel with 10k ohm resistance. This driver is capable of providing a 60 ps rise time pulse. Solder bumps were modeled as 50 pH series inductance and pad capacitance was modeled as 100 fF capacitance between the signal and ground. The lines are terminated with 50 ohm resistances at the receiver.

MagiCAD simulations showed that a wire-width of 10 microns and wire thicknesses of 4 or 5 microns were not sufficient to ensure a clean logic swing at the end of a 10 cm line at a receiver. Therefore a wire width of 12 microns was simulated with wire thicknesses of 4 and 5 microns. Figures 27 and 28 show the results as plots between receiver voltage and line length.

Simulations were done by varying driver current and metal resistivity. Increasing the driver current keeps the signal strong for longer lengths while increased metal resistivity attenuates it more. As the receiver voltage needed at the end of a line is at least 150 mV, the effects of increased resistivity and longer lines were considered to arrive at a wire width of 12 microns and thickness of 5 microns.

Once the maximum net length is determined for a particular grid size and determined to be within the specifications regarding the noise margin the routed MCM is analyzed from a thermomechanical stability point of view. Excessive thermal stresses can occur if the cooling is not sufficient. Similarly, mechanical stresses in the system can be high enough so that the wires may break. An increase in wiring pitch increases the spacing between the chips which in turn increases the cooling ability of the system. As the wiring pitch is doubled the space between the chips is also doubled. Also the number of horizontal tracks within the chip height also reduce to half of its previous value. Therefore the effect of wiring pitch variation on the MCM area required to route the same connections between any two chips is much worse than a linear variation.

Thermal modeling of the proposed F-RISC/G MCM is in progress which will help determining the final wire pitch. The diagonally positioned design domains behave in the same manner with respect to the wire pitch. So the routing and chip/pad design domain try to push the wire pitch towards origin while the electromagnetic and package/thermal domain try to pull it away from the origin.


Figure 23. MCM Design Cycle


Figure 24. Placement of F-RISC/G Chips with Second Level Cache (shown as shaded)

Figure 25. Routing of F-RISC/G Core


Figure 26. Wire Length Distribution on 25 Chip MCM

Figure 27. Receiver voltage vs. wire length for cw12412 structure

Figure 28. Receiver voltage vs. wire length for cw12512 structure