J. F. McDonald
Rensselaer Polytechnic Institute
Troy, New York
I. Motivation : 3D Chip Integration for Shortened Interconnection Lengths.
It is well known that in modern VLSI chips there are two classes of wiring. One of these consists of wiring within tightly defined (often hand crafted) macros of logic functions, and the other consists of the wiring to interconnect these macros. Kang (1987) demonstrated that a histogram of chip wire lengths shows two distinct peaks, one at about 0.1L and a second at 0.5L where L is the die size. Recently Meindl has derived closed form probability distributions for 2D chips utilizing Rent's rule. Within the tightly designed macros, often wire contributions to delay can be only in the 10-20% range, and these account for the first peak. Clearly the inter macro wiring is responsible for the peak of the longer connections and the second peak. Now all of these interconnections must charge and discharge in one clock cycle. It is clear then that the clock cycle can be no faster than the charge and discharge time of the longest of these lines. Ideally in fact, the clock rate would be dictated by just the longest single line. For example, a 12 picosecond gate delay in the 0.5 micron IBM process quickly becomes 1400 picoseconds when loaded with 1.5 centimeter of wire on its second level of metal. In modern designs these very long lines are just beginning to exhibit quadratic wire delay vs. length due to existence of both capacitive and resistive parasitics in the wire. It is these long wires that 3D chip integration seeks to reduce in length.
II. Fast RISC Demonstrator
It would be desirable to demonstrate unequivocally that faster computers are possible with 3D chip integration. Saraswat at Stanford and Rief at MIT have analyzed various models of microprocessors. In each analysis a strategy for obtaining the wire length distribution was derived for a 3D chip with the same architecture. In the Rensselaer 3D effort there is a legacy of a DARPA sponsored RISC processor that was completely designed at Rensselaer with a 2 GHz clock rate. Because the processor was entirely designed within Rensselaer, not only are the wire length distributions freely available, but also the entire netlist specifying the architecture. All timing delays for the interconnections and the role that they play are completely open within the Rensselaer
Figure 1. Four GaAs Byteslice GaAs HBT Integrated Circuits as Fabricated by Rockwell Newbury Park facility for the DARPA Fast RISC (F-RISC/G) Project. Compose the Blocks to the left of the Extremely Wide 128 Byte Bus. The L2 Block is a Conventional BiCMOS L2 Cache Memory.
Figure 2 Architecture of the F-RISC/G. The GaAs HBT Circuits Shown in
The F-RISC/G is a GaAs HBT byteslice architecture for a 32 bit RISC engine running at 2 GHZ clock (four phases at 250ps per phase). It is pipelined approximately at the level of conventional integer pipes for microprocessors. There are four byteslice chips used, namely the Data Path (DP), the Instruction Decoder (ID), the Cache Controller (CC) and the Cache RAM (CR). These are shown in Figure 1 with the DP being the large chip in the upper right, the ID being the large chip on the lower left, the CR being at the upper left corner and the CC being at the bottom right. The extreme right and lower strips of the reticle shown in Figure 1 have other test structures including a calibrating chip containing a 5 GHz single ported register file and a adder ring oscillator. These test structures helped guide the project towards its final goal by providing an early look at the speed possible with this level of integration.
A key feature of the F-RISC/G is an extremely wide L1-L2 cache bus of 1024 bits (128 bytes width). The L1 cache is only 8Kbytes is size (half is for data and the other half is for instructions). These small cache chips are necessary to permit them to keep up with the processor speed, a characteristic that will increasingly be exhibited by faster machines. To mitigate the excessive miss events this causes, the next line of cache is moved as an entire line in one cycle. For F-RISC/G this is 1024 bits or 32 words.
Because the yield of GaAs HBT technology is so low the architecture consists of 23 chips taken from the basic set of four slices. The architecture then needs 4 DP chips, 1 ID chip, 2 CC chips and 16 CC chips.
The 23 GaAs HBT chips that comprise the core of the FRISC/G need to be unified on a common substrate capable of supporting the 2 GHz signals with a reasonable number of harmonics (9 odd harmonics would demand 36 GHz bandwidth for the wiring).
Figure 3 GE/HDI MCM Supporting the 23 GaAs HBT Core of the F-RISC processor.
Figure 3. Mentor CAD Layout for GE/HDI showing Chip-to-Chip wiring (MCM Wiring for F-RISC/G).
Figure 4 F-RISC/G Wire Length Distribution on Log Log plot
The wire length distribution shown in Figure 5 includes both the wires in the MCM in Figure 4 plus wires from within the 23 chips of he architecture flattened in the database.
Note that some of the longest wires are in the chip to chip distribution for the MCM, the longest wire being about 5cm. This length alone with the GE/HDI polyimide dielectric for the wiring tape would dictate a maximum clock frequency of only 3.5 GHz. This would be just the wire toggling frequency for this longest wire in the architecture, and assuming this wire is on a speed of light terminated transmission line in the MCM operating with the 40 GHz bandwidth for 5 cm of GE/HDI wiring. Direct 3D chipstacking of all the 23 chips in the GaAs would clearly reduce this longest wire to only about 1.0 cm, an improvement by a factor of 5 in the longest wire. Assuming that the wiring bandwidth could keep up with this it would suggest that the largest burden of wire could clock at about 16 GHz. Pipelining the architecture properly and use of sufficiently fast devices should make this the achievable clock speed. Hence this is the goal of the FRISC/H project, namely a 16 GHz clock machine. It should be mentioned that the internal speed of the present 50 GHz process used for FRISC/G could only capture about half this speed. Hence a switchover to 100 GHz devices is required to pursue the 16 GHz machine.
Now the present implementation with the Rockwell 2 micron GaAs HBT technology dissipates 240 watts including all the drivers to support the 1024 bit bus (80 watts are required for that task alone). Hence any chance of 3D integration of the processor in 23 chip GaAs technology partition is unlikely. Instead a switch to the IBM SiGe HBT BiCMOS technology is recommended. Here the IBM 0.2 7HP minimum design rules are roughly 1/10 those of the Rockwell GaAs HBT and as might be expected the DC bias current at the peak of the fT vs. Ic curve for 7HP is roughly 1/10 that of the predecessor process. In addition, the turn on voltage, Vbe for the SiGe HBT is only half that for the GaAs HBT. So the net reduction in power will be a factor of 20 to only 12 watts. Even this small amount will represent a challenge for 3D integration.
The changeover to SiGe HBT implementation is under way. Because SiGe HBT technology exhibits at least a 4-8X improvement in yield over the GaAs FRISC/G further chip optimization is possible. The SiGe effort has been named FRISC/H. Because of these yield improvements the architecture could be recast as a 4 chip stack in the 7HP line. Some use of BiCMOS in the L1 cache is possible to reduce power further. Our target is 10 watts to be dissipated in a four chip stack. The area of each chip is expected to be about 1 cm x 1cm.
III. Future Work
This quarterly report has focused on the inspiration for the 3D chip project, namely the previously sponsored DARPA work on the F- RISC/G design. It is clear from the 23 chip MCM that a 23 chip stack would certainly remove all the wires between 1 and 5 cm that occur primarily on the MCM replacing them with wires on the order of 1cm in length. But to capitalize on this idea the rest of the FRISC architecture must be designed to push the envelope of processor design. A 16 GHz processor would have to make additional architecture adaptations. For example experienced gained with the DARPA Tera machine indicates the need for multi-threading the architecture to mitigate cache stalls as the gulf between processor speed and bulk memory increases. One of the tightest timing considerations will be in the register file. The register file forms the heart of any data path, and it is clear that the register file's size and complexity easily limit the clock frequency. Consequently early work in the SiGe version of FRISC has focused on just what performance can be wrung out of these building blocks. One feature may prove exceedingly important in the ability to reach these extreme frequencies, namely the ability to micropipelining the register file, allowing decoding, word line chargeup and bit line selection to be piped. Register file banking may be required to mimic the features of multithreading. A target of just 8 threads has been set. This will require that the size of the each bank be held as small as possible while still providing a reasonable number of registers to each thread. Current thinking is that for an integer pipe machine 8 registers may be sufficient. Pipelining of the file would then place the location decoding for the file in the first pipe, 8-word selection to the bit lines by a second pipe, and finally selection of the bank in the third. Eight to one selection can be accomplished in a single current tree followed by a flip-flop which would place the 7HP register file in the range of 16 GHz.