F-RISC/G and Beyond - Subnanosecond
Fast RISC for
TeraOPS Parallel Processing Applications

ARPA Contract Numbers DAAL03-90G-0187,
DAAH04-93G-0477,
[AASERT Award DAAL03-92G-0307 for Cache Memory]
Semi-Annual Technical Report
May 1995 - January 1996


Prof. John F. McDonald
Center for Integrated Electronics
Rensselaer Polytechnic Institute
Troy, New York 12180

Phone - (518)-276-2919
Fax - (518)-276-8761
MacinFax - (518)-276-4882
Email - mcdonald@unix.cie.rpi.edu
URL - http://inp.cie.rpi.edu/research/mcdonald/frisc/reports


Abstract

The F-RISC (Fast Reduced Instruction Set Computer ) project has as its goal the exploration of the upper speed envelope of one computational node through the use of advanced HBT technology. F-RISC/G involves development of a one nanosecond cycle time computer using 50 GHz GaAs/AlGaAs Heterojunction Bipolar Transistors (HBT). In the past contract period the primary activities consisted of a revision of all chips to avoid distributed RC effects since our interconnect analysis showed that distributed RC effects can double the interconnect delays on 8 mm long Metal 1 lines and cause an excessive delay of 150 ps. This distributed effect was not considered in the simulations since our CAD tools did not support distributed RC interconnect delays. The reduced dielectric thicknesses discussed in the last report and the shrink of the metal to compensate for the thinner dielectric layers pushes long metal 1 lines into the quadratic delay regime which has to be avoided in order to meet the 1 ns cycle time goal. In the last contract period we also designed in collaboration with Rockwell new devices with scaled emitters for faster switching performance at lower power. Further, test structures with ring oscillator for measuring the performance of the new devices and RC delay effect were submitted to Rockwell for inclusion on the next HSCD reticle. A structure with one of the new devices has already been fabricated at Rockwell on a reticle of the Rockwell Science Center, thanks to K. C. Wang. Preliminary testing showed that the scaled emitter devices switch faster at lower current/power levels. We also have upgraded our design tools to be able to model distributed RC delay effects in our simulations. We are working with the vendor of our CAD tool suite to adapt the distributed RC modeling capability from CMOS to the Rockwell HBT technology.

Project Goals



The Distributed RC Problem

Our interconnect characterization effort which we performed as part of an HSCD subcontract to Rockwell showed that the measured interconnect capacitances where higher than expected based upon the dielectric layer thicknesses and constants listed in the 1995 design manual. Measurements at Rockwell showed that the interlayer dielectrics were thinner than expected, 0.9 µm instead of 1.6 µm, at Metal 1 to Metal 2 crossovers. The spin-on polyimide does not planarize as well as expected, the thickness of the polyimide dielectric layer is a function of the structure of previously deposited metal layers. The lack of planarization and the reduced thicknesses increased our interconnect capacitance resulting in a speed degradation of about 20%.

To compensate for the higher capacitance of the wires we shrunk the wire width of Metal 1 from 2.0 µm to 1.6 µm taking advantage of the new wire design rules in the Rockwell HBT process. The shrinking of the wires reduces the interconnect capacitance to about the same level that we expected based on nominal dielectric layer thicknesses. However, an analysis of distributed RC indicated that Metal 1 wires that are longer than 2 mm show a distributed RC delay effects resulting in a quadratically growing interconnect delay. This causes excessive interconnect delays that were not included in our simulations since our tools used to support only a linear interconnect delay model. Even using 2.0 µm wide Metal 1 wires to lower the resistance of the wires does not significantly reduce the quadratic delay effects since the wider wires have larger coupling and crossover capacitances. This is surprising since distributed RC delays are typically only relevant in interconnects with submicron design rules. We notice distributed RC effect in our HBT circuits earlier since the drive impedance of bipolar drivers is low compared to CMOS, hence the resistance in the wires gets larger than the output impedance of the drivers even after a few millimeters of interconnect. Since the targeted cycle time for our HBT chips is very short a 50 ps excess delay due to distributed RC delay is significant while a 50 ps delay increase in a CMOS circuit would have a much lower impact.

In the Rockwell polyimide/Au interconnect process, only Metal 1 shows significant distributed RC delay effects. The distributed RC delay effect is only significant if the total Metal 1 wire length from the driver to a receiver on a net exceeds 2 mm. Metal 2 and Metal 3 are much thicker and have lower interconnect capacitances per unit length since they are further away from the GaAs substrate with a dielectric constant of 13.1. The other dielectrics in the Rockwell process (SiO2 er=3.9, SiN er=6.9, Polyimide er=2.9) have significantly lower dielectric constants. Table 1 shows the characteristics of the three interconnect layers in the Rockwell HBT process and Figure 1 shows a vertical Metal 3 channel crossing an Metal 2 wiring channel and a standard cell.

Table 1. Interconnect Layer Properties
Rockwell's Polyimide/Au Interconnect Layers
Property Thickness
[um]
Resistance
[Ohm/mm]
Capacitance
[fF/mm]
Metal 1 - W=1.6 µm P=6.0 µm 0.65 35 143
Metal 2 - W=2.0 µm P=6.0 µm 1.40 14 101
Metal 3 - W=3.0 µm P=9.0 µm 1.60 11 123

Figure 1. Vertical Metal 3 Channel of Standard Cell and Horizontal Metal 2 Wiring Channel.


Strategies for Solving the Distributed RC Problem

Moving Long Metal 1 Interconnections to Metal 3

Since Metal 3 is thicker than Metal 1, its sheet resistance is only 16 mW instead of 55mW per square. In addition, the M2-M3 dielectric is thicker than the M1-M2 dielectric reducing coupling and capacitance to Metal 2 and Metal 3 is farther away from the GaAs substrate with the high dielectric constant of 13.1. However, the minimal width and spacing of Metal 3 are 3 µm instead of 1.6 µm and 2.0 µm for Metal 2. Also while the minimal width and spacing of the interconnect were reduced in 1995 the vias were not scaled. A M2-M3 via requires a 5 µm by 5 µm Metal 2 and Metal 3 area and a M1-M2 via requires a 3 µm by 3 µm area. The large size of the vias can be seen in Figure 1 that shows a Metal 3 wiring channel crossing a standard cell with two power rails and a standard cell wireing channel.

The RC problem can be avoided if segments of nets that contain long Metal 1 wires between the driver and a receiver are moved to Metal 3. However, the routing density of Metal 3 is much lower than Metal 1 and Metal 3 is already used for the power rails. Thus the power rails have to be cut and bridged with Metal 2 as shown in Figure 1. Metal 1 is used for vertical interconnects on the F-RISC/G architecture chips and Metal 3 is used for horizontal power rails. Fortunately, our estimates indicated that only 20-40 nets are affected on each chip. Thus the reduced wire density on Metal 3 is no impediment.

Figure 2. Test Structure for Metal 1 Capacitance.

Lets assume the Metal 3 power rails are broken in 3 vertical stripes to make room for a total of 40 differential Metal 3 wires. The use of at most three vertical Metal 3 channels keeps the increase in the power rail resistance minimal since at most 0.5 µm of a standard cell power rail must be bridged with Metal 2 . Metal 3 can be routed on top of the existing M1-M2 wiring channels. Figure 2 and Figure 3 show the two test structures used for 3-D capacitance extraction of typical Metal 1 and Metal 3 interconnect capacitances. The structures contain a Metal 2 wiring channel with 20 differential signals and a standard cell with two Metal 3 power rails to emulate an average routing area on a F-RISC/G architecture chip. The signal delays on Metal 1 and Metal 3 are analyzed with the 3-D QuickCap capacitance extractor with nominal and reduced interlayer dielectric thicknesses and a substrate thickness of 100 µm (a lapped wafer).

Based on these results a move from Metal 1 to Metal 3 can both lower the interconnect capacitance on long nets and eliminate the distributed RC effect! Additional QuickCap runs indicate that the capacitance increase on Metal 2 wiring due to additional Metal 3 overlap in the wiring channels is small (<5%).

The structures simulates a Metal 3 wiring channel crossing a standard cell row and a wiring channel with 20 differential Metal 2 wires. The channel cut in the Metal 3 power rails is closed with Metal 2 and M2-M3 vias as shown in Figure 1. The wiring pitch for Metal 3 is varied from 6 um 10 um. The Metal 3 width is 3 µm, the minimal Metal 3 design rule.

Figure 3. Test Structure for Metal 3 Capacitance.

Assuming the following net configuration:

Figure 4. Net Configuration for Delay Analysis.

Figure 5. Signal at Receiver End of Metal 1 and Metal 2 RLC Lines.

The net configuration includes the pull-up resistor of the CML driver, the switched current source (driver), the output capacitance Co which limits the driver rise/fall time the distributed RLC line, the input capacitance of the receiver Cr,, the receiver switching threshold Th .

Figure 6. Signal at Receiver End of Metal 3 RLC Lines.


A distributed analysis of a net for the differential signal mode using a numerical Laplace transform of order 18 yields the signal waveforms shown in Figure 5 and Figure 6 with Is = 1.6 mA (high power gate with output at level 1), Is*Rp = 250 mV and a maximum signal rise time for an unloaded driver of 50 ps and a Cr of 50 fF. The capacitance was extracted with 3-D QuickCap using reduced and nominal dielectric thicknesses and the inductance was estimated from L*C=er/c02 with er=2.9.

The Metal 1 line is dominated by losses and the signal rise time gets severely degraded on long lines causing long delays and a jitter problem in the presents of noise. The Metal 3 lines have much less resistance and show some ringing effect since on chip interconnects are not terminated.
Figure 7 shows Metal 1 and Metal 3 interconnect delays as a function of interconnect length. The Metal 3 delays are almost linear up to 10 mm while the Metal 1 delays are clearly quadratic! The delay of an 8 mm long Metal 1 line is twice as long as expected from a linear delay model! The excess delay on a net running from top to bottom on a chip can therefore be 150 ps longer than originally simulated. Even spacing Metal 1 wires further apart leads to no significant improvement. Increasing the spacing of Metal 3 from 6 µm to 7 µm improves delays on long lines. Further increases in the Metal 3 spacing have no significant effect on delays, but reduce interconnect density. Considering the size o the M3-M2 vias a routing pitch of at least 9 µm is recommended. A 9 µm Metal 3 pitch allows to place in-line vias in the Metal 3 channels. Thus with a 9 um pitch the transition from Metal 3 to Metal 2 or Metal 1 can be done in the existing Metal 2 and Metal 3 wiring tracks of the nets to be moved to Metal 3. The M3W30P90r (Metal 3 Width=3.0 µm Pitch=9.0 µm reduced interlayer dielectric thickness) delays are a few percent lower than the original Metal 1 W20P60nR delays which do not include the wire resistance and are based on a sensitivity towards capacitance of 112 Ohm for a high power gate with the output at level 1. Table 2 lists the interconnect delays for Metal 1, Metal 2, and Metal 3 differential interconnections as a function of wire length. The table shows the delays with nominal (suffix n) and reduced (suffix r) dielectric thicknesses. The M1W20P60nR entries represent the expected delays with nominal dielectric thicknesses and no distributed RLC effects (Rwire=0). Table 3 lists the equivalent capacitance to ground (C11+C12=C22+C12) of differential Metal 1, Metal 2, and M3 interconnects.

Figure 7 . Metal 1 and Metal 3 RLC Line Delays.

Table 2. Distributed RCL Interconnect Delays for M1, M2 and M3 Layers.
Interconnect Delays [ps]
Length 2mm 4mm 6mm 8mm 10mm
M1W16P60r 56.7 121.7 206.7 309.2 429.2
M1W20P60n 51.7 114.2 186.7 276.7 379.2
M2W20P60r 39.2 74.2 114.2 159.2 206.7
M2W30P60n 39.2 79.7 121.7 166.8 214.2
M3W30P90r 39.2 71.7 104.2 139.2 171.7
M1W20P60nR 42.3 78.0 115.6 152.3 188.9

Table 3. Equivalent Capacitance for Differential M1, M2, and M3 Interconnects.
Equivalent Capacitance To Ground for Differential Interconnects.
M1W16P60r C11=C22=143.5 FF/mm C12=22.0 FF/mm
M1W20P60n C11=C22=135.1 FF/mm C12=28.6 FF/mm
M1W16P70r C11=C22=141.0 FF/mm C12=19.4 FF/mm
M1W16P80r C11=C22=137.7 FF/mm C12=18.1 FF/mm
M2W20P60r C11=C22=100.5 FF/mm C12=20.8 FF/mm
M2W30P60n C11=C22= 93.8 FF/mm C12=21.3 FF/mm
M3W30P60r C11=C22=135.1 FF/mm C12=25.2 FF/mm
M3W30P70r C11=C22=125.4 FF/mm C12=17.5 FF/mm
M3W30P80r C11=C22=124.1 FF/mm C12=13.7 FF/mm
M3W30P90r C11=C22=122.8 FF/mm C12=13.4 FF/mm

Voltage Drop on Power Rails

Moving long Metal 1 interconnections to Metal 3 requires that the standard cell power rails that run horizontally be cut and bridged with Metal 2 to make room for vertical Metal 3 interconnect channels. The Metal 2 bridge and the M2-M3 vias added on the power rails increase the voltage drop on the power rails and a worst case analysis is required to determine whether the Metal 2 bridges on the power rails do not cause a significant voltage drop on the power rails.

Figure 8 shows the voltage drop on a 7.2 mm long Metal 3 power rail and a 7.2 mm power rail with three bridges for 14 differential Metal 3 wires with a 9 µm pitch at 1/5, 1/2, 4/5 of the length. The three Metal 2 power rail bridges are each 288 µm wide. Clearly even under worst case assumptions the Metal 2 bridges and vias do not signifcantly increase the voltage drop on the power rails.

Figure 8. Worst Case Voltage Drop on Power Rail.


The M2-M3 via resistance is 0.004 W for a minimal via. Since the current density is high the number of via on each side of the bridge should be at least five. The total via resistance is then 6*0.004 W / 5 = 0.005 W. Hence, even if the maximum current reaches 100 mA the voltage drop on the vias is not significant.

Datapath Chip Modifications

As with the other chips in the Rensselaer F-RISC processor, the Datapath chip was modified to fix the RC-delay problems. The Datapath chip is more regular than the Instruction Decoder chip, but less so than the Cache Controller or Cache RAM chips.
Before the Datapath chip could be modified, the lines in question had to be identified. Since our design tools did not model distributed RC effects, the identification of the nets that needed to be modified was difficult. Analysis of the capacitance values stored in the file used during the routing and simulation phases of the initial chip design indicated that about 30 lines per chip had larger enough capacitances to included several millimeters of Metal 1. Note that the selection of nets was limited to those which were actively involved in the operation of the chip, excluding any nets associated with the boundary-scan testing scheme that operates a much lower clock rates.
Datapath Interconnect Optimization
There were two modes of interconnect optimization for the Datapath chip: local and global. Global optimization involved observing the entire path of a signal, from source to destination(s), while local optimization would consider only interconnect for a smaller area of the chip. Figure 9 contains an example of local and global optimization of a signal. Note that inputs to all standard cells in the F-RISC library pass through the entire cell, hence connections may be made on either side of the cell. This can be seen in the top row of Figure 9, part (a) in which a signal enters the cell on the bottom of the row, makes a connection within the cell, exits from the top and continues on to other standard cells.

(a) Original Routing (b) Local Optimization (c) Global Optimization

Figure 9. Local and Global Optimization

Local and Global Interconnect Optimization

Local optimization is confined to a small area of the chip which is usually limited by the resolution of the designer's workstation. In other words, the designer is constrained to optimizing only the traces which can be easily observed on the workstation screen. In the example, the optimizations are limited to rerouting wire segments which are adjacent to other segments of the same net. thereby eliminating the need for one feedthrough block and reducing the length of the net in the upper-left corner.
Global optimization is achieved in a similar manner to local optimization, however, special procedures must be undertaken in order to observe the entire net simultaneously. As can be seen in the figure, global optimization considers the entire net and, barring any unresolvable wiring conflicts, can produce better results than local optimization. Unfortunately, the special procedures necessary just to observe entire nets and begin the optimization process were cumbersome and time-consuming. First, potential candidates for global optimization had to be identified. Next, the MASKAP layout-versus-schematics (LVS) tool was used to produce plots of the affected nets in order to understand and identify any possible optimization. When an problem net was found, it was located and further examined using the Compass CAD tools in order to identify any constraints on the optimization (e.g. congested wiring channels, lack of available feedthroughs). Due to the time-consuming nature of global interconnect optimization, only the 30 lines with the highest capacitance values were modified. Fortunately, local optimization was much easier to accomplish and was performed wherever possible.

Optimization Methods

The only optimization methods available were rerouting signal traces and changing the interconnect metallization from one level to another with more favorable electrical characteristics. Both of these methods were often constrained by the existing interconnect within the wiring and feedthrough channels. In order to compensate for the wiring congestion, often other interconnect traces were rerouted to "make-way" for the desired nets.
Rerouting signal traces was often a matter of removing certain segments of a net and adding segments elsewhere. The design of the F-RISC standard cell library required that cell connections be available on both the top and bottom sides. Often the router would pass a signal through a feedthrough in order to connect to one side of the cell when the connection could be made more easily on the other side. The inefficiency of the router introduced many unnecessary feedthroughs and segments into the layout.
Analysis of recent test results from the latest Rensselaer fab run have produced new data on the electrical parameters of the interconnect. Based upon these numbers, it was determined that signal traces should avoid using first-level metal (metal-1) as much as possible. Second-level metal was preferred as it has a lower sheet resistance and thus only a very small RC-delay effects. During the optimization process, metal-1 segments were replaced by metal-2 whenever possible.
To further mitigate the effects of RC-delays, it was decided that the nets with the largest capacitance should be rerouted using third-level metal (metal-3) in order to take advantage of the significantly lower sheet resistance. In addition, metal-3 can be routed over active devices, allowing more freedom during the routing process. However, metal-3 is used nearly exclusively in the Datapath chip power rails. In order to use metal-3 as interconnect, it was necessary to define channels for the metal-3 signals and swap the metal-3 power rails to second-level metal.
Global Interconnect Optimization Methods
As mentioned previously, high-capacitance wires were rerouted using metal-3 in order to reduce RC effects and shorten the overall length of the nets. The rerouting process sometimes required the creation of a metal-3 vertical wiring "channel" within a standard cell area, hence any metal-3 power rails had to be moved down to metal-2 in the vicinity of the channel. The channels were placed near the center of standard-cell areas in order to reduce the effect of higher-resistance metal-2 upon the power rail voltage droop.
Another option which was utilized was to reroute signals outside of a standard-cell area. The original chip route would place signals through a standard-cell block even if there were no connections within the block. This method of routing did not reduce the overall length of the wire and may occasionally have increased it. In addition, these routes dramatically increased the parasitic capacitance upon the wires. Optimization of these nets was accomplished by removing the wire segments from the standard-cell area and rerouting the signal in space adjacent to the standard-cell block. This reduced the parasitic capacitance upon the signal itself as well as for the wires inside the standard-cell area. Metal-3 was used as the interconnect due to its lower resistance.
Examples of global optimization are shown in Figure 10. Blue lines represent signal wires placed by the router while the red lines show the optimized route paths. The first plot shows three data bus lines which have been optimized via a metal-3 channel placed adjacent to the standard-cell area (the metal-3 channels are contained within the green borders.) The second plot shows three lines which have very different topologies, but share metal-3 channels which are contained within standard-cell areas. There is a channel in both the upper-right and bottom standard-cell areas, each of which was created by briefly moving the metal-3 power rails down to metal-2. The metal-3 interconnect runs vertically over the power rails and devices so as to reduce the overall length. The driver and receiver components of the nets are depicted in the Figure using the letters "D" and "R".

(a) Optimization with external M3 channel (b) Optimization with internal M3 channels

Figure 10. Global interconnect optimizations


Local Interconnect Optimization Methods

As with the global optimization methods, rerouting of signals and the usage of metal-2 were the primary local optimization methods available. As shown in Figure 9, often segments could be removed and placed elsewhere in order to shorten the length of a net and reduce the number of feedthroughs. In addition, there were often cases where the differential signal components contained extraneous crossovers (as shown in Figure 11.) These crossovers were introduced by the Cutter tool developed at Rensselaer. This tool was developed in order to produce tightly-coupled differential routes using common CMOS VLSI CAD tools. The Cutter tool was rewritten in 1995 avoid this situation.

(a) Extraneous differential-signal crossovers (b) Optimized layout

Figure 11. Differential-signal crossovers introduced by Cutter


Cache Memory and Cache Controller Chip Modifications

Due to advances in our CAD tools since the design of the instruction decoder and datapath chips was completed, the cache controller and cache RAM chips did not suffer from the sort of serpentine routes on critical lines that the earlier designs had. Specifically, "Cutter," the program created by the F-RISC research group to create differential pair routing from single-ended routes, was modified to allow a closer correspondence between net names on the layout and net names in the netlist, and to require less manual routing and hand-modification. With the time saved by these modifications, it was possible to dedicate more time toward optimizing the original routes. In addition, the most critical path through the RAM chip had been simulated using SPICE and QuickCap, a three-dimensional capacitance extractor, which had not been available at the time of the core CPU design.
An additional factor was that the timing in the cache was already more critical than most CPU paths (the adder carry-chain being a notable exception). As a result, more care was taken in placing cells in such a way as to minimize capacitive loading. When the problem of RC delay became apparent, the steps which had been taken to minimize capacitive delay also helped to minimize this new problem. The main area where modifications to the two chips were necessary was within the RAM and tag RAM blocks. Before the undesirability of routing long nets in Metal 1 had been exposed, the blocks made extensive use of Metal 1 to route data from the top to the bottom of the block. Metal 1 was used for long wire routes, switching to higher level metal layers only when necessary to cross over other Metal 1 wires. This design philosophy is illustrated in Figure 12 for a wire channel within a cache memory block.

Figure 12. Cache memory wire channel using Metal 1.

With the discovery that the relatively high sheet resistance of Metal 1 wires with respect to Metal 2 and Metal 3 wires results in increased signal delay when distributed RC effects are considered, steps were taken to eliminate long Metal 1 wires from the design in order to insure that cache performance specs were still met. New design rules from Rockwell allowed us to reduce the width of Metal 2 wires to 2 µm, the old minimum width of the Metal 1 wires originally used to route the cache chips. This allowed us to change long Metal 1 wires in the cache blocks to Metal 2, decreasing the wire resistance without impacting the wire capacitance significantly. This is exemplified in Figure 13, which shows improvements in the cache memory block in Figure 12.
The core CPU chips are composed mostly of standard cell routing areas and have vertical aspect ratios. Metal 1 in standard cell routing areas posed a large problem in these chips because the router used Metal 1 mainly for vertical wires, resulting in long Metal 1 routes within standard cell areas of the core CPU chips. The cache RAM chip, however, is composed mostly of hand routed memory blocks, relying only on a minimum amount of standard cell routing. Therefore, vertical Metal 1 in standard cell areas did not pose a large problem in the Cache RAM chip. The Cache Controller chip, on the other hand, relied more on standard cell routing in its design. However, the Cache Controller chip has a horizontal aspect ratio, resulting in shorter vertical Metal 1 lines than observed in the core CPU chips. Long Metal 1 routes that were discovered in the standard cell areas of the cache chips were optimized by replacing Metal 1 with Metal 2 and Metal 3 in the same manner that was described for the core CPU chips.

Figure 13. Redesigned cache memory wire channel using Metal 2 in place of Metal 1 to improve RC wire delays

The signals which would have been adversely affected are particularly critical. The primary memory critical path, the data RAM access, is dominated by the RAM block access time. The access time is heavily dependent on the interconnect delay on the data output lines and the address broadcast. These lines would have been delayed beyond what our simulations showed by the RC-type effects. The secondary cache critical path, the tag RAM comparison, would also have been adversely affected by RC effects. By changing metallization levels to minimize resistance on critical lines, the RC-delay problem in the cache has been neutralized.

Instruction Decoder Modifications

The instruction decoder is 7.17 mm wide by 8.07 mm high. Since all vertical connection were routed in Metal 1, any signal which traverses half of the chip in the vertical direction must contain a minimum of 4 mm of Metal 1, and signals crossing almost the entire chip can contain up to 8 mm of Metal 1.

Figure 14. Interconnect Capacitance Distribution on Instruction Decoder Chip.

Figure 14 shows that the majority of the nets are short and have very low interconnect capacitances. These nets due not suffer from the distributed RC effect. There are only a few long nets that can have long Metal 1 sections between the driver and a receiver. Some of the long nets are also boundary scan configuration nets and are not speed critical.

The Instruction Decoder chip contains four pad areas and one large standard cell area. The region between the pads and the standard cell area contain power and ground routing, but much of the area available is not heavily utilized. Rather then use the strategy implemented in the Datapath chip which would have involved creating a Metal 3 signal channel down the center of the standard cell area, the regions outside the standard cell area were used for Metal 3 wiring channels instead. Signals running more than 2 mm in Metal 1 were routed outside the standard cell area in Metal 2 and Metal 3. This strategy not only reduced the capacitance and resistance on these critical lines, but reduced the overall capacitance of the chip.

Figure 15 shows two lines in the instruction decoder which required rerouting due to the overall length of Metal 1 between the driver and furthest receiver. The nets were rerouted in Metal 2 and Metal 3 as shown in Figure 16. The overall capacitance of these nets were reduced, and the overall length in Metal 1 was reduced to well under half a millimeter. Over a hundred nets were optimized in a similar fashion, and each time a net was removed from the standard cell area, it's routing resources were used to optimize other less critical nets further dropping the capacitance of the overall design. By removing excess delay in the interconnect, the chip will still operate at the required rate should the fabrication process generate devices that fall short of the required operating speed.

Figure 15. Two Signals Requiring Rerouting.



Figure 16. Signals After Rerouting.


RC Simulation

The processor architecture and layout has been verified using back annotated simulations where the capacitance on each net is extracted, and an appropriate delay associated with the net according to it's capacitance and the strength of the driver. The extraction is performed on the layout of the final chips, and the extracted capacitance values are then used in the simulations of the schematic files. The simulation models a linear delay increase of a gate with respect to the interconnect capacitance, but it does not address problem with signal skew (in the event that a net has multiple receivers) or consider the resistance of the interconnect.

Figure 17. Distributed RC Model Of Test Structure

The problem with signal skew was addressed by carefully designing the clocking scheme so signal skew can not effect the system. An example one such design strategy is to insure that cascade latches are clocked two clock phases away from each other. This allows for up to 250 ps of skew in either direction on the two latches without risk of the data slipping through one latch and corrupting the data in the next stage of the latch pipe.

Extensive SPICE simulations have shown that the resistive effects of the interconnect on nets containing less then 2 mm of Metal 1 are quite small, but increase quadratically as the length of Metal 1 increases. The effect significantly degrades signal rise/fall times with Metal 1 lengths above 4 mm.

With the recent upgrade of our CAD tools, we are now able to correctly simulate our designs with distributed RC effects taken into account. Once these tools are set up and calibrated to our process, our layouts are run through the interconnect extractor which can be automatically determine the delay from each driver to every receiver that driver sources. These delays are then used in a top level simulation to ensure that the chip will run at the required operating speeds, and that the various delays on the nets do not adversely effect the behavior of the overall system.

Figure 18. Delay Model Of Test Structure

Figure 19. Test Structure For Calibrating RC Effects

Figure 17 shows the distributed RC network of a test structure used to calibrate the CAD tools. The extractor determines the resistance and capacitance of each of the nets, and generates a delay model for each net (show in Figure 18). This model is then simulated to verify the simulator results with SPICE simulations of an identical network. The layout used to for this RC model is shown in Figure 19.


Ring Oscillator Test Structures Overview

During the summer of 1995, it was determined that better performance could be obtained from the Rockwell HBT process using devices with smaller base-emitter spacings. Rockwell did not have updated standard process devices available, but they did provide a number of experimental device layouts which had smaller device areas and promised better switching characteristics. In order to investigate the speed and yield of the new designs, the F-RISC group developed a series of ring-oscillators. Each oscillator used the same layout but with different devices. In the end, two devices were selected as candidates for obtaining the best performance.
The original device layouts had an emitter size of 1.6 µm x 2.0 µm which was shrunk to 1.2 µm x 1.6 µm in order to reduce the emitter capacitance. These shrunken devices were then inserted into the ring-oscillators and submitted to Rockwell for fabrication. Of these two circuits, only one was fabricated due to lack of available chip space. The finished wafers came out of fab at the end of December and will be sent to RPI for testing once the owner of the wafer run has finished their testing procedure. While the Rockwell-supplied devices were being fabricated, the F-RISC group further analyzed and modified one of the experimental devices, adding a third active area to the base in order to reduce the base resistance.
The ring-oscillators provided by the F-RISC research group use a 50-stage circuit which incorporates the experimental HBT devices (see Figure 20). The circuit requires non-standard bipolar power supplies in order to use measurement equipment with internal 50 W terminators.

Figure 20. 50-stage ring-oscillator layout

Layout Information

The layout dimensions for either oscillator are 1838 µm by 176 &181;m. There are 102 experimental HBT devices and 4 standard Rockwell HBT devices in an oscillator. The experimental devices are single-emitter HBTs and were produced by Rensselaer based upon larger devices received from Rockwell. The source devices from Rockwell were also experimental devices and were only modified by shrinking the emitter dimensions and position. The standard Rockwell devices are three-emitter high-power HBT transistors.

Power Supplies

The ring-oscillator requires non-standard bipolar power supply levels in order to interface easily with terminated measurement instruments. The supply levels are listed below.

Vss : 0 V
Vdd: 2.4 V

Simulations indicate that the circuit should draw 106 mA, producing a power dissipation of about 0.25 W.

Output Signal

The output from the ring-oscillator is single-ended and is available from the middle pad of the left GSG probe. The expected signal levels are listed below. Signal swing should be about 400 mV. Note that any measurement equipment connected to the ring-oscillator output should be terminated with a 50 W resistor to get optimal performance (see the Ring Oscillator Circuit section for more details).

Vhigh: 0.8 V
Vlow: 0.4 V

Ring Oscillator Circuit

The ring-oscillator consists of 50 inverter stages and an output-driver stage. The ring-oscillator uses differential current-mode logic (CML), but produces a single-ended output signal. A schematic of the ring-oscillator is shown in Figure 21.
To generate a single-ended signal, the ring-oscillator core signal is split between the output pad and an internal 50 W termination. Initially, the core signal passes through a high-current differential amplifier. The differential output of the amplifier is then separated, with one signal internally terminated and the other connected directly to the output pad. If the measurement instrument is not terminated with a 50 W resistor, the output signals of the differential amplifier will be unbalanced and thus the output will be compromised.

Figure 21. Ring-oscillator schematic

Submission of Novel Ring Oscillator Test Structures


The need to find a suitable device configuration which will work at speed and model the new found distributed RC effects led to the design, layout and submission of more test structures to Rockwell. Specifically, two entirely new ring oscillator structures were submitted to for inclusion in the Mayo/Rockwell HSCD reticle. These structures are called structure3 and structureRC. Structure3 is an unloaded ring oscillator structure while structureRC uses a ring oscillator structure to measure distributed RC effects. These structures are described in more detail here.
Structure 3 - Unloaded Ring Oscillator

Figure 22. Layout of Structure3

Features:

1. Size - 1.0 mm x 1.6 mm
2. Device Count - 1057 Transistors, 8 Schottky Diodes, 782 Resistors
3. Device Type - Standard and Non-Standard Q1
4. Current Level - 2 mA, 1.6 mA, 0.8 mA
5. Voltage Swing - 250 mV
6. Number of Stages - 30
7. 1 VSS (-5.2 V) and 2 GND pads
8. High-Speed Output - 1
9. DC Outputs - 3 , DC Inputs - 3

The unloaded ring oscillator structure shown in Figure 22 contains eight 30-stages oscillators to test the unloaded gate delays of different versions of Q1 transistors at varying current levels. These transistors were layed out by Rockwell after the standard Q1 transistor showed slower switching speed than expected and then modified at RPI to scale the emitter sizes. The structure has a very compact layout and contains more than a 1000 transistors in an area of 1.6 mm sq. Therefore, it can also be used as an yield indicator. A smart testing scheme helps measuring all eight oscillators in just one probe touch down. A voltage and current monitor is built in the structure to calibrate the measurements simultaneously with the oscillator frequency measurements.

Figure 23. Schematic of Structure3

A schematic of the structure is shown in Figure 23. Four ring oscillators feed each of the 4-to-1 multiplexers which in turn feed a 2-to-1 multiplexer. The output of this multiplexer drives a 50-ohm driver. The input to the structure are 3 DC select signals to select one of eight oscillators.

StructureRC - RC Ring Oscillator


The RC structure tests the distributed delay effect in the HBT process interconnections. Based on our calculations, Metal 1 wires start showing the quadratic RC delay effect after a length of only 1-2 mm. This is disastrous for any large chip. The test structure uses 8 mm long tapped delay Metal 1 lines in a very compact layout and simplifies testing by using a special ring-oscillator configuration. There are 8 ring oscillators formed out of 8 taps on the delay line. All the 8 oscillators can be measured with one probe touchdown. This structure is very important for testing and calibrating extraction tools for HBT circuits that are more than a couple of mm on the side. The layout of the structure is shown in Figure 24.

Figure 24. Layout of StructureRC


Features :

1. Size - 1.55 mm x 1.6 mm
2. Device Type - Standard Q1
3. Current Level - 2 mA
4. Voltage Swing - 250 mV
5. 2 VSS (-5.2 V) and 2 GND pads
6. High-Speed Output - 1
7. DC Outputs - 2
8. DC Inputs - 3
A schematic of the structure is shown in Figure 25. The inputs to the structure are 3 level-3 DC signals to a 3-to-8 decoder which generates 8 select lines. These lines in turn select one of the eight ring oscillators.

Figure 25. Schematic of StructureRC


Optimized Stage Size Carry Select Adder

See the update for recent developments

In the design of adders, the critical path is typically in the carry chain. The carry into any bit of an adder depends on all previous inputs. With a ripple-carry adder, the simplest type of adder, the carry out from one bit is found directly from the carry from the previous bit, which is computed from the carry of the bit before that, and so forth. The carry chain thus extends in one path through each and every bit in turn, and the longest path length through the adder is directly proportional to the number of bits in the adder.
The carry select adder attempts to shorten this path through parallelization. The adder is divided into several stages. With in a section, some other carry scheme is used, such as ripple carry. However, to avoid the delay waiting for the carry in to the first bit of the section, separate sets of logic are used to find the carry and sums for both possible carry in values (1 and 0). By the time the carry in is available, all possible sums and carry outs have been computed. The carry in can then be used to select the proper output via a multiplexor at each output. The longest path in this case consists of the carry through the first stage plus one multiplexor for each stage where the carry in selects the proper carry output.
The performance of a full-custom, single-macro varied stage carry select adder was compared with the even-length stage carry select adder spread across four dies currently used in the F-RISC Project microprocessor. For future stages of the project, it has been proposed that a factor of two can be gained in adder speed through architectural adjustments and migrating to a higher yield technology where the datapath could fit on one die, before device speed gains are even considered.

Optimum Stage Sizes

Pedagogical examples of the carry select adder usually consider equal size stages; e.g., for a 32-bit four stages of 8 bits each. The time to cross the longest path is 8-bits worth of the lower level carry scheme plus a multiplexor delay for each stage. However, varying the size of each stage can produce an even shorter delay. Optimally, the carry in should arrive just as the carry out possibilities have been generated, and these signals should enter the carry out multiplexor at the same time. Thus each stage should be slightly shorter than its successor.
If cn is the number of bits in stage n, tg is the delay time across the "carry" gate, tm is the delay time across a multiplexor, and N is the number of stages, then the delay of the carry chain is then

assuming

Figure 26. 32-bit Varied Stage-Size Carry Select Adder.

If a constant step in size is used between each stage, then that step
.
If B is the total number of bits in the adder, then (considering that the final stage may not need to be cN to meet size requirements)

Table 4. Estimation of delay versus number of stages.
Stages Carry Stage WidthsCarry Settling Time [# Gate Delays]
1 32 (1) 32
216,16 18
312,13,7 15
4 7,8,9,8 11
55,6,7,8,,6 10
63,4,5,6,7,7 9
72,3,4,5,6,7,5 9
8 1,2,3,4,5,6,7,4 9
9 No Possible Sequence N/A
(1) Ripple carry only. No select needed for one stage
Solving the latter expression for c1 yields

Substituting that into the follow expression for total delay



results in this expression for td in terns of N, B and s:

Figure 27. Adder Bit Slice Logical Schematic.


For the design described here, initial simulations indicated that s was approximately 1. Given that, and B=32, a table of td/tg versus N could be calculated.
The step size constraint cannot hold for nine or more stages. The minimum delay has been reached at N=6, with a stage size sequence of 3,4,5,6,7,7.

Logic Design


In the design described here, ripple carry was used inside each stage. For each bit of the adder, two carries and two sums are generated:

For the first bit of each stage, the carry inputs are known explicitly to be 1 and 0, so the equations can be simplified:

Figure 28. Adder First Bit Slice - Logical Schematic.

The carry out from the previous stage is used to select the sum at each bit and the carry out from this stage.
The logical schematics for each cell are shown in Figure 27 and Figure 28.

Circuit Design


The gates for this design were designed employing GaAs HBT CML. In this logic family, any function of up to three variables can be realized as one gate.

Figure 29. Carry Generation Circuit.


For each bit, two carries and two sums need to be generated, and the two sums need to be multiplexed. Thus, five current trees are needed per bit. Carry, sum, and multiplexor trees are show in Figure 29, Figure 31, Figure 30, and Figure 33. For the first bit of each stage, where the static carry inputs lead to simplification as described above, The corresponding current switch can be eliminated, reducing circuit size.

Figure 30. Sum Generation Circuit.

Figure 31. Carry Multiplexor Circuit.

As well, the two carries out of each stage need to be multiplexed. The selected carry out is a Level 2 offset signal since for a long, loaded line (the selected carry signal drives a multiplexor in every bit of the succeeding stage), Level 2 signals offer the best loaded switching speed, due to the emitter followers on the outputs.

Figure 32. add1mid Layout.

Figure 33. Sum Multiplexor Circuit.

Layout Design

This layout was created with three bottom level cells. add1mid computes the two carries and two sums for each bit, and also contains the multiplexor to select the correct sum. add1head is a specialization of add1mid for the first bit of each stage to take into account the explicit carry inputs. cmux comes at the end of the stage and selects the correct carry out to feed to the next stage.
Figure 36 illustrates the combination of these cells into a three bit stage


Figure 34. cmux Layout.

Figure 35. add1head Layout.

A layout whose width per bit approaches the bitline pitch of the register file was desired, so much effort was put into making the layout as narrow as possible. Although this design falls quite short of the register file pitch, a quite dense layout was achieved. In some cases, portions of different cells overlap, taking advantage of free space in adjacent cells.
Furthermore, only about half of the width of a register file block is due to bit spacing; address decoding accounts for the other half, increasing the average width per bit. Each register file block is eight bits wide, necessitating that four be used with this adder. Whereas the adder is approximately 5.8 mm wide, each register file block--with address decoders--is 0.97 mm wide.
There are also indications that an increase in horizontal density can be achieved by extending the cells in the vertical dimension. Note that there are regions where six transistors sit side by side across the cell. The cells could obviously be narrowed if these arrangements were limited to four transistors across, for example.

Figure 36. A three bit carry select stage.

Table 5. Representative timings from SPICE simulation.
b31 rising to s31 rising 57.2 ps
b32 falling to s31 falling 53.7 ps
cin falling to s0 rising 26.9 ps
cin falling to s31 rising 452.8 ps
cin falling to cout falling 460.6 ps
b0 rising to s0 falling 45.6 ps
b0 rising to s31 falling 521.3 ps
b0 rising to cout falling 527.7 ps

Simulation of the design


A SPICE extraction of the layout, with interconnect capacitances, was used to simulate the design to determine delays through the circuit. Some representative timings taken from the SPICE output trace are shown in Table 5. The longest path through the adder should be from the b0 input to the cout output; thus, the delay of the adder could be characterized as approximately 528 ps.

Conclusions and Areas for Further Development


It was hoped that this adder design would exhibit a factor of two improvement over the present F-RISC adder. As the F-RISC adder delay is slightly under 1 ns, the 528 ps for this design does not currently reach that goal. One possible way to improve this mark is to reexamine the determination that the ratio of tg to tm is about 1. As the load on each multiplexor is directly proportional to the size of the subsequent stage, tm should be larger when diving a larger stage. Perhaps the size step between stages needs to be larger, or should vary with stage size. Another alternative is to use something other than ripple carry with each stage. A faster intrastage carry could result in a larger first stage for the same delay and reduce the number of stages needed. A circuit level solution would be to adjust the nominal current level in each current tree. This adder operates at the F-RISC standard cell high power level, with a current of 1.6 mA per current tree. Utilizing 2.0 mA in the carry chain circuits will offer a faster switching time. Utilizing superbuffers on the carry between stages, which tends to be the longest and most highly loaded line in a stage would also be advantageous.
Another area of interest was the adder width as compared to the register file. The bitline pitch for the register file is approximately 60 µm, while the pitch for adjacent add1mid cells is 160 µm. However, the address decoding in the register file has not been accounted for. If width of total block is considered, the adder is 5.8 mm wide while four register files side by side is 3.88 mm wide without even considering wiring channels for the address lines. If the adder bitline pitch could be reduced by 50 µm, the two blocks could match fairly well over all. The reduction could easily be achieved by another iteration of layout adjustment; the width of the add1mid cell, for example, is held high by an arrangement of six transistors side by side which could easily be realized in a four wide setup, current work indicating this offers a cell width approaching 100 µm. This and other adjustment would trade width of for height, which would only negatively affect timing not on the critical path. A denser circuit will also have reduced RC delays on its nets, and can operate at a higher speed.
It is important to keep in mind that a datapath employing this adder is not physically realizable at present in the technology considered due to yield limitations. The adder plus four register files already exceeds the device count available. This design is useful as a comparative study, to separate the effects of a technology migration into a linear combination of yield-driven architectural changes and device speed advantages.

Figure 37. 32-SPICE Simulation of 32-bit Carry Select Adder.