A 500 ps 32 ´ 8 bit Register File Implemented in GaAs/AlGaAs HBTs
The speed of state of the art computers has increased steadily over the years. In most designs, the register file access delays are on a critical delay path of the microprocessor and thus can limit its performance. Hence, the access times of the register files must scale with the cycle times of the processors [MIYANG86, CHUANG86, CHUANG86, CHAPPE91, TAKADA90].
In this chapter, a 32 ´ 8 bit register file with a measured 500 ps access time is described. It represents a key circuit of the Fast Reduced Instruction Set Computer (FRISC) design project at Rensselaer. The goal of the FRISC project is to demonstrate that a very fast RISC processor can be developed using high speed but yield-limited technologies. A RISC architecture with a seven-stage pipeline implemented in GaAs/AlGaAs heterojunction bipolar transistor (HBT) technology is used to achieve this goal. Further, differential current-mode logic (CML) circuits and thin-film multi-chip module (MCM) packaging are employed.
The goal of the register file design effect for F-RISC/G is to reduce the register access time to 194 ps. The test chip fabrication run was realized with an preliminary experimental 30 GHz process, but the actual F-RISC/G circuitry is to be fabricated in a 50 GHz baseline process.
The memory and test circuitry are fabricated using Rockwell International's GaAs/AlGaAs Heterojunction Bipolar Transistor (HBT) process [ASBECK89]. The logic circuitry is implemented with differential current-mode logic (CML) gates using three levels of current switches [NAH91,GREUB91]. The CML circuits typically have a loaded gate delay of 25 ps. The three-level current switch stacking yields high-functionality gates and helps offset the area penalty for the differential wiring [BARISH92]. Differential logic was chosen to increase noise immunity and allow lower voltage swings.
6.1. Circuit Description
The architecture of the register file circuit is shown in Fig. 6.1. The register file has been organized into 32 ´ 8 bit words. It has five address drivers, eight write circuits, eight bit line drivers, 32 address decoders, 256 Schottky clamped memory cells, and one threshold voltage generator. Conventional designs have been used for the decoder, the memory cell, and the sense amplifier [MIYANG86,CHUANG88]. Voltage sensing rather than current sensing is employed to place the sense amplifier and write circuit at opposite ends of the memory cell array which greatly simplifies the layout. However, new designs were developed for the address driver, the write circuit, and the threshold generator.
6.1.1. Address Line Driver
In each of the five address line drivers, three large transistors, Q5, Q9, Q12, provide large currents needed for deselection of 31 of 32 rows (Fig. 6.2). A 15 mA current flows from Q5 into one of the two address lines. The loads of the switching pair transistors, Q9 & Q12, are active pull-ups, Q14 & Q15 and Q16 & Q17. One of the pull-ups presents a very low resistance to an address line while the other pull-up is cut-off and thus presents a high resistance during any transition. When an address row is selected, the current in Q6 could be switched to Q11 and the current in Q5 switched to Q12. Q16 & Q17 are then cut-off while Q14 & Q15 quickly charges up the address line (ADROL). The address line voltage characteristics is determined by the load resistance and the parasitic capacitance of the address lines. At any specific time, one of the address driver outputs, ADROL and ADROR, will tend toward a load resistance of approximately re // re of the two transistors. Because, only one pair of the active pull-up transistors is in the active-mode at any given time. In the situation described above, ADROL has the load resistance of re // re, whereas high load resistance is present at ADROR since Q16 & Q17 are in cut-off mode. This resistance and the high parasitic capacitance of the address line dominate its rise time. Without the active pull-ups, the line would have to be charged up through the decoders, which is slower. A small current through R9 maintains Q14 & Q15 in active mode after a high transition. R9 also prevents ADROL from slowly drifting up to the ground level through the base emitter junctions of Q14 and Q15; the proper logic high voltage level for the address lines is a Vbe drop from the ground. Since Q16 & Q17 are in cut-off mode, most of the current in Q12 is available for deselecting decoders; the rest of the current is composed of the current through R9 and the small currents in Q16 & Q17. The current in the address decoder of the selected row cuts down to near zero level.
Fig. 6.1. Overall Register File circuit
Fig. 6.2. Address Driver circuit
6.1.2. Memory Cell
Each row of memory cells has a resistor as the current source. This design choice was made after the consideration of the transistor characteristics and the desire to keep the power supply voltage compatible with a standard ECL power voltage of -5.2V. A GaAs/AlGaAs HBT transistor has a Vbe of 1.4 V, much larger than the 0.7 V for a typical Si bipolar transistor. At least 1 V difference is desired in the word line voltages between the selected and the unselected word lines to prevent disturbing the state of the unselected cells during a write (Fig. 6.3). There are two Vbe drops to R10 in the form of Q12 and either Q13 or Q14 in addition to the voltage drops across R6, R7, and R10. Therefore, the available voltage for implementing current source is only about 1 V. This is not enough for an active current source.
Fig. 6.3. Word Line voltage swing
The resistive current source, on the other hand, does not have the above potential drawbacks since there is no active device that must be turned on. Of course, an active current source is better able to provide a steady current level than a passive one, and should be used whenever possible; for example, in the register file described in this paper could be implemented with active current sources if the power supply voltage of - 6 V can be used. This, of course, would also increase power dissipation.
6.1.3. Write Circuit
The voltage on the Vth line is maintained at such a level that during a read cycle the bases of Q3 and Q11 (WIL and WIR) are 3/5 of the way between the collector voltages of a selected memory cell (Fig. 6.1). Q7 and Q8 equally divide the current from Q10, which has the same current level as that through Q9 in the write mode. Thus, the voltages at WIL and WIR are at the mid points of their write cycle levels. Since WIL and WIR are in between the voltage levels at the bases of Q13 and Q14 of a selected memory cell, one of the bit line currents, Q17 and Q18, would be flowing in either Q3 or Q13 while the other would be flowing in Q11 or Q14. The current flow depends on the state of the selected memory cell. For example, if the base of Q14 is high whereas the base of Q13 is low, the Q18 bit line current would be flowing through Q14 instead of Q11 and the Q17 bit line current would be flowing through Q3 instead of Q13. As a result, the base of Q16 (BIT LINE RIGHT) would be higher than the base of Q15 (BIT LINE LEFT) in the sense amplifier by about 3/5 of the way between the collector voltages of Q13 and Q14.
When a write pulse is applied, the current in the right half of the circuit, Q7 and Q8, is switched to the left half of the circuit, Q5 and Q6. The base voltages of either Q3 or Q11 are then lower than the base voltages of Q13 and Q14, irrespective of the state of the selected memory cell (Fig. 6.4). The bit line current switches from either Q3 to Q13 or Q11 to Q14, and the data is written into the memory. No switch occurrs if the same data is written. The data input signals from the outside are level shifted by 2.2V to prevent saturation of Q5 and Q6.
Fig. 6.4. Write operation and critical internal Threshold Voltage signals
6.1.4. Threshold Voltage Generator
The threshold voltage generator output maintains the Vth line at the proper voltage as explained above. The circuit (Fig. 6.5) has four major blocks: a circuit tracking the voltages in a selected memory cell; a circuit tracking write circuit; a high gain amplifier; and two circuits tracking bit line current sources.
A memory cell tracking circuit is located on the left side of Fig. 6.5. The common emitter current source resistor, R5, is eight times greater than the resistor used in a memory cell row plus R10 of memory cell, since this resistor is shared by eight memory cells. The second emitter of Q2 is connected to the collector of Q3, since Q2 is tracking a transistor that is in a cut-off state. The second emitter of Q3 is connected to a circuit tracking a bit line current source. Q1 is tracking a word line transistor connected to a decoder.
The ratio of the resistors on the right side of the memory tracking circuit, R2:R4, is 3:2. The circuit located on the right side tracks the write circuit in a read cycle. Q15, Q19, and Q18 track Q7, Q8, and Q10 of Fig. 6.1, respectively. Q11 & Q12 track Q11 & Q18 and Q3 & Q17 of Fig. 6.1. Q16 tracks the word line transistor of the selected memory row. Q20 and Q21 track the high level of the write signal buffer.
The differential amplifier is designed to have a large gain. Capacitors (C1, C2) and resistors (R6, R14) are used to stablize the feed back amplifier by introducing a dominant pole at low frequency. The input terminals of the differential amplifier are connected to two circuits that should be maintained at the same voltage level.
Fig. 6.5. Threshold Generator circuit
6.1.5. Key Interconnect Parasitic Capacitances
The key interconnect capacitances of the register file are located at the bit lines, the word lines, and the address lines. The parasitic capacitances of these lines have significant impact on the final access time. Hence, it is critical that these capacitances are minimized and accurately estimated.
The net extracted capacitances for SPICE simulations are based on the capacitance model predicted from LINPAR. The power levels in the address line driver and bit line driver circuits have been increased appropriately to compensate for the capacitances. Since only two layers of metal were available and no metal could be routed over the devices, the layout was quite difficult and time consuming. Further, the voltage drops in the power and word lines, and address lines had to be considered during layout.
6.2. Test Scheme
When the RISC processor chips are fabricated, it will be difficult to characterize the performance of the register file. Therefore, a test chip was designed with embedded test circuitry to allow performance measurements of the memory block. At the projected subnanosecond access time, traditional memory testing techniques would not be feasible [SCHUSTE92]. Limiting factors include: skew in probes, cards, and cables; limited probe bandwidth; pattern generation; and signal count limits. The available high-bandwidth probes cannot provide sufficient signals to control all inputs and outputs of the memory. Measurement of signal delays is possible with time domain reflectometry (TDR). However, it is quite difficult and expensive to control these delays accurately if many high speed signal interconnections are involved.
To address these limitations, built-in self-test (BIST) circuitry was added to the register file. Only a few, non-correlated signals are required for testing. A ceramic microwave probe can provide a 5 GHz bandwidth on six signal channels [CASCAD91]. The BIST circuitry had to be implemented with a minimal device count because of yield and power dissipation considerations.
6.2.1. Testing Methodology
A block diagram for the testing scheme is shown in Fig. 6.6. The system is synchronized by an on-chip voltage-controlled oscillator (VCO). By including the oscillator on-chip, we both simplify the test setup and have the opportunity to characterize a voltage-controlled delay line, which is a critical analog circuit component in the FRISC system. Provisions were made to use an off-chip clock generator if the VCO is found to be inoperative.
The clock drives two pattern generators that generate the addresses and data for testing. The pattern generators have the same period and thus remain synchronized. Requirements on the pattern generators include: low transistor count (to avoid yield hits), low power consumption, and simultaneous update of all bits. Also, the address pattern had to generate at least one worst-case transition with all five address bits changing ( 00000 ® 11111). The pattern generators can be initialized with different states, thus allowing various test patterns to be written into the register file.
Two five-bit, linear-feedback shift registers (LFSRs) were chosen as the pattern generators. The LFSRs require only one gate beyond the five master/slave latches with asynchronous load and the clock buffer. Other choices such as counters either required more transistors or did not update all bits simultaneously. The rich functionality of differential current-mode logic allows the master/slave latches to be realized with only two current trees.
To begin a testing cycle, the LFSRs are initialized, and a pattern is written into memory. Once the pattern is complete, the write signal is turned off. A timing diagram for performance testing is shown in Fig. 6.7. The rising edge of the clock causes an address change. On the falling clock edge, the register file output is latched and compared to the expected output (which is still being generated by the data LFSR). If the two signals are identical, then the access time of the register file is less than half of the clock period. The clock frequency is increased until failures are observed. In this manner, the access time of the register file can be determined.
Fig. 6.6. Block diagram of the Register File test scheme
Fig. 6.7. Timing diagram for performance test
6.2.2. Detecting BIST Failures
The desired output from the embedded test circuitry is a constant low signal, indicating a match between the expected pattern and the pattern read from memory. However, various system failures could mimic this output. Additional features were added to the BIST system to detect such conditions. Both LFSR patterns, the VCO signal, and the write signal are observable at the system output to insure that they are working. Both LFSRs must show the expected 31-cycle pattern before the match output can be accepted; the condition of fully functional LFSRs must be satisfied before proceeding further with the register file testing.
If the write circuit were to fail to shut off, then the proper data pattern would appear at the memory outputs even if the memory cells themselves were not functional. Another failure mode would be a match circuit output that is stuck at zero, always indicating correct operation. These failures can both be detected by resetting the LFSRs to generate a different pattern. If the performance testing is run without writing this new pattern into the memory, failures will be detected. If these failures do not appear, then a malfunction in the testing logic is indicated.
6.2.3. Final Design
The fabricated chip (Fig. 6.8) contains 2600 transistors on a 3.2 mm x 3.4 mm die. Along with the register file (which measures 1 mm x 2 mm and has a power dissipation of 1.6 W) the chip includes a copy of the 8-bit ALU carry chain and I/O test circuits. Table 6.1 shows the transistors used for each portion of the chip.
The chip was designed to be tested with a six-channel ceramic microwave probe with a 5 GHz bandwidth. Except for the VCO control voltage, all inputs were static digital signals. The probe additionally provided two power connections, two ground connections with two 500 pF bypass capacitors. The register file itself was powered through a separate probe to allow the testing of the BIST logic without the additional power consumption of the memory macro. The presence or absence of power to the register file was sensed on-chip and was used to control one of the output multiplexers. This allowed the observation of six output signals using only two selection lines on the six-channel probe.
Fig. 6.8. Photograph of fabricated test chip
6.3. Test Results
The test chip was designed for an experimental GaAs/AlGaAs HBT process. Fifty-six testable dies were obtained from the fabrication run. Of these, seventeen had visible defects and were intended as practice dies. However, the test results for these dies did not differ significantly from those of the perfect dies. The wafers were lapped by the foundry to either a 3 or 7 mil thickness and diced. The fabricated devices had a measured fT of 30 GHz at 1 mA.
6.3.1. BIST Support Circuitry
The VCO was designed to produce frequencies from 1.2 to 3.3 GHz with a control voltage ranging from -1 V to 1 V. This would test the register file with access times between 150 ps and 420 ps. The fastest observed VCO reached 2.7 GHz with an average of 2.2 GHz. A VCO running at 1 GHz is shown in Fig. 6.9. The output signal is attenuated by a factor of two by the measurement system setup. An LFSR operating at 1 GHz is shown in Fig. 6.10. The expected 31-cycle pattern was generated.
Fig. 6.9. VCO Output, 50 mV/div, 200 ps/div
Fig. 6.10. LFSR output operating at 1 GHz, 50mV/div, < 5 ns/div
Fig. 6.11. Register File output with one error, 50 mV/div, 5 ns/div
Fig. 6.12. Forced output errors, 50 mV/div, 5 ns/div
Fig. 6.13. LFSR output operating at 2.5 GHZ, 50 mV/div
Table 6.1. Number of transistors in the Test Chip
Register File Testing Logic
6.3.2. Register File Memory
On one die, the entire BIST circuitry was operational simultaneously. The memory test waveform for this chip is shown in Fig. 6.11. The period of the pulses matches the LFSR period, thus the column has an error in one bit position. Fig. 6.12 shows the errors generated when the LFSRs were reset to a different state without reloading the memory. The error pulses verify the correct operation of the output comparators. When the LFSRs were set back to their original state, the waveform in Fig. 6.11 reappeared. Since the VCO was operating at 1 GHz during this test, the access time of the register file was below 500 ps. Individual LFSRs were observed to operate at clock speeds of up to 2.5 GHz (Fig. 6.13) One drawback of the testing scheme was the need for both LFSRs to correctly operate at the same clock frequencies, simulataneously. This factor turned out to have significantly negative effect on the number of the testable chips with the functioning register files. The circuits have been designed for a expected yield of 20 % for 5 K HBTs.
[GAO90, GREUB90, GREUBTHES, HAFIZI90, HAGLEY91, KAWAR78, MATSUE91, MIYAN84, and OKUY88]