Chapter Three


Register File Characterization

 

As with many other processors, the register file is an essential part of the critical path. Unfortunately, experimental results from the RPI testchip indicated that this critical circuit might not achieve the performance levels required by the F-RISC/G processor. As better device and interconnection models became available, the register file once again became the center of attention. In order to determine if it would meet the performance goals, a series of simulations were performed using various models to get a feel for just how fast the circuit would operate. This chapter describes the analysis and characterization of the register file in preparation for modifications to meet the required performance goals. A detailed overview of the circuit operation is also presented along with signal timing specifications.

Introduction

Any microprocessor of recent vintage contains a bank of registers called a register file. The registers are used to provide on-chip memory for the processor in order to avoid the long delays that occur when memory outside of the chip is accessed. All processors have registers but oftentimes the registers are divided into different classes based upon how they can be used, such as address or data registers. With the advent of reduced instruction-set computer (RISC) designs, the register file has played a more prominent role. RISC processors typically have a large number of registers in order to maintain their throughput and avoid off-chip data transactions. One of the earliest RISC processors, the RISC II, had 138 registers that were used for passing parameters between software procedures. Today most processor designs including F-RISC/G use 32 to 64 registers.

The register file is a very important component of the processor critical path. In order to meet the RISC goal of 1 processor cycle per instruction, the register file must be capable of delivering two pieces of data and recording one per cycle. These requirements are based upon typical processor operations such as A=B+C in which two operands (B and C) must be obtained from the register file and the result (A) must be stored afterwards. To meet this goal, most register files have multiple ports that allow simultaneous READ and WRITE operations to occur. Although multiple ports increase the size and complexity of the circuit, the register file has nearly the entire cycle to complete the operations. In contrast, the F-RISC/G processor has a single-port register file that is accessed three times during the 1 ns cycle [PHILH93]. For this reason, the F-RISC/G register file must have READ access times below 200 ps and a WRITE time below 500 ps.

Historical Review

The original F-RISC/G register file design was initiated in December 1991 and completed in April 1992. The design was based upon the Rockwell GaAs HBT process that, at the time, incorporated fixed device layouts provided by the foundry and two interconnection levels. Since then, there have been several process generations that have relaxed the minimum design rules and added a third level of metallization. These changes have since been incorporated into the register file layout.

Although F-RISC/G is a 32-bit processor, the register file is actually a 32x8 circuit, providing only 8 bits of data. This organization was selected based upon the bit-slice design of the processor using 4 slices. Consequently, each F-RISC/G Datapath chip is a complete 8-bit datapath that is combined with 3 other slices to form a 32-bit processor.

The original register file designer, taking into consideration the yield and power constraints of the early Rockwell GaAs HBT process, selected a single-port design in order to avoid the additional devices required to implement a second port. Unfortunately, this placed very severe timing requirements upon the circuit, requiring two READ accesses and one WRITE every nanosecond. In order to emulate a two-port register file, the single-port design has two multiplexers attached to the output signals (Figure 3.1). Each of the multiplexers in turn passes the data onto a latch that maintains the information until the processor is ready to accept it. Although only the latches are necessary to hold the register file data until the processor is ready, there is another situation in which data arriving from off-chip needs to be fed directly into the latches. As a result, both the latches and the multiplexers are required and must be included in the register file cycle time. For this reason, the actual time required to perform a READ operation is 200 ps that allows up to 50 ps for the data to pass through the output multiplexers and latches.

Figure 3.1- F-RISC/G system diagram around the register file

As described in Chapter 1, the register file was included in the original RPI testchip to test the circuit and process performance. Experimental results indicated that the circuits were functionally correct but operated significantly slower than predicted. Since the original design, the register file has been optimized and redesigned several times with each iteration prompted by "more accurate" capacitance and device information. Currently the register file has a simulated READ access time of 189.4 ps and consumes 2.05 W of power.

Register File Architecture and Operation

In following with the Spartan RISC philosophy of the Fast RISC design group, the register file uses a rather standard bipolar memory design [COOPE80] in order to minimize the amount of circuitry required. Others have developed more complex circuits that improve the performance through either circuit or architectural modifications [CHUAN88, MIYAN84, YOSHI83, HIROS90] but the changes are usually either not applicable to the differential F-RISC/G circuits or impractical in terms of device count. The register file is an asynchronous circuit, changing state based upon the address and WRITE signals rather than an external clock.

As shown in Figure 3.2, there are basically six components within the register file. These are the memory cells, address drivers, word decoders, read/write logic, sense amplifiers and threshold voltage generator. The address drivers boost the address input signals in order to drive the long internal lines. The word decoders select a row in the memory cell array and enable that row to perform a READ or WRITE operation. The memory cells store the logical state information. Sense amplifiers are used to detect a small change on the bitlines during a READ operation and generate an output signal. By controlling the bitline voltages the read/write logic dictates whether a READ or WRITE is performed. The threshold voltage generator is used by the read/write logic to properly bias the bitlines, ensuring that the register file is able to perform the READ and WRITE operations.

Figure 3.2 - Register file architecture

READ Operation

Because the register file is asynchronous, a READ cycle begins when the WRITE signal is low and the address changes (the logical path through the register file for a READ operation is shown below in Figure 3.3). A low WRITE signal is required to disable the read/write logic and prevent the memory cells from being modified. When a new address is applied to the input pins, the word decoders select one row of memory cells within the register file. The selected memory cells then drive the bitlines to whatever logical value is stored within the memory cells. The sense amplifier at the bottom of the array detects the impending change on the bitlines and switches the output signals accordingly. Because the sense amplifier can detect small changes in the address lines, the data can appear on the output nodes before the internal bitlines have switched fully. This speeds up the register file and helps to provide some isolation between performance and the bitline capacitance. One example of the READ cycle timing signals is shown in Figure 3.4.

Figure 3.3 - READ path through register file architecture

Figure 3.4 - Register file READ cycle timing

The tahold signal is the length of time that the address must be held constant in order to ensure that the corresponding data will appear upon the output pins at time taccess. The twsetup time is the maximum delay before WRITE must go low if a READ is to be performed. Similarly, twhold is how long WRITE must be held low in order to maintain the READ operation.

One interesting characteristic to note is that twsetup is positive, indicating that the WRITE signal may be high although the desired READ address has been placed on the input pins. Because the internal delay for the address signals is significantly larger than for the WRITE signal, the WRITE may be sent low after the READ address is applied. Of course, this also means that WRITE must be held low by the same amount after the READ address is removed, otherwise a WRITE will be performed at the end of the cycle. For this reason, twhold is typically the same as tahold. As discussed later, these internal latencies provide the opportunity to increase the register file throughput by applying new address signals before the data has appeared on the output pins. This technique is commonly referred to as wave-pipelining.

WRITE Operation

A WRITE cycle is nearly identical to a READ except that a value is being "pushed" into the memory cells rather than out onto the bitlines. Figure 3.5 depicts the WRITE path through the register file. As with a READ cycle, a row in the register file is selected via the address decoder circuit. The data to be stored is sent to the read/write logic that then drives the bitlines accordingly, forcing the selected memory cells to change state and store the new data. The register file has approximately 250 ps in which to perform the WRITE. Figure 3.6 contains an example of the WRITE cycle timing.

Figure 3.5 - WRITE path through register file architecture

Figure 3.6 - Register File WRITE Cycle Timing

The WRITE cycle timing signals are similar to the READ timing but with a few important changes. Due to the difference between the address and WRITE signal internal delays, the address must be applied before the WRITE signal (tasetup). Although it is not required, WRITE can be applied up to twsetup before the data is present because the data internal latency is less than for WRITE. Once applied, the data must be held for tdhold in order to ensure that the memory cells change state. Due again to the internal delays, the address and WRITE signals must be held for a certain length of time after the data is first applied to allow the memory cells to change state. The data written into the memory cells appears on the output pins after a delay of tWRITE. Note that the address and WRITE signals should not be held beyond tasetup and twsetup after the data changes, otherwise the new data may be written into the cells.

Circuit Operation

A simplified schematic of the new register file circuit is shown below in Figure 3.7 with the address driver and threshold voltage generator circuits left out. As mentioned earlier, the circuit is intentionally kept simple compared to other designs in order to keep device count low.

Figure 3.7 - Simplified register file circuit schematic

Address Decoder

The address decoding is performed using single-ended logic. This is accomplished by treating the differential address signals as two single-ended signals, each of which is connected to 16 of the 32 rows in the register file. A row is selected when all of the signals connected to the decoder inputs are high. This then turns off the decoder transistor connected to the address lines that raises the potential of the word line driver device. As a result, the word line potential also rises and selects the row. Because the address signal is differential, the rows connected to one single-ended component of the differential signal must not be connected to the other half as well, otherwise some rows would be permanently deselected.

Read/Write Logic

The purpose of the read/write logic is to set the bitline voltages depending upon the mode of the register file (e.g. READ or WRITE). The circuit is copied in the threshold voltage generator in order to create a feedback loop that can track VBE changes in the read/write logic during operation and compensate for them. As the circuit name implies, it has basically two modes of operation: READ mode and WRITE mode.

READ Mode

During READs, the read/write logic attempts to set the bitlines to a mid-range value based upon the threshold voltage reference. Because this mid-range potential is between the high and low collector voltages in the selected memory cell, the read/write logic establishes the voltage on the lower bitline while the memory cell establishes the high bitline voltage. Changing the threshold voltage modifies the mid-range potential as well as the bit line swing.

WRITE Mode

The read/write logic stores new data in a memory cell by allowing one bitline to become sufficiently low in order to turn on the "off" device in the selected memory cell. When the base-emitter voltage of the device in the selected memory cell exceeds VBE, the device begins to conduct current. This in turn pulls down the voltage on the base of the other ("on") device in the cell, placing it into the cut-off state. The read/write logic also raises the other bitline in order to reduce the base-emitter voltage below VBE and cut off.

Care must be taken when forcing a bitline high during a WRITE, otherwise the recovery time for the operation may be unnecessarily increased. When the read/write logic forces the bitline potential high, excess charge is placed upon the node in order to perform the WRITE. Afterwards, this charge must be dissipated through the constant current-sources in the sense amplifier. Consequently the transition between WRITE and READ modes will take longer when more charge is placed upon the bitline during the WRITE.

Memory Cell

The memory cell is a regenerative circuit that (by definition) retains its logical state until forced into a different state. It is essentially composed of two dual-emitter transistors with Schottky diode and resistive pull-ups. By cross-coupling the base connections to the collector of the other device, a positive feedback loop is established, hence only one device may be conducting current at any time. As described in the previous section, a new state may be stored in the memory cell by manipulating the bitline voltages and forcing the memory cell devices on (off) or off (on).

Design Requirements

In order for the register file to function properly, there are a number of design requirements that must be met under all circumstances. Failure to adhere to these requirements may result in marginal operation or even circuit failure. While these are by no means all of the requirements, they are the most important in terms of proper operation.

Word line swing ³ 800 mV

Memory cell hold voltage ³ 200 mV

Threshold voltage should be set to 50-60% of the swing of a selected memory cell

Register File Performance

To date there have been two fabrication runs that contained the RPI testchip, the original 1992 run and the HSCD run. The target performance of the register file was a sub-200 ps READ access time. The testing of the original fabrication run testchip was severely hampered by the device yield and consequently only one register file was measured. The HSCD run had significantly higher yield and provided many testable chips, however, the register file circuits have only recently been fully-tested and consequently the results were obtained too late to influence the redesign process.

To test the circuit, two linear-feedback shift registers (LFSRs) must work at the same frequency. Of the 56 dies available from the 1992 run, only one had both LFSRs working but only at 1 GHz, consequently the access time was measured at 500 ps. Recent results from the HSCD run have shown the register file to be operational at 222.7 ps. The testing results are discussed later in this section and also in Appendix C.

Before the register file performance on the HSCD fabrication run was verified experimentally, it was not clear just how fast the circuit would operate. To predict the maximum speed of the register file, a series of simulations in SPICE were performed with different interconnection and device models (see Table 3.1). The original Rockwell design rule manual device model and the estimated 2-D capacitance numbers predict rather good performance but the latest models (2-sided switching device model, anisotropic reduced interlevel dielectric model) predict 343.6 ps, a 84.6% increase.

Q1, Q3 Device Models

Interconnection Model

READ Access Time

1990 Rockwell

1990 Rockwell, 2-D model

186.1 ps

1990 Rockwell

anisotropic reduced ILD

236.4 ps

1996 Rockwell

anisotropic reduced ILD

243.0 ps

30 GHz Q1, 1990 Rockwell Q3

anisotropic reduced ILD

234.2 ps

30 GHz Q1, 2-sided Q3

anisotropic reduced ILD

256.2 ps

2-sided switching Q1, 2-sided Q3

anisotropic reduced ILD

343.6 ps

2-sided switching Q1, 2-sided Q3

no capacitance

141.1 ps

3-sided switching Q1, 2-sided Q3

anisotropic reduced ILD

328.2 ps

Table 3.1 - Comparison of simulated register file access times for different models

The best possible circuit performance occurs when there is no parasitic capacitance and is referred to as the intrinsic performance. The intrinsic performance is a direct reflection upon the device speed and presents an absolute lower bound for the performance improvement due to parasitic capacitance optimization. With a difference of 202.5 ps (143.2%) between the intrinsic and capacitively-loaded performance, parasitic capacitance clearly is a problem.

A detailed breakdown of the READ access time is shown in Figure 3.8 for the various device and interconnection models. Note that the largest component of the access time is due to the memory cells and bitlines. The address drivers and address lines are the second largest contributor followed by the wordline and its driver and then the sense amplifiers.

Figure 3.8 - Breakdown of READ access time for different models

Recently, experimental data from the testchip off the HSCD fabrication run has been obtained and compared to the predicted values above. The fastest performance has been measured at 222.7 ps, close to the value predicted by the 50 GHz models and well below the performance of the 2-sided switching model. Other experimental data from the testchip has verified the accuracy of the 2-sided switching model, hence the test results for the analog register file indicate that the circuit does not fit well within the digital switching circuit paradigm. Unfortunately, this information has come after the redesign process. On the other hand, the performance of the redesigned register file is expected to be significantly below the required values, providing a significant amount of safety margin (see Chapter 4). A photograph of the HSCD testchip register file MATCH output with an intentional mismatch is shown below in Figure 3.9 at the maximum observed frequency of 4.49 GHz, corresponding to a READ access time of 222.7 ps. Figure 3.10 shows the same signal on the verge of failure at 4.56 GHz (219.7 ps). Due to the testing scheme used on the chip, the actual access time of the register file is ½ of the waveform period and the pattern consists of 31 bits. The HSCD testchip results and analysis are contained in Appendix C.

Figure 3.9 - Testchip register file MATCH output with intentional mismatch at 4.49 GHz

 

Figure 3.10 - Register file MATCH output on the verge of failure at 4.56 GHz

 

Characterization of Original Register File

In order to determine where to focus the optimization process, a series of characterizations were performed in SPICE using the 3-sided device switching model and the anisotropic dielectric interconnection model. Because the largest physical nodes in the register file layout are the address, bit and word lines, the characterization process began there. These three nets presented the best opportunity to affect large amounts of parasitic capacitance and significantly improve performance.

The characterization process varied the capacitance of each component over a wide range and measured the READ access times using SPICE. The resistance of the lines was relatively small but was included in order to account for any RC effects. QuickCap was used with the anisotropic dielectric interconnection model to estimate the capacitance of each structure.

Address Line Characterization

QuickCap estimated the maximum parasitic capacitance of the thirty-two 1.5 mm x 8 mm metal-2 address lines at about 570 fF. An additional 95 fF was added between the conductors to reflect the Miller capacitance (e.g. the capacitance between adjacent address lines). The sensitivity analysis was performed over a range of 470 fF to 670 fF centered about the extracted value (the Miller capacitance was not varied). The simulation results indicate that the address line capacitance has relatively little effect upon the overall READ access times and virtually no effect upon the WRITE time. A plot of the READ sensitivity to capacitance is shown in Figure 3.11.

Figure 3.11 - READ sensitivity to address line capacitance

Bit Line Characterization

There are sixteen 3.0 mm metal-1 bitlines within the register file that are approximately 1.6 mm long and cross over the 64 upper and lower word lines. Due to the length, relatively high total wire resistance and large number of cross-overs, the bit lines were assumed to be the most critical physical component by far. Although this was disproven when the wordline sensitivity was calculated, the magnitude of the bitline capacitance makes them the largest contributor to the register file delay.

Figure 3.12 - READ sensitivity to bit line capacitance

Word Line Characterization

Each row in the register file contains both an upper and lower word line. The upper wordline provides current to the memory cells within the row and the lower word line provides a path to ground for the hold current. To select a row, the upper word line potential is raised, pulling the internal memory cell voltages and the lower word line along with it. The wordline provides all of the current for the memory cells and half of the bitline current. Each metal-3 wordline is approximately 0.49 mm long and has approximately 110 fF of capacitance (80 fF for the lower wordline).

Figure 3.13 - READ sensitivity to upper word line capacitance

Sensitivity Comparison

The READ access time sensitivity of all three components are listed in Table3.2 and shown in Figure 3.14. Although the word and bit lines have nearly identical sensitivities, the sheer magnitude of the bit line capacitance (and thus the opportunity for improving performance) makes them the preferred starting point for optimization. An added benefit is that improving the bit line capacitance may also reduce the word line parasitics, resulting in a win-win opportunity.

Conductor

READ Sensitivity

Address line

0.023 ps/pF

Bit line

0.110 ps/pF

Word line

0.109 ps/pF

Table3.2 - Comparison of address, bit and word line READ sensitivity to capacitance

Figure 3.14 - Comparison of READ sensitivities

RC / Transmission Line Limitations

Before creating an entirely new memory circuit layout, it may be useful to determine the limits of the current design. The sensitivity to bit, word and address line capacitance in simulations are roughly linear, indicating that the circuit is operating in the linear regime and is not RC limited. Due to the relatively small area of the register file and the large feature sizes, it is highly unlikely that any nodes are RC-limited. The address and word lines are both fairly wide (to handle the high current densities) and consequently have low resistance. In contrast are the bitlines that have a relatively significant amount of resistance (~ 50 W ) due to their use of the first level of metallization, their long length and relatively narrow width. Table 3.3 shows the time-constants for the register file address, bit and word lines.

Node

Total Resistance

Total Capacitance

Time Constant t

Address lines

2.5 W

340 fF

0.85 ps

Bit lines

51 W

250 fF

12.75 ps

Word lines

1 W

88 fF

0.088 ps

Table 3.3 - Time constants for register file address, bit and word lines

While the time constants may change between different configurations, it is unlikely that RC effects will become important unless the circuit size is dramatically increased. The address and word lines carry large amounts of current (about 15-20 mA apiece), therefore the lines must be relatively wide to keep the current density within the process limits. This consequently increases the parasitic capacitance somewhat but also keeps the resistance low. The bitlines may become RC-limited if their length increases significantly in future designs.

To determine if transmission line effects are significant, the electrical length of a signal can be compared to the actual conductor lengths. The electrical length l of a signal is given by

Eq. 1

where u p is the phase velocity of a signal in the system and f is the signal frequency. The phase velocity is related to the dielectric relative permeability and permittivity by

Eq. 2

where c is the speed of light. For most materials, m = m 0. Within the Rockwell process, the relative dielectric constant e r is listed as 2.9 but our experience has indicated that value of 3.5-4.0 generates results that more closely match experimental measurements. This value in turn yields a phase velocity of 1.5 x 1010 cm/s. Simulations using capacitively loaded devices indicate that the maximum signal risetime is about 20 ps that corresponds to an electrical length of about 30 mm. Given that the register file is less than 2.1 mm on a side and the longest node is 1.65 mm, it is unlikely that transmission line effects are present.

Summary

This chapter has described the operating and timing requirements for the F-RISC/G 32x8 register file. The basic architecture has been described and each of the major components has been examined in detail. Both the READ and WRITE modes of operation have been discussed along with the input and output signal timing. The performance was examined using several device and interconnection models in order to determine their relative effects. The READ access time was broken down into the individual delays from each circuit block to determine that provided the most contribution to the overall delay. The performance of the register file was then characterized in terms of parasitic capacitance for the three largest nodes in the circuit, namely the address, bit and word lines, in order to find the best starting point for optimizing the circuit performance.

Original Register File Statistics

Physical Dimensions

0.96 mm X 2.1 mm

Logical Dimensions

32 rows X 8 bits

Power

2.05 W

Longest Node

1647.0 mm (bitlines)

Device count

1466 transistors, 530 diodes

Table 3.4 - Original Register File statistics