Body

3 CHAPTER 3

F-RISC/G Implementation

The latest implementation of the F-RISC processor, F-RISC/G, has been designed for current-mode logic (CML) with GaAs/AlGaAs HBTs from Rockwell International. All on-chip wiring is differential with the wires in each pair placed in adjacent tracks to improve noise immunity. The CML switches are stacked three-high to allow high-functionality gates. The gates use a non-standard resistive current source.

The low yield of Rockwell's HBT process forced us to partition the F-RISC/G central processor into five chips—an instruction decoder and a four-chip bit-sliced datapath. The caches, designed by another student, are direct-mapped with write-back logic. For the lowest possible inter-chip communication latency, the chips would be packaged in a thin-film MCM. A novel system for clock distribution, also designed by another student, employs feed-back to reduce chip-to-chip clock skew.

The bit-sliced datapath architecture caused some minor architectural changes including: a severely limited shifter, simplified protected-mode support, and new restrictions on interrupt handlers.

Shortly after the publication of the Berkeley RISC processor, the idea of a GaAs RISC was raised[Keir85]. The pairing seems natural. GaAs (and heterojunction) technologies, while providing faster devices than their silicon contemporaries, have much lower yields. The low yields, and relatively high prices, had previously relegated GaAs mostly to a supporting role providing a few small, fast parts to speed up critical paths in multi-chip designs. Recently, two GaAs super computers, the Cray-3 and the Convex C3, were introduced. Each contains thousands of small GaAs MESFET chips[Cost93]. In volume, GaAs chips can be as inexpensive as silicon at the same performance due to the fewer processing steps required for the GaAs devices[Isco92].

Because RISC processors use only a fraction of the transistors included in a CISC design, RISCs require relatively few GaAs chips. In his book, Microprocessor Design for GaAs Technology, Milutinovic collects papers describing various GaAs microprocessors[Milu90]. Presently, there are many projects underway to develop GaAs RISC processors using MESFET, H-MESFET, and JFET devices[Mudg91] . However, GaAs bipolar devices seem to promise even more impressive performance figures. Also, since the bipolar devices do not rely on the extremely fine-line lithography of modern CMOS, they may eventually surpass CMOS for yield and cost at high performance.

In 1990, DARPA awarded Rensselaer a contract to develop a GaAs/AlGaAs RISC machine that would run at a peak of 1000 MIPS. This machine was christened "FRISC/G." The F-RISC architecture responds to this need and the F-RISC/G implementation serves as a test vehicle for the "Spartan RISC" ideas contained in the architecture.

3.1 TECHNOLOGY

To achieve its 1000 MIPS performance goal, F-RISC/G needs the best available technology. Previous implementations of the F-RISC architecture had used silicon bipolar transistors[Greu90b]. Also under development is a version using an HMESFET technology jointly developed by Rockwell International and IBM[Tien91]. More recently, work has begun on adapting the F-RISC/G implementation for use in Xilinx FPGAs and Si/Ge HBTs.

Figure 3.1: Layer structure of Rockwell HBT device
(Adapted from [Asbe89].) 3.1.1 Heterojunction Bipolar Transistors

The circuits in F-RISC/G use a AlGaAs/GaAs heterojunction bipolar transistor (HBT) technology from Rockwell International[Asbe89]. The devices are formed in MBE-grown epitaxial layers on semi-insulating substrates as shown in figure 3.1. The devices of primary interest have 1.4 mm ´ 3 mm emitter stripes and a DC current gain as high as 400. Some key SPICE parameters are in table 3.1. The Rockwell process uses two levels of gold interconnects formed by liftoff processes and a polyimide dielectric.

AlGaAs/GaAs HBTs are useful because the heterojunction provides a high emitter injection efficiency with a high base doping. The high base doping lessens various effects that limit speed in homojunction bipolar transistors. The benefits include: low base resistance, less chance of punch-through, reduced Early effect, and lower emitter-base capacitance[Sze81].

Table 3.1
Key SPICE Parameters and Sizes

Emitter area
1.4 mm ´ 3 mm

fT 50 GHz

R_e 15 W

R_b 140 W

R_c 50 W

VBE 1.35 V

C_jeo 7 fF

C_jco 11 fF

3.1.2 Current Mode Logic

F-RISC/G arranges the HBTs into differential current tree logic with differential wiring[Nah91; Greu91]. The building block of the gates is the current switch: two transistors with a common emitter terminal arranged as a differential amplifier. A simple gate with one current switch is shown in figure 3.2. The lower resistor, , acts as a current source. Depending on the input signal, the two transistors steer that current either through the left or through the right load resistor. The resulting voltage difference is the output signal. The voltage swing is fixed at 250 mV by the current source and the load resistors. This gives a current ratio between the on and off transistors of [Trea89]. The switching characteristics of this gate are shown in figure 3.3.

Figure 3.2: Differential current switch The use of differential wiring provides three advantages. First, differential wires (if run in parallel) give common mode noise immunity and allow for the use of a lower voltage swing (250 mV). Second, the need to generate and distribute three reference voltages is eliminated. Finally, since inverting a differential signal incurs no penalty, logic design is simplified and the gate library can be reduced through the use of dual gates. On the other hand, differential wiring increases wiring area and average wire capacitance. Each time one line in a differential pair switches, the other line switches in the opposite direction. Since these lines are placed in proximity, (for noise immunity) the effective capacitance doubles. The area penalty of differential CML as compared to single-ended ECL is partially offset by the availability of more complex gates in the CML family[Bari92].

Figure 3.3: Switching characteristics of high-powered gate Gate Design with Stacked Current Switches

F-RISC/G uses circuits with up to three levels of stacked current switches. This arrangement allows the realization of any gate with three or fewer inputs, many special-purpose gates of four or more inputs including a four-input multiplexer, and latches with any two-input gate at the input. Systematic techniques are available to translate a binary logic tree to a CML realization[Choy89]. Two examples of complex gates are shown in figure 3.4.

The ALU gate, shown as Fig. 3.3(a), can generate the XOR, AND, and OR of its A and B inputs. It realizes the function:

The lower portion of the circuit forms a 2-to-4 decoder. The two odd parity terms together become the XOR term. Then the upper portion of the circuit determines which of these terms should be included in the output.

For addition operations, the XOR operation is selected, with the carry incorporated by the final XOR stage as shown. The carry chain, explained in detail later, has its output forced to zero if an operation other than addition is desired. In this case, the final XOR stage passes the ALU result unchanged.

Figure 3.4 (a):
Three-function ALU gate

Figure 3.4 (b):
Carry propagation gate

Figure 3.4 (b) shows one bit of the carry chain. This gate takes one bit from each of the two operands, A and B, and the input carry, CIN. The operation of the gate is similar to the operation of the ALU gate. Its output is the carry output to the next bit on the carry chain. It realizes the function:

Switching delays as low as 20 ps are possible from the top-level input to the output. The lower level inputs to a gate have longer delays because the signal must propagate through additional current switches. (These delays are summarized for high-powered gates in table 3.2.) Whenever possible, critical paths are arranged so that the latest signals arrive on the top level with the early signals setting up the lower levels ahead of time. This idea can be seen in the carry propagation gate of figure 3.4(b). The critical carry-in signal arrives at the top level of the gate. In the less critical ALU gate, the levels for the A and B signals were chosen to be the same as in the carry propagation gate.

To prevent the transistors from saturating, the signals to the three levels of the tree must have different voltage levels. As shown in figure 3.5, each level is offset from the adjacent levels by . Each gate generates, as its output, only one of the three voltage levels. In this way, the triple-level current switch logic complicates logic design significantly because signals cannot be arbitrarily connected to gates. Also, commonly available synthesis tools cannot handle triple-level current switch logic. Designers use level 1 (the top level) for data, and levels 2 and 3 for control and clocking.

Table 3.2
Propagation Delay in HBT CML Gates

Output Level

Input Level
1

2

3

1
24 ps

31

36

2
29

40

45

3
34

49

54

Circuit Design

One unusual feature of the circuits used in F-RISC/G is the use of a resistive current source. This decision was prompted by the high VBE of the Rockwell devices—1.35 V compared to 0.85 V for a typical silicon bipolar transistor[Greu91]. The required supply voltage depends on the VBE of the transistors as follows:

for an active current source:

for a resistive current source:

where VS is the desired voltage across the supply resistor. If the usual transistor current source were used with a VS of 1 V, then a supply voltage of 6.4 V would be required. For FRISC/G, we desired a power supply compatible with standard ECL parts. With the resistive current source, a standard 5.2 V power supply is sufficient.

Figure 3.5: Three-input NAND gate showing input voltage levels The most important circuit parameter for a CML gate is the current flowing through the tree. Figure 3.6 shows the delay behavior of a gate as the current changes. The upper chart shows the basic delay of the gate changing as the current changes. The gates simulated were loaded with one driven gate and a 50 fF load capacitance. The lower chart shows the gate's sensitivity to load capacitance. The higher this sensitivity, the more the gate speed degrades as the fan-out or interconnect length increases.

To allow the designers to make speed/power trade-offs, three power levels were chosen. Each gate is available as high power, medium power, and low power. High powered gates use 1.6 mA; medium powered use 1.0 mA; low powered use 0.6 mA.

Figure 3.6: Design curves for a HBT CML gate with level 1 output Theoretical Analysis

Various authors have attempted to provide a theoretical framework in which to analyze the behavior of CML and ECL gates.

Ronald Treadway provides some interesting DC techniques[Trea89]. He gives equations for gate propagation, assuming that the current switch has zero delay. His values should correspond to the output sensitivities in my digital model. Table 3.3 shows the results of such a comparison. Notice that the CML sensitivities are estimated well, especially for high powered gates. (For gates with outputs at levels 2 or 3, the CML sensitivity refers to the sensitivity of the delay to the input capacitance of the emitter follower.)

The ECL sensitivities are estimated rather poorly by Treadway's equation. I suspect that the difference is in the differential logic. Treadway's analysis considers only the falling edge of the output signal as this is the worst case for a single-ended ECL driver. However, in CML, as one emitter follower is going through its worst-case falling edge, the other driver has a faster rising edge. The differential pair switches with an intermediate speed.

Table 3.3
Comparison of Theoretical Output Sensitivities to SPICE Results

Power, Output Level
Treadway CML Sens. (ps/pF) SPICE CML Sens. (ps/pF)
Percent Deviation Treadway ECL Sens. (ps/pF) SPICE ECL Sens. (ps/pF)
Percent Deviation

High, 1
110 112 – 2 %

High, 2
182 177 3 % 95 55 72 %

High, 3
182 185 – 1 % 94 83 13 %

Medium, 1
172 179 – 4 %

Medium, 2
228 228 0 % 167 70 138 %

Medium, 3
229 245 – 7 % 164 118 39 %

Low, 1
284 293 – 3 %

Low, 2
456 482 – 5 % 329 120 174 %

Low, 3
456 515 – 11 % 329 199 65 %

Impact of Resistor Current Source

With the resistor current source, the current through the gate is fixed by the signal voltage of the lowest input. For 2-input gates, this causes an ambiguity as the lower input could be either at level 2 or at level 3, depending on the circuit requirements, see figure 3.7. The source resistor will have a different voltage drop in each case. Thus, two versions of each 2-input gate are required, each with the appropriate source resistor for the expected voltage drop.

Figure 3.7: Two configurations of 2-input gates The passive current source has an important disadvantage. The current through a gate (and thus its output swing) is not constant. From basic principles,

(3.1)

where VI is the voltage of the lowest gate input and VEE is, in this case, the local supply voltage at the bottom of the current tree. Looking at the top of the tree,

(3.2)

Equations 3.1 and 3.2 have some important implications for designers. First, control of the supply voltage across the chip is critical to insuring consistent gate performance. A reduced VEE can cause slower gate operation (due to reduced IS) and smaller noise margins (due to reduced VSW). I have designed a CAD tool that models the power rails on the chip and identifies regions where the voltage drops are unacceptably high.

Second, common-mode noise, which would be rejected by the differential amplifier, can be detrimental to gate performance by reducing noise margins and slowing gate operations as described above. Among two-input gates, the "top" configuration (in figure 3.7) is less susceptible to this effect, due to its higher RS. Thus, designers are advised to choose the "top" gates whenever possible.

3.1.3 Advanced Packaging

If a multiple-chip system is to have a cycle time of 1000 ps, then clearly the system requires advanced packaging. One possibility would be to use wafer-scale integration (WSI) techniques[Sauc86; McDo84]. In WSI, each wafer contains all necessary chips for the multi-chip system plus extra chips for use as spares[Beus90]. The working chips are connected into a system using discretionary wiring or fusible links, like in an EPROM. The wafer is packaged as a whole.

Much has been written about the merits and demerits of such an approach. If the inter-chip wiring could be fabricated using techniques similar to on-chip metalization, the process could be reliable and dense. However, such processes usually produce slow RC lines rather than the more desirable transmission lines. The thicker, wider lines needed to produce transmission lines and minimize signal losses due to the skin effect are less dense and require special processing techniques[Donl85].

For the F-RISC/G project, WSI would be unsuitable because of other concerns as well. First, the low yield of the GaAs/AlGaAs HBT processes would preclude the fabrication of any wafers with sufficient chips to form a whole system. Also, the need to fabricate many chip designs on the wafer, in differing numbers proportional to system requirements, would complicate the lithographic processes and require changing reticles during stepping operations. Finally, as we shall see later in this chapter, the timing constraints of the F-RISC/G module do not allow connections to remote "spare" chips to replace a malfunctioning die. The working chips must be positioned with minimal separation.

Multi-chip Module Packaging

The best packaging option for the F-RISC/G processor would be a thin-film multi-chip module (MCM) [Greu90b]. Similar to WSI, an MCM contains multiple chips in a single packaging. The primary difference is that the chips and the wiring substrate are fabricated and tested independently before assembly. Each can be optimized separately. The thin film dielectric provides the required low dielectric constant needed for highspeed, transmission line interconnect. One disadvantage of MCM technology as compared to WSI is that the working dies are separated from the wafer and subjected to handling. Die breakage and poor connections to the wiring substrate are important concerns[Donl86].

Initially, it was assumed that the package would be developed and manufactured by the advanced packaging group at Rensselaer. However, it is increasingly apparent that this will not be the case. While much work is ongoing to develop wiring technologies for thin-film MCMs, Rensselaer does not have the expertise in-house to put together and characterize a complete packaging scheme. Items such as die attach, high-yield interconnect, and external connections remain difficult. Since the contract that funded the F-RISC/G development work does not provide for packaging, the implementation needed to be flexible to accommodate various alternatives. Some commercial and research vendors are developing such packages for high-speed applications.

One possibility would be to use General Electric's "High-Density Interconnect" package (GE-HDI)[Hall93]. It is a commercial packaging scheme that has already seen various military applications and commercial test vehicles developed. Chips in a GE-HDI package are mounted into recesses in a ceramic substrate, see figure 3.8. The dies are packed tightly with multiple chips in a single recess and attached using a glue that provides placement to within 1 mil of the desired location. Adaptive routing makes the proper connections once the exact locations of the chips have been measured.

Figure 3.8: View of GE-HDI package before wiring layers are added Inter-Chip Transfers

F-RISC/G is a multi-chip processor with a system clock of 1 ns. Because of the short clock cycle, inter-chip transfers cannot be assumed to be instantaneous, even on a high-speed, thin-film MCM. ECL poses especially difficult problems[Hamb93]. The floorplan of the proposed MCM is shown in figure 3.9. The cache memory has been simplified for the purposes of illustration. For a fuller treatment of the cache, see [VEtt93].

An important consideration in developing a timing strategy for F-RISC/G is the time spent transferring signals between and among chips. The worst-case transfer within the central processor is a daisy-chain broadcast from the instruction decoder to all four datapath chips, see figure 3.10. The minimum distance for a broadcast signal is:

(3.3)

However, conditions do not often permit the minimum routing specified in equation 3.3. If the control signal must be broadcast early in the DE stage, then it is better to broadcast from the left side of the instruction decoder. Transfers on the MCM are faster than transfers across the chips. Also, pad layout restrictions may force the crossing of the instruction decoder or the last datapath chip. In the average case, the distance for the broadcast signal is:

Figure 3.9: Floorplan of the F-RISC/G MCM package Our initial studies in this area were with the assumption of using C4 (Controlled Collapse Chip Connections—IBM's flip-chip technology) to attach the chip to an MCM using a parylene dielectric. The signal propagation velocity on this MCM (with ) would be [Bako90]:

(3.4)

The major question was: Could the broadcast depicted in figure 3.10 be achieved in a single clock phase? The criterion for a single-phase broadcast is:

where Df is the maximum allowed clock skew between chips. Since the allowable clock skew is the dependent variable, a more useful form is:

(3.5)

In my estimates, I allowed 70 ps for driver and receiver delay, 20 ps for each gate delay. On the MCM, I assumed chips measuring 5 mm ´ 8 mm were placed 1 mm apart. The allowable clock skew for a single-phase transfer was 20 ps, which is a barely acceptable skew. Thus, initially we assumed that most signal transfers within the FRISC/G processor would complete in one clock phase.

Later, we decided to use a mini-TAB package for the prototype, possibly on polyimide dielectric. The inter-chip signal delays were recalculated. Two changes were made to the assumptions. First, the chip-to-chip distance increased to 3 mm. Also, the chip size increased to 8 mm due to the addition of boundary scan testing (see chapter 4) and the need for extra power pads to compensate for the increased inductance in the power leads. Repeating the same calculation yields a maximum skew below zero. Thus, the system timing needed to be changed to allow two clock phases (500 ps) for any transfers of this length. Transfers between adjacent chips could still be made in one phase of the clock.

Figure 3.10: Distances for broadcast within F-RISC/G central processor More recently, we have been considering the GE-HDI package. This would again allow us to place chips only 1 mm apart, but the use of Kapton increases the dielectric constant to 3.5. Invoking equation 3.4 gives , which is over a 10% loss in propagation velocity. Taken together, the closer chip spacing has more effect. Using the results of equation 3.5, the new MCM is 20 ps faster in the average broadcast, 25 ps faster in the best-case, and about equal in the worst-case adjacent chip transfer. The savings are not enough to allow single-phase broadcasts, but could allow an additional gate before or after the transfer. Table 3.4 compares the MCM technologies using 8 mm chips.

Table 3.4
Comparison of Multi-chip Module Technologies

C4 on Parylene TAB on Parylene GE-HDI

Dielectric Parylene Polyimide Kapton

Propagation velocity 0.18 mm/ps 0.18 mm/ps 0.16 mm/ps

Chip attachment C4 TAB GE-HDI

Chip spacing 1 mm 3 mm 1 mm

Shortest adjacent transfer 6 ps 17 ps 6 ps

Longest adjacent transfer 94 ps 106 ps 106 ps

Shortest broadcast 156 ps 200 ps 175 ps

Average broadcast 200 ps 244 ps 225 ps

3.2 IMPLEMENTATION DECISIONS

The F-RISC architecture description leaves some latitude for adapting the architecture to the particular implementation technology. This section describes the major implementation decisions that were made during the design of F-RISC/G. Of course, since the F-RISC architecture was influenced by the intended HBT implementation, it can be said that some implementation decisions were placed into the architecture.

Figure 3.11: System diagram for GaAs/AlGaAs HBT implementation of F-RISC Figure 3.12 shows a diagram of the F-RISC/G implementation with clock signals indicated for the major latches in the design. The RFA and RFB latches serve to hold the two register file results—they are required by the single-port register file.

Figure 3.12: F-RISC/G system diagram showing timing information
(Figure adapted from [Greu90a]) 3.2.1 Chip Descriptions

Due to the small capacity of the HBT chips, the architecture must be partitioned into multiple chips. This partitioning is shown in figure 3.11. The core processor is divided into two sections: the instruction decoder and the datapath. Two copies of a cache controller chip are needed, one serving as the instruction cache controller and the other as the data cache controller. The cache memories consist of sixteen memory chips—eight for each cache. The optional data alignment chip shown in figure 3.11 will not be present in the initial F-RISC/G prototype. It would have extracted bytes and half-words from the memory. Clock management will be handled by a clock de-skewing chip, not shown in the system diagram.

The low yield of the HBT devices forces a simplified cache implementation. FRISC/G will have direct-mapped caches. The data cache will be write-back. (The instruction cache is read-only in the F-RISC architecture.) Each cache will be divided into 32 lines. This choice of size was made so that the main register file memory can be copied and used directly as tag memory for the cache. The cache line size will be 16 words. To minimize the time of bus transfers between the primary cache and the second-level cache, the bus will be 16 words wide. Thus the entire cache line can be refreshed in one bus cycle.

Table 3.5
Chip Names and Abbreviations

	Number	Abbreviation
Datapath	4	DP
Instruction decoder	1	ID
Instruction cache controller	1	ICC
Data cache controller	1	DCC
Cache memory	16	DM
Byte operations	1	BYTE
Clock de-skew	1	DSK

Throughout the rest of this document, the abbreviations shown in table 3.5 are used to refer to the above chips.

Instruction Decoder

The instruction decoder contains the pipeline controllers, state machines, register tags, and control logic required to operate the F-RISC architecture. I designed and laid out this chip.

Datapath

The datapath contains the ALU, the shifter, the program counter history, and the register file. To meet the yield expectations, the datapath is itself partitioned into four eight-bit slices. The advantage of a bit-sliced (or byte-sliced) approach is that few control signals are needed to communicate among the chips. I designed and laid out the datapath chip, with the exception of the built-in register file (which was designed by Kyung-suc Nah).

Cache Controllers

Each of the two caches will communicate with the instruction decoder through its cache controller. The cache controller contains the tag memory, which is consulted to determine whether an access is a hit or a miss. Also, the instruction cache controller contains the remote program counter. This chip is being designed by John Van Etten.

Cache Memories

Each memory chip contains eight copies of the register file with added redundancy to improve yield prospects. This chip is being designed by John Van Etten.

Byte Operations

This chip performs the multiplexing, masking, and shifting necessary to extract bytes and half-words from a word read by the cache. Also, it inserts a byte into a word for STORE instructions. Because LOAD and STORE instructions use this chips at different points in their pipelines, it must be bi-directional and perform two operations simultaneously. It is not being designed at present.

Clock De-skew Chip

Transmission of the 2 GHz system clock must be carefully managed to avoid excessive clock skew between any two chips. This function is handled by a special analog clock de-skewing chip, which is being designed by another student (Kyung-suc Nah). The clock distribution system is discussed briefly in section 3.2.3.

3.2.2 Instruction Set Changes

After considering the technology choices and the partitioning, certain changes were made to the architecture and instruction set to allow an easier implementation. The complete instruction set of F-RISC/G is documented in appendix C. Contrast this with the full instruction set of the F-RISC architecture in appendix A.

Shifter

Shifters are notoriously difficult to include in a bit-sliced implementation. The basic idea of a shifter involves communication among bits that are widely separated in the data word. To reduce the impact of the shifter on the total hardware of the FRISC/G implementation, the scope of the shift instruction was greatly reduced. The shift amount and direction were limited to one-bit shifts right. This eliminated both the need for the communication ports associated with a larger shifter and the need to decode the shift amount as specified in the SHIFT instruction. While the programming rules indicate that a fixed shift amount of one bit right should be specified in all cases, the actual amount specified is ignored by the hardware. A one-bit left shift is available through the addition operation.

To support extended (> 32-bit) shifts, the /SHEX flag on a SHIFT instruction will cause the contents of the C flag (in the PSW) to be placed in the MSB of the result. Also, the LSB of the operand will be available as the new C flag.

PSW Bits for Logical Operations

The F-RISC architecture does not define the carry (C) and overflow (V) flags in the PSW after a logical instruction (AND, OR, XOR, ANDI, ORI, XORI) with the /SCC flag. The F-RISC/G implementation guarantees that these flags will be zero. Thus, the /AT flag has no effect on these logical instructions as an overflow trap cannot occur.

Write-back Cache

As noted in the chapter covering the F-RISC architecture, the write-back cache in F-RISC/G requires the STORE instruction to occupy the cache for two cycles. Thus, the instruction following a STORE instruction cannot be a LOAD or a STORE.

Protected Mode

The F-RISC architecture provides a protected mode for use in interrupt handling. This mode has two purposes. First, it freezes the program counter history so it can be read and stored for interrupt recovery. During the beginning of an interrupt service routine (ISR) the history registers should point into the interrupted program—not into the ISR itself. Second, protected mode disables external interrupts so that ISR code is not interrupted during critical sections. In addition, internal interrupts are redirected to alternate interrupt vectors to indicate that the processor was in protected mode when the interrupt occurred. ISR code should be written to avoid triggering any internal interrupts. The code should be locked into memory to avoid page faults.

Protected mode is automatically entered upon an interrupt. When the ISR has successfully saved the state of the processor, it exits protected mode using the /RES flag on a JUMP instruction. Later, when the ISR is preparing to resume the interrupted program, it must re-enter protected mode because the interrupt recovery code cannot be restarted if it is interrupted. The F-RISC architecture provides for a /SPM (set protected mode) flag on the MPSW instruction to force the processor into protected mode. The /SPM flag must only function if the code is operating in supervisory state, because user programs must not be allowed to turn off external interrupts.

In F-RISC/G, the protected mode bit is maintained on the instruction decoder and the supervisor mode bit is maintained on a datapath slice. Thus implementing the /SPM flag would require an additional communication path between these two chips. Instead, in F-RISC/G the /SPM flag is disabled under all conditions. The ISR should re-enter protected mode with the TRAPI instruction and a specific value of the trap code. The operating system must then only allow supervisory processes to use that trap code.

INTPC Instruction

The INTPC instruction is used to read the value of PC_DW during interrupt processing. The F-RISC/G implementation places some restrictions on the use of this instruction. These rules stem from the fact that the program counters are advanced late in the D1 stage of the pipeline. Since the PC_DW register is read early in the DE stage of an INTPC instruction, this results in a three-cycle latency from one INTPC to the next. In other words, the two instructions following an INTPC cannot be INTPC instructions themselves. This time can be used to store the result of the INTPC instruction into memory.

The second rule comes from the fact that once the INTPC instruction enters its D1 stage, the program counters will advance even if a preceding instruction has an interrupt in its D2 or DW stage. Thus, the operating system must insure that the instructions surrounding an INTPC are never interrupted. This provision should be straightforward as all external interrupts are disabled in protected mode (when INTPC is executed) and the operating system can insure that no page faults or user-requested interrupts (such as TRAP instructions or arithmetic faults) can occur.

Interrupt Latency

All external interrupts, except page faults, experience a three-cycle latency between the raising of the interrupt line and the recognition of the interrupt by the processor. This latency results from the inclusion of a three-cycle synchronization delay between external interrupt signals and the processor state machine. Likewise, the ID bit in the PSW (which disables user interrupts) has an additional latency of three cycles after it is changed by an MPSW instruction.

Overflow of Remote Program Counter

The F-RISC architecture provides memory protection by forcing the most-significant bit of all addresses generated by user processes to one. Thus, the user cannot LOAD from or STORE to addresses with a most-significant bit of zero. Also, user mode processes cannot BRANCH or JUMP to instruction addresses with the most-significant bit reset. However, the remote program counter can complicate matters. If a user process is executing at address FFFFFFFF (hex), the RPC will increment to 00000000 (hex) for the next instruction. This address is in the supervisor space. Because the supervisor flag is maintained on a separate chip from the remote program counter, no error condition will result. This would appear to be a hole in the security.

However, there is, in fact, no potential for mayhem. Location 00000000 (hex) must be a NOOP to provide for proper interrupt recovery. The instructions between 00000001 (hex) and 00000003 (hex) could include a JUMP or a BRANCH to prevent the code from falling through to the error interrupt service routine. This JUMP will be subject to the memory protection features, and thus will result in a jump to a location in the user's space.

Another possibility is to disable the carry into the most-significant bit of the RPC. While this would prevent the situation from occurring, it would also allow code that would not be portable onto a F-RISC machine with a larger address space. In the end, however, the point is moot for an engineering prototype such as F-RISC/G.

Initial Reset

Because of the limited fan-in of CML gates and our desire to reduce the complexity of the F-RISC/G instruction decoder, the F-RISC/G CPU has some implementation-specific initialization requirements.

1. RESET interrupts always use the "protected mode" interrupt vector, 20 (hex), regardless of the previous state of the processor.

2. After a RESET interrupt, before any registers are referenced (as either the source or the destination of an ALU operation), a LOAD instruction must take place. This LOAD instruction will serve to reset the latch tag on the DIN latch.

3.2.3 Clocking

The student designing the register file has determined that a single-port register file is the best choice for F-RISC/G based on considerations of transistor count, speed, area, and power consumption. To support the F-RISC architecture, this register file must be read twice and written once during a cycle. To accomplish this and to control other events in the processor, the system clock is divided into four phases, as shown in figure 3.13. Each of the phases is nominally 250 ps. The phases are not guaranteed to be non-overlapping. They may overlap slightly or be slightly disjoint.

The register file is read during each of f2 and f3. The address is set up for the write in the beginning of f4 and the write occurs during the end of f4 and the beginning of f1.

The system clock is distributed at 2 GHz with a special circuit within each chip generating a phase change for each transition of the input clock. The chip-to-chip transmission of the system clock uses a novel approach to reduce the skew among the chips. A central clock distribution and deskew chip generates a clock signal for each system chip and applies a controlled phase shift to each signal. Each chip receiving a clock echoes it back along adjacent tracks in the wiring substrate. Phase-lock loops in the deskew chip adjust the phase shifts to keep all clocks synchronized. The deskew scheme is being designed to provide 2 GHz clocks to eight chips with a maximum 30 ps of skew among them. That is, each signal may differ from the ideal by only 15 ps.

Figure 3.13: Four-phase clocking scheme for F-RISC/G On-chip Clock Distribution

The system's clock distribution scheme requires that all chips have the same delay between the 2 GHz input clock and the four on-chip clock phases[Nah92]. Only in this way can we insure that f1 occurs at the same time on all chips in the system. Therefore, the on-chip clock buffering is critical to the proper timing of events within the F-RISC/G system. Figure 3.14 shows various clock buffering schemes. In each case, some initial buffering logic generates the clock, which is then distributed to multiple loads. Each of these loads is a super-buffer that drives five to eight loads. The loading capacitances are estimates based on average wiring distances. They do not include the active loads of the driven gates—this loading is calculated separately.

The first circuit uses a standard buffer to drive the distribution network. As the clock distribution is performed at levels 2 and 3, this buffer has an emitter follower at its output. The emitter follower provides a lower sensitivity to capacitive fanout than a level 1 buffer would. The circuit in figure 3.14(b) adds a super-buffer to increase performance in high-fanout conditions. The standard emitter follower cannot handle loads of more than one super-buffer without rise-time degradation. In (c) the initial buffer is replaced by a super-buffer. This serves only to reduce the total delay. Finally, circuit (d) uses a special clock buffer (developed for the boundary scan testing scheme described in chapter 4) for additional drive capability. This would be ideal for the highest fanout conditions.

(a)

(b)

(c)

(d)

Figure 3.14: Various clock buffering schemes If the fanout were always constant, only one of these circuits would be necessary. However, in practice the number of super-buffers driven in the final stage can vary from one to five. Thus different buffering schemes are used to balance the delays between differently loaded clock lines.

Table 3.6 shows the initial estimates of delays using the various schemes. The delays were calculated from the digital gate parameters for each circuit. Both the capacitive loads shown and the applicable active loads are added. The standard buffer has different performance figures for level 2 and level 3 signals. Thus, these cases are treated separately. Some uninteresting cases are omitted.

The cases marked with a dagger (†) are chosen for initial use in F-RISC/G chips. The basic delay time is fixed by the high-fanout cases. In these cases, the extra power consumed by the clock buffer is tolerable because it is the only way to obtain good signals with high fanout. Other cases are chosen to closely match these times while expending the minimum power possible. These figures and designs should be taken as starting points only. The loadings on the various buffers will change based on the locations and numbers of loads. Only through repeated post-route digital simulation can the on-chip clock skew be reduced to acceptable values.

Table 3.6
Performance of Various Clock Buffering Schemes

Number of super-buffers driven

1 2 3 4 5

(a) Level 2
79 ps

97 †

(a) Level 3
91

118

(b) Level 2
99 †

109

(b) Level 3
107

117

(c)
76

86

97 †

108

(d)

89

92

95 †

98 †

† Configuration chosen for F-RISC chips

3.3 COMMUNICATIONS

There is considerable communication between and among the chips in the F-RISC/G architecture. An understanding of this communication is essential for a complete understanding of the implementation. This section and the following one on timing should be read in parallel. No one said this would be easy. Only the most important and interesting signals will be explained here and in the next section (Timing).

The following sub-sections divide the signals into categories. Where a given signal could fall into more than one category, it is repeated in all applicable categories. All signals are differential.

3.3.1 Datapath Control

Most of the communication lines within the CPU are used by the instruction decoder to control the datapath chips. These signals are listed in table 3.7. The B operand inversion signal needs to go from the instruction decoder to two locations on the datapath. The first, the actual B operand inverter, is covered by the INVB signal. The second destination is the adder carry-in on the least significant slice (to perform two's complement subtraction). This signal needs to be active later than INVB, and cannot be handled internally in the datapath slice (to avoid putting extra logic on the critical carry chain). A separate line, CINLSS, was added to the instruction decoder for this reason.

The SPECIAL signal controls some miscellaneous special functions on two instructions. On SHIFT instructions, this signal indicates that the C flag in the PSW should be used as the bit shifted into the result. On JUMP instructions, SPECIAL indicates that the S and OS bits should be reset—this copies the /RES flag on the instruction.

Table 3.7
Datapath Control Communication Lines

From To Size Description

ALUOP ID DP 2 ALU operation select

CIN,COUT DP DP 3 Ripple carry signals for adder

CINI, COUTI DP DP 3 Ripple carry signals for program counter adder

CINLSS ID DP0 1 Carry in for adder LSB; late copy of INVB

FFRA, FFRB ID DP 4 Feed-forward multiplexer select lines

IMM ID DP 16 Immediate constant from instruction

IMMSEL ID DP 1 Load constant in high or low halfword

INVB ID DP 1 Invert B operand

MBYA, MBYB ID DP 2 Substitute DIN for register value

OPA, OPB ID DP 3 Operand selection lines

RESSEL ID DP 1 Store ALU or shifter result

RFA_A, RFA_BD ID DP 10 Register file addresses

RFSEL ID DP 1 Store DIN or RES_EX into register file

RFWR ID DP 1
Write into register file

SHIFTH, SHIFTL
DP DP 3
Ripple shift signals

SPECIAL
ID DP 1
Extended shift operation

STALL
ID DP 1
Processor stall signal

3.3.2 Condition Evaluation, Branching and the Processor Status Word

Another area of communication between the instruction decoder and the datapath is in the evaluation of conditional branches. The signals for this function are listed in table 3.8. The BRA signal is generated on the DP3 chip. This signal must be distributed to the other datapath slices (so they can store the previous result as the new PC_I1 value), to the instruction decoder (so it can flush instructions as necessary), and to the instruction cache controller (so it reads the address bus). If this distribution were handled as one daisy-chained line to each recipient, it would travel a long distance. Thus, the signal is sent twice from the DP3 chip. One path, BRA1, goes to the other datapath slices; the other path, BRA2, goes to the decoder and cache controller.

The zero flag in the PSW is the OR of signals from each datapath slice. These signals converge on DP3 from the other slices. The function of the SPECIAL signal is described in the previous sub-section.

Table 3.8
Condition Evaluation, Branching, and the Processor Status Word
Communication Lines

	From	To	Size	Description
BRA1	DP3	DP2, DP1, DP0	1	Branch taken (goes from BRAOUT1 on DP3 to BRAIN on DP0, DP1, DP2)
BRA2	DP3	ID, ICC	1	Branch taken (originates from BRAOUT2)
CC	ID	DP	4	Condition code for BRANCH evaluation
FLUSHDP	ID	DP	1	Flush EX stage if branch was taken
SCC	ID	DP	1	Store condition code
SPECIAL	ID	DP	1	Leave supervisor mode on JUMP
UI (I#)	—	DP	3	External inputs to PSW
UO (O#)	DP	—	3	External outputs from PSW
WPSW	ID	DP	1	Write the PSW from the result
ZIN0	DP0	DP3	1	Zero flag (originates as ZOUT on DP0)
ZIN1	DP1	DP3	1	Zero flag (originates as ZOUT on DP1)
ZIN2	DP2	DP3	1	Zero flag (originates as ZOUT on DP2)

Table 3.9
Instruction and Data Cache Communication Lines

From To Size Description

ABUS DP DCC
ICC 32 Address transfer to cache

ACKD ID DCC 1 Acknowledgement of data cache miss

ACKI ID ICC 1 Acknowledgement of instruction cache miss

DBUSI DM DP 32 Data transfer path for LOAD

DBUSO DP DM 32 Data transfer path for STORE

IBUS IM ID 32 Instruction transfer from cache

IOCTRL ID DCC 3 Control bits for LOADs and STOREs

MISSD DCC ID 1 Data cache miss

MISSI ICC ID 1 Instruction cache miss

STALLM ID ICC DCC 1 Cache memory stall

TRAPD DCC ID 1 Data cache page fault

TRAPI ICC ID 1 Instruction cache page fault

VDA ID DCC 1 ADDR bus has valid data address

WDC ID DCC 1 Write line for data cache

WDOUT ID DP 1 Write OUT_EX latch on datapath chips (during STORE instruction)

3.3.3 Instruction and Data Caches

Many signals go to and from the caches. They are listed in table 3.9. Many of these lines are time-critical as the cache misses are closely synchronized with the processor. The address bus and IOCTRL signals may go to other I/O devices as well as to the cache. These other I/O devices may wish to decode the STALLM signal to detect stalls in the other cache. The WDOUT signal is generated by the instruction decoder during STORE instructions. It activates the OUT_EX latch (see figure 3.12) which holds the data to be sent on the DBUSO lines.

3.3.4 Interrupts and Traps

The communication lines for interrupts and traps, listed in table 3.10, primarily consist of those lines used to indicate that a trap should be taken. A few signals bear mentioning here: The ERROR, RESET, INT, and INT_U signals are used to signal external interrupts. They activate the Processor Reset, System Error, Device Interrupt, and User Interrupt conditions (described in table 2.16), respectively. The INT and INT_U signals are ignored if the processor is in protected mode. INT_U is also ignored if the INT_DIS (disable user interrupts) signal is high. The INT_DIS signal comes from the UO line on the lowest datapath slice. The user can set the value by using the MPSW instruction.

The PROT signal is sent from the instruction decoder (which maintains this line as part of the processor state machine) to the UI line on the lowest datapath slice. In this way, the PROT signal appears in the PSW and can be accessed by users.

Table 3.10
Interrupt and Trap Communication Lines

From To Size Description

ATRAP DP ID 1 Arithmetic trap?

ERROR — ID 1 External error condition

FLUSH ID DP 1 Reset PSW on exception

INT — ID 1 Device interrupt (masked in protected mode)

INTDIS DP0 ID 1 Disable user interrupts

INTU — ID 1 User interrupt (masked in protected mode or by INTDIS)

PCLOCK ID DP 1 Stop program counters from advancing

PROT ID DP0 1 Protected mode flag for PSW

RESET DSK ID 1 Processor reset

TRAPD DCC ID 1 Data cache page fault

TRAPI ICC ID 1 Instruction cache page fault

3.3.5 Clock, Synchronization, and Configuration

The clock deskew scheme requires that all modules that receive the master clock return a copy of it to the deskew chip. The signals used for clock distribution and deskew are listed in table 3.11. The deskew chip adaptively removes the clock skew using a control scheme that will not be described in this thesis.

Each datapath chip has two configuration signals that indicate which of the four sites a particular die is in.

Table 3.11
Clock and Synchronization Communication Lines

From To Size Description

CLK DSK ALL 1
Input clock

CLKRTN
ALL DSK 1 Clock return for deskew

CONFH, CONFL — DP † 2
Datapath configuration signals

SYNC
DSK ALL 1
Clock synchronization signal

† Single-ended ECL signals

3.4 TIMING

3.4.1 Datapath Timing

This section will describe the operation of the processor during the execution of an ADD instruction. Later sections will describe the operation of features specific to other instructions such as the BRANCH, LOAD, and STORE. A diagram of the F-RISC/G ALU timing is shown in figure 3.15; the system diagram is in figure 3.12.

I1 Stage

By the time they are needed during the middle of the I1 stage, the program counter on the datapath and the remote program counter have been incremented to calculate the instruction address. If this instruction is the target of a BRANCH instruction, then the address bus (ABUS) from the datapath will contain the instruction address by f3 of I1. The instruction address is then transferred from the instruction cache controller to the instruction memory chips which read the desired instruction. Simultaneously, the instruction cache controller checks the tags to insure that there is no cache miss.

I2 Stage

Assuming there is no cache miss, the instruction is transferred from the instruction memory to the instruction decoder beginning in f3.

DE Stage

When the instruction arrives at the instruction decoder, the most urgent task is to send the register file addresses to the datapath chips. The operands themselves must be ready by f4 of the DE stage. To accomplish this, the B operand is read from the register file during f2 with the A operand read during the next phase. The bulk of the instruction bits arrives at the instruction decoder at the beginning of the DE stage. The portion of the instruction containing the B operand address (RFB) arrives early (f4 of I2), so it can be ready on the datapath slices by the end of f1 of DE. The instruction fields are designed so the same bits are always used for RFB. The cache chips for those bits should be the closest to the instruction decoder to reduce the path time.

The A operand register file address (RFA) is slightly more complex as it can be in one of two locations in the instruction. (See figure 2.13.) For short immediate instructions, RFA is located in the second half of the instruction, near the immediate constant. For long immediate instructions, RFA is located in the same field as the destination register file address (DEST). Thus, in these instructions RFA and DEST specify the same register.

Figure 3.15: F-RISC/G timing during an ALU instruction While the transfers of RFA and RFB proceed, the instruction decoder determines whether one of the tagged latches contains a more recent value for the selected registers. The results of these decisions are transmitted to the feed-forward multiplexers and the register file output latches on the datapath (as the FFRAH, FFRAL, FFRBH, FFRBL, MBYA, and MBYB signals). Also, the immediate constant fields are decoded and transferred. The instruction set provides three different types of immediate constants (compare the ALU, ALUI, and STORE instructions). The need for interrupt response adds a fourth source for the constant—the trap vector.

Also during this time, the control signals for the operand selection multiplexers (i.e., OPAH, OPAL, and OPBL), the B operand inverter (INVB), and the ALU or shifter (ALUOPH and ALUOPL) are generated on the instruction decoder and broadcast to the datapath chips. The datapath selects the operands and prepares them for the start of the ALU cycle at f4. When f4 arrives, the result register latches the previous cycle's result and the OPA and OPB latches open to present the operands for the current cycle. The calculation then begins in the ALU.

EX Stage

The ALU completes the calculation of the instruction result by the end of f3. During the next phase, the result is stored in the RES_EX register as the next instruction begins its calculation. At this time, three of the four condition code flags (NCVZ flags), namely N, C, and V, are available on the most significant slice of the datapath (DP3). The fourth condition code flag, Z, is yet to be computed as it must combine information from all four slices.

The recording of the NCVZ flags is optional and is controlled by /SCC flag on many instructions. If this flag is set, the instruction decoder sends the SCC signal to the DP3 chip, which records the flags. During the execution of the MPSW instruction, which updates all PSW flags, the SCC signal is sent to force the recording of the new NCVZ flags. In this case, however, these flags are copied directly from the top four bits of the ALU result.

N and C are the MSB of the result and the carry out of the fourth slice, respectively. The N flag can come from either the ALU result or the shifter result. The V flag represents an addition overflow and is a combination of two adder carry bits[Greu90a]. This flag is transferred to the instruction decoder (as the ATRAP signal) in case the instruction requested an arithmetic trap on an overflow (the /AT option). Because they are ready immediately, the N, C, and V flags can be latched when the result is.

However, the Z flag is more complex as it requires a synthesis from all 32 bits of the result. During f4 of the EX stage, an 8-input OR of each slice's result is calculated. During the first two phases of D1, these results are placed on the ZOUT output from each datapath slice. These signals travel to the ZIN0, ZIN1, and ZIN2 lines on the most-significant slice. This slice reads the three signals and forms the Z flag.

D1 Stage

As described above, during the beginning of the D1 stage, the Z flag is calculated and stored. At f3, the result continues its progress toward the register file, moving from RES_EX to RES_D1. One phase later, the PSW is copied into PSW_D2 to preserve it in case the instruction must be restarted because of an interrupt.

D2 Stage

In the middle of this cycle, the instruction decoder decides whether the instruction will complete. Either an arithmetic trap (from an ALU overflow) or a problem in the preceding instruction (in its DW stage now) would cause the instruction to abort. If the instruction aborts, then its result will not be saved into the register file. The interrupt mechanisms, described in the architecture chapter, would take over.

During the D2 stage, the destination address (DEST) is transferred from the instruction decoder to the datapath slices. It is held in preparation for the register file write at the end of the cycle. Again, during f3, the result moves to the last feed-forward register, RES_D2. One phase later, the PSW is copied into PSW_DW.

Once the result is positioned into RES_D2 and the DEST address has been transferred, the register file write can begin. During the beginning of f4, the register file is addressed. Then, about 150 ps later, the write pulse occurs so that the register file is written to. After the 250 ps long write pulse, the register file is given 100 ps of settling time before the next address change occurs.

DW Stage

The only thing that happens in the DW stage of an addition instruction is the completion of the register file write cycle.

3.4.2 BRANCH Instruction Timing

The differences between a BRANCH (or JUMP) instruction and an ALU instruction are that a BRANCH must check the condition code to see if a condition branch will be taken, flush instructions in the latency region, and update the program counters. Also, the BRANCH and JUMP instructions do not write their result into the register file. The complete operation of a BRANCH instruction can be seen in figure 3.16.

The fetching and preparation of operands for a BRANCH instruction are the same as for any other instruction, and have been described previously. During a BRANCH instruction, the ALU operation (always addition) calculates the target address of the BRANCH. Instead of being stored in the result register, the target address is stored in the ADDR register for transfer to the cache if the branch is taken.

To Branch or Not to Branch

One of the clear differences between a BRANCH instruction and an ADD instruction is that BRANCH instructions can change the flow of instructions into the processor. The condition code flags (NCVZ) in the PSW are evaluated using the condition code field in the instruction to form a branch condition—in other words, to decide whether or not to branch.

In figure 3.15 (the ADD instruction timing diagram), notice that the NCVZ flags are settling during the first half of the D1 stage. If the instruction preceding the BRANCH sets these flags, then the flags are settling during the EX stage of the BRANCH. Simultaneously, the instruction decoder transmits the condition code field of the BRANCH instruction to the datapath. Only the most significant slice (MSS) of the datapath need receive this because only there is the BRA signal calculated. To simplify the design of the datapath, during non-BRANCH instructions the instruction decoder sends 0000 (meaning "don't branch") as the condition code. Thus the datapath never needs to be explicitly signaled when a BRANCH is taking place. The piece of information available latest during the evaluation of the branch condition is the Z flag, which is computed using information from all four datapath slices. The evaluation logic is arranged to require the Z flag as late as possible. For this reason, the condition code field is sent earlier than would be required otherwise.

Figure 3.16: Timing of the F-RISC/G BRANCH instruction The transmission of the BRA signal is critical as it must go from the datapath MSS to the other three datapath slices (the datapath slices need to be told to copy the new address into the PC_I1 register) and to the instruction cache controller (to update the remote program counter) as well as to the instruction decoder. Among its other responses to a BRANCH being taken, the instruction decoder must flush some or all the instructions following the BRANCH (depending on the /LAT field of the BRANCH instruction). When the instruction decoder receives the branch result, the most advanced of these instructions is midway through its EX stage. By this time, it is too late to block transmission of the condition code for that instruction (in the case of two back-to-back BRANCH instructions) or the SCC and WPSW signals for PSW manipulation. This problem is solved by having the instruction decoder send a FLUSHDP signal that indicates that SCC, WPSW, and branch conditions should be ignored if a successful BRANCH has just occurred.

Taking the Branch

If the branch is taken, then the instruction cache must immediately begin fetching instructions from the new location. The result of the addition operation is stored in the ADDR register and broadcast over the ABUS to the instruction cache controller. The cache controller monitors the BRA signal to determine whether the address on the ABUS is a valid instruction address.

3.4.3 LOAD Instruction Timing

The LOAD and STORE instructions access the F-RISC/G data cache. The LOAD instruction is illustrated in figure 3.17. The STORE instruction will be covered in the next section.

The sequence of a LOAD instruction diverges from that of an ALU instruction during the last phase of the EX stage. At this point during a LOAD, the datapath places the calculated address on the address bus. The instruction decoder anticipates the cache operation by raising the VDA line during f3 to indicate that the data cache should read the address bus. The data cache controller reads the address. It then begins the process of reading the memory and checking the tags. If there is a cache miss, the MISS_D signal is sent to the instruction decoder by the end of the D2 stage.

Figure 3.17: F-RISC/G timing for a LOAD instruction Before being sent to the datapath, the data from the data cache is transferred to the byte operations chip. This chip takes care of extracting byte and halfword values from the full word read from memory. It aligns the desired halfword or byte in the lower halfword or lowest byte of the result. All unused upper bits are set to zero. The byte chip is controlled by the /IOCTL field of the LOAD instruction. This chip is not being designed at this time.

As has been explained earlier, the data from the LOAD instruction does not arrive at the datapath in time to be saved directly into the register file. A special latch, DIN, is included to hold the result of a LOAD until it can be saved into the register file. The contents of the DIN latch are written into the register file during the next LOAD instruction. In other words, during the D2 and DW stages of this LOAD instruction, the result of the previous LOAD instruction is being stored into that instructions' destination register.

3.4.4 STORE Instruction Timing

The timing of a STORE instruction is shown in figure 3.18. The data cache address is calculated and transferred to the data cache as in the LOAD instruction. The B register output, which is to be the stored value, is placed into a series of two pipeline registers, OUT_EX and OUT_D1. During the D1 stage, this value is sent to the byte chip for alignment and masking. The result is then sent to the data cache to be stored into the memory.

Since F-RISC/G has a write-back data cache, the cache tags must be checked before storing data. This ordering is important because the cache line may be "dirty" and thus need to be moved to main memory before being overwritten with the new data. Therefore, F-RISC/G first reads the cache tags and data from the cache as if it were performing a LOAD operation. The extra reading of the cache data will speed up the processor in the case of a cache miss on dirty data.

Figure 3.18: Timing for the F-RISC/G STORE instruction 3.5 Critical Paths

Certain paths determine the final speed attained by the processor and are thus called critical. When designing an implementation, the designer must be aware of what operations are critical in the particular implementation. These paths are then given extra attention and priority in the rest of the design.

A few such paths have been described in the preceding sections: the BRA signal loop between the instruction decoder and the datapath (on page ).

3.5.1 Adder Carry Chain

In most processor designs, the adder carry chain is on the critical path. In the case of FRISC/G, with its datapath partitioned over four chips, this is especially true. Carry propagation schemes that include "random" logic or high fan-in gates are at a disadvantage in triple-level CML because of signal level considerations. Carry look-ahead is especially hurt by the many ways in which the same signals are combined. It is difficult to assign voltage levels to signals in this case.

The chosen configuration is a carry-select adder with eight-bit blocks. These blocks correspond to the datapath slices. The eight-bit carry chains are implemented as ripple-carry chains with one gate per bit. This gate is shown in figure 3.4 (a); the whole scheme is in figure 3.19. The top of figure 3.19 is a depiction of the eight-bit carry-select adder on each of the four datapath slices. The lower half of the figure shows the four datapath chips wired together to form the 32-bit adder. The performance of this adder is clearly affected by the partitioning. There are three chip crossings on the 32bit carry chain.

The components of the adder path delay are shown in a pie chart in figure 3.20. The multi-chip implementation of the adder accounts for 42 % of the delay. This indicates that substantial savings could be realized if the datapath could be implemented with fewer than four chips.

Figure 3.19: Carry select adder used for F-RISC/G 3.5.2 Register File Addresses and Related Signals

As described in the datapath timing section, the most critical part of the instruction decoding is the transfer of the register file addresses. Before any operations can be performed on the datapath, the operands must be prepared and latched. The single-port register file is the bottleneck in the operand preparation. Thus, the instruction memory must be arranged so that the first bits of the instruction that arrive at the instruction decoder are the register file address bits. Of these fifteen bits (operand A can be set from one of two five-bit fields), the operand B address is most critical as it is the earlier address required by the register file.

Figure 3.20: Components of the adder critical path Also important at this time are the memory bypass signals (MBYA and MBYB) because they control the latches that hold the register file values. While the datapath requires the bypass signals later than it requires the addresses themselves, the extra computation needed to generate the bypass signals warrants considering these signals critical as well.

Both sets of feed-forward signals (FFRAH, FFRAL, FFRBH, and FFRBL) must arrive at the datapath when the later memory bypass signal MBYA arrives. Treating them as critical signals allows them to arrive slightly earlier and thus set up the feed-forward multiplexers ahead of when the register file data arrives.

3.6 Core Processor Chips

Under contract with the Advanced Research Projects Agency (ARPA), I have developed detailed plans for the two core processor chips in the F-RISC/G implementation: the datapath chip and the instruction decoder. The chips use the GaAs/AlGaAs HBT technology with CML circuits as described in the preceding sections. The overall organizations of these chips may provide insights for future implementations of the F-RISC architecture. Also, the transistor count and power dissipation figures can provide some understanding of the nature of the technology.

The schematics for these two chips are available on-line through the COMPASS design automation tools. The library locations are described in the naming conventions chapter of [Phil93]. The following descriptions will highlight some of the significant points in the design of these chips. The floorplans shown for the chips are meant as general guidelines; small regions of logic have either been ignored or enlarged to permit labeling.

3.6.1 Datapath

Most of the datapath chip was straightforward to design. The overall structure of the datapath is shown in figure 3.12. The most difficult portions were the control logic (for the PSW and PC registers). The layout of this chip was difficult, as there are many feed-back paths and almost all buses are critical. A floorplan for the datapath chip is shown in figure 3.21. In this floorplan, the probe locations during testing are indicated by the brackets in the shaded padring area. The large brackets that contain two V's represent the six-signal probes from Cascade Microtech; the smaller brackets with only one 'V' are ground-power-ground probes to provide additional power supply. These probes are discussed in the following chapter. A plot of the chip is in figure 3.22. A summary of the devices and power in the datapath chip is found in table 3.12.

Figure 3.21: Floorplan of the datapath chip

Paste Datapath chip plot here

Figure 3.22: Layout of the datapath chip

Table 3.12
Datapath Power and Device Breakdown

	Device count	Power consumption (mW)
Core chip logic	4713	5356
Register File	1785	1768
Clock distribution	575	1531
Drivers and Receivers (without scan)	444	1515
Boundary scan latches	1912	995
Boundary scan control	585	1406
Boundary scan verification	311	227
Total	9785	12798

The following sub-sections describe the most important blocks in the datapath schematics:

Rfaddr

This sheet latches the register file addresses coming in from the instruction decoder and selects the proper one for presentation to the register file. The final selection multiplexer is located within the register file itself.

Regfdat

This is a diagram of the register file as implemented for the F-RISC/G datapath chip. It contains one of the address selection multiplexers as well as two sets of output latches for the A and B operands.

ALU

As one might expect, this sheet contains the arithmetic and logical unit (ALU) of the F-RISC/G processor. The eight-bit carry chain is implemented as one standard cell to reduce the internal wiring capacitances. The gates from figure 3.4 are used.

Simple-shifter

This sheet includes the one-bit right shifter. Besides shifting its input (which takes no gates other than to select the type of shift), this circuit provides the access path to the PC_DW register.

Resregs

The feed-forward registers are provided by this sheet. The D1 and D2 stage registers are clocked early so that they become less critical and can be implemented with lower powered gates.

Addrreg

This is the address register. It provides the address for the caches and implements the memory protection scheme by forcing the MSB of the outgoing address to be high unless the processor is in protected mode.

PC_block

This sheet contains the program counter registers. It consists of three sub-blocks: pci1, inc, and pc_history. The pci1 sub-block holds the program counter for the I1 stage. The output levels of the bits differ to provide the correct input levels for the incrementer. The incrementer is contained in the inc sheet. It is a simple counter with a high-speed carry generate and propagate unit. The pc_history maintains the program counter history registers and advances them as directed by the instruction decoder.

PSW

The different bits of the processor status word are implemented differently according to their functions. Each type of bit is in a separate sheet. The zero bit (Z) is generated in the zgen sheet. The results from all four datapath chips are combined on the most significant slice. The rest of the condition code bits are generated in the ncvz_regs sheet. All four condition code bits are stored here. Four operations on these registers are implemented: hold previous value, update value from ALU or shifter result (/SCC flag), update value directly from result bits (MPSW instruction) and restore old values for exception processing.

The uio_regs sheet provides for the I and O bits. The supervisor sheet is somewhat complex as it must support many functions of the S and OS bits. The following functions are supported: copy S into OS (setting S) during interrupt, copy OS into S at the end of an interrupt, updating S and OS from ALU result if the S bit was previously set. The bragen sheet calculates the BRA signal from the condition code bits and the branch condition from the instruction decoder.

DP_clock_tree

This sheet contains the clock distribution tree and buffers.

Config

This sheet reads the external configuration signals and determines whether this slice is either the LSS or MSS. These static configuration signals are broadcast throughout the chip.

3.6.2 Instruction Decoder

The instruction decoder chip consists mostly of random logic. Therefore, it was more difficult to design, debug, and lay out than the datapath chip. The floorplan of the instruction decoder is shown as figure 3.23. The final chip layout is plotted in figure 3.24. A summary of the power consumption and transistor count in the instruction decoder chip is shown in table 3.13. Only about 45% of the devices and 35% of the power is spent generating the design functions of the instruction decoder.

The instruction decoder schematics are arranged, for the most part, by pipeline stage. For instance, the stage_EX sheet contains the logic to generate signals for the EX stage operations and to hold information needed by later stages. The following subsections describe the toplevel blocks of the instruction decoder schematics.

Stage_I1

This is one of the simplest pages of schematics. The logic controls the I1 stage of the pipeline. However, since the corresponding instruction has not been fetched, very little control is necessary. Only one latch is included; it indicates whether the instruction in I1 is valid or has been flushed during a BRANCH.

Stage_I2

The I2 stage is not much more complex. Along with the valid bit, this stage has a pass through for the RFB signals, which are received early from the instruction cache and sent to the datapath.

Figure 3.23: Floorplan of the instruction decoder

Paste Instruction Decoder Plot Here

Figure 3.24: Plot of instruction decoder layout Stage_DE

This stage is more complex than the previous two. The major duties of the DE stage are as follows: store the incoming instruction, assemble the immediate constant, generate some control lines to select operands on the datapath, and pre-calculate control signals needed early in the next stage. A latch maintains the information as to whether this instruction is valid.

The latch_inst sub-sheet holds the incoming instruction, nullifies it if it should not be executed, and substitutes the dummy BRANCH instruction during interrupt processing. Its companion, imm, generates the immediate constant for the datapath. Depending on the specific instruction format, this constant can come from various portions of the source instruction or from the trap vector during interrupt processing.

Table 3.13
Instruction Decoder Power and Device Breakdown

Device count Power consumption (mW)

Core chip logic
3280 4074

Clock distribution 622 1648

Drivers and Receivers (without scan) 520 3005

Boundary scan latches 2040 1213

Boundary scan control 585 1406

Boundary scan verification 311 227

Total 7358 11573

Stage_EX

This sheet controls the EX stage of the pipeline. While the DE stage passes on most of the instruction to its successor, by the EX stage it is more economical to send only those control lines needed by future stages. Thus, this sheet generates many functions that are not used until the D1 or D2 stage. This stage also forms the first result tag—for the RES_EX register. The one sub-sheet, latch_opcode, holds the five-bit opcode and translates its bits to levels appropriate for decoding.

Certain functions on this sheet are disabled by the FL_EX_EA signal. This signal, an abbreviation for "FLush EX stage EArly", cancels certain critical signals if the instruction in the EX stage is being flushed by the instruction in the D1 stage. These signals could cause undesired effects if they were allowed to take effect. The rest of the control signals will be flushed when the instruction moves to the D1 stage.

The "validity bit" in this stage differs from those in the preceding stages in that it refers to the program counter register (on the datapath) rather than the instruction itself. The distinction is important in protected mode, when these program counter registers are not advancing normally. The extra pipeline latch on this path is included to compensate for skew between the normal clock signal and the qualified clock, ADV3, which is inactive in protected mode.

Stage_D1

The major event in the D1 stage is the incoming BRANCH result. The incoming BRA signal from the datapath's evaluation of the condition code is used to flush selected instructions from the pipeline, as indicated by the latency bits on the BRANCH instruction. This stage also controls the program counter history registers. This logic is complex, as described in sections 2.5.2 and 2.5.4. Also, the arithmetic overflow indication is received from the datapath at this time. The interrupt is deferred until the next cycle.

Stage_D2

Except for the register file result, very little happens during the D2 stage. Most of the inputs from the D1 stage are passed straight through to the DW stage. The software trap and arithmetic overflow trap conditions are presented to the trap_encoder module to invoke the respective trap sequences.

At the end of the D2 stage, the register file will be updated. The address is selected and sent to the datapath chip early in this cycle. The validity bit on the selected tag will be used to determine if the write should take place.

Stage_DW

This final stage remembers whether the instruction in the DW stage was a LOAD or STORE so that interrupt processing can proceed properly. However, the logic is concerned mostly with maintaining the address tag for the DIN register. The DIN latch tag should be updated under the following circumstances:

· A successful LOAD instruction should validate the tag to match the invalid one in the current RES_D2 feed-forward latch.

· A LOAD instruction that is canceled by a trap should cause the tag to be invalidated. The previous contents have already been written into the register file, and the latch now contains whatever data was on the data bus when the trap occurred.

· If a valid tag in the RES_D2 latch matches the tag in the DIN register, the DIN register should be invalidated. This accounts for the case where an ALU instruction overwrites the destination of a LOAD instruction while that result is still waiting to be stored into the register file.

RF_addr_gen

This sheet contains the logic to generate the register file addresses and control signals for the datapath chip. Very little logic appears here, only the multiplexers between the B-operand and destination addresses for the shared bus and the logic to cancel writes when a trap is in progress.

Tagcheck

Herein are the gates that compare the register file addresses to the tags for the feed-forward registers and the DIN register on the datapath. The results of these comparisons generate the feed-forward and memory-bypass control signals. Also, the D2 stage tag and the DIN register tag are compared to determine if the DIN register should be invalidated. The latter function was described in figure 2.15 and the surrounding text.

Statemach

The processor state machine (figure 2.28) and the logic to generate the STALL signal are contained on this page. The operation of this logic is highly critical as its outputs are used to control widely separated functions on the instruction decoder.

Especially time-critical is the generation of the STALL signal. This is an OR of the ACKI and ACKD lines, which control the caches. ACKI is generated if the DE stage instruction is valid, the cache signals a miss, and there is no pending trap. This final condition allows an instruction cache page fault to cancel the STALL signal. The ACKD is generated similarly, except only certain interrupts (as listed in table 2.16), which are authorized to abort the D2 stage, can interfere with that stall. Some of the inputs for these signals (including the TRAP signal from the trap encoder) are available only at f2 of the clock. However, the f3 clock must be gated with the STALL signal for use in various functions.

Trap_encoder

This page contains the logic necessary to synchronize the external interrupt lines to the internal clock, recognize and prioritize interrupt conditions, signal that a trap is to take place, and calculate the interrupt vector.

ID_clock_tree

The clock distribution system is grouped in this module. A combination of super-buffers, clock buffers, and normal high-powered buffers and gates is used to reduce clock skew within the chip.

BS_control
Scan_ports

These two sheets contain the control logic for boundary scan testing. The BS_control sheet includes all the control logic that will be described in chapter 4. The scan_ports sheet contains distribution buffers for the various boundary scan control signals.

3.7 Conclusions

A 1000 MHz implementation of the F-RISC architecture has been designed. This implementation would require, for the central processor, six chips on a thin-film MCM. Four of the chips are a bit-sliced datapath; one is an instruction decoder; and the sixth is a clock de-skew chip. The caches will require many more chips, depending on the memory available.

Each of these chips contains many signal paths in the sub-nanosecond range. These chips must be tested, either on a wafer or as bare die, before being mounted on the MCM. Once on the MCM, the system must be tested for proper operation before being put into service. The timing resolutions of conventional test equipment or standard boundary scan techniques are not sufficient for at-speed testing of these chips. Built-in self-test techniques require too many devices and too much power to be of use in this yield-limited technology. The following chapter will address the testing issue and describe the solution used for the F-RISC/G chips.

Emitter area	1.4 mm ´ 3 mm
fT	50 GHz
R_e	15 W
R_b	140 W
R_c	50 W
VBE	1.35 V
C_jeo	7 fF
C_jco	11 fF


Figure 3.4 (a): Three-function ALU gate	Figure 3.4 (b): Carry propagation gate

Power, Output Level	Treadway CML Sens. (ps/pF)	SPICE CML Sens. (ps/pF)	Percent Deviation	Treadway ECL Sens. (ps/pF)	SPICE ECL Sens. (ps/pF)	Percent Deviation
High, 1	110	112	– 2 %
High, 2	182	177	3 %	95	55	72 %
High, 3	182	185	– 1 %	94	83	13 %
Medium, 1	172	179	– 4 %
Medium, 2	228	228	0 %	167	70	138 %
Medium, 3	229	245	– 7 %	164	118	39 %
Low, 1	284	293	– 3 %
Low, 2	456	482	– 5 %	329	120	174 %
Low, 3	456	515	– 11 %	329	199	65 %

	C4 on Parylene	TAB on Parylene	GE-HDI
Dielectric	Parylene	Polyimide	Kapton
Propagation velocity	0.18 mm/ps	0.18 mm/ps	0.16 mm/ps
Chip attachment	C4	TAB	GE-HDI
Chip spacing	1 mm	3 mm	1 mm
Shortest adjacent transfer	6 ps	17 ps	6 ps
Longest adjacent transfer	94 ps	106 ps	106 ps
Shortest broadcast	156 ps	200 ps	175 ps
Average broadcast	200 ps	244 ps	225 ps

	Number of super-buffers driven
	1	2	3	4	5
(a) Level 2	79 ps	97 †
(a) Level 3	91	118
(b) Level 2	99 †	109
(b) Level 3	107	117
(c)	76	86	97 †	108
(d)		89	92	95 †	98 †