The Fast Reduced Instruction Set Computer (F-RISC) project has been undertaken to explore the highest speed possible for computer clock rates using some of the most advanced devices that have been developed in the US. The project has capitalized on existing GaAs/AlGaAs Heterojunction Bipolar Transistors (HBTs) and microwave compatible Multichip Modules (MCMs) as the vehicles to achieve these goals. The project can be expected to impact applications ranging from "super" workstations, and parallel processing nodes in TeraOPS computers, to virtual reality engines for simulation, media access controllers for fast microwave communication networks, and direct Digital Signal Processing (DSP) at high frequencies. These latter applications might be suitable for radar, high speed encryption/decryption, and data compression/decompression. This final report discusses one extremely important subtopic for this effort, namely the creation of memory that is fast enough to keep up with Fast RISC technology. The project was funded as a part of the Augmentation Awards for Science and Engineering Research Training (AASERT) program. Two students were sponsored under this funding. One of these has already completed his Ph.D. requirements and has accepted employment at EXPONENTIAL, a small start up company in California with partial ownership by Apple, and which is using some of the bipolar techniques being studied in the F-RISC project for implementation of a Fast Power PC (IBM 604e). Hence some of the insights provided by the F-RISC may already have begun to find applications in the commercial arena.
The goal established for the ARPA/ARO grants of the F-RISC series has been to create a demonstration Fast RISC integer engine with a 2 GHz clock rate and a peak throughput of 1,000 MIPS. Rockwell International offered the Rensselaer team the opportunity to employ their 50 GHz baseline HBT process for this project. Typical gate delays for that HBT process were revealed by Rockwell to be approximately 25 picoseconds, and with reasonable pipelining it has been possible to create an architecture that could respond in about 10 gate delays per clock phase, or 250 picoseconds. Given the low initial yield expected with this process a multichip architecture rather than a monolithic single chip microprocessor was proposed. Typical chip yields of 20% at 5,000 HBTs were assumed for the purpose of the demonstration originally, but this needed to be upgraded to 8,000 HBTs during the course of the project. Most of the additional devices were needed to make the chips testable at microwave frequencies using boundary scan based, embedded at-speed test circuitry. Fortunately, Rockwell's yields improved during the period of this project to meet this requirement.
The final project report for the parent or "core" contract (for which this AASERT has been created) has already been published and submitted to ARO, and it details many of the underlying technical issues involved in creating the building blocks of the 1000 MIPS 2 GHz clock CPU logic. More will be revealed with the completion of the final report for the follow on contract DAAL-04-H93-0477. This report, however, is focused on the creation of the first level of memory for the F-RISC, namely the L1 cache memory. F-RISC/G, the present design effort demands an extremely fast memory which can access sequential memory locations at the peak instruction rate of 1000 MIPS. Data fetch operations being by the traditional LOAD/STORE type of access mechanism also need to be fast to avoid having the CPI [Clocks Per Instruction] rating of the machine degrade. Branches in the instruction stream, also following traditional RISC design can take several clock cycles, but the main instruction fetch must proceed at a rate fast enough to feed the CPU at a rate to match its peak demands.
These fast L1 cache memory chips suffer from the present yield and heat limitations of the GaAs/AlGaAs HBT technology and are not as large as conventional L1 cache memories for CMOS. Consequently, a "cache miss" (in which the CPU tries to access a location not in this fast memory) is a higher probability event than might be found in CMOS architectures. To compensate for this our design for L1 provides a very wide data path for L1 to communicate with an L2 memory. This data path is 1024 bits wide, or 32 full words. Hence, if a cache miss occurs, an entire line of cache is exchanged with the slower but larger L2 memory (which itself can be implemented in CMOS). In this way the speed difference between L1 and L2 and the higher cache miss frequency of the small L1 is partially offset by the high bandwidth of one cache line transfer. The cache controller (CC) chip for L1 has to be a very fast memory management circuit, but also suffers from the same yield and heat limitations of the present HBT technology. Hence, an elegant tradeoff between the complexity of the cache controller and the net CPI has been conducted as a part of this study.
This first goal of the program paves the way toward other, still higher clock rate systems that could be created in the future. For example, during the period of this work it became clear that a yield improvement for the 50 GHz baseline process to 30,000 HBTs could create the opportunity to double the speed of the system to 2,000 MIPS with some minor architectural changes. Furthermore, Rockwell revealed that a 100 GHz upgrade to the 50 GHz baseline process might make another clock doubling possible to achieve 4,000 MIPS. A superscalar upgrade of the design might then achieve 8,000 MIPS. Finally, the existence of still faster HBTs, up to 320 GHz, suitable for digital design were disclosed to the Rensselaer team, suggesting that 3-4 times higher speeds will eventually become feasible. Because these speeds are well above any projections for CMOS in the SIA roadmap the Rensselaer team selected Rockwell as an industrial partner for the F-RISC project.
To date the project has accomplished nearly all of its goals. An integrated circuit HBT cell library has been developed, CAD tools unique for the project requirements have been developed and tested, the four architecture chips for F-RISC have been designed, and checked extensively, and finally test circuits have been fabricated to help verify process, device and circuit models. The four architecture integrated circuits are to be fabricated on funds still associated with the budget for this project and which have been committed to Rockwell through a purchase order.
Challenges emerged during the project as speed discrepancies were discovered between the original HBT SPICE models supplied by Rockwell and measured transistor performance in fabricated test structures. Additional discrepancies were discovered regarding thickness of the polyimide interlevel dielectric (ILD) in different circuit regions on our test chips. This latter problem was discovered on chips fabricated under companion funding, subcontracted to us by Rockwell under the HSCD BAA. With Rockwell's collaboration we are currently investigating device speed improvements with smaller emitter areas and scaling interconnections to address these challenges.
A follow on contract work has already been awarded under the HPCS
BAA which concentrates on solving the speed problem, creating
device and interconnect layouts that compensate for the device
modeling error, and which fabricates demonstration architecture
chips. Solutions for regaining this speed are being sought in
a manner that permits use of the existing architecture chips with
only simple transistor substitutions and interconnection transformations;
a strategy which thereby preserves most of the investment in the
architecture from this contract. In addition the follow-on work
will continue to fabricate chips till a sufficient number of Known
Good Die (KGD) are available to populate several MCM prototypes,
and design the MCM layout. At that point funding would be required
to insert these chips into an MCM to build a Fast RISC module.
A proposal for this work has been submitted under BAA 95-06, for
mixed mode MCMs. That proposal has been assigned a status of "selectable,"
subject presumably to satisfactory performance under the present
HPCS BAA and availability of funding.
Table of Contents
I. FOREWORD i
I.1. List of Figures vii
II. FINAL REPORT 1
II.1. F-RISC / G Overview and Statement of the Problem Studied 1
II.2. Technology 2
II.2.A. GaAs Heterojunction Bipolar Transistors 3
II.2.B. Current Mode Logic 4
II.2.C. Thin Film Multi-Chip Modules 4
II.3. Memory Hierarchies 5
II.3.A. Write Policies 6
II.3.B. Address Mapping 7
II.3.C. Architecture 8
II.4. Summary of the Most Important Results 9
II.5. Trace-driven Simulation 9
II.6. Cache Transactions 19
II.6.A. Load Hit 19
II.6.B. Clean Load Miss 19
II.6.C. Dirty Load Miss 20
II.6.D. Store Hit 21II.6.E. Clean Store Miss 21
II.6.F. Dirty Store Miss 22
II.7. Secondary Cache 22
II.8. Yield and Redundancy Analysis 24
II.8.A. Block Replacement 25
II.8.B. Nibble Replacement 26
II.8.C. Column Replacement 26
II.8.D. Additional Columns 27
II.8.E. Analysis 28
II.9. F-RISC / G Cache Implementation 32
II.10. Advanced Packaging 34
II.11. Cache Pipeline 39II.12. Cache RAM 44
II.12.A. Cache RAM Architecture 44
II.12.B. Cache RAM Design 45
II.12.C. Cache RAM Timing 49
II.12.D. Cache RAM Details 49
II.13. Cache Controller 49
II.13.A. Chip Architecture 50
II.13.B. Instruction and Data Configuration 53
II.13.C. Cache Controller Design 53
II.14. Communications 54
II.14.A. Intra-cache Communications 57
II.14.B. Secondary Cache Communications 58
II.14.C. MCM Placement 59
II.15. Virtual Memory Support 61
II.16. Timing 62
II.16.A. Load Timing 62
II.16.B. Store Timing 65
II.16.C. Instruction Fetch Timing 68
II.16.D. Other Cache Stalled 70
II.16.E. Processor Start-up 71
II.17. L2 Cache Design 74
II.18. At-Speed Testing Scheme 78
II.19. Evaluation of the F-RISC/G Boundary Scan Scheme 78
II.20. Test Scheme Design 84
II.20.A. Special Tests 90
II.21. Implementation and Test Plan 92
II.21.A. External Connections 92
II.21.B. Testing Control Logic 93
II.22. Beyond F-RISC / G 95
II.23. Cache Organization and Partitioning 96
II.23.A. Effective Use of Higher Device Integration 96
II.23.B. Remote Program Counter 97
II.23.C. Pipeline Partitioning 98
II.23.D. Column Associativity 100
II.23.E. Superscalar / VLIW CPU 101
II.24. Future Packaging 102
II.25. Improved Virtual Memory Support 106
III. LIST OF ALL PUBLICATIONS AND TECHNICAL REPORTS 108
IV. LIST OF ALL SCIENTIFIC PERSONNEL SHOWING ADVANCED DEGREES
EARNED BY THEM WHILE EMPLOYED ON THE PROJECT 113
V. LIST OF INVENTIONS BY NAME 114
VI. REFERENCES 115
VII. BIBLIOGRAPHY 119
VIII. APPENDIX 121
VIII.1. Register File / Cache RAM Optimization Process 122
VIII.2. Register File Circuit Sensitivity Analysis and Component Modifications 124
VIII.3. Register File Circuit Modifications 129
VIII.4. Register File Optimization Summary 134
FIGURE 2.1: CROSS SECTION OF ROCKWELL HBT (FROM ASBECK) 3
FIGURE 2.2: DIFFERENTIAL CURRENT SWITCH 4
FIGURE 3.1: MEMORY HIERARCHY 6
FIGURE 5.1: SIMULATIONS: 2 KB HARVARD CACHES, DIRECT-MAPPED, SPICE TRACE 12
FIGURE 5.2: SIMULATIONS: 2 KB HARVARD CACHES, DIRECT-MAPPED, TEX TRACE 13
FIGURE 5.3: SIMULATIONS: 2 KB HARVARD CACHES, DIRECT MAPPED, GCC TRACE 13
FIGURE 5.4: SIMULATION RESULTS FOR BENCHMARK SUITE 14
FIGURE 5.5: 2 KB HARVARD CACHES, DIRECT-MAPPED, BLOCK SIZE EQUALS BUS
FIGURE 5.6: CPI AS A FUNCTION OF SET SIZE, BLOCK SIZE 15
FIGURE 5.7: EFFECT OF SET SIZE 16
FIGURE 5.8: EFFECT OF ARCHITECTURE ON CPI 17
FIGURE 5.9: EFFECT OF HARVARD CACHE SIZE ON CPI 17
FIGURE 6.1: LOAD HIT 19
FIGURE 6.2: CLEAN LOAD MISS 20
FIGURE 6.3: DIRTY LOAD MISS 20
FIGURE 6.4: STORE HIT 21
FIGURE 6.5: CLEAN STORE MISS 21
FIGURE 6.6: DIRTY STORE MISS 22
FIGURE 7.1: REQUIRED SECONDARY CACHE HIT TIME 24
FIGURE 8.1: BLOCK REPLACEMENT 25
FIGURE 8.2: NIBBLE REPLACEMENT 26
FIGURE 8.3: EXTRA COLUMN PER BLOCK REPLACEMENT 27
FIGURE 8.4: CHIP YIELDS AS A FUNCTION OF FAULT PROBABILITY 31
FIGURE 9.1: F-RISC / G SYSTEM 33
FIGURE 10.1: CRITICAL PATH DIAGRAM 34
FIGURE 10.2: DATA CACHE CRITICAL PATH 35
FIGURE 10.3: ADDRESS TRANSFER FROM CPU TO CACHES 36
FIGURE 10.4: SINGLE BUS ADDRESS TRANSFER FROM CONTROLLER TO RAMS 37
FIGURE 10.5: DUAL BUS ADDRESS TRANSFER FROM CONTROLLER TO RAMS 37
FIGURE 10.6: INSTRUCTION TRANSFER - RAM TO ID 38
FIGURE 10.7: DATA TRANSFER - RAM TO DATAPATH 38
FIGURE 11.1: SEQUENTIAL CACHE OPERATION 41
FIGURE 11.2: PIPELINED CACHE OPERATION 41
FIGURE 11.3: SYSTEM PIPELINE - SEQUENTIAL LOADS 42
FIGURE 11.4: SYSTEM PIPELINE - "SEQUENTIAL" STORES 43
FIGURE 11.5: PIPELINE DIAGRAM WITH BUBBLE 44
FIGURE 11.6: PIPELINE ROTATE 44
FIGURE 12.1: CACHE RAM BLOCK DIAGRAM 46
FIGURE 12.2: CACHE RAM FLOORPLAN 47
FIGURE 12.3: CACHE RAM CHIP LAYOUT 48
FIGURE 13.1: SIMPLIFIED CACHE CONTROLLER BLOCK DIAGRAM 50
FIGURE 13.2: CACHE CONTROLLER FLOORPLAN 51
FIGURE 13.3: CACHE CONTROLLER LAYOUT 52
FIGURE 13.4: REMOTE PROGRAM COUNTER 52
FIGURE 14.1: LOAD CRITICAL PATH COMPONENTS 54
FIGURE 14.2: COMPONENTS OF ADDER CRITICAL PATH (ADAPTED FROM [PHIL93]) 55
FIGURE 14.3: ABUS PARTITIONING 56
FIGURE 14.4: MCM LAYOUT 59
FIGURE 16.1: DATA CACHE TIMING - CLEAN LOADS 64
FIGURE 16.2: SAMPLE LOAD COPYBACK CODE FRAGMENT 64
FIGURE 16.3: DATA CACHE TIMING - LOAD COPYBACK 66
FIGURE 16.4: DATA CACHE TIMING - STORE COPYBACK 67
FIGURE 16.5: INSTRUCTION CACHE MISS TIMING 68
FIGURE 16.6: DATA CACHE DURING INSTRUCTION CACHE STALL 70
FIGURE 16.7: INSTRUCTION CACHE DURING A DATA CACHE STALL 71
FIGURE 16.8: INSTRUCTION CACHE AT START-UP 72
FIGURE 16.9: INSTRUCTION CACHE DURING TRAP 73
FIGURE 17.1: SECONDARY CACHE BLOCK DIAGRAM 75
FIGURE 17.2: LOAD COPYBACK IN F-RISC / G CACHE 76
FIGURE 17.3: INSTRUCTION DECODER BLOCK DIAGRAM 77
FIGURE 19.1: RECEIVER USED IN F-RISC / G CORE BOUNDARY SCAN SCHEME 80
FIGURE 19.2: SIMPLIFIED AT-SPEED TESTING TIMING DIAGRAM 81
FIGURE 19.3: DRIVER USED IN F-RISC / G CORE BOUNDARY SCAN SCHEME 82
FIGURE 19.4: F-RISC/G CORE BOUNDARY SCAN SCHEME 83
FIGURE 20.1: READ TIMING IN CONTINUOUS MODE 85
FIGURE 20.2: SINGLE SHOT TIMING 86
FIGURE 20.3: L2 PATH DRIVER / RECEIVER 86
FIGURE 20.4: BOUNDARY SCAN RECEIVER 88
FIGURE 20.5: BOUNDARY SCAN DRIVER 89
FIGURE 21.1: BOUNDARY SCAN CONTROLLER STATE DIAGRAM 94
FIGURE 23.1: MODIFIED CACHE CONTROLLER BLOCK DIAGRAM 98
FIGURE 23.2: COLUMN ASSOCIATIVE CACHE - SLOW HIT 99
FIGURE 24.1: 3-D CHIP STACK 102
FIGURE 24.2: 3-D RAM STACK MCM LAYOUT 102
FIGURE 24.3: 3-D RAM AND CPU STACK MCM LAYOUT 103
FIGURE 24.4: CACHE CRITICAL PATHS 104
FIGURE 1: REGISTER FILE PERFORMANCE WITH VARIOUS DEVICE AND INTERCONNECT MODELS 121
FIGURE 2: ACCESS TIME SENSITIVITY TO ADDRESS, BIT AND WORD LINE CAPACITANCE 122
FIGURE 3: MEMORY CELL LAYOUTS 124
FIGURE 4: WORDLINE SENSITIVITY TO TOTAL DECODER RESISTANCE 125
FIGURE 5: WORDLINE AND ADDRESS-LINE CURRENT SENSITIVITY TO DECODER RESISTOR RATIO (TOTAL RESISTANCE = 440 OHMS) 126
FIGURE 6: ADDRESS LINE SENSITIVITY TO DECODER RESISTANCE RATIO 127
FIGURE 7: READ/WRITE LOGIC BITLINE SWINGS DURING WRITE (PULL-UP RESISTANCE = 600 ) 128
FIGURE 8: ACCESS TIMES AND BITLINE SWING DURING WRITE FOR VARIOUS READ/WRITE LOGIC PULL-UP RESISTOR VALUES 128
FIGURE 9: ACCESS TIME SENSITIVITY TO BITLINE CURRENT 129
FIGURE 10: INTERNAL SIGNAL SWINGS DUE TO RELATIVELY STATIC AND DYNAMIC ADDRESS CHANGES 130
FIGURE 11: ORIGINAL AND REDUCED-WORDLINE-SWING MEMORY CELL DESIGNS 131
FIGURE 12: "BRIDGE" RESISTOR BETWEEN BITLINES 133
FIGURE 13: SENSITIVITIES TO BITLINE BRIDGE RESISTOR 134
FIGURE 14: PERFORMANCE SENSITIVITY TO BITLINE BRIDGE
To increase a computer's speed one can improve the algorithm used to calculate the results, devote more resources to the problem, or increase the efficiency of each resource. The goal behind F-RISC is speeding up computations by decreasing the clock cycle time. By simplifying the circuitry necessary to implement the architecture by stripping the architecture down to the bare necessities, and using the most advanced available packages, devices, and interconnect technologies, we hope to operate F-RISC at the highest possible clock speed.
The F-RISC / G implementation contains a seven stage pipeline as shown in Table 0.1. The I1, I2, D1, D2, and DW stages are all dedicated to memory accesses.
|Instruction Fetch 1||Transfer instruction address to cache on branch|
|Instruction Fetch 2||Receive instruction from cache|
|Data Read 1||Transfer data address|
|Data Read 2||Receive data from cache if a LOAD|
|Data Write||Cache modified if STORE; write result into register|
Like many modern RISC implementations, F-RISC relies on deep pipelines and a cache hierarchy to achieve high throughput. Table 1.2 enumerates a number of state-of-the-art RISC processors, along with F-RISC, and illustrates some of the key architectural features of these processors.
The use of cache memory hierarchies has become paramount in computer design. Irrespective of the expense of massive amounts of extremely high speed memory, packaging technology has not yet evolved to the point where the entire main memory of a high speed computer can be placed in close enough proximity to the core CPU to allow reasonable access times.
Table 1.3 lists some of the design details of the processors shown in Table 1.2. As shown in the table, the F-RISC / G core CPU and primary cache alone are expected to dissipate 250 W (or 2.5 W / cm2), illustrating the obvious problems that would be associated with packing even more memory onto the multi-chip module (MCM).
|UltraSPARC||0.5 µm CMOS||5,200,000|
|Alpha 21164||0.5 / 0.4 µm CMOS||9,300,000|
|PA-RISC 7200||0.55 µm CMOS||1,260,000|
|PowerPC 620||0.5 µm CMOS||6,900,000|
|MIPS R10000||0.5 µm CMOS||5,900,000|
|F-RISC/G||1.2 µm GaAs HBT||200,000|
The technologies which are used in the F-RISC / G prototype, while providing the performance necessary to achieve its 1 ns cycle time, also impose several difficulties in its design. Most notable among these limitations is poor device integration and comparatively high static power dissipation.
The devices used in the F-RISC/G CPU are based on AlGaAs/GaAs heterojunction bipolar transistor technology as supplied by Rockwell International. Figure 2.1 shows a cross-section of the Rockwell HBT device. The baseline process produces transistors with a nominal fT of 50 GHz.
The primary advantage of using HBTs is that the heterojunction provides good emitter injection efficiency, lower base resistance than in bipolar junction transistors (BJTs), improved punch-through resistance, and lower emitter-base capacitance (Cje) [Sze81].
The GaAs / AlGaAs system also offers advantages. Among them, a large bandgap can be achieved, electron mobility is high (on the order of 5000-8000 in pure material vs. 800-2000 for Si.), reducing base transit time and charge storage at the emitter junction, and a semi-insulating substrate is available (on the order of 5 108 ) [Sze90; Long90].
F-RISC / G makes use of differential current tree logic and differential wiring [Nah91; Greu91]. These circuits are built out of differential pairs of transistors called current switches which are arranged in a common emitter configuration.
Figure 2.2 shows a simple differential current switch. The current source Is may either be passive (a resistor) or active (a transistor).
F-RISC / G uses passive sourcing and a 0.25 V logic swing. Passive sourcing was selected due to the high VBE of the devices supplied by Rockwell (1.35 V). Three levels of switches are stacked, allowing complex logic functions to be realized. It was desired that F-RISC / G be compatible with standard ECL parts (VEE = 5.2 V) so a passive source must be used.
Differential circuitry is used due to its common-mode noise immunity and the elimination of the common reference voltage required in Emitter-Coupled Logic (ECL). An added benefit of using differential wiring is that inversions can be accomplished merely by flipping wires. The use of differential wiring can increase capacitance and requires more routing area.
At the cutting edge of device technology device yield tends to be a serious problem, leading to low device integration. This makes construction of useful circuits a serious challenge. If several small die can be interconnected with little penalty, the challenge of poor yield can, in part at least, be circumvented.
The goal of any packaging scheme in a high-performance system
is to provide high-bandwidth interconnection between system components.
Other requirements include the ability to integrate bypass capacitors
(to reduce noise), terminating resistors (to reduce reflections),
the capability to dissipate heat, and physical protection of the
die. Since the highest performance devices are usually products
of immature technologies (and thus suffer from poor yields), a
technology which allows packaging of multiple die into a single
high-speed circuit is important. Thin film multi-chip modules
(MCMs) are a form of wafer-scale hybrid packaging that can provide
all of these features.
It is often not technologically feasible to allow the CPU to communicate directly with a quantity of RAM equivalent to the processor's address space. In order to circumvent this difficulty it is possible to use a cache memory hierarchy in which small amounts of high speed memory are used in concert with large amounts of lower speed memory to approximate the performance of large amounts of high speed memory. This is an idea dating back to the University of Manchester's Atlas machine's "one-level storage system" in the early 1960s [Kilb62]. The idea is for the most frequently used data to be stored in the high speed memory, while less frequently used data is fetched from main memory into the high speed memory as needed.
The small high-speed memory is conventionally called a cache. The dimensions of the cache (depth and width), organization of the cache, behavior during writes, communications with higher levels of memory, and even the quantity of caches (do the data and instruction streams get cached separately?) are all parameters which the cache designer must set.
Figure 3.1 illustrates a typical cache memory hierarchy. Typically, lower levels of memory are faster than higher levels of memory, whereas higher levels of memory are larger than lower levels of memory. There may be levels of memory above the secondary cache. The highest level of memory is called main memory. Main memory itself may be smaller than the address space, and an external storage device can be used to simulate having a larger main memory, with elements, called pages, swapped between the main memory and storage device as needed. This is a technique known as virtual memory.
An important issue in cache design is what to do when the CPU attempts to write something into a particular memory address (i.e. perform a STORE). Although most cache accesses will be reads (all instruction fetches will be reads, and few instructions write to memory [Henn90]), writes are common enough that some care must be taken in handling them.
When writing into the cache, there are two options.
In a copy-back design, the cache tag RAMs will likely store a bit which determines whether the cache block in question has been modified by the CPU since it was retrieved from higher levels of memory. If the CPU has modified the block by performing a STORE, and the higher levels of memory have not been updated, the block in the cache is called dirty. If the CPU has not modified the block, or the block was modified but the higher levels of memory have been informed, the block is called clean.
To prevent the need to wait for the write through to occur on every write, it is common to include buffers between the cache levels. This buffering scheme can mitigate the simplicity advantages of the write through design, however.
A second issue in handling memory writes is what to do in the event of a write miss. In a copyback cache the block occupying this frame will be transferred to higher levels of memory. In a write-through cache this is not necessary. In either case it is still necessary to actually perform the write that was requested by the CPU. There are two methods of accomplishing this:
Fetch on write is used when the expectation is that the modified block will be modified many times in close temporal proximity. Write around caches are usually used in write through designs since modifications to memory have to filter through to main memory, anyway.
Each level of the cache hierarchy may differ in dimension. A block from a lower level of memory or the CPU must be mapped to the proper location in each cache. Caches are classified by the manner in which this mapping is accomplished.
In a direct mapped cache, each main memory address maps to one and only one location within the cache. Multiple main memory addresses may map to a single cache address, however.
The direct mapped cache has the advantage of being simple to implement. In addition, it is simple to determine whether or not a given main memory address is stored in the cache because there is only one possible place it can be stored. On the other hand, if there is a need to store another address which maps to the same block frame, the block is displaced, at the very least entailing a block transfer penalty, and, at worst, causing the displaced block to be swapped back into the cache again if it is needed shortly.
In a fully associative cache, any main memory address can be stored in any block in the cache. One advantage of a fully associative cache is that as long as there is a free block in the cache, then there is no need to ever replace the cache contents with other addresses. If all of the blocks are full, then ideally one would wish to replace the block which will not be needed for the longest time [Matt70]. Since this is difficult to implement, alternate cache replacement algorithms such as "least-recently used" [Henn90] are often used.
A disadvantage of this approach is that when a block is sought the entire cache must be searched. This will either require more hardware (a comparator for each block frame) or more time (the contents of each block frame serially being passed through a single comparator). A second disadvantage is that implementing the algorithms needed to decide where to store blocks (a decision must be made as to where to store each block, since any frame is a legitimate destination), and the multiplexing logic needed to route the data appropriately, require a comparatively large amount of hardware.
In a set-associative cache each address in the main memory address space maps to a limited quantity of locations in the cache. The possible block locations to which a particular main memory address block can be mapped is called a set. (Thus, a direct-mapped cache has a single set.)
The set associative cache requires fewer hardware resources than the fully associative cache in that the multiplexors which route data into the data memories can be smaller (since the number of possible block frames in which a given block may be stored is typically fewer in number), and the number of comparators necessary within the tag RAM if a parallel search is to be done is fewer.
An advantage of the set associative cache over the direct-mapped cache is that if a particular block frame is occupied, it is possible that alternate block frames within the same set will be available for new data.
In the 1940s, researchers at Harvard University built the Mark series of computing machines. The Mark-III and the Mark-IV were stored program machines, with separate memories for the instructions and data. While this type of architecture is rare today, it is common for a machine to have a shared main memory, but separate instruction and data caches; this is called a Harvard architecture. The alternative, a single, shared cache, is called a unified architecture or Von Neumann architecture.
The advantages of separating the data and instruction caches is that it makes it simpler for instructions and data to be simultaneously fetched from memory. In systems in which the CPU is pipelined and can fetch instructions and operand simultaneously, it is a great advantage to have separate buses to memory to fetch these items, rather than forcing a stall and fetching them one at a time.
On the other hand, when a Harvard architecture is used, it is
generally impossible to modify the instruction stream (the instruction
cache is capable of reading memory but not writing to it), and
thus self-modifying code, an important element of certain types
of artificial intelligence systems, becomes much more difficult
to implement. In addition, since the two caches can not share
cache RAMs, the cache hierarchy is less able to adapt to changing
conditions by partitioning different amounts of memory to holding
instructions and data. It is possible, however, to optimize each
cache since they need not be identical.
The summary of research activity during the first three contract years followed roughly the plan presented in the contract proposal:
The key contributions to the field of high-speed computer design
are the analysis of whether it is possible to achieve competitive
computing performance from a high-speed / low-integration process.
In an effort to provide this performance, investigations into
various memory architectures given this particular design regime
were performed. Insights into the effect of interconnect on architecture
and implementation of high-speed digital circuits were provided
by this analysis. Additional insight is provided concerning the
value of the HBT in providing higher speed memory, which can keep
up with the multi-GHz computing rates possible with this device
when it is used in ALU design.
In evaluating cache designs it is helpful to consider metrics which take into account the performance of the system as a whole. One useful metric is Cycles per Instruction (CPI), which is defined as the average number of processor cycles required to execute a single instruction.
One technique frequently used to determine the performance of cache designs is trace-driven simulation. The idea behind this technique is to analyze the behavior of workloads (sets of processes) which the target design is intended to execute. A cache trace is an ordered list of memory references made by this target workload. Typically the trace is obtained by introducing hardware between a CPU and memory which captures all memory references. An alternative method would be to simulate the target code on a software simulator capable of storing memory accesses. In order to evaluate the performance of different cache designs in the F-RISC/G processor, the DineroIII trace-driven simulator [Hill84] was used with three representative traces.
In order to use cache traces to predict the overall performance of a cache architecture under a wider variety of workloads, three problems must be overcome [Ston90].
General purpose machines such as F-RISC are particularly difficult to represent with a single trace, and thus multiple traces are often used. Typically, suites of traces such as those encompassed by the SPEC92 benchmarks are used. Alternatively to using traces captured from actual running code, one may use traces generated by statistical models of memory reference patterns (synthetic traces).
The required length of the cache trace is estimated to grow as the cache size raised to the 1.5 power [Ston90]. As cache size increases the number of misses which occur in a trace of given length decrease, thus making them statistically less significant.
In order to determine how long a trace must be, a statistical model can be introduced. It may be assumed that cache misses are a Bernoulli process will probability m of "success" and h=1-m of "failure." This is not a particularly good approximation because in a Bernoulli process cache misses would occur independently of each other, which is clearly not the case in the cache, but it should serve as a useful lower bound.
Using this model the true mean miss-ratio can be estimated by . So, if we want the trace-based cache evaluation procedure to produce an accurate result we must minimize the deviation term.
The traces used in evaluating the F-RISC cache were on the order of 1,000,000 references in length. Since the F-RISC cache is small, a comparatively poor hit rate of 0.85 will be assumed (solely for the purposes of determining whether the cache traces are of sufficient length. The actually miss rate can be determined from simulations.) In this case, the true mean miss ratio would be estimated to be , meaning that the variance is sufficiently small to suppose that the cache lengths are long enough to properly represent the workload they were derived from.
The three traces used in the DineroIII simulations are tex, spice, and gcc which are representative of three common code bases frequently executed on workstation class machines. Table 5.1 shows some of the important characteristics of these traces.
The CPI for a processor is a compilation of CPI figures from several sources. For example, the processor may be designed so that every ALU operation requires one cycle to execute, in which case the CPI due to ALU operations would be 1. If the processor were always able to execute two instructions in parallel, then the CPI due to ALU operations would be 0.5.
In the case of the F-RISC / G processor, the pipeline structure is such that the CPU component of CPI is expected to be approximately 1.45 [Phil93, Greu90]. This CPI component is sometimes called the latency component of CPI. Although the CPU is designed to execute with a peak CPI of 1.0, pipeline latencies (the inability to keep parts of the processor busy due to code dependencies) result in reduced performance.
In addition to CPI penalties caused by pipeline inefficiencies, the cache contributes to an increased CPI. This component of CPI is known as the stall component of CPI. Some percentage of the instructions executed by any given code will result in cache accesses (in fact, all instruction will result in instruction cache accesses), and the overall CPI of the processor depends on the cache's ability to perform the requested transaction in the time allotted. Whenever the primary cache must resort to communicating with the secondary cache, the effect is to stall the processor and prevent it from accomplishing any useful work.
It is clear from these results that for this trace small block sizes perform best. This is due to the fact that since the cache memory is so small, a large block size imposes great penalty in terms of percentage of available space taken up by a single block. When a new block address needs to be stored, then with large block sizes there are few places to store it.
For a given block size the best performance occurs when the bus width is equal to the block width. If the bus width were smaller than the block size, then multiple bus accesses, each incurring a miss penalty, would be necessary (unless a hardware-intensive buffering scheme were used - in which case occasional stalls would still occur.) Figure 5.1 is a graph showing the cache stall component of CPI as a function of block size and bus width for the Spice trace.
Figure 5.2, plotted on the same scale as Figure 5.1, shows that the magnitude of the CPI results obtained using the Tex trace is lower overall all than the results obtained with the Spice trace. Once again, at a given bus width, smaller block sizes seem to yield superior performance.
The optimum point (1.41) occurs with a block size and bus width of 64 bytes. Given the estimated latency CPI component of 1.45, the total CPI taking into account both latency and stall CPI components would be 1.86. Figure 5.4 shows a plot of the weighted mean stall CPI for all three cache traces as a function of block size and bus width.
Having determined that the optimum configuration occurs when block size is equal to bus width, it is possible to plot the stall component CPI as a function of equal block sizes and bus widths as shown in Figure 5.5, where the results of each trace are plotted along with the weighted mean. From this plot it is clear that the minimum CPI occurs at a bus width and block size of 64 bytes (512 bits). It is also interesting to note that at half that size (32 bytes) the CPI is approximately 1.44, which is only 0.03 cycles per instruction worse than the optimal point.
Figure 5.6 is a plot of stall CPI as a function of set size and equal block and bus sizes for a Harvard architecture with dual 2 kB caches and copyback. From this plot it can be seen that larger set sizes tend to produce better CPIs, although the bus width and block size seem to have a larger effect on the overall CPI.
Figure 5.7 shows the effect of varying set size for the three block sizes and bus widths which provide the best results for 2 kB Harvard caches employing copyback and the timing constraints mentioned earlier. As can be seen, the CPI does not improve markedly as set size is increased beyond 4. The effect of moving from a direct-mapped (set size = 1) cache to a 4-way set associative cache, however, is fairly significant.
Figure 5.8 illustrates the effect of various cache architectures on stall CPI. The graph shows CPI as a function of block and bus width for a Harvard cache with 2 kB per cache, a unified cache with 4 kB of single-ported RAM, and a unified cache with 4 kB of dual-ported RAM. A direct-mapped cache employing copyback is assumed.
In a unified cache with dual ported RAM it is possible to read both an instruction and data simultaneously, while, for a single-ported RAM scheme, it is possible to perform only one access at time.
As the graph shows, the Harvard cache tends to perform the best. For the unified cache designs, the use of dual-ported RAM provides the best results except at extreme block sizes.
For the single ported unified cache, at most one cache access, instruction or data, can be accomplished at any time. As a result, the equation used to calculate CPI from the DineroIII output is as follows:
Figure 5.9 shows the effect of varying cache size given a Harvard direct mapped cache employing copyback and a 64 byte block and bus width. The stall component of CPI drops below 1.5 at a cache size of 2048 bytes per cache. At twice that memory size there is comparatively little improvement in CPI, and there is little doubt that it would be extremely difficult to implement that much memory given the interconnect lengths that would be required and the difficulty in removing the heat from that many bipolar RAM blocks.
Based on these cache simulations, the design point which was chosen for the F-RISC / G prototype cache is as listed in Table 5.2. Assuming a miss penalty of 5 cycles, the predicted stall CPI for this design is approximately 1.41.
|Ins. Cache Size:||2 kB|
|Data Cache Size:||2 kB|
|Write Policy:||Copyback, Write Allocate|
|Bus Width:||512 bits (64 bytes)|
|Block Width:||512 bits (64 bytes)|
Table 5.3 shows the results of the cache simulations broken down by type of event. The probability of each event occurring is also given. Based merely on these events, the stall CPI would add to 0.73. What remains unaccounted for are 68% of the instructions which may be either ALU operations or BRANCHs. Each ALU or BRANCH operation can be assumed to take 1 cycle (since BRANCH misses are already accounted for in the "Instruction miss" category.) Therefore, the net stall CPI would be 0.73 + 0.68 = 1.41, as reported above. Note that the "Reads" figure presented in Table 5.3 includes the reading cycle that occurs at the beginning of each STORE, thus the write penalty would only be 1 additional cycle.
|ALU / Branch||1442343||.68||1||.68|
From the perspective of the CPU, the only operations which can take place in the cache are a load, in which information is fetched from the primary cache and sent to the CPU, and a store, in which information is sent from the CPU to the primary cache for storage. Depending on the state of the cache when the CPU attempts a transaction, however, things can get more complicated.
Perhaps the most complicated situation occurs in caches which employ copyback for writes. In such a cache not only is it necessary to handle hits and misses, but clean and dirty operations become an issue, as well.
For a copyback cache the possible transactions are a load hit, clean load miss, dirty load miss, store hit, clean store miss, and dirty store miss.
A load hit occurs when the CPU attempts to read from an address which is available in the cache. In this case the CPU generates an address which gets sent to the cache. The cache will usually simultaneously check its tag RAM to see if the requested address is available and start reading from the appropriate location in the data RAM. The data will be sent to the CPU when it is found.
A clean load miss occurs when the CPU attempts to retrieve data which is not available in the cache. In this case the cache must retrieve the requested data from the next level of memory.
Figure 6.2 is a timing diagram for a clean load miss. Note that the secondary cache may begin its access in parallel with the primary cache, and that the data arriving from the secondary cache may or may not be written into the primary cache simultaneously to sending it to the CPU.
When a dirty load miss occurs, the block fetched from the secondary cache can not be stored in the primary cache until the modified data already in that primary cache location is updated in the higher levels of memory.
Figure 6.3 shows a representative timing diagram for the dirty load miss case, also known as a load copyback. Note that in this particular implementation the address from the CPU is sent to the secondary cache in parallel with the primary cache, allowing the secondary cache to get an early start retrieving the required data. By the time the primary cache realizes that a dirty miss has occurred (at the end of the tag comparison step) and has sent the dirty data to the secondary cache, the secondary cache has finished retrieving the requested data.
A store hit occurs when the CPU attempts to write into a memory address which is already available in the primary cache. Regardless of whether the contents of that address have already been modified by the CPU, the primary cache need not perform a copyback to higher levels of memory.
In a copyback cache, if the CPU attempts to perform a store into an address which is not available in the cache, the cache must determine whether the block frame into which the requested address will be stored already contains an address which has been modified in the cache but not in higher levels of memory (and therefore needs to be copied back.) If the data has not been modified (the line is not dirty), then no copyback need take place.
If the CPU attempts to write into a block already occupied by data which has been modified in the cache but not in the higher levels of memory, a copyback must take place.
Figure 6.6 is a timing diagram for a typical dirty store miss.
The primary cache checks its tag RAM while the secondary cache
begins its access. By the time the primary cache has determined
that a dirty condition exists and send the dirty data to the secondary
cache, the secondary cache will hopefully have finished its read
The assumed access miss penalty for the primary cache, as previously stated, was 5 ns. This means that in order to achieve the CPI figures reported in [Phil93] the average time to access the next higher level of memory must be 5 ns, including the time required to transfer the address and receive the data.
Simulating a primary cache employing a Harvard architecture with 2048 byte direct-mapped caches, copyback, and 512-bit wide blocks, secondary cache traces were created for the tex, spice, and gcc traces. Table 7.1 lists the lengths of these traces.
The secondary cache was simulated, assuming 512-bit wide blocks, a direct-mapped unified architecture, and copyback. The hit rates for various secondary memory sizes is given in Table 7.2. A unified cache was simulated since information about whether particular transactions occurred as a result of an instruction or data miss is lost in the trace translation process, but these figures should provide a rough estimate of the performance of various secondary cache configurations.
Figure 7.1 is a plot showing the required secondary cache hit time given a secondary cache hit rate and a tertiary cache mean access time.
Based on these figures, one can reach the conclusion that a 1.88 overall CPI can be achieved if a direct-mapped Harvard secondary cache with 512-bit wide blocks, 32 kB per cache of RAM, copyback write allocation, and 20 ns mean access time for the tertiary cache is employed. This would require the hit time on the secondary cache to be around 3.5 ns.
Assuming a tertiary cache penalty of 20 ns, a secondary cache
hit rate of 0.90, the required hit access time of the secondary
cache to achieve the 5 ns overall access time is 4.84 ns. By having
the secondary cache detect that it has had an additional 2 ns
to access two thirds of the incoming requests, the mean hit time
can be slowed by 1.34 ns, which allows the secondary cache RAMs
to be significantly slower.
In order to achieve higher chip yield despite the high likelihood of critical faults in yield-limited technologies, it is common to utilize redundancy and fault-tolerance in digital circuits. RAM chips are particularly well suited to this type of solution, as it is usually possible simply to include redundant memory cells which can be swapped for failed cells in the event of failure.
CML circuits draw power whether they are switching or not. As a result, the inclusion of extra circuitry for the purposes of redundancy must be carefully weighed against problems with heat dissipation. Also, the inclusion of excess circuitry will increase die area, and, therefore, increase signal transmission delays, thus possibly slowing down the circuit. A further speed penalty must be paid due to the need to introduce multiplexing logic to select between which cell to actually use, although, since the multiplexors are statically set at start-up, the delay is due to propagation only, and not switching.
In order to minimize the amount of multiplexing logic needed, block replacement, column replacement, or nibble replacement are the most logical redundancy schemes. In order to minimize the area requirements of a replacement scheme, adding additional columns to each block is an additional possibility.
In block replacement, an additional cache RAM block is placed on the RAM chip and swapped for a block containing a non-functional memory cell at system startup. The dimension of the replacement block must be equal in dimension to that of the other blocks.
Figure 8.1 shows how block replacement might be implemented giving the F-RISC / G RAM chip architecture. The cache RAM normally consists of four 16-bit wide cache RAM blocks. When communicating with the secondary cache, all 64 bits of the cache RAM blocks are used. When communicating with the CPU, however, one nibble is selected from among the four blocks.
In the F-RISC / G RAM chips, each of the four cache RAM blocks is divided into four nibbles, one of which is supplied to a multiplexor which further selects between the nibbles provided by each block. The nibble output of this multiplexor is sent to the CPU.
One nibble replacement scheme would have an additional cache RAM block which would allow the capability to replace any four malfunctioning nibbles. While the block replacement scheme allows an infinite number of faults in exactly one cache RAM block, the nibble replacement scheme allows an infinite number of faults scattered among exactly four nibbles, which may be physically located in any of the cache RAM blocks. Figure 8.2 illustrates one possible implementation of nibble replacement for the F-RISC / G RAM chip.
Overall, the nibble replacement scheme requires 2400 additional transistors for multiplexing, in addition to the hardware cost of the additional block and control circuitry.
In a column replacement scheme, an extra block is added to provide replacement memory columns for any malfunctioning columns. In the F-RISC / G RAM chip, an additional cache RAM block would enable any 16 columns to be replaced. Thus, an infinite number of faults could be scattered across any 16 columns, and the RAM chip would still function.
Adding an additional block to the cache RAM would require a great deal of space. The only way to do this and keep the die size under 1 cm2 would be to hand craft all of the additional circuitry to minimize the space it requires. An alternative possibility would be to add additional columns to each block, solely for the purpose of replacing malfunctioning columns within that block.
Figure 8.3 is a block diagram for a possible implementation of additional column replacement in F-RISC / G. Each column contains one or more extra columns which can be swapped for malfunctioning columns using multiplexing circuitry. The C multiplexors are used to allow the output of any of the primary columns to be overridden by the contents of any of the additional columns. If there is only one additional column, then each of these multiplexors must accept two inputs, requiring 384 devices in all.
In determining whether some sort of redundant circuitry should be included in the RAM chip in order to aid in fault-tolerance the fault process must be modeled. Once a model for fault distribution is established, the circuitry of the chip must be analyzed to determine which circuits can be replicated efficiently, and how many circuits must work and in what quantity they must work in order to have a chip which fully functions.
The simplest assumption has the chip consisting entirely of circuitry which is duplicated to provide fault tolerance. For example, if block replacement is used, one might model the chip as consisting of five cache RAM blocks, any four of which must work completely in order to have a fully functioning chip. If fewer than four cache RAM blocks work, then the chip is no good.
Assume the chance of a single block containing no critical defects (no defects which prevent that block from fully functioning) is equal for each of the blocks on the chip and is represented as YBLOCK. Further suppose that block failures occur independently of each other. Then the number of working blocks which can be expected is characterized by the binomial distribution:
N is the total number of blocks, x is the number of functioning blocks, and P(X = x) is the probability of exactly x functioning blocks given N total blocks.
Similarly, YCOLUMN, the probability of a fault free column, and YNIBBLE , the probability of a fault free nibble, can be substituted into the equation to find the probability of working chips given column or nibble replacement.
Given that four working blocks (block replacement), sixteen working nibbles (nibble replacement), or sixty-four working columns (column replacement) are necessary for each RAM chip to work, Table 8.1 gives the probability of a working chip, still ignoring support circuitry, for each replacement scheme:
|Single Extra Column Replacement|
|Three Extra Columns Replacement|
In order to introduce faults in the support circuitry to the model, the simplest assumption to make is that all of the circuitry on the chip is either replicated for redundancy, or support circuitry which, if a fault occurs, results in a non-functioning chip. If the probability of a chip having enough functioning replicated circuitry to function is Yfault-tolerant and the probability of the support circuitry containing no faults is Yfault-intolerant, then the probability of a fully functioning chip is Yfault-tolerant Yfault-intolerant.
In order to calculate the relative desirability of each replacement scheme, it is necessary to relate YBLOCK, YCOLUMN, and YNIBBLE, and determine the Yfault-intolerant for each scheme.
The first step is to generate a model for critical defects. If it is assumed that critical defects occur independently of each other, then the binomial distribution is appropriate. If it is further assumed that failures occur as point defects scattered across the wafer's surface, then the probability of a defect being located at any specific point is extremely small, and the number of possible points where a defect can occur is extremely large. The Poisson distribution is a limiting case of the binomial distribution with small probability of "success" and large number of "trials." For this reason, it is often used as a first-pass model for defect distributions in many types of systems [Devo91]. The probability of k defects on a wafer, assuming a Poisson distribution of defects, would be , where lambda is the population mean rate of defects per wafer. Since the probability of a working wafer is the probability of zero defects occurring in the block, k would be 0, so the equation would reduce to .
The number of defects per wafer is, by definition, the number of defects per unit area (p) multiplied by the area (a) of the wafer (lambda = pa). As an approximation, the area of a column is 1/16 that of a block, and the area of a nibble is 1/4 that of a block. In general, the yield for a circuit of area a would be . Once p is known for the process, the yield for a circuit of size a can be solved for.
Given the area relationships between a block, column, and nibble, YBLOCK, YCOLUMN, and YNIBBLE can be solved for in terms of only the column area (c) and p:
The pc term is the average number of critical defects expected per column.
The approximate device penalty for implementing each type of replacement scheme can be found in Table 8.2.
A rough estimate of the minimum support circuitry required without any replacement scheme is 2500 devices. A column contains approximately 140 devices (each bit uses only two devices, but these special devices are approximately twice as large as the devices used for other circuits). Making the rather poor assumption that the area required by a circuit is proportional to the number of devices in the circuit (the cache RAM blocks are much more dense than most other circuitry), the yield of the support circuitry for each replacement scheme can be written in terms of only p and c:
Since the circuitry in the cache RAM blocks is much more dense than the rest of the circuitry in the cache RAM, a defect occurring in the area of the cache block may be more likely to cause one or more faults. The calculations would more properly be done by determining the critical area for each type of circuit. The critical area is the area of a circuit which is sensitive to a defect of some given size, and is difficult to calculate for complicated circuits without specialized software tools. Since the non-cache RAM block areas, while having lower device density, tend to have more metallization, the critical area of a non-cache RAM block circuit with a given number of transistors may be on par with the critical area of a RAM block circuit with comparable device count, even while occupying more physical space.
Figure 8.4 shows the chip yield for each replacement scheme given a cp figure. At low fault probabilities the chip yield (the probability that a chip will work) approaches 1 for each of the five schemes. The differences between the scheme are minute, with single extra column replacement having performing the best. Although adding three columns per block would result in more flexibility and allow multiple bad columns per block, the additional support circuitry required results in reduced yield.
When yield is particularly bad, none of the schemes performs particularly well. When one in ten columns contains a fault, the chance of a working chip using any of these schemes is essentially zero.
The RPI Testchip contained several testable circuits of comparable complexity to a cache RAM block column (the chip actually contained a version of the cache RAM block, but, unfortunately, too few of the chips had enough working circuitry to allow any meaningful results to be obtained.)
Table 8.3 [Phil93] shows the test results for some circuits of comparable critical area to a cache RAM block column. From these results cp is found to range from 0.5 to 0.73, which, as Figure 8.4 illustrates, is the region where these fault-tolerance strategies have little beneficial effect.
The interesting conclusion to be drawn is that implementing fault
tolerant strategies in extremely low-yield technologies may not
be worthwhile. This is particularly the case when the low-yield
technology is being used because of its high-performance. In such
cases, the performance penalty induced by the implementation of
fault-tolerance may negate the performance benefit of the low-yield
technology. Also, even if fault-tolerance is implemented, simple,
inflexible schemes may provide superior chip yield results to
more flexible, and therefore, more complicated, solutions.
While few of the design constraints on the F-RISC / G cache resulted from architectural issues, the design of the F-RISC / G core processor constrained the design of the cache to a great degree.
The F-RISC/G system is illustrated in Figure 9.1. The Central Processing Unit (CPU) is comprised of four datapath (DP) chips and a single instruction decoder (ID) chip. Instructions supplied by the instruction cache are decoded by the instruction decoder, which sends the decoded operands and control information to the datapath.
The data cache is used only for LOAD and STORE instructions (as with most RISC systems, F-RISC allows access to memory only through these instructions.)
The Level 1 (L1) Cache is comprised of the primary instruction and data caches. Each cache consists of a single cache controller chip and eight RAM chips. Each of the two cache controllers must perform slightly different functions, but configuration circuitry is used to permit a single design to function in either the instruction or data cache.
Each RAM chip is configured to store 32 rows of 64 bits and is
single-ported. One unique feature of these chips, however, is
that they have two distinct "personalities." Each RAM
may read or write data four bits at a time using the DIN
and DOUT buses. Each 64-bit row
of memory may be filled one nibble at a time. A separate 64-bit
bi-directional bus (L2BUS) allows
reading or writing of an entire row at once. The wide bus is used
to communicate directly with the secondary cache, and thus is
less time critical than the four-bit bus which is used to communicate
data directly to the CPU datapath.
Each cache must be able to handle one new memory access each cycle. Were the processor and cache to operate serially, this would require, for the data cache, that an address be communicated from the datapath to the data cache controller, that the tag be compared, that the address be forwarded to the cache RAMs, that the RAMs perform a read and multiplex the appropriate data to the output pads, and that the data be communicated back to the datapath in less than a nanosecond. All of the memory subsystem data critical paths are shown in Figure 10.1 while this particular critical path is diagrammed in Figure 10.2.
|Driver Delay + On-Chip Skew|
|MCM Time of Flight + Skew|
|Receiver Delay + 2 Multiplexor Delays + D-Latch Delay + On-Chip Skew|
|Driver Delay + On-Chip Skew|
|MCM Time of Flight + Skew|
|RAM Read Access Time|
|MCM Time of Flight + Skew|
|Receiver + D-Latch Delay + On-Chip Skew|
The cache RAM blocks were designed to be accessed for reads in 500 ps, and the cache RAM as a whole requires 750 ps from address presentation to data valid. This clearly makes it unlikely that the entire cache operation can be performed in 1 ns.
As a result, the cache and CPU are pipelined, so the effective allowed time for the data cache is 2250 ps (1850 ps-2100 ps for the instruction cache). Specifically, two CPU pipeline stages are allocated for each memory operation. The instruction fetch takes place during the I1 and I2 stages of the CPU pipeline. Data reads take place during the D1 and D2 stages, while data writes are additionally allotted the DW stage. The D1 and I1 CPU stages correspond to the A cache stage, while the D2 and I2 stages correspond to the D cache stage [Phil93].
The data cache controller must be able to receive the address, latch it, run it through a multiplexor (which is used to select alternate address components in the event of a primary cache miss - specifically the tag stored in the tag RAM), and drive it onto the MCM lines. Allowing for slack and capacitive loading, 330 ps is a reasonable time allowance for these operations. A similar amount of time should be allotted to the datapath to drive the address and receive the data. This leaves approximately 840 ps for communications between chips. Note that the address transfer between the datapath and the cache controllers is further constrained by latch clocking to approximately 500 ps (or, more precisely, to approximately an integer number of clock phases - two phases seems to be the minimum attainable delay.)
Assuming a dielectric constant for Paralyne of 2.65 [Maji89] the time of flight on the MCM would be 5.43 ps/mm. Allowing for clock skew between chips, rise time degradation of MCM signals, and some slack due to variations in MCM dielectric constant and dielectric thickness', an MCM time of flight of 5.75 ps/mm is reasonable for the purposes of this analysis. This would mean that the total MCM distance allowed for this critical path is approximately 146 mm. These times do not take into account the resistance of the lines which results in an R-C charging effect which increases rise time at both the drivers and the receivers; it is hoped that these lines will be wide enough to minimize this problem. If rho is the interconnect metal resistivity, l is the line length, t is the interconnect thickness, and d is the dielectric thickness, the R-C charging effect can be approximated as [Salm93]:
Looking at this portion of the cache subsystem critical path more closely, the datapath chips and the cache controllers are each clocked by a global de-skewed system clock [Nah94]. The pipeline latch on the cache controller which receives the address from the CPU is clocked approximately 500 ps after the address is formed in the datapath. This means that there is 500 ps allowed for the datapath I/O drivers, the MCM time of flight, the cache I/O receivers, and associated skew, slack, and rise time degradation allowances.
The next stage of the critical path is the transfer of the address from the cache controller to the RAMs. Each cache controller must send a 9-bit address to each of 8 RAMs. Were each cache controller to incorporate only one set of address output drivers, then this 9-bit bus must be long enough to reach each of the eight RAM chips, as shown in Figure 10.4.
If the cache controller is given a second set of address drivers for this 9-bit bus, then the length of the longest address transfer from cache controller to most cache RAMs is significantly reduced (Figure 10.5).
If a LOAD or an instruction fetch is taking place, then when the cache RAMs receive the address they are expected to read the appropriate location and send the data to either the instruction decoder (instruction cache) or the datapath chips (data cache). The CPU data and instruction word size is 32 bits, so in each cache each of the eight chips provides 4 bits of data.
In the instruction cache, the eight cache RAMs must send four bits of data each to the instruction decoder (Figure 10.6). The length of the longest net for this portion of the critical path is determined by the longest distance between any RAM in the instruction cache and the instruction decoder.
For the data cache, each datapath chip communicates with two data RAM chips. The length of the longest net for this portion of the critical path is therefore determined by the longest distance between a RAM in the data cache and its associated datapath slice. Since each of these nets must connect only three chips, as opposed to the instruction cache in which each net must connect nine chips, one would expect these nets to be shorter than in the instruction cache.
The constraints on the critical paths are:
Instruction cache: (worst case) 1560 ³ D+E+F+G+H
Data cache: 1790 ³ D+E+F+G+H
Simulations based on preliminary MCM placement and routing predict a time of approximately 1584 ps for the data cache (including skew), which leaves approximately 206 ps for the byteops chip should one eventually be incorporated. The predicted time for the instruction cache is 1504 ps on the fast path, and 1589 ps on the slow path (which has a constraint of 1675 ps). Table 10.2 shows a breakdown of the timing for the cache subsystem critical paths.
|A||Address I/O (datapath):||145||145||145|
|B||Address Transfer (DP to CC):||170||170||170|
|C, D||Address I/O (CC):||334||334||334|
|E||Cache RAM Address Transfer (CC to RAM):||300||300||300|
|F||RAM Access Time:||750||750||750|
The F-RISC/G CPU contains a seven stage pipeline. Both the instruction and data caches are allotted two pipeline cycles to complete a fetch, and the data cache is allowed three cycles to complete a store. In the event of an acknowledged miss (a miss which is not ignored by the CPU due to an interrupt or trap) the CPU pipeline is stalled.
Table 11.1 shows the operations which take place in either cache during a fetch. Cache Controller and RAM operations may take place in parallel where appropriate.
|Tag RAM read||Receive Address|
|Send miss||Send data|
|Wait for acknowledge|
The operations shown in Table 11.1 can be divided into three stages as shown in Table 11.2. Figure 11.1 shows cache operation over time if the cache is operating sequentially. The numbers in the table represent addresses sent by the CPU to the cache to be fetched. Although not every address will miss, it is assumed that the cache hardware and CPU / Cache interface require regularity of operations, so each address must pass through the miss handling stage. If each cache stage takes one cache cycle, then each fetch requires three cache cycles. In addition, the cache can only handle one address every three cycles.
By incorporating pipelining, however, it is possible to allow the cache to operate in parallel with the CPU. Although each cache fetch will still require three cache cycles, the cache can handle three addresses in any three cycle period. By isolating the cache hardware through the use of "pipeline latches," it is possible to attain this type of behavior.
|Read Address||Receive Address|
|Tag RAM read||Receive Address|
|Send Results||Tag compare|
|Send miss||Send data|
|Wait for acknowledge|
Figure 11.2 shows how the pipelined cache would behave over several consecutive fetch requests.
As can be seen from Figure 11.2, each cache stage is isolated so that at any given time it can deal with an address different from each of the other stages. While each address still requires three cycles, the cache is capable of completing a fetch during each cycle, under peak conditions.
There are times, however, when the pipeline is not operating at peak efficiency. When the pipeline first starts up it is empty, and several cycles (Figure 11.3 shows how the cache pipelines (address and data) are integrated into the CPU pipeline. This figure assumes that each instruction is a Load, and no misses take place.
Since the F-RISC/G prototype uses a copyback cache, each data cache STORE requires that the tag RAM both be read from and written to, even in the event of a cache hit. Even if dual-ported RAM is available, the read and write operations can not take place simultaneously because in the event of a miss, the old RAM contents will be needed.
As a result of the requirement for the cache to perform two memory operations during a STORE, an extra pipeline stage is assigned to the CPU pipeline (DW), to allow time for both operations to take place. If an additional stage were added to the cache pipeline to handle STOREs, then each cache transaction, regardless of whether a LOAD or STORE was taking place, would require four cycles. The alternative would be to include hardware to engage the additional pipeline stage only when STOREs are taking place, an unpleasant alternative given the yield and power dissipation concerns generated by the use of the GaAs HBT process.
one for each pipeline stage) are required before the first cache transaction is completed. The process of loading addresses into the empty pipeline is called a pipeline fill. Any time the pipeline must be filled a performance penalty is incurred.
As shown in Figure 11.4, in which the grayed out squares represent instructions which do not access memory, address 1 spends two cycles in each pipeline stage, moving to each successive stage after the first cycle in the previous stage.
One issue differentiating the cache pipelines from the CPU pipeline is the fact that not every instruction handled by the CPU results in a data cache access. F-RISC, like most RISC architectures, limits data memory access to the LOAD and STORE instructions; ALU, BRANCH, and other instructions will not require access to the data memory.
If the data cache pipeline were allowed to advance only when the CPU requested a new transaction then transactions already in the pipeline would be prevented from advancing toward completion. As a result, the pipelines advance during every cycle, and a valid field is kept in each pipeline stage to indicate whether the transaction currently stored in that stage is the result of an actual CPU request, or merely invalid addresses captured off of the CPU address bus.
The pipelining behavior previously described applies only to normal LOAD or STORE transactions to the primary cache. In the event of a primary cache miss, the cache that misses will assert the MISS line, and, if the miss is acknowledged, the CPU will stall its pipeline.
By the time the miss acknowledgment arrives at the cache, however, the cache pipeline has already advanced twice. As a result, the transaction which caused the miss is in the M cache pipeline stage at the time the ACK is received. The address in the M stage needs to be sent to the cache RAMs and the tag RAM in order to handle the miss (the secondary cache has already stored it in its own pipeline).
This is accomplished by executing a pipeline rotate. When a miss is acknowledged, the address in the M stage is sent to the A stage, with the other stages advancing as normal. Figure 11.6 is a pipeline diagram for a miss occurring on address 1. At time 4 the pipeline rotates in response to an ACK at time 3.
Once the pipeline rotates and address 1 is again in the A
stage, the tag RAM and the cache RAMs are properly addressed to
handle the miss and copyback as necessary.
Each primary cache of the F-RISC/G system consists of a single cache controller chip which performs memory management functions, and eight cache RAM chips.
The cache RAM chips used in each cache (eight chips per cache) are 64 bits wide and 32 bits deep (2 kb each). Each chip has two I / O buses. One bus, the high speed bus or CPU bus, is 4 bits wide and consists of separate input and output lines. The second bus, the so-called L2 bus or wide bus, is bi-directional and 64 bits wide. The cache RAM chips are designed to provide a read access time at the pads of 750 ns.
A block diagram for the cache RAM is shown in Figure 12.1.
Each RAM chip contains 64 bi-directional I/O data pads (d-d) which are intended for communications with higher levels of memory. The pull-ups to VDD which are required on all CML circuit trees are included on the cache RAM rather than the secondary cache in order to optimize the pads' driving capabilities. An external signal is provided to the RAM from the cache controller (CRRECEIVE) to control whether these pads drive or receive, although they are automatically set to receive when the cache controller asserts CRWRITE which is always the desired behavior when one wishes to write into the RAM.
Separate four-bit high speed buses (di[0:3] and do[0:3]) are provided for communications with the CPU. A nine bit address bus (a[0:8]) is used.
There are also external WRITE, LATCH, HOLD, and WIDE signals which are used for normal RAM operations. The LATCH signal is used to prevent the inputs at the din and a buses from being presented to the core circuitry which allows pipelining of the cache (since it permits varying the cache RAM inputs prior to completion of a cache RAM transaction). The HOLD signal prevents the contents of the dout bus from changing, despite changes on the din or a buses, which is also used for cache pipelining. The WIDE signal selects between the din and d buses when performing a write into memory. Reads from memory always are presented to both buses. The WRITE signal, when asserted, causes the data on the selected inputs to be written into the RAM location selected by the external address pads.
The testing circuitry includes both boundary scan and built-in self-test elements. The majority of the circuitry used for testing is encompassed in the latches which are used to hold captured core outputs and scanned-in core inputs, and built-in self-test circuitry such as a counter which is used to generate 32-bit addresses, and an 8-bit rotator which is used to generate input data patterns (see "Test Scheme Design"). While the hardware cost of implementing this testing scheme is not negligible, untestable circuitry is useless, and the scheme was optimized where possible to minimize this penalty.
Unlike most boundary-scan schemes, the sampling and input latches are located in the core rather than in the pad ring. These latches and associated multiplexors and control circuitry take up most of the standard cell area.
The latches on the four bit input bus serve the second purpose of preventing the inputs to the core from changing when the LATCH signal is asserted during normal operations.
The cache RAM critical path was simulated using SPICE, an analog circuit simulator, using capacitances extracted using Quickcap, a three-dimensional capacitance extractor, and the Compass VTITools two-dimensional capacitance extractor.
Critical capacitances within the cache RAM block were extracted with Quickcap, and SPICE simulations were performed to confirm that the RAM block should have an access time of 500 ps. In addition, the complete RAM critical path from address pad I/O to Data out I/O was simulated in Spice to assure an net access time of 750 ps.
The cache RAM is 6.703 mm wide and 9.347 mm high. The majority of the on-chip circuitry is dedicated to the basic RAM functionality and to the I / O pads - the testing and control circuitry represent a small fraction of the transistors count. Table 12.1 is a breakdown of the transistor usage on the cache RAM chip by circuit.
|I / O||886||3490|
|Testing (not including latches)||224||102|
|Testing: Rotator and Counter||254||291|
|Multiplexing / Distribution||501||49|
The F-RISC / G system contains two cache controllers: one each for the data and instruction caches. Each of these chips is responsible for handling all communications between the core CPU and the cache RAMs in the primary caches, as well as the secondary cache and I/O devices.
Although the responsibilities of the two cache controllers differ slightly, it was decided to design a single, configurable controller, due both to the cost and time required to design an extra chip; the operation of the controllers in both caches is similar enough that methods were found to minimize the penalty for using a single chip.
The key functional components of the cache controller chip are the tag RAM, a three stage pipeline with integrated counter, and a comparator. The organization and interconnection of these functional structures is illustrated in Figure 13.1. The chip sends out 26 or 28 rather than 21 or 23 address bits to the secondary cache in order to allow sub-block replacement in the secondary cache. The chip additionally includes circuitry to supply appropriate control signals to the major functional units and circuitry which provides at-speed testing capability of un-mounted die as well as functional testing capability of mounted die.
The cache controller was designed for use in both the instruction and data caches. For this reason the first pipeline latch serves also as the Remote Program Counter (RPC) in the ICC configuration. Figure 13.4 shows the manner in which the two caches share a common CPU address bus and how the RPC can be loaded from this bus. If two separate cache controller chips had been designed, it would have been possible to include only two pipeline latches in the DCC as at any given time only two addresses need be stored (the third always being available on the bus.) Since the hardware for the RPC had to be included, however, it was decided that it also act as a latch in order to reduce problems caused by hazards and skew on signal lines while at the same time minimizing chip configuration and initialization logic.
The cache controller contains a pad, IS_DCC?, which is used to enable the chip to be configured for either the instruction or data cache controller. For data cache use the signal is asserted by hardwiring it on the MCM.
Additionally, when the chip is intended for the data cache, the BRANCH pad should be asserted by hardwiring it on the MCM; the ICC will have the BRANCH signal asserted by the instruction decoder whenever a branch is to occur. This signal is used to determine whether the first pipeline stage (the remote program counter) is loaded or counts.
Since it is impossible to perform a STORE into the instruction cache, the WDC line must be hardwired low. In addition, the instruction cache must retrieve an address on every cycle, so VDA should be tied high.
The cache controller chip is 8.365 mm high and 9.472 mm wide. Table 13.1 shows an approximate device usage breakdown for the cache controller chip. As in the cache RAM, a large percentage of the power is dissipated in the RAM blocks and the I/O pads.
Table 13.2 compares the critical features of the F-RISC / G chip
set. Despite being designed by different people, all of the chips
are seen to be similar in size, area, and power dissipation. The
cache controller and datapath chips are seen to be of comparable
complexity (were the unnecessary columns removed from the tag
RAM block this would be even more the case), while the cache RAM
and instruction decoder, while being quite difficult in nature,
are similar in size and complexity. This comparison suggests that
it might be worthwhile in future designs to move some of the functionality
of the cache controller into the instruction decoder.
|I / O||2548||2810|
|Write byte decoding||80||85|
|Tag RAM blocks||3420||4000|
|Pipeline and RPC||2664||2208|
As the F-RISC/G prototype is partitioned, inter-chip communications becomes an important issue. Large fractions of the cycle time on are consumed by communication between chips.
Figure 14.1 shows a breakdown of the components of the LOAD critical path in the data cache, assuming that the Byte Operations chip is present. As can be seen, off-chip communications accounts for over 40% of the critical path. This is a unique design space that required special attention throughout the design process. Interestingly, these numbers are similar to those for the F-RISC/G adder critical path, as shown in Figure 14.2 adapted from [Phil93].
Table 14.1 lists the communications signals sent from the core CPU to the primary cache. Aside from an address and data, the CPU also sends out several handshaking and control signals. These signals inform the caches of stalls and determine whether a LOAD or STORE is to take place.
|ABUS||32||DP||DCC, ICC||Word (Instruction cache) or Byte (Data cache) address. Shared by both caches.|
|WDC||1||ID||DCC||Signals data cache to perform store.|
|STALLM||1||ID||DCC, ICC||Signals both caches to stall.|
|ACKI||1||ID||ICC||Signals instruction cache that it has caused a stall.|
|ACKD||1||ID||DCC||Signals data cache that it has caused a stall.|
|VDA||1||ID||DCC||Address on bus is valid for data cache.|
|IOCNTRL||3||ID||DCC, ICC||Flush / Initialize / Write alignment|
|BRANCH||1||ID||ICC||Instruction cache should set RPC to address on bus.|
|DATAOUT||32||DP||DRAM||Word of data to be stored in data cache.|
The IOCNTRL lines are a 3 bit field that is part of the LOAD and STORE instructions, and are sent to both cache controllers. These bits are used to inform the caches when the system startup routine is complete, and to inform the data cache in the event of aligned byte or half-word writes. The meaning of the control bits are as shown in Table 14.2.
As the data cache receives a byte address from the datapath (unlike the instruction cache, which uses word addresses), support is provided using IOCNTRL to allow reads and writes to any byte, half-byte, or word in the processor's address space. To read a non-word-aligned byte or half-byte, however, requires the presence of the Byte Operations chip on the MCM. Non-word aligned word-fraction Store support is provided in the DCC.
|000||Read or write entire word|
|001||Read or write half-word|
|010||Read or write byte|
|011||Force a miss on this address|
In order to prevent the need to design two different cache controllers, the cache controller chip is designed internally to handle word addresses. On the DCC, ABUS, the word address, must be wired to the pad ABUS. Similarly, each bit on the bus is wired to the pad corresponding to its position in the word address. The two low order ABUS bits (byte address) are wired to the high order pads (See Figure 14.3). The controller chip knows to ignore these two bits when handling tags and presenting addresses to other chips, and uses them only when setting the RAMs into Write mode.
Table 14.3 lists the signals sent from the cache to the CPU. These consist mostly of requested data, but also include signals to inform the CPU that a miss has occurred and the requested data will not be available in time.
|MISSI||1||ICC||ID||A miss has taken place in the instruction cache.|
|MISSD||1||DCC||ID||A miss has taken place in the data cache.|
|INSTRUCTION||32||IRAM||ID||32 bit Instruction|
|DATAIN||32||DRAM||DP||Word of data for the datapath.|
The primary caches each consist of a single cache controller chip and eight cache RAM chips. While there is no inter-cache communication (i.e. the instruction and data caches do not communicate with each other), there is extensive communication between each cache controller and its associated RAM chips.
|Signal||MCM Length (mm)||Delay (ps)|
|DATAOUT||upper path: 22
lower path: 27
|INSTRUCTION||fast bits: 13
slow bits: 24
|DATAIN||upper path: 22
lower path: 28
Table 14.5 lists communications lines between the cache controllers
and RAMs. The CRWRITE line is
used to write into the cache RAMs. The CRWIDE
line is used to toggle between the 4-bit per RAM CPU data path
and the 64-bit per RAM L2 data path. The CRDRIVE
line is used to control the bi-directional drivers / receivers
used on the RAMs for communicating with the L2 cache.
|CRABUS||9||CC||RAM||5 bit row address and 4 bit word address.|
|CRWRITE||4||DCC||DRAM||Write / .|
|HOLD||1||CC||RAM||Prevent RAM outputs from changing..|
|INLAT||1||CC||RAM||Allow 4-bit data input to pass through input latch.|
|CRWIDE||1||CC||RAM||Select wide input path (64 bit) for write from L2..|
|CRDRIVE||1||CC||RAM||Control bi-directional L2 bus.|
|L2ADDR||CC||L2||23-bit line address.|
|L2DONE||L2||CC||Indicates that the L2 has completed a transaction. Any data L2 places on the bus must be valid when this is asserted.|
|L2DIRTY||CC||L2||Indicates that the L2 will be receiving an address to be written into.|
|L2MISS||CC||L2||Indicates that the address on L2ADDR is needed by the CPU.|
|L2VALID||L2||CC||Indicates that the current data in the cache row specified by the cache tag currently being transacted is correct. De-asserted by L2 during TRAP.|
|L2SYNCH||CC||L2||A 1 GHz clock used for synchronizing with L2.|
|L2VDA||CC||L2||The address currently on L2ADDR is valid.|
The HOLD and INLAT signals are used to latch the RAM 4-bit data outputs and inputs, respectively. The lengths of each of these lines or buses is less than 45 mm, for an estimated flight time of 300 ps.
Table 14.6 enumerates the signals used for communication between the primary and secondary caches.
Each cache controller will send out a 28 bit cache line address as soon as it is received from the CPU. This is done to allow the L2 cache to read its tag RAM simultaneously to the L1 cache. The cache controller will assert L2DIRTY as soon as it completes its tag RAM access if the accessed line is dirty. The L2 will not receive the address as stored in the primary caches tag RAM until later, however, and only if it is required (that is, a Stall occurs.)
The cache controller asserts L2MISS only if a miss occurs and the CPU acknowledges the miss. Whenever the address on the L2ADDR bus is valid, L2VDA is asserted.
Since the secondary caches do not have a synchronized clock, the L2SYNCH signal is used to inform the secondary caches that valid data is on the control and address lines. When the L2SYNCH signal goes high the data on the L2 communications lines is valid. It remains so for approximately 500 ps. If the MCM routing is done carefully, it may be possible to assure that the L2 communications signals are valid for as long as L2SYNCH is asserted.
The L2DONE signal is asserted by the L2 to indicate that it has performed the requested operations, both modifying its RAMs as appropriate and placing requested data on the bus. Any data being sent by the L2 must be on the bus for 750 ps prior to L2DONE being asserted.
In the event that the primary cache has to perform a copyback, the secondary cache will first receive the address (originating from the CPU and passing through the primary cache controller) that caused the copyback, along with the L2DIRTY signal and the data to be copied back, which should be latched at that point. Two more addresses will appear on the bus to the L2 (although they may or may not be valid), followed by address that had been stored in the tag RAM (the address of the data being copied back).
This "out of order" execution, in which the L2 may perform the read before the write on a copyback from the primary cache, allows maximum flexibility for the secondary cache designer (for example if two port RAM is available.)
The F-RISC / G processor is designed to mounted on a thin film Multi-Chip Module (MCM). Four datapath chips, the instruction decoder, the two cache controller chips, and the sixteen RAM chips will all be mounted on this MCM. In order to achieve the timing necessary to operate the processor with a 1 ns cycle time, the chip placements on the MCM had to be carefully considered.
Figure 14.4 shows the placement of the core CPU and primary cache chips on the MCM. The placement of the datapath (DP) and instruction decoder (ID) chips is determined by the constraints of the CPU adder critical path. [Phil93] provides an analysis of this aspect of the MCM floorplan. [Phil93] reports that the worst case communication between the core processor chips is the "daisy-chain" broadcast from the instruction decoder to each of the four datapath chips. Due to the layout of the instruction decoder, the signals to be broadcast must often be driven from the side of the chip farthest from the datapath chips. The sizes of all of the F-RISC / G core and cache chips are given in Table 14.7. These chips are all significantly larger than the 8 mm x 8 mm size which Philhower assumed in his calculations, due mostly to the late inclusion of terminating resistors in the pads. These restrictions severely constrained the placement options for all of the cache chips on the MCM. In low device-integration, partitioned designs, the placement of the core CPU chips will, as a rule, constrain the placement of the cache chips in this way, so long as speed is the primary concern.
The F-RISC/G CPU is designed with rudimentary support for virtual memory. Specifically, control and communications lines are provided to enable the caches to signal the CPU in the event of a page fault, as shown in Table 15.1.
|TRAPD||Cache||CPU||Data cache page fault|
|TRAPI||Cache||CPU||Instruction cache page fault|
|I1, I2, I3||Cache||CPU||Status lines sensed by PSW|
|O1, O2, O3||CPU||Cache||Status lines controlled by PSW|
The word addresses supplied by the CPU to the instruction cache and the byte addresses supplied by the CPU to the data cache are virtual addresses in that they refer to a location in the CPU's memory space without regard to their actual presence in physical memory. The CPU doesn't care where a particular virtual address maps to, as long as when data is requested from that address it is available.
Since the virtual instruction space is 232 words in size and the data memory space is 230 words in size, it is unlikely that the amount of physical RAM available in main memory will span the entire virtual memory space. In a typical virtual memory system, hardware and software is provided to allow the virtual memory to be divided into pages each of which may exist either in physical memory or on a secondary storage device, such as a disk drive. When the CPU requests a transaction to an address which is in a page not currently in physical memory, a page fault occurs, and the page which is needed is loaded from secondary storage, replacing another page already in physical memory if necessary. Since the amount of time necessary to access the secondary storage device, transfer the existing memory page to this device, locate the required page on the disk, and retrieve it back into memory is extremely long compared to the CPU cycle time, it is desirable for the cache to inform the CPU of the problem and allow the CPU to proceed with other instructions while the page swap occurs, if possible. This is typically performed by the operating system which will context switch to another waiting, unrelated process.
Due to the hardware cost of such a system, the virtual to physical address translation can not occur in the primary cache. Instead, it is expected that some higher level of memory, perhaps the level just before main memory, will handle the translation of virtual addresses into physical addresses. When a page fault occurs at this level of memory, the CPU is informed via the TRAPD or TRAPI signal. The CPU then handles the interrupt by branching to the appropriate trap vector. It is presumed that the operating system has installed code at the appropriate trap vector to handle page faults. The caches will send "DONE" signals all the way down to the primary cache, which will recover from its stall and lower the MISS line as if it had the correct data. The cache must then be re-validated through a flush of the incorrect address. The CPU will lower the STALL and ACK in response to the primary cache lowering its MISS, and will prevent it from going high again in response to the incoming TRAP.
Typically, the CPU, upon receiving the TRAP, will perform instructions which don't involve the memory location which page faulted, and, when the page is finally available, will re-issue the request. The CPU contains pipeline stages which enable it to re-issue a LOAD or STORE which result in a page fault.
The exact behavior of the CPU in response to a memory page fault depends on the contents of the CPU pipeline and the state of the caches at the time the page fault occurs.
|Data cache page fault|
|Instruction cache page fault|
As mentioned earlier, the cache memory hierarchy has its own critical paths. The most critical of these is the path from address generation at the CPU to data reception by the CPU.
Figure 16.1 is a timing diagram of data cache LOAD operations. This timing diagram is based on the back-annotated (post-route) netlists for the cache controller, instruction decoder, and datapath chips. The vertical timing lines represent synchronized clock phase 1. Slightly after phase 1 of the first cycle, the CPU puts address (20)hex on the ABUS (Table 16.1).
It arrives at the data cache controller during phase 2 where it passes through the master of pipeline latch 1. The WDC and VDA lines are stable prior to the address. On the DCC, the tag RAM receives its inputs (address and data) from the master of pipeline latch 1, while the slave is used to feed the comparator. The tag RAM read access time is approximately 500 ps.
After the cache RAMs supply the data to the CPU, the only remaining task for the cache is to inform the CPU that the data is available and to re-synchronize with the CPU's pipeline.
The situation is more complicated if the cache row corresponding to the cache access is marked as dirty. If a miss occurs and the cache row is dirty, the primary cache must send the current contents of that row to the secondary cache before overwriting it with the data requested by the CPU.
Figure 16.2 is an example of code that would result in this condition. The first line of code stores the contents of register 2 into cache row 2 (the row is calculated by bits 4 through 8 of the address). The corresponding tag would be 0, and the dirty bit would be set to indicate that the CPU has changed the contents of this address and that the higher levels of memory are out of date.
The two ADDI instructions are used to set register 3 to 3FFFFE20 (the use of two instructions is necessary since no F-RISC instructions accept 32 bit literal values). Finally, the LOAD instruction should fetch the contents of 3FFFFE20 into register 1.
ADDI R3=0+FE20 ; the add instructions are used to
ADDI R3=0+3FFF /LDH ; assemble 3FFFFE20 as the destination for the LOAD
LOAD R1=[0+R3] ; put the contents of 1024 into R1
3FFFFE20 corresponds to cache row 2 and tag value 1FFFFF. Since row 2 previously held tag 0, a miss will occur. Since the dirty bit for row 2 is set, a copyback must first take place.
Figure 16.3 is the timing diagram for this example. The STORE request is received by the primary cache at time 9375. In order to show the worst case, only one cycle of latency is allowed on this timing diagram between the STORE and subsequent LOAD. The LOAD request is received at time 11375.
Figure 16.4 is a timing diagram showing consecutive STORE instructions. When a STORE is to take place, the instruction decoder signals the cache controller by asserting the WDC signal. Since the signal is derived from the instruction word and can be sent directly from the instruction decoder rather than the datapath chips, the signal arrives a few hundred picoseconds before the address (at time 9075 in this example).
Every STORE instruction is allocated two cycles by the CPU. The second cycle is necessary because a STORE requires a read from and a write to the tag RAM.
For the first of the two cycles, the cache controller will be in the READ state. While in this state, the cache controller checks the tag RAM in order to determine whether a hit has occurred. As far as the cache controller is concerned, the first half of a STORE instruction proceeds identically to a LOAD instruction.
The cache controller latches the address from the CPU during the first half of the STORE, so the CPU does not have to keep the address stable for two cycles. During the second cycle the comparator calculates the result.
The instruction cache timing is, in most respects, similar to the timing of the data cache during a LOAD. This is particularly true when a BRANCH occurs.
The instruction cache controller contains a remote program counter (RPC) which is used to generate addresses to fetch and send to the CPU. This occurs without any intervention from the datapath or instruction decoder. In the event of a BRANCH, the address is received off of the ABUS, as in the data cache.
Unlike in the data cache, it is not necessary to delay the data sent to the CPU using the HOLD signal, since the instruction cache timing is much more constrained.
When the CPU starts up, a "phantom" BRANCH to location 20hex is injected into the pipeline. Figure 16.5 illustrates how such a BRANCH might take place. As in the data cache, the target address is expected to be available at the cache controller at approximately 375 ps after "phase 1" (simulation time 9375). The actual BRANCH signal arrives approximately a phase earlier.
The timing of the instruction cache is more critical than that in the data cache. The architecture was designed to support a byte-operations chip in the data cache; by not including it, the timing in the data cache became fairly relaxed. The instruction cache has only from 1850 ps - 2100 ps in which to perform a fetch, versus 2250 ps in the data cache. Bits 3-7 of the instruction word must arrive at the instruction decoder a phase earlier than the remaining 27 bits.
In order to allow bits 3-7 (the "fast" bits) to arrive more quickly, the two RAMs which supply these bits to the instruction decoder were placed as close to the ID as possible without increasing the distance from the instruction cache controller.
If the CPU determines that the request to the cache can not be flushed, it must stall, and will assert the STALLM line, which is shared by both caches.
Upon receiving STALLM, each cache will move into the MISS state. At the time this occurs, neither cache knows whether or not it is the cache which caused the stall. In order to inform the appropriate cache that it is responsible for the stall, the instruction decoder will assert the appropriate acknowledgment line (either ACKI or ACKD).
The cache that receives both the ACK and the STALLM will progress through the normal miss cycle as previously described. The other cache will behave almost identically, but will skip the WAIT state, thus preventing any cache state information from being overwritten. This cache will skip directly into the RECOVER state, and, once cycle later, will enter the STALL state where it will idle while awaiting STALLM to be de-asserted. Since the pipeline rotate occurs only in the RECOVER state (rather than in the STALL state), the pipeline in the non-stalled cache will be correct for when the CPU recovers from the stall.
When a cache determines that a miss has occurred and that it will not be able to satisfy the CPU's request in the time allotted, the cache controller will assert the appropriate MISS line (MISSI for the instruction cache, or MISSD for the data cache).
One of the most important responsibilities of the cache is to enable the processor to correctly start up. When the processor is powered on, or reset, it needs to be fed the appropriate start-up instructions, and the data cache must be invalidated or pre-loaded with valid data.
When the processor is initialized, it inserts an unconditional BRANCH to location 20hex into the pipeline. It is the responsibility of the instruction cache to fetch this instruction upon receiving the BRANCH signal and the address.
Figure 16.8 illustrates the timing at the instruction cache controller during processor start-up. The cache controller will receive the branch request and must realize that a miss must occur, regardless of whether the tag in the tag RAM accidentally matches the tag of the start-up address (0). This is accomplished through coordination with the secondary cache, since too little handshaking exists between the CPU and the cache to enable this to be self-contained.
The secondary cache will receive the global RESET line (as well as all external trap and interrupt lines) and is responsible for initializing the CPU and the cache in the proper sequence.
Figure 16.9 illustrates the operation of the instruction cache during a page fault or during a trap which happens to occur coincidentally to a secondary cache transaction. The cache must take special measures to preserve the integrity of the tag RAM during such an event. When a page fault occurs, at least once primary cache (the one corresponding to the fault) is awaiting data from the secondary cache.
The primary cache will be in the WAIT state, with the tag RAM and cache RAM WRITE signals asserted. The cache RAMs will be performing a wide WRITE, awaiting the data from the secondary cache. The tag RAM will be writing in the new tag from the pipeline (originating from the CPU) along with the appropriate value of DIRTY. The old tag will have already been sent to the secondary cache during the READ stage of that memory access cycle.
When the trap occurs (presumably at the main memory level of the memory hierarchy in the case of a page fault), the trap is sent to the secondary cache. The secondary cache will then de-assert the L2VALID line. This bit is stored in the appropriate row of the tag RAM, along with the appropriate tag. If the bit is set to "valid," then future cache operations on that tag will proceed as normal. If, however, the data transfer from the secondary cache is interrupted by a trap, then the secondary cache sets the bit to "invalid," and if another operation takes place on that tag, it automatically causes a miss to take place.
In the event that a STORE into the data cache caused the page fault, it is questionable as to whether the transaction should be interrupted. If the cache were to simply mark the tag as invalid, the data stored by the CPU would be lost, and the CPU would have no way of knowing about it. Since STOREs are comparatively rare, and STORE misses even more so, the best decision is simply to stall the processor until the primary cache has valid data.
Since it takes approximately 750 ps to write into the tag RAM, and the data should be stable for a considerable period before that, the secondary cache should wait two cycles after deasserting L2VALID before sending the trap signal through to the primary caches and CPU.
The primary cache responds to the trap signal by resetting to
the READ state. The MISS
signals may be spuriously asserted by the primary cache while
the trap is held high (the trap is tied to the INIT
signal pad), but the secondary cache has enough information to
ignore it, and the CPU ignores misses which occur while processing
The design of the F-RISC / G prototype's primary cache imposes certain constraints on the design of the secondary cache.
Figure 17.1 shows a block-diagram for a possible secondary cache configuration. A 32 kB Harvard architecture is assumed. Pipeline latches are included in order to enable the secondary cache to recover addresses in the event that a secondary cache miss occurs on an address that is eventually determined to be needed by the primary cache. (By the time the L2MISS signal issued by the primary cache reaches the secondary cache, the secondary cache may have received two additional addresses. If an additional valid address is received before the correct data for the previous address is fetched from either the secondary cache data RAM or the tertiary cache, and the data for the previous address is needed by the primary cache, then the address must be stored in the secondary cache as the primary cache will not re-send it - when the primary cache determines the data is needed, it sends the address of the data already present in the primary cache, instead.)
A pipeline latch is needed on the data RAM outputs in order to handle primary cache copyback situations.
Figure 17.2 illustrates the interaction of the F-RISC / G caches during a load copyback. The primary cache sends an address to the secondary cache before it is determined whether the primary cache needs the address. By the time the miss signal is sent to the secondary cache, assuming the secondary cache has not received additional valid addresses (the primary cache will assert the L2VDA signal when a valid address is on the bus), the secondary cache has already had at least a cycle to perform a read. The secondary cache must finish the read, and, using the copyback address and data which is sent to the secondary cache following the L2MISS signal, perform a write. While the write is being performed, the data read from the secondary cache must be latched. Once the data on the bidirectional bus is no longer needed, the secondary cache can assert L2DONE signal and put the data on the bus (the data should be on the bus for a phase before L2DONE is asserted.)
It is important to note that the five cycle mean access time for the secondary cache was based on calculations for the stall component of CPI. Therefore, the required five cycle limit implies that, on average, accesses to the secondary cache result in a stall of only five cycles. Since, in the event of a primary cache hit, the data is required at the CPU at approximately the same time the secondary cache receives the miss signal in the event of a primary cache miss, the five cycles allotted to the secondary cache begin approximately when the secondary cache receives the L2MISS signal. This means that, on average, a primary cache read miss has 7 ns to be completed. (The data cache has an additional phase, while the instruction cache fast bits have one phase fewer).
The goal of the testing schemes for the cache chips is to provide an exhaustive static testing capability of on-chip circuitry as well as all off-chip drivers and receivers. At-speed testing of the core circuitry and critical I/O paths is also a necessity. Other design goals are:
Bob Philhower, in his doctoral thesis [PHIL93], details several alternative testing schemes for high-speed circuitry, so these alternatives will merely be summarized here. [Maie94] also contains information on Philhower's scheme and on the scheme implemented for the testing of the cache RAM.
Philhower determined that an LSSD latch would require 50% more power and 167% more transistors than the non-LSSD D-latch in the CML circuitry used in the F-RISC/G prototype. Master - slave latches would be similarly effected.
There is a recently adopted ANSI/IEEE standard 1149.1-1990[Maun86;
Maun92; Webe92] which allows full static testing of chips with
lower transistor penalties than LSSD designs. Philhower discusses
this standard in his thesis, and determined that while the IEEE
standard was not a reasonable choice for testing low-yield parts,
modifications were possible that would allow the use of even fewer
transistors, and would also allow at-speed testing. An advantage
of boundary scan over LSSD is that rather than replacing all on-chip
latches with LSSD devices (which, at an extreme, could include
modifying any register files or RAM blocks to allow serial loading),
only the I/O drivers are modified.
The scheme proposed and implemented by Philhower has several key features:
There are, however, some problems with utilizing his scheme in the cache RAM chip:
Although the scheme used in the datapath and instruction decoder chips is unsuitable for use in the cache RAM, many of the ideas first implemented in these chips can be modified for use in the RAM chips. Figure 19.1 shows the receiver used in Philhower's boundary scan scheme (the "core boundary scan scheme"). Figure 19.3 shows a schematic for the driver used in the core boundary scan scheme. Figure 19.4 is a schematic of the overall boundary scan control system used in the core CPU chips.
The Philhower scheme contains several key modifications to the ANSI/IEEE standard. Perhaps the most important of these is the capability to perform speed testing of on-chip circuitry. In order to accomplish this, the scheme takes advantage of the four-phase clocking available on-chip in the instruction decoder and datapath chips.
The idea behind the at-speed testing in this scheme is that when a given clock phase is asserted the input pattern is presented to the on-chip circuitry. On some other clock phase, the outputs of the core are sampled. The time between clock phases is known, so all that remains is to scan the sampled data off chip for examination. Figure 19.2 illustrates this type of at-speed testing. If the sampled pattern is as expected, it is clear that the on-chip circuitry is capable of operating in the allowed time period.
Philhower proposed using a master - slave latch as a sampling element.
The scheme further allows any clock phase to be selected for either pattern presentation or pattern sampling, allows the first or second occurrence of a given phase to be used to trigger sampling, and allows a selectable offset from the clock phase to be used to trigger pattern presentation and sampling.
This scheme allows a great deal of flexibility, but requires a large amount of logic to implement. Besides the obvious overhead of supplying a four-phase clock generator, four-input multiplexors are required for both the pattern present and pattern sample circuitry to select a clock phase, and further circuitry is required to select the delay offsets from the selected clock phases. Still more circuitry is required to select from between the first or second occurrence of the clock phase used to trigger the pattern sample signal.
Another feature of this design is that the scan latches are located
in the pad ring, near the drivers and receivers. As the scan latches
which act as a shift register are also used to sample core outputs,
the scan clock received by some latches may be delayed by the
pattern sampling logic. As all scan latches are clocked in parallel,
skew becomes an important issue; a skew of one gate delay between
adjacent latches can cause improper scanning operation.
The overall idea behind the cache RAM testing scheme is to use a limited number of boundary scan latches and a minimum amount of control logic to allow access time and functional testing of the RAM.
As shown in the block diagram (Figure 12.1), the RAM chip contains a large number of pads. d[0:63] are the bi-directional pads which are used for communication with the secondary (L2) cache. Since L2 cache transactions are, by design, relatively rare, the speed of this 64 bit wide path is not as critical as that of the four bit path, di[0:3] and do[0:3]. The four bit path is used for communication with the core CPU (specifically a datapath chip). Due to the large quantity of pads required on the RAM chip and current yield limitations it is not feasible to include a full boundary scan receiver or driver in each one. This is likely to be a problem in any exotic technology, as pin-outs tend to remain relatively constant even as RAM dimensions decrease.
Continuous Testing Mode
Continuous mode operation is one of the two ways in which the RAM chip can be tested. In this mode, on-chip circuitry is used to generate patterns with which the cache memory blocks can be filled. The five bit counter is used to generate consecutive row addresses (each memory block contains 32 rows), and the eight bit rotator is used to produce a four-bit pattern which can be used to fill one of the four nibbles in any of the four register files.
In addition to the counter and rotator, four other scan latches can be loaded and used to provide a block and nibble address.
Once the cache memory blocks are loaded with test patterns, the test circuitry can be switched to read mode. Any of the four bits do[0:3] may be viewed on the scope output pad (the bit to be viewed is externally selectable). This mode is particularly useful for generating scope output which can be used to quickly confirm that large portions of a chip work.
Single Shot Mode
A single shot test consists of applying one set of inputs to the core circuitry synchronously to a high speed clock and sampling the outputs of the core circuitry synchronously to a second edge on that same clock. Possible operations include reading a single address from the core memory onto either the 4 or 64 bit data paths, writing data to a single address, or changing a control line (HOLD, WIDE, or WRITE).
This type of test is similar to the method used in the F-RISC/G core chips and cache controller. The chief disadvantage of this test is that it makes exhaustive testing of all memory cells tedious. Since the results are stored in the driver latches and must be shifted out serially, the results are not satisfying in the sense of producing clear, scope-measurable output. This type of test is useful, however, for performing tests of the MCM traces, the L2 drivers and receivers, and for testing worst case access times (in which the row, block, and nibble addresses do not increment consecutively.)
The boundary scan system used on the cache RAM chip requires three distinct types of I / O pads. The I / O pads used for communication with the L2 cache consist of 64 bidirectional pads, each of which consists of a tri-state driver and receiver connected to each other via the pad. This allows the signal from the driver to be used as an input to the receiver if no other signal is present at the pad.
Figure 20.3 is a schematic representation of the L2 Driver / Receiver circuit.
The connection between the driver output and receiver input allows static testing of the L2 drivers and receivers. The idea is that a pattern is loaded into the RAM, output onto the drivers, and, through the connection, read back into the receivers.
In addition to these driver / receiver pads, a special boundary scan receiver is used. This receiver is used for the address pads, 4-bit data input pads, and control signals. The purpose of these receivers is to capture inputs from the MCM during MCM testing, and to supply surrogate inputs to the core during die testing.
A large resistor is used to connect the secondary address source (the scan latch) to each of the pads. If a signal is asserted on the pad externally, it will overwhelm this weak, secondary signal. This resistor allows testing of the pad path through the multiplexor. The pattern is inverted as it passes through the resistor, allowing us to test whether or not the multiplexor has switched.
Table 20.1 summarizes the control signals which need to be supplied to the boundary scan receivers.
Figure 20.4 shows, schematically, the boundary scan receiver. Two receivers are shown connected to each other on the scan chain. As can be seen in this figure, the transparent latch and I / O receiver are laid out in a single pad cell, while the scan latch is a separate cell. This differs from the scheme used in the instruction decoder and datapath chips, which use receiver cells with embedded scan latches.
|INP_SEL||This signal selects the inputs to the multiplexors on the inputs of the transparent latches. For normal (non-testing) operation, and to test the pad signal path through the input latch and multiplexor, the INP_SEL signal must be set to select the pads (low). For die testing, the INP_SEL signal must select the outputs of the scan latches (high). For MCM testing, the INP_SEL signal must select the pads.|
|PP (Pattern Present)||This is the write signal for the transparent latches. This signal must be asserted to present the patterns stored in the scan chain or from the pads to the core.|
|CSC (Close Scan Chain)||This signal is used to "close the scan chain," causing each scan latch to receive its input from the previous latch on the scan chain rather than from the pad latches. This signal should be set to select the pad latch outputs for MCM testing (in order to allow the scan chain to capture the pad inputs.)|
|PS (Pattern Sample)||Used to latch in the incoming pattern during an MCM test. The external scan clock line generates this signal if in the TEST state during an MCM test.|
|SC (Scan Clock)||This is an externally provided signal used to trigger the master-slave scan latches.|
An additional special I / O cell used in this boundary scan scheme is the boundary scan driver. This driver is used only on the four-bit data output path. It is used both to put signals on the bus to the datapath and to sample these signals during testing. This allows testing of the drivers .
The drivers consist, logically, of a multiplexor and a master-slave latch. The latch has a multiplexor on its input which can be selected to allow either of two inputs to be latched. There is also a transparent latch which can be used to latch the data output from the register files, even if the input address changes. The transparent latch contains pull-ups, allowing it to be used to receive the tri-state signals generated for use on the 4-bit bus.
Like the driver used in the core CPU chips, this driver does not use an extra multiplexor to allow the contents of the scan latch to be output on the MCM traces for MCM trace testing. The reason that this multiplexor is not necessary is that the core can usually be set to a state to produce the desired outputs. In the cache chip, this is clearly always the case.
Table 20.2 lists the control signals used in the boundary scan drivers. Figure 20.5 shows the logical representation of the drivers. The tri-state receivers, off-chip drivers, and D-latches are laid out in a combined pad cell, while the scan latch is located in the core.
|PS (Pattern Sample)||This is the write signal for the scan latches when used to sample core outputs. In this configuration, the outputs from the core feed the output drivers, and feed back into the scan latches.|
|CSC (Close Scan Chain)||This signal is used to "close the scan chain," causing each scan latch to receive its input from the previous latch on the scan chain rather than from the output drivers. When CSC is asserted, scanning operation can take place.|
|HOLD||This signal is the same signal used to latch the outputs of the register file within the register file macro. It is used here to hold the outputs, even if the address changes, thereby changing the 4 bits selected for output on the 4-bit data path.|
|SC (Scan Clock)||This is an externally provided signal used to trigger the master-slave scan latches. When used to trigger the scan latches dedicated to the four-bit output path, this signal must be OR'd with PS.|
It is important that the feedback path from the pad to the scan latch be as short as possible in order to minimize capacitive loading on the driver. This must be taken into account during routing, and is a possible reason to include scan latches in the pad ring as was done in the Philhower scheme.
The final special I/O cell required for the boundary scan scheme is a clock receiver for the SCAN CLOCK and SCAN signals. These signals arrive single-ended at the pads, and have slow rise times (and thus are more sensitive to noise) which could cause deleterious effects to the state of the chip undergoing test. The clock receiver pad converts the single-ended signal to full differential, and contains a Schmidt trigger to prevent glitches near the high and low transition points from causing multiple logical transitions.
Aside from the capability to test the core circuitry, several other critical tests are made possible by this testing scheme, among them the ability to test the MCM traces, the drivers and receivers, and the L2 data path.
Using the capabilities provided by the driver and receiver circuitry, it is possible to test all of the MCM traces which connect to the cache RAM dies. In order to test the control or four-bit input traces, it is necessary only to select the pads rather than the scan latches for input during single-shot testing.
The chips which are to provide these signals must first be set to output a test vector on the appropriate drivers using the testing scheme provided by those chips. The cache RAMs are then set to perform a single shot test, and the test is started. When the pattern sampling in the drivers normally takes place, the scan latches in the receivers scan as well. As a result, the signals arriving on the receivers are latched by the receivers, and can be scanned out for analysis.
In order to test the output latches on the four-bit wide path, signals which are to be output on the drivers must be loaded into the register file using the four-bit data path and the die test mode of operation.
A single shot test is then initiated, and the contents of the selected register file will be presented to the MCM traces.
In order to test the L2 traces it is necessary only to perform a single shot test with the WIDE control bit set.
It is important to be able to test the off-chip drivers and receivers without mounting the die on the MCM. In order to do this, the weak coupling between the scan latch and the pad in the drivers and receivers is used.
To test the boundary scan receivers, the pad inputs to the input multiplexors are selected and a single shot test is performed. The contents of the scan latches will then proceed through the receivers, mux, D-latch, and into the scan latch through the inverting resistor. The result is that if the input path is functioning properly, the contents of the scan latches in the receivers will be inverted.
To test the boundary scan drivers a single shot test takes place as normal. The outputs of the off-chip drivers are always latched by the driver scan latches if the drivers are functioning normally.
In order to test the L2 Driver / Receiver pads it is necessary to perform several consecutive single shot tests.
First, the register files must be loaded with the desired 64-bit pattern using the four-bit wide data path, four bits at a time. This will require 16 separate test runs, one for each nibble.
Once this is accomplished, a WIDE READ
test must be performed. This has the effect of causing the test
vector to be read from the register files onto the driver / receiver
pads, where it is then available for input from the receivers.
A second WIDE READ must next take
place, this time with the HOLD
control signal asserted. This has the effect of causing the register
files to latch the vector, preventing it from changing during
the next single shot test, which is a WIDE
WRITE, presumably to a new row address. This combination
of tests has the effect of reading a 64 bit double-word from a
particular row and writing it back into a new row. Single shot
or continuous mode testing can then be used to detect if the test
Since this testing scheme is simple, the hardware required to implement it is also simple. The testing circuitry can be divided into several sub-blocks: the state machine, the state machine decoder, the counter, the rotator, and the I / O pads.
Boundary scan control is limited to 12 pins. Table 21.1 lists the pins which are intended to be probed by the two Cascade probes. While the signal which controls whether die or MCM trace testing takes place might have been more conveniently placed on the pad ring, the decision was made to put it on the scan chain instead in order to allow the two channel select pads which select a bit for display on the scope to be placed on the pad ring.
|SCAN:||When asserted, scanning operation takes place when the scan clock pulses.|
|BS_IN:||Used to provide data serially to be shifted onto the scan chain.|
|BS_OUT:||Used to read data on the output of the scan chain.|
|SCOPE:||Displays any one signal on the four-bit data out bus, as selected by CH0 and CH1.|
|CH0:||The low order selection bit for the multiplexor which selects between the four data out bus signals.|
|CH1:||The high order selection bit for the multiplexor which selects between the four data out bus signals.|
|HS_CLK:||The high speed clock used for at-speed testing. It also clocks the boundary scan state machine.|
|SCAN_CLK:||The scan clock. It serves the dual use of selecting between read and write when in continuous die test mode.|
|WRITE:||An analog signal used to set the digital delay offset of the write pulse.|
|W_DEL:||Used to view the digital delay offset of the write pulse.|
|C_SYNC:||Generated by roll-over of the address counter in continuous operation. Used to provide a scope sync signal.|
|SS:||When asserted, single-shot operation. Otherwise, continuous operation.|
In addition to the pads shown in this table, NORMAL is used to determine whether normal die operation or testing operation is to take place. This signal requires a special receiver which, in the absence of an external signal, is pulled low. Sometimes this signal is referred to as . This represents a shortcoming of the testing scheme in that it is possible that a die can pass all tests, and yet not work in normal operation because of a failure in this pad or in the asynchronous logic used to override the internal testing signals.
Controlling the boundary scan test mechanism with a limited number of pads proved to be challenging. Since many control signals are needed on chip, and only twelve pads are available for all testing functions (control, input, and observation), a state machine was designed to control the testing from on chip based only on a few external signals. A decoder is used to produce the needed on-chip signals based on the current testing state. This scheme is similar to that used on the core CPU chips, although the decoding logic is decidedly simpler. Due to the specialized nature of the RAM chip, is was possible to eliminate some complication while imposing only minor testing inconveniences.
A state machine is used by the boundary scan testing system to generate control signals for the drivers, receivers, and on-chip logic. This state machine implements four states: Scan, Test_Run, S.S. (Single shot), and Done. The state machine is driven synchronously by the high speed clock (HSCLOCK). An additional testing state, Normal, is entered asynchronously when the external TEST signal is brought low.
The state of the controller is determined by the SCAN signal (which is provided by the test probe) and the NORMAL pad. The TEST signal is normally pulled up, but should be pulled down when on the MCM after testing is complete.
Table 21.2 summarizes each of the five states. The Normal state is actually not implemented from within the state machine, but is imposed by gating the decoder outputs with the TEST signal. As a result, if this signal goes low at any time, the Normal mode of operation is asynchronously entered.
|Normal||This state is intended for the normal operation of the RAM chip within the F-RISC architecture. In this mode, all core inputs come from the receiver pads, and all core outputs go to the driver pads.|
|Scan||When in this state, the scan chain is closed, and toggling the scan clock results in shifting the boundary scan shift register.|
|Test_Run||When in this mode, a test is run. The exact details of what this entails result from the contents of the SS pad and the scan_clock pad.|
|Single Shot||Upon entering this state, the input pattern is presented to the core circuitry.|
|Done||Upon entering this state, the output pattern is sampled. The pp signal remains high to allow continuous mode operation to work. If a single shot write has taken place, the write latch is cleared to prevent timing problems on succeeding single shot tests.|
The Scan state is the default state during testing mode. Any time the SCAN signal goes high, the state machine enters this state, which is used to conduct scanning operations. At power up, the SCAN signal should be asserted to put the testing control circuitry into a known state.
Upon entering DONE state, the write scan latch is cleared to prevent timing problems on subsequent single-shot tests. (During scanning operation, the write latch would otherwise store the 1, and at the beginning of the next test, a runt write pulse would occur.) During continuous mode the machine is kept in this state while testing proceeds. During single shot mode, the testing halts after the patterns are sampled.
The F-RISC architecture has been implemented in several technologies and several more are planned. As the implementations do not have to support a massive software base, it is permissible to make changes to it to suit research purposes. F-RISC/H, the successor to the current design, has begun to take shape on paper. It is hoped that faster devices, higher device integration, multiple, perhaps deeper, pipelines (using VLIW technology), better CAD tools, and more advanced packaging will allow at least an eight times increase in single node throughput.
The most important lesson to be learned from the F-RISC / G cache design is the necessity to design the entire processor holistically; if the cache is relegated to secondary importance the overall design will suffer. In the F-RISC / G case, the fact that the cache was designed after the CPU design was frozen resulted in tighter than necessary critical paths in the cache, excessive circuit complexity in the cache controllers, and unnecessarily long MCM routing.
In [Przy90], this idea is validated:
"Guideline 8: Design the memory hierarchy in tandem with the CPU
It is an easy trap to fall into: design the CPU, shrink its cycle
time as much as possible, then design the best cache possible
that can cycle at the same rate. If either the cache or CPU must
be designed first, it should be the cache. Look at the resources
available to it and determine what attainable cycle time and size
pair yield the highest performance. Then build the CPU with whatever
resources are left. Better yet, consider the whole problem: the
system design. Partition the resources among the various functional
units so that they are all matched in cycle time and together
they yield the best overall system-level performance."
The current partitioning of the F-RISC/G system was chosen so as to minimize the length of the primary CPU critical paths, and thus maximize CPU clock rate. The partitioning was in large part enforced by economic and technical limitations which may, in the future, become less important. The use of even more exotic packaging alternatives, the ability to produce larger die with more devices per die, and faster devices would all affect the partitioning decisions made in the F-RISC/G system.
The overall processor throughput would be greatly improved if the device yield were higher, allowing larger device integration.
When millions of transistors are available cache designers tend to incorporate complicated cache architectures rather than dedicating all of the transistors to cache memory. As memory sizes become arbitrarily large, rather than using additional transistors to make a small percentage change in the size of the cache, it is preferable to use the devices to improve the control circuitry to make better use of the RAM already available.
Improvements in control logic which can be accomplished when the device budget is high include adding multiple cache sets, branch prediction, and complicated replacement algorithms.
In the low transistor regime, however, it makes more sense to simply increase the size of the cache RAM. This is confirmed by the Dinero simulations.
In the F-RISC/G design, the instruction cache controller contains the Remote Program Counter, as was discussed in Chapter 2. The benefit of including this circuitry on the cache controller is that it reduced the address bus traffic between the CPU and the caches; only on branches does the cache controller need to receive an address from the CPU.
By reducing this traffic, it was possible to have both caches share an address bus, reducing pin-out, device counts, and routing complexity on the datapath chips. An important disadvantage of this combined bus is that the load capacitance on the bus is approximately twice as high as each of two separate buses would have been. This increases the cycle time for each data cache access by approximately 150 ps.
In addition, the inclusion of the remote program counter results in an additional multiplexor delay on the tag RAM address bits and cache RAM address bits. In addition, if not for the RPC, the low order address bits could pass directly to the cache RAMs. This would shorten the cache memory cycle time by approximately 300 ps (presuming the tag RAM and comparator could produce a result quickly enough.) Alternatively, more RAM chips could be included, slowing the access time but increasing hit rate.
Aside from splitting the address bus into instruction and data address buses, the RPC could be eliminated if the cache controller contained some of the functionality of the instruction decoder, or, alternatively, if the cache controller functions were merged into the instruction decoder.
If the device yield were sufficient, the cache controller could decode PC-relative unconditional BRANCH instructions immediately. Conditional BRANCHes would require control signals to be exchanged between the datapath and cache controller. Also, BRANCH instructions which do not use PC relative addressing would have to rely entirely on the datapath. This particular solution merely complicates the cache controller while adding further communications complexity between the cache and CPU. The possible advantage is that PC-relative addressing may be so common that an additional cycle of latency for non-PC-relative BRANCH instructions can be tolerated.
The other option is to incorporate a tag RAM and comparator on the instruction decoder, eliminating the instruction cache controller entirely. This would require the address to be communicated from the datapath to the instruction decoder on every cycle, which would do little to improve the communications latency problem. If, however, the comparator and tag RAMs were distributed across the datapath instead, each slice of the comparison could be performed very quickly, and it would be necessary only to accumulate one signal across the four slices, indicating whether a miss occurred. This signal accumulation is similar to what occurs in the F-RISC / G design with the adder carry chain.
In the instruction cache, the inclusion of the remote program counter mandates that the nine low order address bits be routed to the cache controller, and, from there, to the cache RAMs. In the data cache, where there is no need for an RPC, and if the instruction cache were modified to eliminate the RPC, it would still be necessary to route these address bits through the cache controllers in order to handle misses; the addresses must be saved since by the time the cache RAMs need them (to allow the secondary cache to write into them), the CPU will have placed alternate addresses on the bus. In the current implementation, these pipeline latches are located in the cache controller.
If the latches for these nine address bits are moved to the cache RAMs (which are already among the smallest chips in the system) or to the datapath, it may be possible to eliminate a pair of I/O delays, as well as decrease the MCM time-of-flight; this could decrease the length of the primary critical path by approximately 500 ps.
If the pipeline caches are included on the cache RAMs, then the cache RAMs must be informed as to when a miss as occurred (either by the cache controller or the datapath), and must receive a Valid Address strobe to indicate when to advance the pipeline.
If, on the other hand, the datapath is responsible for maintaining the pipeline, these signals are not necessary. In fact, the datapath already includes program counter history registers which contain the necessary addresses for the instruction cache. At the time the instruction decoder receives MISSI, the PC_I2 register will contain the address which the cache RAMs will need in order to process the miss. For a data cache miss, at the time MISSD arrives at the instruction decoder the RES_D1 register will contain the address which data cache RAMs will need. The pipelines will advance once before the datapath can be informed of the miss.
The datapath already includes a datapath from the RES_D1 through the ALU and out to the address bus. In order to remove the cache controller from the address critical path it should only be necessary to modify the control circuitry in the CPU.
Before such modifications can be implemented in the datapath and instruction decoder, careful consideration should be given to the effects of such changes on the main CPU critical paths. Such modifications would clearly increase the size of the core CPU chips, and if the increase in size is significant, the cycle time of the core CPU will be adversely affected.
Figure 23.1 is a block diagram for the cache controller given a split bus, removal of the remote program counter and the pipeline latches, and the use of the pipeline latches already present in the datapath.
While such modifications to the chip set appear practical, it is unclear how much circuitry would have to be added to the core CPU to accomplish all of the objectives of the cache control circuitry. Copybacks, cache start-up, and page faults are all fairly complicated, and care must be taken to ensure that the cache and tag RAMs are addressed properly at all times. It may be necessary to maintain some pipeline latches on the cache controller in order to handle the write and valid signals, as well. However, since there would be no RPC, there would no longer be any purpose in the third pipeline stage; the minimum two is all that need be implemented.
It should also be noted that while such modifications immediately speed up the secondary cache memory critical path by eliminating several gate delays (multiplexor and latch delays between the I/O receivers and the tag RAM) and decreasing the size of the die, further work may be necessary to reduce the tag RAM ./ comparator cycle time sufficiently to prevent it from becoming the primary critical path. As much of the comparator delay time is caused by routing through the top standard cell area to the tag RAM, removing the pipeline latches which spatially dominate that portion of the chip would greatly reduce the comparator propagation delay. In addition, if time permitted, the tag RAM could be reduced from 32 bits to 24 bits wide, and the 25% savings in power could be used to increase the speed of the block if necessary.
One method of improving the hit rate of a direct-mapped cache with little additional hardware is the column-associative cache [Agar87].
A column associative cache is essentially a hybrid between a direct-mapped cache and a set-associative cache in that there is sufficient hardware only to check a single tag at a time but there are multiple possible locations in which any block may be stored.
The simplest implementation would allow only two locations to store any particular block. One location would be the normal location where the block would reside in a standard direct-mapped cache. The second location should be easily computed from the address, say by simply inverting one of the line address bits.
When data is to be retrieved from the cache and it is not found in the primary location, the CPU and secondary cache are informed as in the normal direct-mapped cache. While the secondary cache is retrieving the necessary data, the primary cache checks the alternate location. If the data is found, the CPU is informed that the data is available, and the CPU can end its stall.
As a result, there are three possibilities: the data is in the primary location in which case the CPU need not stall, the data is in the secondary location in which case the CPU need stall only for one cache RAM cycle (in the F-RISC / G case, one CPU cycle), or the data isn't in either location, in which CPU need stall for the same amount of time as in a miss in a conventional direct-mapped case (presuming that the stall time would have been more than a cycle.)
This idea has particular merit in F-RISC / G since the CPU can handle variable stall times.
Figure 23.2 shows the approximate timing of a slow cache hit in a column associative cache. The data will arrive at the CPU later than if the primary location contained the data, but more quickly than if it had been necessary to go to the secondary cache.
One of the ways that the F-RISC architecture is likely to be expanded in the future is through the implementation of fine-grained parallelism. This would entail adding additional parallel pipelines to the system, each of which is capable of independently processing instructions. Two methods of accomplishing this are superscalar architectures and Very Long Instruction Word (VLIW) architectures.
In each of these architectures, parallel pipelines and additional functional units are added to the processor to enable multiple instruction streams to be executed simultaneously.
In superscalar architectures the hardware will typically examine the incoming instruction stream for code dependencies and is responsible for scheduling instructions for execution. The instructions need not execute in the order in which they occur in the code.
In VLIW architectures, the compiler is largely responsible for determining which instructions can be executed in parallel, and the instruction word is widened to accommodate multiple parallel instructions. Typically there is far less decoding by the hardware, which makes it ideal for low-yield technologies. The negative aspect of VLIW, however, is that it makes it difficult to maintain code compatibility among successive generations of processors.
These types of architectures raise special complications in designing the memory hierarchy.
In any processor with multiple conventional pipes, there might be several instruction fetches occurring simultaneously. In addition, several memory LOADs and STOREs may also be occurring.
The problem becomes significantly more complicated in superscalar systems where the CPU buffers many upcoming instructions and, based on dependencies between instructions, executes the ones it determines can be executed together (out-of-order-issuing).
In order to prevent instruction fetch latencies, the instruction cache must be capable of providing instructions to each pipe on every clock cycle. In the case of the VLIW architecture, a single, very wide instruction would be transferred, while, in the case of a superscalar architecture, several smaller instructions would have to be fetched.
The VLIW case requires a simpler hardware implementation than the superscalar case. A single instruction address would either be transferred from the CPU or generated using a remote program counter. The cache controller circuitry would then address its cache RAMs, each of which would send some bits of the instruction word to the instruction decoding circuitry. The main difference between the VLIW and single-pipe implementations is in the width of the data bus between the cache and the CPU.
A VLSI architecture could be implemented fairly easily (with respect to the cache) if limitations are placed on the types of instructions which can be placed in parallel. Specifically, unless the cache RAMs are multi-ported or interleaved, only one LOAD or STORE instruction can be performed at a time. There is a possible exception, however. If multiple LOADs or STOREs are to be performed to addresses within the same cache line, then they can be executed in parallel if the data and address buses are sufficiently wide. Such an event would probably tend to occur only on consecutive word addresses. For example, a programmer may wish to load two consecutive registers with two consecutive words from memory. If the CPU contains two ALUs it would then be possible to fetch a 64 bit long word from memory, perform an operation on it, and put it back in memory in three cycles. Such a capability is particularly useful for floating point operations.
As the width of the instruction word is increased the cost to add more ALUs and functional units increases linearly. The cost in terms of memory bandwidth is much more severe. Multi-porting the cache is extremely expensive in terms of speed and hardware, but, if multiple simultaneous accesses to memory are not allowed, it will be difficult to make full use of the parallel pipelines. The functional units need data on which to operate, and that data will always originate in the cache.
Due to this memory bottleneck, it may be desirable to increase
the cycle time of the cache if doing so would provide a net increase
in speed. For example, if double-porting the cache results in
less than a doubling of the cache cycle time, the net CPI may
One of the most promising new packaging technologies is three-dimensional (3-D) packaging. In 3-D packaging, rather than laying out the chips in a single layer on a flat module, they are stacked vertically. Since the chips are much thinner than they are wide or long, the distance between chips is much reduced. If a way can be found to take advantage of this vertical communication distance, then the overall cycle time can be much reduced.
From a practical point of view, one of the most difficult problems with stacking chips vertically rather than distributing them on a surface is that the vertical chip stack has poor thermal qualities.
In order to use 3-D stacking to eliminate this delay, the RAM chips may be combined in a stack with the appropriate cache controller. While this merely removes around a clock phase from the critical path (which is not enough to have any effect on overall CPU throughput), the slight communications delay caused by the vertical separation between chips on the stack remains very small as more chips are added, so the CPI may be reduced by including more RAM chips and increasing the cache hit rate.
Figure 24.2 shows the MCM layout given this type of chip stack. An added benefit to stacking the chips this way is that the other communications components of the cache subsystem critical path are significantly reduced as well. Data transfer between the cache RAMs and the CPU requires fewer than two chip edges.
Table 24.1 shows that the estimated critical path delay using this scheme is reduced to 1500 ps. This may be fast enough to eliminate the D1 stage of the CPU pipeline.
|A||Address I/O (datapath):||145||145||45|
|B||Address Transfer (DP to CC):||170||100||< 10|
|C,D||Address I/O (CC):||334||334||90|
|E||Cache RAM Address Transfer (CC to RAM):||< 50||< 50||< 50|
|F||RAM Access Time:||750||750||750|
|G||Data Transfer:||< 50||< 50||< 50|
|Total||< 1499||< 1429||< 995|
Additional speed improvements could be made by stacking the datapath chips (Figure 24.3). It is doubtful that the instruction decoder could be included in the stack due to the complexity of the resulting stack routing.
In this arrangement the CPU critical path would be greatly reduced, which would allow the cycle time to be decreased accordingly. In addition, the address broadcast from the datapath to the caches will increase in speed, resulting in a modest decrease in critical path length. Since this decrease is small, if the CPU cycle time is reduced it may be necessary to remain with a seven stage pipeline. There is a possibility that further gains may be possible by tailoring the drivers and receivers of the chips to take advantage of the reduced load capacitances.
A final 3-D stacking solution would be to incorporate all of the core CPU and cache chips in a single stack. The benefits of this arrangement over the three stack arrangement are difficult to quantify, and depend largely on the quality of the inter-chip route.
Figure 24.4 shows the primary critical path as dark lines, and the comparator critical path as dashed lines. The comparator critical path is limited to approximately 2.5 ns (the exact time depends on which cache is involved).
Table 24.2 gives the path breakdown for this sub-critical path
in the current F-RISC/G implementation. Most of the path delay
is caused by on-chip logic in the cache controller.
|Address I/O (datapath):|
|Address Transfer (DP to CC):|
|Address I/O (CC):|
|Tag RAM Access Time:|
|Comparator. MUX, Latch Time:|
If the primary critical path is reduced through chip stacking to around 1 ns, this secondary critical path length must be reduced as well, or no benefit is gained. In the single chip stack implementation, the time could probably be reduced to under 1700 ps. Hand crafting and optimizing the layout of the comparator could shave off perhaps another clock phase or so. Still more time can be saved by re-partitioning the cache pipeline (6.1.3 Pipeline Partitioning).
At the time this document was being prepared, a "2-D and
a half" package, incorporating two MCMs stacked back to back,
was being investigated.
The manner in which virtual memory is supported in the F-RISC / G prototype is inefficient, largely due to compromises made in the cache design. Due to cost, power, and timing constraints, it was impossible to implement a translation look-aside buffer in the primary cache. Doing so would enable the primary cache to perform virtual-to-physical address translations within the normal cache access time as long as the virtual address was in the cache.
Without this support, a higher level of cache memory must make the translation and perform page swapping as necessary. When a single thread (or, equivalently, several threads accessing the same page frame) is being executed, there is little difference between these techniques. When multiple threads, each accessing individual page frames, are being accessed in a multi-tasking environment, the F-RISC / G prototype cache will perform very poorly. While the primary cache would store addresses from multiple pages, due to the small size of the cache, each time the processor switches tasks it is likely that the entire cache will need to be swapped to the second cache level.
In the F-RISC / G prototype, the cache is a virtual cache meaning that virtual addresses, rather than physical addresses are cached. As a result, each time the operating system switches processes, the virtual addresses in the cache will map to differing physical addresses, resulting in a page fault. If each process is given the same range of virtual addresses to work with, then in order to switch processes it is necessary for the operating system to flush the entire cache (via the IOCTRL mechanism which is, in itself, very inefficient.) While the data cache could be flushed with 32 consecutive LOADs or STOREs, in order to flush the instruction cache without external hardware intervention would require 497 cycles.
An alternative would be to have external hardware monitor the
IOCTRL lines and execute the cache
initialization routine which would invalidate the entire cache
in far less time.
 ``Cell Library for Current Mode Logic using an Advanced Bipolar Process,'' (J. F. McDonald, H. J. Greub, T. Yamaguchi, and T. Creedon), I.E.E.E. J. Sol. State Cir., Special issue on VLSI, (D. Bouldin, guest editor), I.E.E.E. Trans. on Solid State Circuits, Vol. JSSC-26(#5), pp. 749-762, May, 1991.
 ``F-RISC/I: Fast Reduced Instruction Set Computer with GaAs H-MESFET Implementation," Proc. I.E.E.E. Int. Conf. on Computer Des., (J. F. McDonald, C. K. Tien, C. C. Poon, H. Greub) Boston, MA, (I.E.E.E. Cat. # CH3040-3/91/0000/0293), pp. 293-296, October 14-16, 1991.
 ``F-RISC/G: AlGaAs/GaAs HBT Standard Cell Library, ''Proc. I.E.E.E. Int. Conf. on Computer Des., (J. F. McDonald, K. Nah, R. Philhower, J. S. Vanetten, S. Simmons, V. Tsinker, Maj. J. Loy, and H. Greub), Boston, MA, (I.E.E.E. Cat. # -3/91/0297), pp. 297-300, October, 1991.
 ``Wideband Wafer-Scale Interconnection in a Wafer Scale Hybrid Package for a 1000 MIPS Highly Pipelined GaAs/AlGaAs HBT Reduced Instruction Set Computer,'' Proc. 1992 Int. Conf. on Wafer Scale Integration, ICWSI-4, San Francisco, January 20, 1992, Reprinted Hardbound by Computer Science Press, V. K Jain, and P. W. Wyatt, Eds. [I.E.E.E. CS#2482], pp. 145-154. (J. F. McDonald, R. Philhower, J. S. Van Etten, S. Dabral, K. Nah, and H. Greub).
 ``Bypass Capacitance for WSI/WSHP Applications,'' Proc. Fifth Int. Conf. on WSI, ICWSI93, San Francisco, CA, M. Lea, Ed., I.E.E.E. Computer Soc. Press, pp. 218-228, February, 1993 (J. F. McDonald, H. Greub, R. Philhower, J. Van Etten, K. S. Nah, P. Campbell, C. Maier, Lt. C. J. Loy, P. Li, L. You, and T.-M. Lu).
 ``Fluorinated Parylene as an Interlayer Dielectric for Thin Film MultiChip Modules,'' spring 1992 meeting of the Materials Research Society, Reprinted in Vol. 264 of the MRS Symposium Proceedings, Electronic and Packaging Materials Science VI, Paul S. Ho, K. A. Jackson, C.-Y. Li and G. F. Lipscomb, Eds., pp. 83-90, 1993 (J. F. McDonald, S. Dabral, X. Zhang, W. M. Wu, G.-R. Yang, C. Lang, H. Bakhru, R. Olsen and T.-M. Lu)
 ``A 500ps 32 X 32 Register File Implemented in GaAs/AlGaAs HBT's,'' Proc. I.E.E.E. GaAs Symposium [I.E.E.E. Cat. 93CH3346-4], San Jose, Oct. 1993, pp. 71-75, (J. F. McDonald, K. S. Nah, R. Philhower, and H. Greub).
 ``F-RISC/I: A 32 Bit RISC Processor Implemented in GaAs H-MESFET Super Buffer Logic,'' Proc. I.E.E.E. GaAs Symposium [I.E.E.E. Cat. #93CH3346-4], San Jose, CA, Oct. 1993, pp. 145-148, (J. F. McDonald, C. K. Tien, K. Lewis, R. Philhower, and H. J. Greub).
 ``Frequency Domain (1kHz-40GHz) Characterization of Thin Films for Multichip Module Packaging Technology,'' (J. F. McDonald, W.-T. Liu, S. Cochrane, X.-M. Wu, P. K. Singh, X. Zhang, D. B. Knorr, E. J Rymaszewski, J. M. Borrego, and T.-M. Lu), Elect. Lett., Jan. 20, 1994, Vol. 30(#2), pp. 117-118.
 ``Poly-tetrafluoro-p-xylylene as a Dielectric for Chip and MCM Applications,'' (J. F. McDonald, S. Dabral, G.-Y. Yang, X. Zhang, & T.-M. Lu, J. Vac. Sci. and Technol., B 11(#5), Sept./Oct. 1993, pp. 1825-1830.
 ``Application of a Floating-Random-Walk Algorithm for Extracting Capacitances in a Realistic HBT Fast-RISC RAM Cell.'' (J. F. McDonald, Y. L. Le Coz, R. B. Iverson, H. J. Greub, P. M. Campbell, and J. F. McDonald), Proc. I.E.E.E. VLSI Multi-Layer Interconnect Conf., V-MIC94, Santa Clara, CA, June, 1994, pp. 542-544.
 ``Design of a Package for a High Speed Processor Made with Yield Limited Technology,'' (J. F. McDonald, A. Garg, J. Loy, and H. Greub), Proc. I.E.E.E. Fourth Great Lakes Symposium on VLSI, March 4-5, 1994, Notre Dame University, Indianna, [I.E.E.E. Cat. #94TH0603-1, Comp. Soc. # 5610-02], pp. 110-113.
 ``Wiring Pitch Integrates MCM Wiring Domains,'' (J. F. McDonald, J. Loy, A. Garg, M. Krishnamoorthy), Proc. I.E.E.E. Fourth Great Lakes Symposium on VLSI, March 4-5, 1994, Notre Dame University, Indianna, [I.E.E.E. Cat. #94TH0603-1, Comp. Soc. # 5610-02], pp. 110-113.
 ``Differential Routing of MCMs - CIF: The Ideal Bifurcation Medium,'' (J. F. McDonald, J. Loy, A. Garg, M. Krishnamoorthy), Proc. I.E.E.E. Int. Conf. on Computer Des., Cambridge, MA, [I.E.E.E. Cat. # 94CH35712], pp. 599-603, October 10-12, 1994.
 ``Thermal Design of an Advanced Multichip Module for a RISC Processor,'' (J. F. McDonald, A. Garg, J. Loy, H. Greub, T.-L. Sham), Proc. I.E.E.E. Int. Conf. on Computer Des., Cambridge, MA, [I.E.E.E. Cat. # 94CH35712], pp. 608-611, October 10-12, 1994.
 ``Three Dimensional Stacking with Diamond Sheet Heat Extraction for Subnanosecond Machine Design,'' (J. F. McDonald, H. Greub, A. Garg, P. Campbell, S. Carlough, C. Maier), Proc. 1995 Int. Conf. on Wafer Scale Integration, ICWSI-7, San Francisco, January 20-22, 1995, Reprinted in Hardbound by Society Press, S. K. Tewksbury, and G. Chapman, Eds. [I.E.E.E. CS #2482], pp. 62-71.
 ``Design of a 32-bit Monolithic Microprocessor Based on GaAs H-MESFET Technology,'' in review for I.E.E.E. Transactions on VLSI Systems,'' (J. F. McDonald, C.-K. V. Tien, K. Lewis, H. J. Greub, and T. Tsen).
 ``A Very Wide Bandwidth Digital VCO Implemented in GaAs HBT's Using Frequency Multiplication and Division,'' (J. F. McDonald, P. M. Campbell, H. J. Greub, A. Garg, S. Steidl, C. Maier, and S. Carlough), Proc. 17th Ann. I.E.E.E. GaAs Symposium, San Diego, CA, October 29-November 1, 1995.
 ``Metal-Parylene Interaction Systems,'' (J. F. McDonald, S. Dabral, X. Zhang, B. Wang, G.-R. Yang, and T.-M. Lu), Mat. Res. Soc. Proc., Vol. 381, in ``Low K Dielectrics,'' S. Murarka, T.-M. Lu, T.-S. Kuan, and H. C. Ting, Eds., pp. 205-218, 1995.
 ``Low Dielectric Constant Polymers for On-Chip Interlevel Dielectrics with Copper Metallization,'' (J. F. McDonald, R. J. Gutmann, T. Paul Chow, D. J. Duquette, T.-M. Lu, and S. P. Murarka), Mat. Res. Soc. Proc., Vol. 381, in ``Low K Dielectrics,'' S. Murarka, T.-M. Lu, T.-S. Kuan, and H. C. Ting, Eds., pp. 177-195.
 ``Unterminated Bonds in Parylene-N Films,'' (J. F. McDonald, X. Zhang, B. Wang, and S. Dabral), Semiconductor International, Vol. 14, December 1995, pp. 89-94.
 ``Crystallinity Properties of Parylene N Affecting ILD Applications,'' Paper H2.11 at the ICMCTF 95, reprinted in Thin Sol. Films, Vol. 270, (1-2), pp. 508-511, December 1, 1995.
 ``Improvement of Parylene N Deposition Rate by Electric Field,'' I.E.E.E. Dielectrics for Ultralarge Scale Integration Multilayer Interconnection Conference, DUMIC 96, Santa Clara Mariott Hotel, Santa Clara, CA, pp. 214-221, February 20-21, 1996 (J. F. McDonald, B. Wang, T.-M. Lu, and G. Yang)
 ``Very Fast RISC Processors - Subnanosecond Computing using High Performance HBT Devices and MCM packaging,'' Gov. Microcir. Applic. Conf., GOMAC96, Orlando, Hyatt Orlando Inn, Orlando, Florida, March 29-21, 1996, pp. 217-221, (J. F. McDonald and H. Greub).
 ``Chip Pad Migration is a Key Component in High Performance MCM Design,'' (J. F. McDonald, J. Loy. A. Garg, and M. Krishnamoorthy), Sixth Great Lakes Symposium on VLSI, GLSVLSI96, (I.E.E.E Cat. 96TB100041), Iowa State University, March 22-23, 1996, pp. 96-99.
 ``Fast RISC Design using 100 GHz HBT and Microwave HDI MCM Technology,'' (J. F. McDonald and H. Greub), RADC Workshop on Academic Electronics in New York State, Embassy Suites Hotel, Syracuse, New York, June 13-14, 1996, pp. 259-264.
 ``Dual Damascene Structure Fabrication with Parylene-N as the ILD and Copper as the Interlayer Metal,'' (J. F. McDonald, B. Wang, C. Steinbruchel, and R. Tacito), 13th annu. VLSI MultiLevel Interconnection Conference, VMIC96, Santa Clara Marriott, Santa Clara, CA, June 18-20, 1996, pp. 58-60.
 ``A Floating Random Walk Method for Computing Interconnection
Capactiances,'' (J. F. McDonald, Y. L. LeCoz, H. J. Greub,
A. Garg, and R. Iverson), 13th annu. VLSI MultiLevel Interconnection
Conference, VMIC96, Santa Clara Marriott, Santa Clara, CA, June
18-20, 1996, pp. 230-232.
 Lt. Cmdr. James Loy, "Differential Routing Tools for High Speed GaAs HBT CML Circuits," Ph.D. 1993.
 Robert Philhower, "Spartan RISC Architecture for Yield Limited Technology," Ph.D. 1993.
 Kyung Suc Nah, "An Adaptive Clock Deskew Scheme and a 500 ps 32 by 8 Bit Register File for a High Speed Digital System," Ph.D. 1994.
 C.-K. Vincent Tien, "System Design, Analysis, Implementation and Performance Evaluation of a 32 Bit RISC Processor Based on GaAs HMESFET Technology," Ph.D. 1994.
 Cliff Maier, ``High Speed Microprocessor Cache Memory Hierarchy
for Yield Limited Technology,'' Ph.D. August, 1996.
No formal patent applications have been filed during this grant
due to lack of funds for legal expenses. However, it is possible
that the ideas presented in Appendix C on clock deskew circuitry
could qualify for a patent if one were to be submitted.
[Agar93] Agarwal, A. and S. D. Pudar, "Column-associative caches: A technique for reducing the miss rate of direct-mapped caches," 20th Annual International Symposium on Computer Architecture ISCA '20, San Diego ,Calif., May 16-19. Computer Architecture News 21:2 (May), 179-90.
[Beac88] Beach, W. F. and Austin, T. M. "Parylene as dielectric for the next generation of high density circuits," proceedings of the 2nd International SAMPLE Electronics Conference, June 14-16, 1988 pp 25-45.
[Bens95] Benschneider, Bradley J., A. J. Black, W. J. Bowhill, S. M. Britton, D. E. Dever, et. al., "A 300-MHz 64-b quad-issue CMOS RISC microprocessor," IEEE Journal of Solid-State Circuits, Vol. 30, No. 11, Nov. 1995, pp. 1203-1214.
[Casc91] Cascade Microtech, Incorporated. "Multicontact high-speed integrated circuit probes." Beaverton, Oregon, 1991.
[Dabr93] S. Dabral, X. Zhang, X. M. Wu, G.-R. Yang, L. You, H. Bakhru, R. Olson, J. .A. Moore, T.-M. Lu, and J. F. McDonald, "aa'a"a'" Poly-tetrafluoro-p-xylene as an interlayer dielectric for thin film multichip modules and integrated circuits," Journal of Vacuum Science and Technology, B 11(5), Sep/Oct 1993.
[Deve91] Devore, Jay S. Probability and Statistics for Engineering and the Sciences, Third Edition. Pacific Grove, California. Brooks / Cole Publishing, 1991.
[Dill88] Dillinger T. E. VLSI Engineering. pp. 624-93, Englewood Cliffs, New Jersey: Prentice Hall, 1988.
[Faus95] Faust, Bruce. "Designing Alpha-based systems." Byte Magazine, pp. 239-240, June 1995
[GE95] G.E. Corporate Research & Development Advanced Electronics Assemblies Program, "Microwave High Density Interconnect Design Guide." February 1995
[Greu90] Greub, H. J. "FRISC - A fast reduced instruction set computer for implementation with advanced bipolar and hybrid wafer scale technology." Ph.D. dissertation, Rensselaer Polytechnic Institute, Troy, New York, December 1990.
[Greu91] Greub, H. J., et. al. "High-performance standard cell library and modeling technique for differential advanced bipolar current tree logic." IEEE Journal of Solid-State Circuits, Vol. 26, No. 5, pp. 749-62, May 1991.
[Hall93] Haller, T. R., et. al. "High frequency performance of GE high density interconnect modules." IEEE Transactions on Components, Hybrids, and Manufacturing Technology, Vol. 16, No. 1, pp. 21-27, February 1993.
[Henn96] Hennessy, J. L., and D. A. Patterson. Computer Architecture: A Quantitative Approach, second edition, San Mateo, California: Morgan Kaufmann, 1996.
[Hill84] Hill, Mark D. and Alan Jay Smith. "Experimental evaluation of on-chip microprocessor cache memories," Proc. Eleventh International Symposium on Computer Architecture, June 1984, Ann Arbor, MI, 1984.
[Kilb62] Kilburn, T., D. B. G. Edwards, M. J. Lanigan, and F. H. Sumner. "One-Level Storage System," IRE Transactions on Electronic Computers, Vol. EC-11, No. 2, pp. 223-236, April 1962.
[Lev95] Lev, Lavi A., A. Charnas, M. Tremblay, A. R. Dalal, B. A. Frederick, et. al., "A 64-b microprocessor with multimedia support," IEEE Journal of Solid-State Circuits, Vol. 30, No. 11, Nov. 1995, pp. 1227-1236/
[Long90] Long, S. I., S. E. Butner. Gallium Arsenide Digital Integrated Circuit Design, New York, McGraw-Hill Publishing Company, 1990.
[Maie94] Maier, C. "A testing scheme for a sub-nanosecond access time static RAM" Masters Thesis, Rensselaer Polytechnic Institute, 1994.
[Maji89] Majid, N., Dabral, S., and J. F. McDonald. "The parylene-aluminum multilayer interconnection system for wafer scale integration and wafer scale hybrid packaging." Journal of Electronic Materials, Vol. 18, No.2, pp. 301-311, 1989.
[Matt70] Mattson, R. L., J. Gecsei, D. R. Slutz, and I. L. Traiger. "Evaluation techniques for storage hierarchies." IBM Systems Journal, 9, pp. 78-117, 1970.
[Maun86] Maunder, C. "Paving the way for testability standards." IEEE Design and Test of Computers, Vol. 3, No. 4, p. 65, 1986.
[Maun92] Maunder, C. M. and R. E. Tulloss. "Testability on TAP." IEEE Spectrum, pp. 34-37, February 1992.
[Nah91] Nah, K., R. Philhower, J. S. Van Etten, S. Simmons, V. Tsinker, J. Loy, H. Greub, and J. F. McDonald. "F-RISC/G: AlGaAs/GaAs HBT standard cell library," Proc. 1991 IEEE International Conference on Computer Design: VLSI In Computers & Processors, pp. 297-300, 1991.
[Nah94] Nah, K. "An adaptive clock deskew scheme and a 500 ps 32 by 8 bit register file for a high speed digital system" Ph.D. Dissertation, Rensselaer Polytechnic Institute, 1994.
[Phil93] Philhower, B. "Spartan RISC architecture for yield-limited technologies" Ph.D. Dissertation, Rensselaer Polytechnic Institute, 1993.
[Przy90] Przybylski, S. A. Cache and Memory Hierarchy Design: A Performance-Directed Approach. San Mateo, California: Morgan Kaufmann, 1990.
[Salm93] Salmon, Linton G. "Evaluation of thin film MCM materials for high-speed applications." IEEE Trans. On Components, Hybrids, and Manufacturing Technology, Vol. 16, No. 4, June 1993.
[Ston90] Stone, Harold S. High Performance Computer Architecture, Second Edition. Reading, Massachusetts. Addison-Wesley, 1990.
[Sze81] Sze, S. M. Physics of Semiconductor Devices. Second Edition, pp. 182-3, New York: John Wiley and Sons, 1981.
[Sze90] Sze, S. M. High-Speed Semiconductor Devices. pp 371-373, New York: John Wiley and Sons, 1990.
[Tien95] Tien, C.-K. "System design analysis, implementation, and testing of a 32-bit GaAs microprocessor" Doctoral Thesis, Rensselaer Polytechnic Institute, 1995.
[Webe92] Weber, S. "JTAG finally becomes an off-the-shelf solution." Electronics, Vol. 65, No. 9, p. 13, 10 August 1992.
[Zhan95] Xin Zhang, "Parylene as an interlayer dielectric,"
Ph.D. Dissertation, Rensselaer Polytechnic Institute, 1995.
 C. Y. Chang, and Francis Kai, GaAs High-Speed Devices, John Wiley, 1994.
 R. Anholt, Electrical and Thermal Characterization of MESFETs, HEMTs, and HBTs, Artech House, 1995.
 D. J. Roulston, Bipolar Semiconductor Devices, McGraw Hill, 1990.
 B. Jalali, and S. J., Pearton, Eds., InP HBTs, Growth, Processing and Applications, Artech House, 1995
 R. Williams, Modern GaAs Processing Techniques, Artech House, 1991.
 U. Ciligiroglu, Systematic Analysis of Bipolar and MOS Transistors, Artech House, 1994.
 F. Ali, and A. Gupta, Eds., HEMTs & HBTs, Artech House, 1991.
 J. W. Mayer and S. S. Lau, Electronic Materials Science for Integrated Circuits in Si and GaAs, Macmillian, 1990.
 N. Kanopoulos, Gallium Arsenide Digital Integrated Circuits, Prentice Hall, 1989.
 S. Long, and S. Butner, Gallium Arsenide Digital Integrated Circuit Design, McGraw Hill, 1990.
 V. Milutinovic, Ed., Microprocessor Design for GaAs Technology, Prentice Hall Advanced Reference Series in Engineering, 1990.
 M. Katevenis, Reduced Instruction Set Computer Architectures, MIT Press, 1984.
 J. R. Ellis, Bulldog: A compiler for VLIW Architectures, MIT Press, 1985.
 S. S. Sapatnekar, and S.-M. Kang, Design Automation for Timing Driven Layout Synthesis, Kluwer Academic Publishers, 1993.
 R. Jain, The Art of Computer Systems Performance Analysis, J. Wiley & Sons, 1991.
 S. A. Przybylski, Cache and Memory Hierarchy Design, Morgan Kaufman Publishers, 1990.
 H. B. Bakoglu, Circuits, Interconnections, and Packaging for VLSI, Addison Wesley Publishers, Inc., 1990.
 D. A. Patterson, & J. L. Hennessy, Computer Organization & Design - The Hardware Software Interface, Morgan Kaufman Publishers, 1994.
 E. J. Rymaszewski, Handbook of Microelectronics Packaging, Van Nostrand, 1990.
 F. E. Gardiol, Lossy Transmission Lines, Artech House,
Optimization of the Register File and Cache RAM Blocks Used in the RPI Datapath and Cache RAM Chips
From the HSCD test structures, we were able to glean more information about the Rockwell interconnect and device performance. This information was backannotated into our models and CAD tools in order to predict the performance of our circuits. What we found was that the memory circuits would perform significantly below their required levels (the target performance for the register file was 200 ps and 450 ps for the cache RAM block). This is due in part to increased parasitic capacitance as well as degraded device performance. Figure 1 shows a comparison of the register file access times for different interconnect and device models. The latest device models are called "2-sided" and "3-sided" and the most recent interconnect model is called "anisotropic".
Because the register file had tighter performance requirements
than the cache RAM blocks, it was selected for optimization first.
Although any layout improvements from the register file could
be fed into the cache RAM block, separate circuit optimizations
were required due to the differences in size and power. For the
most part, the cache RAM block optimization followed the same
process as used in the register file and was somewhat easier due
to the larger access time requirements. To date, both the register
file and cache RAM blocks have been redesigned to meet their performance
requirements. The register file has a 195 ps READ access time
and the cache RAM block has a 400 ps READ access time. These performance
metrics (along with some safety margin) have been incorporated
into the redesign of the datapath and cache RAM chips.
The optimization process began with layout because the process design rules had changed but the physical design had not been updated. There are three sets of large nodes in the register file and cache RAM blocks, namely the address lines, bit lines, and word lines. The capacitance of these nodes has a direct effect upon performance. From Figure 1, it can be seen that the largest contribution to the access time comes from the memory cells and bitlines, followed by the address drivers and address lines, and finally the word drivers and word lines.
Although the relative contributions to delay were known, the effect of layout optimizations upon each delay component was not. A series of simulations using SPICE were performed in which the capacitance of the address, bit and word lines was varied in order to determine the sensitivity of the circuit delay to that component. The results (shown in Figure 2) indicated that the bit and word lines are the most sensitive, suggesting that the optimization process should focus upon these nodes. Due to the relatively large bit line capacitance (~3X larger than the word line value), the bit lines became the primary focus of layout optimization.
Now that the sensitivity of the access time to the node capacitances was known, the emphasis shifted to minimizing capacitance through layout changes. The primary focus was the bit lines in the memory cells. The word lines were also optimized indirectly as a side-effect of the bit line changes. Although the register file had much lower sensitivity to the address lines, they were optimized anyway in order to squeeze out as much performance as possible.
Bit / Word Line Optimizations
A number of memory cell layouts have been progressively developed, some of which have been made possible by process design rule changes. To date eleven distinct memory cell layouts (Figure 3) have been produced along with numerous variations. The first four iterations produced the most significant improvements in parasitic capacitance but unfortunately they were not sufficient alone in meeting the target performance numbers. Circuit modifications were then undertaken and a new memory cell was developed (described in the next section).
The original memory cell had several disadvantages which were solved with the addition of metal-3 to the Rockwell process. The primary problem was the parasitic capacitance between the metal-1 bit and the metal-2 word lines. The first iteration placed the top word line in metal-3 to reduce the crossover capacitance. This helped somewhat but it wasn't sufficient. The justification for leaving the lower word line in metal-2 was to avoid the large metal-2/metal-3 via which would be required to connect a metal-3 word line to the metal-1 resistor connection. For the upper word line, this via could be hidden underneath the resistors, but for the lower word line a via would complicate routing and possibly increase the coupling between the lower word line of one row and the upper word line of the next. In the end, it was decided that routing the lower word line in metal-3 was necessary despite the disadvantages, so the second cell iteration was produced.
The next redesign opportunity arose when Rockwell reduced the dimensions of the HBT devices and relaxed the minimum feature sizes. These changes allowed the memory cell to be packed more tightly, creating more room for the bit lines and reducing their parasitic capacitance. The smaller feature sizes also allowed the resistors to be shrunk which became important in later redesigns. The effects of the process/design changes can be clearly seen in the fourth iteration of the memory cell: the resistors and devices are smaller, the devices are placed closer together and the interconnect is routed closer to the devices. Since the core of the cell is now more compact than before, the coupling to the bit lines is reduced because the adjacent structures are further away. More importantly, a smaller core allows the bitline - bitline spacing to be increased. Because the majority of the bitline coupling is with the neighboring bitline, any reduction can significantly improve the overall bit line parasitic capacitance. After the core is redesigned for maximum compactness, the bitline-bitline spacing is adjusted to determine the optimum spacing for minimal parasitic capacitance.
Once it became apparent that layout modifications alone would not be sufficient to recover the "lost" performance, a series of SPICE simulations were performed to determine the sensitivity of the register file to component value changes. There are numerous components in the register file which have an impact upon the performance, but several components are particularly important. These are the address decoder resistors, the read/write logic pull-ups and current source resistors, the sense amplifier bitline current source resistor, and the memory cell resistor ratio in the threshold voltage generator. Some components affect several nodes in the circuit with conflicting requirements, presenting a difficult and complex optimization problem.
Address Decoder: Wordline Voltage
The address decoder directly sets both the address line and wordline voltage swings. The wordline is the mechanism by which a row of memory cells is selected and enabled to place their logical values on the bitlines. As a result, the switching time of the wordlines directly impacts the overall access time of the register file. The wordline swing is determined by both the total resistance in the address decoder, the ratio of the resistors and the VBE of the devices.
In Figure 4, the effect of different total decoder resistance values upon the access time and wordline swing are shown for a top:bottom resistor ratio of 1:1. When the total resistance is increased, the wordline swing also grows because the voltage drop across the total resistance increases, forcing the wordline driver base lower and thus the wordline voltage as well. The upper value of the wordline voltage is fixed at VCC-VBE because the base is pulled to VCC when all five of the address decoder Q1s are cut-off. From the second plot in Figure 4, a lower bound on the total resistance of about 420 ½ can be determined which will satisfy the minimum wordline static swing of 850 mV.
Figure 5 depicts the effect of various decoder resistor ratios for a total resistance of 440 ½. Although it appears that the wordline swing should not be affected by the resistance ratio, the changing ratio does reduce the current through the Q1 devices. The different current levels in turn affect the voltage drop across the total decoder resistance and thus the wordline swing. This is just one example of the intricate and complex balance between different parts of the register file circuit.
Address Decoder: Address Line Voltage and Current
The address lines are also affected by the resistors in the address decoder. The ratio determines the voltage swing on the address lines which in turn determines the current. The maximum address line voltage is fixed at VCC-VBE but the decoder ratio determines the minimum voltage and thus the total swing. Current flows through the address lines only when the address line is low, hence the current decreases with increasing swing (or, alternatively, the current decreases with decreasing minimum address line voltage).
Read/Write Logic: Bitline Voltage
The read/write logic has a significant effect upon the bitlines, primarily in the WRITE mode. In order to overwrite the state of a memory cell, the read/write logic pushes the bitlines to relatively extreme high and low voltages in order to cut-off and turn-on the memory cell devices. The speed of the WRITE as well as the recovery time are determined primarily by the bitline swing (a larger range results in faster WRITEs but a slower recovery and vice versa). During a READ, the circuit attempts to set the bitlines to a mid-range value. This specifies the low bitline voltage and thus clamps the lower part of the bitline swing.
The read/write logic uses the threshold voltage along with resistive pull-ups and a resistor current source to generate the bitline voltages. The actual bitline potentials depend upon the value of the pull-up resistors and the amount of current flowing through the resistors (determined by the resistive current source). For a READ, current flows through both resistors equally, dropping the resistance by half and producing a mid-range voltage of Vth- IR. During a WRITE, current only flows through one of the resistors, hence the voltage swing is Vth to Vth- IR. Because the read/write logic uses the threshold voltage Vth as a reference and power supply, drawing excessive amounts of current from the threshold voltage generator can seriously stress the generator circuit and reduce its robustivity. For this reason, the amount of current which can be drawn by the read/write logic is limited and should be kept low if possible.
Figure 7 below shows the time required to perform a WRITE and the bitline swing for a range of read/write logic current source resistances (also shown are the static READ access times). As the current source resistance increases, the current through the pull-ups decreases and the bitline swings are reduced. This leads to longer WRITEs and eventually (at higher resistance values) failure to overwrite the memory cell state.
The pull-ups in the read/write logic have an even larger effect upon the WRITE time but can adversely affect the READ times by lowering the low bitline voltage and increasing the total bitline swing. Most importantly, a larger pull-up resistor increases the time required to switch from a WRITE to READ because the internal read/write logic swings are higher and thus more charge must be dissipated to change modes. In the end, however, the choice for the current source resistance was made to reduce the strain on the threshold voltage generator and the pull-up value was optimized for this current. Figure 8 below shows the effect of different pull-up values on the access times and the bitline swing during a WRITE.
Sense Amplifier: Bitline Current
The sense amplifiers contain the current source for the bitlines. By varying the current through the bitlines, the delay due to parasitic capacitance can be significantly reduced. However, care must be taken not to burn out the devices in the memory cells, hence the maximum bitline current is limited.
The bitline current source is simply a high-current device and
a resistor connected between the emitter and VEE. A
bias generator sets the base voltage and produces a constant voltage
drop across the tail resistor, thereby determining the bitline
current. Because the bitlines exhibit the most sensitivity to
capacitance of all the large nets in the register file, they offer
the most opportunity for improvement. By increasing the current
flowing through the bitlines, they can be discharged quickly and
thus improve the switching time. Figure 9 demonstrates the sensitivity
of the register file access time to the bitline current.
After the sensitivity analysis was performed and the circuit components were fine-tuned, it was obvious that component value changes alone would also not be sufficient to meet the performance requirements. In order to correct some of these problems and improve performance, various circuit modifications were explored and analyzed. The optimization process focused upon the static and dynamic circuit performance in order to reduce the static access time. The difference between static and dynamic performance was primarily felt at two locations in the circuit: the wordlines and bitlines.
Wordline Swing / Memory Cell
Because the wordlines provide the means for selecting a row within the register file and determine the bitline swing, their switching time has a direct impact upon the performance of the register file. The layout modifications have trimmed the parasitic capacitance down tremendously and the resistance of the line is minuscule, hence the RC effect does not contribute significantly to the delay. However, the device switching speed does contribute and is dominant, hence the swing of the wordline has a direct effect upon the switching time.
As can be seen in Figure 10(b), the wordline swing for static and dynamic signals are significantly different. Because the static swing is higher, the time required to switch the wordline after it has charged fully (due to a "static" address) is greater than if the address had changed in the previous cycle (i.e. a "dynamic" address). The wordline swing determines the swing on the memory cell collector nodes which drive the bitlines, thus when the wordline switching time increases, it directly affects the bitline switching time. Ideally, the static and dynamic swings should be equal, eliminating any difference between access times.
Several clamping circuits were investigated as a way to restrict the high static wordline swing, but severe operating requirements hampered this effort. Some of the problems were the high wordline current (approximately 20 mA), the large 0.8 V drop across the Schottky diodes, and the need to fit the clamp circuit within a small area in order to maintain the original register file dimensions. In the end, no satisfactory circuit was found.
Wordline voltage divider
Because the switching time is actually based upon when the bitlines switch rather than the wordlines, there is the possibility of improving the access time without reducing or limiting the wordline swing. By lowering the internal memory cell swings, the bitline swing is also reduced and thus switches faster when the wordlines start to change.
One drawback to reducing the internal swings was the reduction of the dynamic wordline swing and a corresponding increase in the dynamic access time. However, although the dynamic access time increases significantly, it is still less than the static access time. Since only the longest access time is important from the standpoint of the F-RISC/G datapath chip, the relatively fast dynamic access time of the original design provided no benefit and could be sacrificed for the benefit of the static access time.
To reduce the internal memory cell swings, a simple voltage divider was created in the memory cells by placing a resistor between the wordline and the previous wordline connection point (see Figure 11). This resistor provides a voltage drop and creates an "effective" wordline. The actual potential drop across the resistor depends upon the selected/deselected state of the memory cell due to the different current levels. Because the drop is proportional to the current, it reduces the effective wordline potential in the selected state much more than in the deselected state.
(a) Original memory cell design (b) Memory cell with wordline-voltage divider
Simulations in SPICE agree with the analysis above. The static access times are reduced at the expense of the dynamic access times. The addition of a small voltage-divider resistor into the memory cells was simpler to implement than a clamping circuit for each row in the register file and did not increase device count, further justifying this method.
Bitline Swing: Read/Write Logic
During a READ, only the high bitline is actually driven by the memory cells. The low bitline voltage is set by the read/write logic based upon the threshold voltage. For a WRITE, the read/write logic sets both bitline voltages in order to overwrite the memory cell state. To do this, the read/write logic has to force the bitlines to values which will force the devices in the memory cell on or off and thereby store the logical value. The bitline voltages during a WRITE determine in part the speed of the operation with larger bitline swings corresponding to faster WRITEs. However, after the WRITE operation is over, a new address may be presented for a READ and the register file must respond with the appropriate data within 200 ps. If the WRITE bitline voltages are too large, the switching time of the bitlines may be significantly delayed due to the excess charge from the WRITE. One way to avoid this situation is to increase the lower bitline voltage while decreasing the high one.
The original design of the read/write logic applied high and low bitline voltages of equal magnitude relative to the voltage of the read/write logic during a READ. A sensitivity analysis was performed in which the magnitude of the bitline swing during a WRITE was varied and the time to store the data was measured. The results indicated that the WRITE time was within the specifications while the bitline swing was below the normal READ levels, meaning that the swing during a WRITE did not have to be adjusted.
Even though it was not required in the register file, adjusting the bitline swings during a WRITE was necessary in the cache RAM optimization. During the redesign of the register file, it was not clear that no changes were necessary regarding the WRITE bitline voltage swings and a new read/write logic circuit was developed which reduced the high bitline excursion. The read/write logic operates by generating three distinct voltages: a mid-range voltage for both bitlines during a READ and high and low voltages for the bitlines during a WRITE. All of the voltages are based upon the threshold voltage and the mid-range and low voltages are generated using resistors.
Bitline Swing: Bridge Resistor
To improve the switching performance of the bitlines, a "bridge" resistor was connected between them (Figure 12). The bridge resistor attempts to equalize the bitline voltages (and thereby improve the switching speed) but is large enough to maintain the bitline swing between address changes. The actual value of the bridge resistor was determined using a sensitivity analysis which examined the access time, WRITE time, bitline swing, memory cell device current and current through the bridge resistor.
The bridge resistor affected many parts of the register file circuit.
It increased the current through the memory cell devices significantly
because, in addition to sinking current from the bitline current
sources, current was also coming from the other bitline. It also
increased the WRITE time for the same reason but to a greater
extent due to the larger bitline swings during the WRITE. Despite
all of these negatives, the bridge resistor increased the register
file access time significantly. Figure 13 shows the effect of
various bridge resistor values upon the bitline swing, the bitline
current and the current through the resistor itself. In Figure
14, the static and dynamic READ access times and the WRITE time
are shown relative to the bridge resistor value.
The 32x8 register file circuit and layout has been optimized to
achieve a 195 ps READ access time using 2.01 W. The 32x16 cache
RAM block has a 400 ps READ access time with a power dissipation
of 1.5 W. The external dimensions of the register file have remained
the same while the cache RAM blocks were increased by 7 µm
to accommodate the bridge resistor (the cache RAM block requires
a 3k½ resistor, significantly larger than the register file