High Speed Microprocessor Cache Memory Hierarchies for Yield-Limited Technologies

High Speed Microprocessor Cache Memory Hierarchies For Yield-Limited Technologies

Note: This html version of the dissertation may differ slightly from the final, archived version. Last minute changes, data translation problems, and the like may result in some differences. In particular, footnotes did not come across. As a result, some citations may not be made correctly (sorry!).

The printed version of this document is available through the RPI Library, or on microfilm from NMI.

Click here for frames

Cliff Maier

A DISSERTATION SUBMITTED TO THE GRADUATE FACULTY OF

RENSSELAER POLYTECHNIC INSTITUTE

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Major Subject: Electrical Engineering

Approved by the Examining Committee:

____________________________	____________________________
John F. McDonald, Thesis Advisor	Kenneth Rose, Prof., ECSE
____________________________	____________________________
T. Paul Chow, Prof. ECSE	B. Szymanski, Prof. Comp. Sci.

Rensselaer Polytechnic Institute

Troy, New York

August 1996

(For Graduation December 1996)

High Speed Microprocessor Cache Memory Hierarchies For Yield-Limited Technologies

Cliff Maier

AN ABSTRACT OF A THESIS SUBMITTED TO THE GRADUATE FACULTY OF

RENSSELAER POLYTECHNIC INSTITUTE

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Major Subject: Electrical Engineering

The original of the complete thesis is on file

in the Rensselaer Polytechnic Institute Library

Examining Committee:

John F. McDonald, Thesis Advisor

T. Paul Chow, Prof. ECSE

B. Szymanski, Prof. Comp. Sci.

Kenneth Rose, Prof. ECSE

Rensselaer Polytechnic Institute

Troy, New York

August 1996

(For Graduation December 1996)

Cliff Maier

TABLE OF CONTENTS

1. Introduction and Historical Review

1.1 F-RISC / G Overview

1.2 Technology

1.2.1 GaAs Heterojunction Bipolar Transistors

1.2.2 Current Mode Logic

1.2.3 Thin Film Multi-Chip Modules

1.3 Memory Hierarchies

1.3.1 Write Policies

1.3.2 Address Mapping

1.3.3 Architecture

2. Cache Architecture and Microarchitecture

2.1 Trace-driven Simulation

2.2 Cache Transactions

2.2.1 Load Hit

2.2.2 Clean Load Miss

2.2.3 Dirty Load Miss

2.2.4 Store Hit

2.2.5 Clean Store Miss

2.2.6 Dirty Store Miss

2.3 Secondary Cache

2.4 Yield and Redundancy Analysis

2.4.1 Block Replacement

2.4.2 Nibble Replacement

2.4.3 Column Replacement

2.4.4 Additional Columns

2.4.5 Analysis

3. F-RISC / G Cache Implementation

3.1 Advanced Packaging

3.2 Clock Synchronization

3.3 Cache Pipeline

3.4 Cache RAM

3.4.1 Cache RAM Architecture

3.4.2 Cache RAM Design

3.4.3 Cache RAM Timing

3.4.4 Cache RAM Details

3.5 Cache Controller

3.5.1 Chip Architecture

3.5.2 Instruction and Data Cache ConFIG.GIFuration

3.5.3 Clocking

3.5.4 Cache Controller Design

3.6 Communications

3.6.1 CPU and Primary Cache Communications

3.6.2 Intra-cache Communications

3.6.3 Secondary Cache Communications

3.6.4 MCM Placement

3.7 Virtual Memory Support

3.8 Timing

3.8.1 Load Timing

3.8.2 Store Timing

3.8.3 Instruction Fetch Timing

3.8.4 Other Cache Stalled

3.8.5 Processor Start-up

3.9 L2 Cache Design

4. At-Speed Testing Scheme

4.1 Evaluation of the F-RISC/G Boundary Scan Scheme

4.2 Test Scheme Design

4.2.1 Overall Scheme

4.2.2 Special Drivers and Receivers

4.2.3 Special Tests

4.3 Implementation and Test Plan

4.3.1 External Connections

4.3.2 Testing Control Logic

4.3.3 Timing

4.3.4 Continuous Mode Testing

4.3.5 Testing Sequence

4.4 Individual Chips

4.4.1 Cache Controller

4.4.2 Cache RAM

5. Beyond F-RISC / G

5.1 Cache Organization and Partitioning

5.1.1 Use of Higher Device Integration

5.1.2 Remote Program Counter

5.1.3 Pipeline Partitioning

5.1.4 Temporal and Spatial Interleaving

5.1.5 Column Associativity

5.1.6 Superscalar / VLIW CPU

5.1.7 Multiprocessing

5.1.8 Cache Pre-processing

5.2 Future Packaging

5.3 Improved Virtual Memory Support

5.4 CAD Improvements

5.5 Clocking

5.6 Radio Frequency Data Memory

5.7 Conclusions

References

Appendix A: Dinero Simulation Results

Appendix B: Cache Controller Scan Chain

Appendix C: Cache Controller Schematics

Appendix D: Cache RAM Schematics 317

LIST OF FIGURES

Note: Clicking on the figure number will usually just bring up the figure. In some cases the link will bring you to the appropriate portion of the dissertation.

FIGURE 1.1: CROSS SECTION OF ROCKWELL HBT (MODIFIED FROM [NAH91])

FIGURE 1.2: DIFFERENTIAL CURRENT SWITCH

FIGURE 1.3: MEMORY HIERARCHY

FIGURE 1.4: DIRECT MAPPED CACHE

FIGURE 1.5: FULLY ASSOCIATIVE CACHE

FIGURE 1.6: SET ASSOCIATIVE CACHE

FIGURE 1.7: CACHE ARCHITECTURES

FIGURE 2.1: TYPICAL MEMORY REFERENCE PATTERN

FIGURE 2.2: NAIVE MEMORY REFERENCE PATTERN

FIGURE 2.3: SIMULATIONS: 2KB HARVARD CACHES, DIRECT-MAPPED, SPICE TRACE

FIGURE 2.4: SIMULATIONS: 2KB HARVARD CACHES, DIRECT-MAPPED, TEX TRACE

FIGURE 2.5: SIMULATIONS: 2KB HARVARD CACHES, DIRECT MAPPED, GCC TRACE

FIGURE 2.6: SIMULATION RESULTS FOR BENCHMARK SUITE

FIGURE 2.7: 2 KB HARVARD CACHES, DIRECT-MAPPED, BLOCK SIZE EQUALS BUS WIDTH

FIGURE 2.8: 512 BIT BUS RAM PARTITIONING

FIGURE 2.9: 256 BIT BUS RAM PARTITIONING

FIGURE 2.10: CPI AS A FUNCTION OF SET QUANTITY, BLOCK SIZE, ASSUMING LRU

FIGURE 2.11: EFFECT OF SET SIZE

FIGURE 2.12: EFFECT OF REPLACEMENT ALGORITHM

FIGURE 2.13: EFFECT OF ARCHITECTURE ON CPI

FIGURE 2.14: EFFECT OF HARVARD CACHE SIZE ON CPI

FIGURE 2.15: LOAD HIT

FIGURE 2.16: CLEAN LOAD MISS

FIGURE 2.17: DIRTY LOAD MISS

FIGURE 2.18: STORE HIT

FIGURE 2.19: CLEAN STORE MISS

FIGURE 2.20: DIRTY STORE MISS

FIGURE 2.21: REQUIRED SECONDARY CACHE HIT TIME

FIGURE 2.22: CACHE RAM BLOCK FLOORPLAN

FIGURE 2.23: BLOCK REPLACEMENT

FIGURE 2.24: NIBBLE REPLACEMENT

FIGURE 2.25: EXTRA COLUMN PER BLOCK REPLACEMENT

FIGURE 2.26: CHIP YIELDS AS A FUNCTION OF FAULT PROBABILITY

FIGURE 3.1: F-RISC / G SYSTEM

FIGURE 3.2: CRITICAL PATH DIAGRAM

FIGURE 3.3: DATA CACHE CRITICAL PATH

FIGURE 3.4: SIGNAL TIME OF FLIGHT

FIGURE 3.5: ADDRESS TRANSFER FROM CPU TO CACHES

FIGURE 3.6: SINGLE BUS ADDRESS TRANSFER FROM CONTROLLER TO RAMS

FIGURE 3.7: DUAL BUS ADDRESS TRANSFER FROM CONTROLLER TO RAMS

FIGURE 3.8: INSTRUCTION TRANSFER - RAM TO ID

FIGURE 3.9: DATA TRANSFER - RAM TO DATAPATH

FIGURE 3.10: SEQUENTIAL CACHE OPERATION

FIGURE 3.11: PIPELINED CACHE OPERATION

FIGURE 3.12: SYSTEM PIPELINE - SEQUENTIAL LOADS

FIGURE 3.13: SYSTEM PIPELINE - "SEQUENTIAL" STORES

FIGURE 3.14: SAMPLE CODE WHICH CAUSES A DATA CACHE BUBBLE

FIGURE 3.15: PIPELINE DIAGRAM WITH BUBBLE

FIGURE 3.16: PIPELINE ROTATE

FIGURE 3.17: CACHE RAM BLOCK DIAGRAM

FIGURE 3.18: CACHE RAM LAYOUT

FIGURE 3.19: CACHE RAM FLOORPLAN

FIGURE 3.20: CACHE RAM PARTITIONING FOR F-RISC / G

FIGURE 3.21: ADDRESS PARTITIONING ON CACHE RAM

FIGURE 3.22: SIMPLIFIED CACHE CONTROLLER BLOCK DIAGRAM

FIGURE 3.23: CACHE CONTROLLER FLOORPLAN

FIGURE 3.24: CACHE CONTROLLER LAYOUT

FIGURE 3.25: REMOTE PROGRAM COUNTER

FIGURE 3.26: CACHE CONTROLLER STATE DIAGRAM

FIGURE 3.27: LOAD CRITICAL PATH COMPONENTS

FIGURE 3.28: COMPONENTS OF ADDER CRITICAL PATH (ADAPTED FROM [PHIL93])

FIGURE 3.29: DATA CACHE COMMUNICATIONS

FIGURE 3.30: INSTRUCTION CACHE COMMUNICATIONS

FIGURE 3.31: ABUS PARTITIONING

FIGURE 3.32: MCM LAYOUT

FIGURE 3.33: ADDRESS BROADCAST TO CACHE CONTROLLERS

FIGURE 3.34: ADDRESS BROADCAST TO RAMS

FIGURE 3.35: RESULTS FROM CACHE TO CPU

FIGURE 3.36: GE-HDI MCM CROSS-SECTION

FIGURE 3.37: DATA CACHE TIMING -CLEAN LOADS

FIGURE 3.38: INSTRUCTION DECODER BLOCK DIAGRAM

FIGURE 3.39: SAMPLE LOAD COPYBACK CODE FRAGMENT

FIGURE 3.40: DATA CACHE TIMING - LOAD COPYBACK

FIGURE 3.41: DATA CACHE TIMING - STORE COPYBACK

FIGURE 3.42: TIMING AT CACHE RAM DURING STORE

FIGURE 3.43: SAMPLE STORE COPYBACK CODE FRAGMENT

FIGURE 3.44: INSTRUCTION CACHE MISS TIMING

FIGURE 3.45: CACHE WAIT TIMING

FIGURE 3.46: DATA CACHE DURING INSTRUCTION CACHE STALL

FIGURE 3.47: INSTRUCTION CACHE DURING A DATA CACHE STALL

FIGURE 3.48: INSTRUCTION CACHE AT START-UP

FIGURE 3.49: INSTRUCTION CACHE DURING TRAP

FIGURE 3.50: SECONDARY CACHE BLOCK DIAGRAM

FIGURE 3.51: LOAD COPYBACK IN F-RISC / G CACHE

FIGURE 4.1: RECEIVER USED IN F-RISC / G CORE BOUNDARY SCAN SCHEME

FIGURE 4.2: SIMPLIFIED AT-SPEED TESTING TIMING DIAGRAM

FIGURE 4.3: DRIVER USED IN F-RISC / G CORE BOUNDARY SCAN SCHEME

FIGURE 4.4: F-RISC/G CORE BOUNDARY SCAN SCHEME

FIGURE 4.5: SAMPLING BEHAVIOR OF MASTER - SLAVE LATCH (MODIFIED FROM [PHIL93])

FIGURE 4.6: PARALLEL SCAN CLOCKING

FIGURE 4.7: RAM CHIP TESTING SCHEME

FIGURE 4.8: WRITE TIMING IN CONTINUOUS MODE

FIGURE 4.9: READ TIMING IN CONTINUOUS MODE

FIGURE 4.10: SINGLE SHOT TIMING

FIGURE 4.11: L2 PATH DRIVER / RECEIVER

FIGURE 4.12: BOUNDARY SCAN RECEIVER

FIGURE 4.13: BOUNDARY SCAN DRIVER

FIGURE 4.14: MCM SCAN PATH

FIGURE 4.15: CASCADE PROBE HEAD

FIGURE 4.16: BOUNDARY SCAN STATE TRANSITIONS

FIGURE 4.17: BOUNDARY SCAN CONTROLLER STATE DIAGRAM

FIGURE 4.18: TIMING DIAGRAM

FIGURE 5.1: CYCLE TIME / CPI TRADE-OFF

FIGURE 5.2: PROPOSED REVISED SYSTEM DIAGRAM

FIGURE 5.3: PROPOSED DATAPATH BLOCK DIAGRAM (ADAPTED FROM [PHIL93])

FIGURE 5.4: MODIFIED TIMING DIAGRAM

FIGURE 5.5: MODIFIED CACHE CONTROLLER BLOCK DIAGRAM

FIGURE 5.6: NIBBLE INTERLEAVING

FIGURE 5.7: WORD INTERLEAVING

FIGURE 5.8: 1 KB L0 CACHE

FIGURE 5.9: 1 KB VS. 2 KB L0 CACHE

FIGURE 5.10: COLUMN ASSOCIATIVE CACHE - SLOW HIT

FIGURE 5.11: ASSOCIATIVITY SCHEMES

FIGURE 5.12: SHARED MEMORY MULTI-PROCESSOR IMPLEMENTATION

FIGURE 5.13: SNOOPING CACHE COMMUNICATIONS

FIGURE 5.14: ARCHITECTURE TRANSLATION

FIGURE 5.15: 3-D CHIP STACK

FIGURE 5.16: CHIP WITH INTERPOSER

FIGURE 5.17: 3-D RAM STACK MCM LAYOUT

FIGURE 5.18: 3-D RAM AND CPU STACK MCM LAYOUT

FIGURE 5.19: CACHE CRITICAL PATHS

FIGURE 5.20: COARSE CLOCKING

FIGURE 5.21: RADIO FREQUENCY DATA MEMORY INTERFACING

LIST OF TABLES

TABLE 1.1: F-RISC / G SEVEN-STAGE PIPELINE (ADAPTED FROM [PHIL93])

TABLE 1.2: ARCHITECTURAL FEATURES OF CONTEMPORARY RISC PROCESSORS

TABLE 1.3: CONTEMPORARY RISC PROCESSOR TECHNOLOGY

TABLE 2.1: TRACE CHARACTERISTICS

TABLE 2.2: SPICE TRACE RESULTS

TABLE 2.3: TEX TRACE RESULTS

TABLE 2.4: GCC TRACE RESULTS

TABLE 2.5: SIMULATION RESULTS ON BENCHMARK SUITE

TABLE 2.6: COMPARISON OF CACHE ARCHITECTURES

TABLE 2.7: F-RISC / G PRIMARY CACHE PARAMETERS

TABLE 2.8: STALL CPI COMPONENTS

TABLE 2.9: SECONDARY CACHE TRACE LENGTHS

TABLE 2.10: SECONDARY CACHE SIMULATION RESULTS

TABLE 2.11: ANALYSIS OF SECONDARY CACHE TRACE FOR HEAD STARTS

TABLE 2.12: EXPECTED CHIP YIELD

TABLE 2.13: REPLACEMENT SCHEME DEVICE PENALTY

TABLE 2.14: RPI TESTCHIP TESTING RESULTS [PHIL93]

TABLE 3.1: DELAYS ALONG CRITICAL PATH

TABLE 3.2: CRITICAL PATH TIMINGS

TABLE 3.3: CACHE OPERATIONS DURING A FETCH

TABLE 3.4: STAGES OF CACHE OPERATION

TABLE 3.5: DATA CACHE OPERATIONS DURING STORE

TABLE 3.6: CACHE RAM DEVICE COUNT

TABLE 3.7: CACHE CONTROLLER DEVICE COUNT

TABLE 3.8: COMPARISON OF F-RISC / G CHIPS

TABLE 3.9: CPU TO CACHE COMMUNICATIONS

TABLE 3.10: IOCNTRL SETTINGS

TABLE 3.11: CACHE TO CPU COMMUNICATIONS

TABLE 3.12: CRITICAL PATH TIMING CONSTRAINTS

TABLE 3.13: MCM NET LENGTHS - CPU / CACHE SIGNALS

TABLE 3.14: INTRACACHE COMMUNICATIONS

TABLE 3.15: SECONDARY CACHE COMMUNICATIONS

TABLE 3.16: CHIP DIMENSIONS

TABLE 3.17: COMPARISON OF F-RISC / G PACKAGES

TABLE 3.18: VIRTUAL MEMORY CONTROL

TABLE 3.19: CPU TRAP BEHAVIOR

TABLE 3.20: BACK-ANNOTATED SIGNAL TIMINGS

TABLE 4.1: BOUNDARY SCAN RECEIVER CONTROL SIGNALS

TABLE 4.2: BOUNDARY SCAN DRIVER CONTROL SIGNALS

TABLE 4.3: BOUNDARY SCAN PROBE ASSIGNMENTS

TABLE 4.4: BOUNDARY SCAN CONTROLLER STATES

TABLE 4.5: CACHE RAM SCAN CHAIN

TABLE 5.1: ESTIMATE OF CRITICAL PATH LENGTHS USING 3-D STACKING

TABLE 5.2: SECONDARY CRITICAL PATH BREAKDOWN

Acknowledgments
This research was sponsored in part by the Advanced Research Projects Agency under contract AASERT DAAL03-92G-0307 for cache memory, ARPA/ARO DAAH04-93-G-0477 and DARPA/ARO DAAL03-90-G-0187.

I would like to thank my F-RISC/G comrades, Atul Garg, Pete Campbell, Steve Carlough, Matt Ernest, and Sam Steidl in particular, without whose help this would have taken a lot longer. Bob Philhower and Jim Loy deserve thanks for showing me how it all is done. John Van Etten is thanked for his preliminary research in this area, as is Kyung-Suc Nah for his work on the memory blocks. All of my committee, John McDonald, Kenneth Rose, Paul Chow, and Boleslaw Szymanski are thanked for their valuable suggestions, contributions, and guidance. Thanks, too, to Hans Greub, who unfortunately could not be there at the very end. Thanks also to those friends who don't inhabit the "Sun Room": Bob Laffen and Denise Combs for smiling so much, Lisa Schmeiser for giving me a reason to smile myself, and Karin Karg for reminding me each time I lost my motivation to finish my research that the "real world" is a nice place to be.

Finally, thanks, grandma, grandpa, and most especially dad, for your assistance and support all these many years, and for putting up with my increasingly infrequent visits.

Abstract

The speed of a microprocessor is determined by many factors; device speed, interconnect, architecture and micro-architecture, compiler design, memory bandwidth, and packaging all play a role. While the temptation to optimize the speed of the execution units can be appealing, if attention is not given to these other factors the overall speed of the processor will suffer.

At the cutting edge of device technology one must trade off quantity of devices for increased device switching speed. In order to avoid squandering this increased device speed, this design regime requires different design decisions than for more conventional technologies.

In order to explore the problems to be overcome in this regime, researchers at Rensselaer Polytechnic Institute created the F-RISC / G 1000 MHz CPU. The memory hierarchy for this microprocessor was particularly challenging, and many of the problems associated with low device integration and interconnect-dominated cycle times had to be overcome. A subnanosecond access time SRAM chip was designed, as was a dual-purpose cache controller chip. Schemes for improving the RAM die yield through incorporation of redundancy were analyzed, and a determination about the use of redundant circuitry in high-speed low-device-yield circuits was made. An at-speed testing scheme for the SRAM chip was developed and implemented. Options for the secondary cache and processor packaging were also evaluated.