Chapter 1

Introduction and Historical Review

Computer users have shown a remarkable ability to tax the computing resources available to them, no matter the scale. While many researchers are attempting to achieve increases in computing speed by increasing parallelism, there is little doubt that in the future faster nodes will be required. The F-RISC project has led to the discovery of fundamental obstacles ahead for those who are attempting to extend the speed of today's processing nodes. One of the most difficult problems will be keeping the processor fed with instructions and data, a problem that can only be solved through considerate memory hierarchy design. In designing these memory hierarchies, interactions with the CPU, the types of code to be run on the processor, and the limits of the technologies to be used must be taken into account. Based on the limitations imposed by the F-RISC architecture and the technologies used in the F-RISC / G prototype, a cache with hit access time of 1 ns and a CPI of 1.88 is proposed.

  1. F-RISC / G Overview

There are a limited number of ways to increase the speed by which computers can perform calculations. One can improve the algorithm used to calculate the results, devote more resources to the problem, or increase the efficiency of each resource.

Much work is being done in searching for the optimal algorithm to solve many types of problems. The gains to be achieved by this area of research, while substantial, are nonetheless finite. In addition, certain classes of problems can not be efficiently attacked except by throwing brute-force methods at them.

Getting more work accomplished during each cycle can be done by increasing parallelism or ensuring that all functional units are kept busy doing productive work at all times. There are several things which limit how parallelism can be incorporated into the solution; these factors include the nature of the problem to be solved, the overhead incurred by communications between parallel functional units, and programming difficulties.

By focusing on the speed of each resource, it is possible to increase the speed with which a solution is achieved. The basis of the F-RISC architecture is to speed up computations by decreasing the clock cycle time. By simplifying the circuitry necessary to implement the architecture by stripping the architecture down to the bare necessities, and using the most advanced available packages, devices, and interconnect technologies, the goal of F-RISC is to operate at the highest possible clock speed. Any improvements which can be made in single-node processing capability can be applied to highly parallel machines and can benefit from improved computational algorithms.

There have been several implementations of the F-RISC architecture, each based on a different device technology. The current implementation, F-RISC / G, uses a GaAs/ AlGaAs Heterojunction Bipolar Transistor (HBT) technology supplied by Rockwell International. Other recent implementations include F-RISC / I, a simplified GaAs MESFET version of the architecture, and FPGA and SiGe bipolar implementations.

The F-RISC / G implementation contains a seven stage pipeline as shown in Table 1.1. The I1, I2, D1, D2, and DW stages are all dedicated to memory accesses.
I1
Instruction Fetch 1 Transfer instruction address to cache on branch
I2
Instruction Fetch 2 Receive instruction from cache
DE
Decode Decode instruction
EX
Execute Execute instruction
D1
Data Read 1 Transfer data address
D2
Data Read 2 Receive data from cache if a LOAD
DW
Data Write Cache modified if STORE; write result into register

TABLE 1.1: F-RISC / G SEVEN-STAGE PIPELINE (ADAPTED FROM [PHIL93])

Clock

Pipe
Primary Cache (i/d)
Year
Bits
(MHz)
Stages
kB
Assoc.
UltraSPARC
1995
64
167
9
16/16
1/1
Sparc64
1995
64
143
10
4/0
1/-
Alpha 21164
1994
64
300+
7
8/8
1/1
PA-RISC 7200
1994
32
140
5
1/2
na/64
PowerPC 620
1995
64
130
4
32/32
8/8
MIPS R10000
1995
64
200
5
32/32
2/2
F-RISC/G
1995
32
1000
7
2/2
1/1
TABLE 1.2: ARCHITECTURAL FEATURES OF CONTEMPORARY RISC PROCESSORS

Like many modern RISC implementations, F-RISC relies on deep pipelines and a cache hierarchy to achieve high throughput. Table 1.2 enumerates key features of F-RISC and a number of other state of the art RISC processors.

The use of cache memory hierarchies has become paramount in computer design. Irrespective of the expense of massive amounts of extremely high speed memory, packaging technology has not yet evolved to the point where the entire main memory of a high speed computer can be placed in close enough proximity to the core CPU to allow reasonable access times. Even if gigabytes of RAM could be placed within millimeters of the CPU, it would be difficult if not impossible to remove the heat dissipated by such dense circuitry.
Technology
Power (W)
Size (mm2)
Devices
UltraSPARC
0.5 um CMOS
28
315
5,200,000
Sparc64
0.4 um
50
2745 (MCM)
22,700,000
Alpha 21164
0.5 / 0.4 um CMOS
50
299
9,300,000
PA-RISC 7200
0.55 um CMOS
30
210
1,260,000
PowerPC 620
0.5 um CMOS
30
311
6,900,000
MIPS R10000
0.5 um CMOS
30
298
5,900,000
F-RISC/G
1.2 um GaAs HBT
250
10000 (MCM)
200,000

TABLE 1.3: CONTEMPORARY RISC PROCESSOR TECHNOLOGY

Table 1.3 lists some of the implementation details of the processors shown in Table 1.2. As shown in the table, the F-RISC / G core CPU and primary cache alone are expected to dissipate 250 W (or 2.5 W / cm2), illustrating the obvious problems that would be associated with packing even more memory onto the multi-chip module (MCM).

  1. Technology

The technologies which are used in the F-RISC / G prototype, while providing the performance necessary to achieve its 1 ns cycle time, also impose several limitations on its design. Most notable among these is poor device integration and relatively high power dissipation.

  1. GaAs Heterojunction Bipolar Transistors

The active devices used in the F-RISC/G CPU are based on GaAs/AlGaAs Heterojunction Bipolar Transistor (HBT) technology as supplied by Rockwell International. Figure 1.1 shows a cross-section of the Rockwell HBT device. The baseline process produces transistors with a nominal fT of 50 GHz.

The devices are formed on semi-insulating substrate using MBE-grown epitaxial layers. The most commonly used device features 1.4 um x 3 um emitter strips. The Rockwell process provides three levels of interconnect formed through liftoff. A polyimide dielectric is used between layers.


FIGURE 1.1: CROSS SECTION OF ROCKWELL HBT (MODIFIED FROM [NAH91])

The advantage of using HBTs is that the heterojunction provides good emitter injection efficiency, lower base resistance than in bipolar junction transistors (BJTs), improved punch-through resistance, and lower emitter-base capacitance (Cje) [Sze81].

The GaAs / AlGaAs system also offers advantages. Among them, a large bandgap can be achieved, electron mobility is high (on the order of 5000-8000 cm2/(V s) in pure material vs. 800-2000 cm2/(V s) for Si), reducing base transit time and charge storage at the emitter junction, and a semi-insulating substrate is available (on the order of 5 x 108 ohm-cm ) [Sze90; Long90].

  1. Current Mode Logic

F-RISC / G makes use of differential Current Mode Logic (CML) and differential wiring [Nah91; Greu91]. These circuits are built out of differential pairs of transistors, current switches, which are arranged in a common emitter configuration.

Figure 1.2 shows a simple differential current switch. The current source Is may either be passive (a resistor) or active (a transistor). If VIN is sufficiently positive, then the left transistor will turn on (into the active region) and the right transistor will shut off. This forces the current drawn by the current source to flow through the left transistor, and reduces VCE for the left transistor toward 0 V. This pulls the bottom-most of the two output rails toward the emitter voltage, while the upper-most of the two output rails is held high. As a result, the VOUT differential signal is positive. Likewise, if VIN is significantly negative, VOUT will be negative, making this a simple differential buffer.


FIGURE 1.2: DIFFERENTIAL CURRENT SWITCH

F-RISC / G uses passive sourcing and a 0.25 V logic swing. Passive sourcing was selected due to the high VBE of the devices supplied by Rockwell (1.35 V). Three levels of switches are stacked, allowing complex logic functions to be realized. The supply voltage for passive sourcing is:

VEE = 3VBE + VL = 4.05 V + VS

where VL is the desired source voltage. For active sourcing:

VEE = 4VBE + VL = 5.4 V + VS

It was desired that F-RISC / G be compatible with standard ECL parts (VEE = 5.2 V) so a passive source must be used.

Differential wiring is used due to its common-mode noise immunity (wires are run in parallel tracks of equal length) and the elimination of the common reference voltage required in Emitter-Coupled Logic (ECL). An added benefit of using differential wiring is that inversions can be accomplished by merely crossing wires. The use of differential wiring does cause increased capacitance, however, and requires more routing area.

  1. Thin Film Multi-Chip Modules

At the cutting edge of device technology yield tends to be a serious problem, meaning that in order to harvest a reasonable quantity of working die it is necessary to reduce device integration levels. This makes construction of useful circuits a serious challenge. If several small die can be interconnected with little penalty, the challenge of poor yield can, in part at least, be circumvented. Two approaches to providing this type of interconnection technology are Wafer Scale Integration (WSI) and Wafer Scale Hybrid Packaging (WSHP) [Maji88].

The goal of any packaging scheme in a high-performance system is to provide high-bandwidth interconnection between system components. Other requirements include the ability to integrate bypass capacitors (to reduce noise), terminating resistors (to reduce reflections), the capability to dissipate heat, and physical protection of the die. Since high performance devices are usually immature technologies (and thus suffer from poor yields), a technology which allows packaging of multiple die into a single high-speed circuit is important. Thin film multi-chip modules (MCMs) are a form of WSHP that can provide all of these features.

  1. Memory Hierarchies

It is often not technologically feasible to allow the CPU to communicate directly with a quantity of RAM equivalent to the processor's address space. This assertion is made for the following reasons:

In order to circumvent these difficulties it is possible to use a cache memory hierarchy in which small amounts of high speed memory are used in concert with large amounts of lower speed memory to approximate the performance of large amounts of high speed memory. This idea dates back to the University of Manchester's Atlas machine's "one-level storage system" in the early 1960's [Kilb62]. The idea is for the most frequently used data to be stored in the high speed memory, while less frequently used data is fetched from main memory into the high speed memory as needed.

The small high-speed memory is conventionally called a "cache." In designing a cache there are several micro-architectural decisions to be made. The dimensions of the cache (depth and width), organization of the cache, operations during writes, communications with higher levels of memory, and even the quantity of caches (do the data and instruction streams get cached separately?) are all parameters which can be adjusted.

The beneficial results of caching are due chiefly to the principles of temporal and spatial locality. These principles are the basis of the hypothesis that at any given time an executing program will tend to access a small portion of its code base, and need to access a small portion of its data address space. If the small portions of the code and data which need to be accessed reside in the cache, then the cache will perform well.

The principles can be stated as follows:

Although it is often more useful to evaluate the performance of the processor as a whole, there are several metrics which may be used to evaluate a cache by itself.

Figure 1.3 illustrates a typical cache memory hierarchy. In this diagram the primary cache is considered to be a "lower" level of hierarchy than the secondary cache. Typically, lower levels of memory are faster than higher levels of memory, whereas higher levels of memory are larger than lower levels of memory. There may be levels of memory above the secondary cache. The highest level of memory is called main memory or main store. Main memory itself may be smaller than the address space, and an external storage device can be used to simulate having a larger main memory, with elements, called "pages," swapped between the main memory and storage device as needed. This technique is known as "virtual memory."


FIGURE 1.3: MEMORY HIERARCHY

A cache "hit" occurs when a lower level of the cache hierarchy can read or write to the next higher level of memory without that higher level needing to transact with a still higher level of memory; the data that is required to complete the CPU instruction is already stored in the cache. A "miss" occurs when the intermediate memory level is forced to first transact with a higher memory level before handling a read or write request. This means that the data that the CPU requires to complete its instruction is not available in the cache, and must be fetched from elsewhere. There are three reasons that a miss could occur:

Capacity

The amount of memory used by the processes running on the CPU may be greater than the capacity of the cache. As a result, some data already in the cache must be moved off to the next higher level of memory, and returned if needed.

Compulsory

The cache will initially contain no valid data, so as the CPU starts up every memory reference will miss until the cache is filled. Compulsory misses may also occur when the processor undergoes a "context switch" in which the cache is flushed and replaced with data and instructions for another process.

Conflict

In some caches a particular block may be needed by multiple addresses, but can only contain one address' data at a time, forcing some data to be sent to the next higher level of memory and perhaps eventually returned.

The "hit ratio" for a cache level is the fraction of total memory transactions that the cache level can handle without having to communicate with higher levels of memory. The "miss ratio" is simply defined as 1 - hit ratio.

The time required for a cache to process a hit (which typically includes the time required to determine if the required data is present as well as the time required to submit a request to the cache and receive the results) is called the "hit time." The "miss penalty" is the additional time required to process a miss. Both of these figures may be expressed as an absolute time (e.g. 1 ns for a hit) or in terms of a quantity of processor cycles.

"Access time" specifically refers to the amount of time required to receive data from a memory after issuing an address.

While these numbers are clearly important, a more useful measure of a cache's performance is the "average memory access time" which is defined as [Henn90]:

average memory-access time = hit time + miss ratio miss penalty

The miss penalty can be broken into two important components: "transfer time" and "access time." The transfer time is the time required to send an address from the lower level of cache (where the miss occurred) to the higher level of cache and the time required to send back the requested data. The access time is simply the amount of time required for the RAM to access the requested data. The transfer time is a function of interconnect and packaging technology, while the access time is a function of device technology.

The fundamental unit of data on which cache transactions take place is the "block." Each cache level may use different sized blocks, but when a higher level of cache sends data to the next lower level of cache, the dimension of the block transferred must be that used by the lower level of cache. Anytime the CPU needs to access a word within a given block, the entire block must be brought into the cache if not already present. The space in the cache in which a block is stored is called a "block frame" or "line."

The data buses between levels of hierarchy need not be as large as the blocks which are transferred on them, in which case multiple sub-block transfers must take place for each transaction. For example, if the cache block is twice as wide as the data bus, then two half-block transfers are required any time the cache must communicate with a higher level of memory.

When the CPU requests data from memory, the primary cache intercepts the request and determines whether the requested data is immediately available. If the primary cache does not have the requested data, it issues a request to the secondary cache. Assuming the secondary cache has the requested data, it will send it to the primary cache. The amount of data sent is equal to the size of the primary cache block, not the CPU word. If the secondary cache does not have the required data, it must, in turn, retrieve it from the next higher level of memory. The amount of data that the secondary cache would have to retrieve is equal to the secondary cache block size.

By using a block size bigger than the CPU word size, the cache takes advantage of the principle of spatial locality; if the CPU needs a particular word at a given instant, it will likely need the rest of the words in the block shortly. Rather than incur the overhead of multiple communications between caches, it can be beneficial to send several words at once, hence the transfer of an entire block at once. Determining exactly how big to make the blocks for each level of hierarchy can be complicated, and is often accomplished through simulations.

Each cache has room for a fixed number of blocks. A given CPU address may map to a subset of the available cache block locations. Since the cache will not have room to hold the entire CPU address space, several memory address will map to each cache block frame. Each cache will have an content-addressable or associative memory (called a "tag RAM") which stores a directory providing the mapping between the CPU address space and the cache block memory. Whenever the cache needs to reference its data memory it must search the tag RAM to determine which blocks it has stored in its block frames.

The directory entries which are written into the tag RAMs are called "tags" and are usually directly extracted from the address of the block. There is one tag entry for each block frame. For example, if, in a given cache, any block may be stored in any block frame, then, as each block is stored, the tag entry for a given frame will be the block address of the block being stored in that frame. As an alternate example, suppose a cache has only two block frames, one of which is permitted to store even block addresses and the other of which holds only odd block addresses. Then the tag entries would not need to contain the lowest order bit (which determines "oddness" or "evenness") since to find an odd or even block address the cache knows precisely which block frame to check, and needs only to know if the specific block it's looking for is present. In short, the portion of the block address which is necessary to reconstruct a block address when the frame in which the block is stored is known is stores as the tag.

  1. Write Policies

An important issue in cache design is what to do when the CPU attempts to write something into a particular memory address (i.e. perform a STORE). Although most cache accesses will be reads (after all, all instruction fetches will be reads, and few instructions write to memory [Henn90]), writes are common enough that some care must be taken in handling them.

The reason that writes are complicated is that, unlike reads, the actual transaction can not take place simultaneously to determining if the proper address is stored in the cache. On a LOAD (a cache read), the cache can check the associative memory it uses to determine if the appropriate address is present simultaneously to actually reading the requested data. If the required data is available, then by the time the cache realizes that a hit has occurred, the data has already been read from the data RAMs, and no penalty is incurred for having had to check the tag RAM. If, during a STORE, the cache were to write the new data into its data memory while checking its associative memory, and the proper address were not available in the cache, the cache will have overwritten data that it might need later. For this reason it is necessary to first read the cache, and only write into the cache once the proper address is found. This is also done because the CPU word size is usually smaller than the cache block size, so the cache may have to read an entire block, modify one word, and write the modified block back into the cache. All this makes a write take longer than a read.

When actually writing into the cache, there are two possible options.

In a copy-back design the cache tag RAM's will likely store a bit which determines whether the cache block in question has been modified by the CPU in the time since it was retrieved from higher levels of memory. If the CPU has modified the block by performing a STORE, and the higher levels of memory have not been updated, the block in the cache is called "dirty." If the CPU has not modified the block, or the block was modified but the higher levels of memory have been updated, the block is called "clean."

If a block with a different address needs to be stored in the space occupied by a dirty block, the dirty block must be written to the next higher level of memory or its modifications will be lost. The advantage to this type of cache over a write-through cache is that the time consumed in writing to the next level of memory need only take place when this situation occurs. In a write-through cache, multiple writes to the same memory block would each result in block transfers to the next higher level of memory, whereas, in a copy-back design, no block transfers occur until another block needs to occupy the modified block's frame.

Write through designs do, however, have the advantage of being simpler to implement, particularly in multiprocessor designs with shared memory in which it is important that the highest, shared, level of memory be consistent with the caches at all times.

To prevent the need to wait for the write through to occur on every write, it is common to include buffers between the cache levels. This buffering scheme can mitigate the simplicity advantages of the write through design, however.

A second issue in handling memory writes is what to do in the event of a write miss. A write miss occurs when the CPU attempts to write to a location whose cache block address is occupied by other main memory addresses. In a copyback cache the block occupying this frame will be transferred to higher levels of memory. In a write-through cache this is not necessary. In either case it is still necessary to actually perform the write that was requested by the CPU. There are two methods of accomplishing this:

Fetch on write is used when the expectation is that the modified block will be modified many times in close temporal proximity. Write around caches are usually used in write through designs since modifications to memory have to filter through to main memory, anyway.

  1. Address Mapping

Each level of the cache hierarchy may be of differing dimensions, which will be different in dimension from the main memory. As a result, a block from a lower level of memory or the CPU must be mapped to the proper location in each cache. Caches are frequently classified by the manner in which this mapping is accomplished.

Direct Mapped

In a direct mapped cache each main memory address maps to one and only one location within the cache. Multiple main memory addresses may map to a single cache address, however, if the cache size is smaller than the main memory address space. Typically the mapping is accomplished using the modulus function. For example, if the cache contains eight blocks as shown in Figure 1.4, then main memory block address 9 would map to block 1 in the cache (9 mod 8 = 1). Note that main memory block addresses are used. For example, if there are four words per block, then address 32 through 35 would be contained in block 9.
0
1
2
3
4
5
6
7
Block Number
CACHE


MAIN

MEMORY




...
0
1
2
3
4
5
6
7
89 1011...

FIGURE 1.4: DIRECT MAPPED CACHE

The direct mapped cache has the advantage of being simple to implement. In addition, it is simple to determine whether or not a given main memory address is stored in the cache because there is only one possible place it can be stored. On the other hand, if there is a need to store another address which maps to the same block frame, the block is displaced, at the very least entailing a block transfer penalty, and, at worst, causing the displaced to be swapped back into the cache again if it is needed shortly.

0
1
2
3
4
5
6
7
Block Number
CACHE


MAIN

MEMORY




...
0
1
2
3
4
5
6
7
89 1011...

FIGURE 1.5: FULLY ASSOCIATIVE CACHE

Associative

In a fully associative cache, any main memory address can be stored in any block in the cache. For example, as shown in Figure 1.5, main memory block address 9 could be stored in any block of the eight-entry cache.

One advantage of a fully associative cache is that as long as there is a free block in the cache, then there is no need to ever replace the cache contents with other addresses. If all of the blocks are full, then ideally one would wish to replace the block which will not be needed for the longest time [Matt70]. Since this is difficult to accomplish, alternate cache replacement algorithms such as "least-recently used" [Henn90] are often used.

A disadvantage of this approach is that when a block is sought, the entire cache must be searched. This will either require lots of hardware (a comparator for each block frame) or a lot of time (the contents of each block frame serially being passed through a single comparator). A second disadvantage is that implementing the algorithms needed to decide where to store blocks (somehow a decision must be made as to where to store each block, since any frame is a legitimate destination), and the multiplexing logic needed to route the data appropriately, require a comparatively large amount of hardware.

Set Associative

A set associative cache is similar to a direct-mapped cache, but for each address in the main memory address space there is more than one possible block location in the cache. The possible block locations to which a particular main memory address block can be mapped is called a set.
0
1
2
3
4
5
6
7
Block Number
CACHE





MAIN

MEMORY




...
0
1
2
3
4
5
6
7
89 1011...

FIGURE 1.6: SET ASSOCIATIVE CACHE

Figure 1.6 shows an example of a set associative cache with eight set addresses and two blocks per set. Using the modulus mapping that was used in the direct-mapped cache, main memory block address 9 is once again mapped to cache set 1. It may occupy either block within that set, however. The location where a particular block is stored may be chosen using random, least-recently-used, or other algorithms, as in the fully associative cache.

The fully associative and direct-mapped caches can be treated as special cases of the set associative cache. In the case of the fully associative cache there is only one set, encompassing the entire cache. In the case of the direct-mapped cache, the set size is one block, and there is one set for each block in the cache.

The set associative cache requires fewer hardware resources than the fully associative cache in that the multiplexors which route data into the data memories can be smaller (since the number of possible block frames in which a given block may be stored is typically fewer in number), and the number of comparators necessary within the tag RAM if a parallel search is to be done is fewer. On the other hand, the direct-mapped cache requires no multiplexors and only one comparator.

An advantage of the set associative cache over the direct-mapped cache is that if a particular block frame is occupied, it is possible that alternate block frames within the same set will be available for new data.

  1. Architecture

In the 1940's, researchers at Harvard University built the Mark series of computing machines. The Mark-III and the Mark-IV were stored program machines, with separate memories for the instructions and data. While this type of architecture is rare today, it is common for a machine to have a shared main memory, but separate instruction and data caches; this is called a "Harvard architecture." The alternative, a single, shared cache, is called a "unified architecture" or "Von Neumann architecture." Figure 1.7 illustrates each type of cache architecture.

The advantages of separating the data and instruction caches is that it makes it simpler for instructions and data to be simultaneously fetched from memory. In systems in which the CPU is pipelined and can fetch instructions and operand simultaneously, it is a great advantage to have separate buses to memory to fetch these items, rather than forcing a stall and fetching them one at a time.


FIGURE 1.7: CACHE ARCHITECTURES

On the other hand, when a Harvard architecture is used, it is generally impossible to modify the instruction stream (the instruction cache is capable of reading memory but not writing to it), and thus self-modifying code, an important element of certain types of artificial intelligence systems, becomes much more difficult to implement. In addition, since the two caches can not share cache RAM's, the cache hierarchy is less able to adapt to changing conditions by partitioning different amounts of memory to holding instructions and data. It is possible, however, to optimize each cache since they need not be identical.