Computer users have shown a remarkable ability to tax the computing
resources available to them, no matter the scale. While many researchers
are attempting to achieve increases in computing speed by increasing
parallelism, there is little doubt that in the future faster nodes
will be required. The F-RISC project has led to the discovery
of fundamental obstacles ahead for those who are attempting to
extend the speed of today's processing nodes. One of the most
difficult problems will be keeping the processor fed with instructions
and data, a problem that can only be solved through considerate
memory hierarchy design. In designing these memory hierarchies,
interactions with the CPU, the types of code to be run on the
processor, and the limits of the technologies to be used must
be taken into account. Based on the limitations imposed by the
F-RISC architecture and the technologies used in the F-RISC /
G prototype, a cache with hit access time of 1 ns and a CPI of
1.88 is proposed.
There are a limited number of ways to increase the speed by which computers can perform calculations. One can improve the algorithm used to calculate the results, devote more resources to the problem, or increase the efficiency of each resource.
Much work is being done in searching for the optimal algorithm to solve many types of problems. The gains to be achieved by this area of research, while substantial, are nonetheless finite. In addition, certain classes of problems can not be efficiently attacked except by throwing brute-force methods at them.
Getting more work accomplished during each cycle can be done by increasing parallelism or ensuring that all functional units are kept busy doing productive work at all times. There are several things which limit how parallelism can be incorporated into the solution; these factors include the nature of the problem to be solved, the overhead incurred by communications between parallel functional units, and programming difficulties.
By focusing on the speed of each resource, it is possible to increase the speed with which a solution is achieved. The basis of the F-RISC architecture is to speed up computations by decreasing the clock cycle time. By simplifying the circuitry necessary to implement the architecture by stripping the architecture down to the bare necessities, and using the most advanced available packages, devices, and interconnect technologies, the goal of F-RISC is to operate at the highest possible clock speed. Any improvements which can be made in single-node processing capability can be applied to highly parallel machines and can benefit from improved computational algorithms.
There have been several implementations of the F-RISC architecture, each based on a different device technology. The current implementation, F-RISC / G, uses a GaAs/ AlGaAs Heterojunction Bipolar Transistor (HBT) technology supplied by Rockwell International. Other recent implementations include F-RISC / I, a simplified GaAs MESFET version of the architecture, and FPGA and SiGe bipolar implementations.
The F-RISC / G implementation contains a seven stage pipeline
as shown in Table 1.1. The I1, I2, D1, D2,
and DW stages are all dedicated to memory accesses.
Instruction Fetch 1 | Transfer instruction address to cache on branch | |
Instruction Fetch 2 | Receive instruction from cache | |
Decode | Decode instruction | |
Execute | Execute instruction | |
Data Read 1 | Transfer data address | |
Data Read 2 | Receive data from cache if a LOAD | |
Data Write | Cache modified if STORE; write result into register |
Stages | |||||||
UltraSPARC | |||||||
Sparc64 | |||||||
Alpha 21164 | |||||||
PA-RISC 7200 | |||||||
PowerPC 620 | |||||||
MIPS R10000 | |||||||
F-RISC/G |
Like many modern RISC implementations, F-RISC relies on deep pipelines and a cache hierarchy to achieve high throughput. Table 1.2 enumerates key features of F-RISC and a number of other state of the art RISC processors.
The use of cache memory hierarchies has become paramount in computer
design. Irrespective of the expense of massive amounts of extremely
high speed memory, packaging technology has not yet evolved to
the point where the entire main memory of a high speed computer
can be placed in close enough proximity to the core CPU to allow
reasonable access times. Even if gigabytes of RAM could be placed
within millimeters of the CPU, it would be difficult if not impossible
to remove the heat dissipated by such dense circuitry.
UltraSPARC | ||||
Sparc64 | ||||
Alpha 21164 | ||||
PA-RISC 7200 | ||||
PowerPC 620 | ||||
MIPS R10000 | ||||
F-RISC/G |
Table 1.3 lists some of the implementation details of the processors
shown in Table 1.2. As shown in the table, the F-RISC / G core
CPU and primary cache alone are expected to dissipate 250 W (or
2.5 W / cm2), illustrating the obvious problems that
would be associated with packing even more memory onto the multi-chip
module (MCM).
The technologies which are used in the F-RISC / G prototype,
while providing the performance necessary to achieve its 1 ns
cycle time, also impose several limitations on its design. Most
notable among these is poor device integration and relatively
high power dissipation.
The active devices used in the F-RISC/G CPU are based on GaAs/AlGaAs Heterojunction Bipolar Transistor (HBT) technology as supplied by Rockwell International. Figure 1.1 shows a cross-section of the Rockwell HBT device. The baseline process produces transistors with a nominal fT of 50 GHz.
The devices are formed on semi-insulating substrate using MBE-grown
epitaxial layers. The most commonly used device features 1.4 um x
3 um emitter strips. The Rockwell process provides three levels
of interconnect formed through liftoff. A polyimide dielectric
is used between layers.
The advantage of using HBTs is that the heterojunction provides good emitter injection efficiency, lower base resistance than in bipolar junction transistors (BJTs), improved punch-through resistance, and lower emitter-base capacitance (Cje) [Sze81].
The GaAs / AlGaAs system also offers advantages. Among them,
a large bandgap can be achieved, electron mobility is high (on
the order of 5000-8000 cm2/(V s) in pure material vs. 800-2000 cm2/(V s) for Si),
reducing base transit time and charge storage at the emitter junction,
and a semi-insulating substrate is available (on the order of
5 x 108 ohm-cm ) [Sze90; Long90].
F-RISC / G makes use of differential Current Mode Logic (CML) and differential wiring [Nah91; Greu91]. These circuits are built out of differential pairs of transistors, current switches, which are arranged in a common emitter configuration.
Figure 1.2 shows a simple differential current switch. The current
source Is may either be passive (a resistor)
or active (a transistor). If VIN is sufficiently positive,
then the left transistor will turn on (into the active region)
and the right transistor will shut off. This forces the current
drawn by the current source to flow through the left transistor,
and reduces VCE for the left transistor toward 0 V.
This pulls the bottom-most of the two output rails toward the
emitter voltage, while the upper-most of the two output rails
is held high. As a result, the VOUT differential signal
is positive. Likewise, if VIN is significantly negative,
VOUT will be negative, making this a simple differential
buffer.
F-RISC / G uses passive sourcing and a 0.25 V logic swing. Passive sourcing was selected due to the high VBE of the devices supplied by Rockwell (1.35 V). Three levels of switches are stacked, allowing complex logic functions to be realized. The supply voltage for passive sourcing is:
where VL is the desired source voltage. For active
sourcing:
It was desired that F-RISC / G be compatible with standard ECL parts (VEE = 5.2 V) so a passive source must be used.
Differential wiring is used due to its common-mode noise immunity
(wires are run in parallel tracks of equal length) and the elimination
of the common reference voltage required in Emitter-Coupled Logic
(ECL). An added benefit of using differential wiring is that inversions
can be accomplished by merely crossing wires. The use of differential
wiring does cause increased capacitance, however, and requires
more routing area.
At the cutting edge of device technology yield tends to be a serious problem, meaning that in order to harvest a reasonable quantity of working die it is necessary to reduce device integration levels. This makes construction of useful circuits a serious challenge. If several small die can be interconnected with little penalty, the challenge of poor yield can, in part at least, be circumvented. Two approaches to providing this type of interconnection technology are Wafer Scale Integration (WSI) and Wafer Scale Hybrid Packaging (WSHP) [Maji88].
The goal of any packaging scheme in a high-performance system
is to provide high-bandwidth interconnection between system components.
Other requirements include the ability to integrate bypass capacitors
(to reduce noise), terminating resistors (to reduce reflections),
the capability to dissipate heat, and physical protection of the
die. Since high performance devices are usually immature technologies
(and thus suffer from poor yields), a technology which allows
packaging of multiple die into a single high-speed circuit is
important. Thin film multi-chip modules (MCMs) are a form of WSHP
that can provide all of these features.
It is often not technologically feasible to allow the CPU to
communicate directly with a quantity of RAM equivalent to the
processor's address space. This assertion is made for the following
reasons:
In order to circumvent these difficulties it is possible to use a cache memory hierarchy in which small amounts of high speed memory are used in concert with large amounts of lower speed memory to approximate the performance of large amounts of high speed memory. This idea dates back to the University of Manchester's Atlas machine's "one-level storage system" in the early 1960's [Kilb62]. The idea is for the most frequently used data to be stored in the high speed memory, while less frequently used data is fetched from main memory into the high speed memory as needed.
The small high-speed memory is conventionally called a "cache." In designing a cache there are several micro-architectural decisions to be made. The dimensions of the cache (depth and width), organization of the cache, operations during writes, communications with higher levels of memory, and even the quantity of caches (do the data and instruction streams get cached separately?) are all parameters which can be adjusted.
The beneficial results of caching are due chiefly to the principles of temporal and spatial locality. These principles are the basis of the hypothesis that at any given time an executing program will tend to access a small portion of its code base, and need to access a small portion of its data address space. If the small portions of the code and data which need to be accessed reside in the cache, then the cache will perform well.
The principles can be stated as follows:
Although it is often more useful to evaluate the performance of the processor as a whole, there are several metrics which may be used to evaluate a cache by itself.
Figure 1.3 illustrates a typical cache memory hierarchy. In this
diagram the primary cache is considered to be a "lower"
level of hierarchy than the secondary cache. Typically, lower
levels of memory are faster than higher levels of memory, whereas
higher levels of memory are larger than lower levels of memory.
There may be levels of memory above the secondary cache. The highest
level of memory is called main memory or main store. Main memory
itself may be smaller than the address space, and an external
storage device can be used to simulate having a larger main memory,
with elements, called "pages," swapped between the main
memory and storage device as needed. This technique is known as
"virtual memory."
A cache "hit" occurs when a lower level of the cache
hierarchy can read or write to the next higher level of memory
without that higher level needing to transact with a still higher
level of memory; the data that is required to complete the CPU
instruction is already stored in the cache. A "miss"
occurs when the intermediate memory level is forced to first transact
with a higher memory level before handling a read or write request.
This means that the data that the CPU requires to complete its
instruction is not available in the cache, and must be fetched
from elsewhere. There are three reasons that a miss could occur:
Capacity
The amount of memory used by the processes running on the CPU
may be greater than the capacity of the cache. As a result, some
data already in the cache must be moved off to the next higher
level of memory, and returned if needed.
Compulsory
The cache will initially contain no valid data, so as the CPU
starts up every memory reference will miss until the cache is
filled. Compulsory misses may also occur when the processor undergoes
a "context switch" in which the cache is flushed and
replaced with data and instructions for another process.
Conflict
In some caches a particular block may be needed by multiple addresses,
but can only contain one address' data at a time, forcing some
data to be sent to the next higher level of memory and perhaps
eventually returned.
The "hit ratio" for a cache level is the fraction of total memory transactions that the cache level can handle without having to communicate with higher levels of memory. The "miss ratio" is simply defined as 1 - hit ratio.
The time required for a cache to process a hit (which typically includes the time required to determine if the required data is present as well as the time required to submit a request to the cache and receive the results) is called the "hit time." The "miss penalty" is the additional time required to process a miss. Both of these figures may be expressed as an absolute time (e.g. 1 ns for a hit) or in terms of a quantity of processor cycles.
"Access time" specifically refers to the amount of time required to receive data from a memory after issuing an address.
While these numbers are clearly important, a more useful measure
of a cache's performance is the "average memory access time"
which is defined as [Henn90]:
The miss penalty can be broken into two important components: "transfer time" and "access time." The transfer time is the time required to send an address from the lower level of cache (where the miss occurred) to the higher level of cache and the time required to send back the requested data. The access time is simply the amount of time required for the RAM to access the requested data. The transfer time is a function of interconnect and packaging technology, while the access time is a function of device technology.
The fundamental unit of data on which cache transactions take place is the "block." Each cache level may use different sized blocks, but when a higher level of cache sends data to the next lower level of cache, the dimension of the block transferred must be that used by the lower level of cache. Anytime the CPU needs to access a word within a given block, the entire block must be brought into the cache if not already present. The space in the cache in which a block is stored is called a "block frame" or "line."
The data buses between levels of hierarchy need not be as large as the blocks which are transferred on them, in which case multiple sub-block transfers must take place for each transaction. For example, if the cache block is twice as wide as the data bus, then two half-block transfers are required any time the cache must communicate with a higher level of memory.
When the CPU requests data from memory, the primary cache intercepts the request and determines whether the requested data is immediately available. If the primary cache does not have the requested data, it issues a request to the secondary cache. Assuming the secondary cache has the requested data, it will send it to the primary cache. The amount of data sent is equal to the size of the primary cache block, not the CPU word. If the secondary cache does not have the required data, it must, in turn, retrieve it from the next higher level of memory. The amount of data that the secondary cache would have to retrieve is equal to the secondary cache block size.
By using a block size bigger than the CPU word size, the cache takes advantage of the principle of spatial locality; if the CPU needs a particular word at a given instant, it will likely need the rest of the words in the block shortly. Rather than incur the overhead of multiple communications between caches, it can be beneficial to send several words at once, hence the transfer of an entire block at once. Determining exactly how big to make the blocks for each level of hierarchy can be complicated, and is often accomplished through simulations.
Each cache has room for a fixed number of blocks. A given CPU address may map to a subset of the available cache block locations. Since the cache will not have room to hold the entire CPU address space, several memory address will map to each cache block frame. Each cache will have an content-addressable or associative memory (called a "tag RAM") which stores a directory providing the mapping between the CPU address space and the cache block memory. Whenever the cache needs to reference its data memory it must search the tag RAM to determine which blocks it has stored in its block frames.
The directory entries which are written into the tag RAMs are
called "tags" and are usually directly extracted from
the address of the block. There is one tag entry for each block
frame. For example, if, in a given cache, any block may be stored
in any block frame, then, as each block is stored, the tag entry
for a given frame will be the block address of the block being
stored in that frame. As an alternate example, suppose a cache
has only two block frames, one of which is permitted to store
even block addresses and the other of which holds only odd block
addresses. Then the tag entries would not need to contain the
lowest order bit (which determines "oddness" or "evenness")
since to find an odd or even block address the cache knows precisely
which block frame to check, and needs only to know if the specific
block it's looking for is present. In short, the portion of the
block address which is necessary to reconstruct a block address
when the frame in which the block is stored is known is stores
as the tag.
An important issue in cache design is what to do when the CPU attempts to write something into a particular memory address (i.e. perform a STORE). Although most cache accesses will be reads (after all, all instruction fetches will be reads, and few instructions write to memory [Henn90]), writes are common enough that some care must be taken in handling them.
The reason that writes are complicated is that, unlike reads, the actual transaction can not take place simultaneously to determining if the proper address is stored in the cache. On a LOAD (a cache read), the cache can check the associative memory it uses to determine if the appropriate address is present simultaneously to actually reading the requested data. If the required data is available, then by the time the cache realizes that a hit has occurred, the data has already been read from the data RAMs, and no penalty is incurred for having had to check the tag RAM. If, during a STORE, the cache were to write the new data into its data memory while checking its associative memory, and the proper address were not available in the cache, the cache will have overwritten data that it might need later. For this reason it is necessary to first read the cache, and only write into the cache once the proper address is found. This is also done because the CPU word size is usually smaller than the cache block size, so the cache may have to read an entire block, modify one word, and write the modified block back into the cache. All this makes a write take longer than a read.
When actually writing into the cache, there are two possible
options.
In a copy-back design the cache tag RAM's will likely store a bit which determines whether the cache block in question has been modified by the CPU in the time since it was retrieved from higher levels of memory. If the CPU has modified the block by performing a STORE, and the higher levels of memory have not been updated, the block in the cache is called "dirty." If the CPU has not modified the block, or the block was modified but the higher levels of memory have been updated, the block is called "clean."
If a block with a different address needs to be stored in the space occupied by a dirty block, the dirty block must be written to the next higher level of memory or its modifications will be lost. The advantage to this type of cache over a write-through cache is that the time consumed in writing to the next level of memory need only take place when this situation occurs. In a write-through cache, multiple writes to the same memory block would each result in block transfers to the next higher level of memory, whereas, in a copy-back design, no block transfers occur until another block needs to occupy the modified block's frame.
Write through designs do, however, have the advantage of being simpler to implement, particularly in multiprocessor designs with shared memory in which it is important that the highest, shared, level of memory be consistent with the caches at all times.
To prevent the need to wait for the write through to occur on every write, it is common to include buffers between the cache levels. This buffering scheme can mitigate the simplicity advantages of the write through design, however.
A second issue in handling memory writes is what to do in the
event of a write miss. A write miss occurs when the CPU attempts
to write to a location whose cache block address is occupied by
other main memory addresses. In a copyback cache the block occupying
this frame will be transferred to higher levels of memory. In
a write-through cache this is not necessary. In either case it
is still necessary to actually perform the write that was requested
by the CPU. There are two methods of accomplishing this:
Fetch on write is used when the expectation is that the modified
block will be modified many times in close temporal proximity.
Write around caches are usually used in write through designs
since modifications to memory have to filter through to main memory,
anyway.
Each level of the cache hierarchy may be of differing dimensions,
which will be different in dimension from the main memory. As
a result, a block from a lower level of memory or the CPU must
be mapped to the proper location in each cache. Caches are frequently
classified by the manner in which this mapping is accomplished.
In a direct mapped cache each main memory address maps to one
and only one location within the cache. Multiple main memory addresses
may map to a single cache address, however, if the cache size
is smaller than the main memory address space. Typically the
mapping is accomplished using the modulus function. For example,
if the cache contains eight blocks as shown in Figure 1.4, then
main memory block address 9 would map to block 1 in the cache
(9 mod 8 = 1). Note that main memory block addresses are used.
For example, if there are four words per block, then address 32
through 35 would be contained in block 9.
CACHE |
| |||||||||||||||
MAIN
MEMORY | ... | |||||||||||||||
8 | 9 | 10 | 11 | ... |
The direct mapped cache has the advantage of being simple to
implement. In addition, it is simple to determine whether or not
a given main memory address is stored in the cache because there
is only one possible place it can be stored. On the other hand,
if there is a need to store another address which maps to the
same block frame, the block is displaced, at the very least entailing
a block transfer penalty, and, at worst, causing the displaced
to be swapped back into the cache again if it is needed shortly.
CACHE |
| |||||||||||||||
MAIN
MEMORY | ... | |||||||||||||||
8 | 9 | 10 | 11 | ... |
In a fully associative cache, any main memory address can be stored in any block in the cache. For example, as shown in Figure 1.5, main memory block address 9 could be stored in any block of the eight-entry cache.
One advantage of a fully associative cache is that as long as there is a free block in the cache, then there is no need to ever replace the cache contents with other addresses. If all of the blocks are full, then ideally one would wish to replace the block which will not be needed for the longest time [Matt70]. Since this is difficult to accomplish, alternate cache replacement algorithms such as "least-recently used" [Henn90] are often used.
A disadvantage of this approach is that when a block is sought,
the entire cache must be searched. This will either require lots
of hardware (a comparator for each block frame) or a lot of time
(the contents of each block frame serially being passed through
a single comparator). A second disadvantage is that implementing
the algorithms needed to decide where to store blocks (somehow
a decision must be made as to where to store each block, since
any frame is a legitimate destination), and the multiplexing logic
needed to route the data appropriately, require a comparatively
large amount of hardware.
A set associative cache is similar to a direct-mapped cache,
but for each address in the main memory address space there is
more than one possible block location in the cache. The possible
block locations to which a particular main memory address block
can be mapped is called a set.
CACHE |
| |||||||||||||||
MAIN
MEMORY | ... | |||||||||||||||
8 | 9 | 10 | 11 | ... |
Figure 1.6 shows an example of a set associative cache with eight set addresses and two blocks per set. Using the modulus mapping that was used in the direct-mapped cache, main memory block address 9 is once again mapped to cache set 1. It may occupy either block within that set, however. The location where a particular block is stored may be chosen using random, least-recently-used, or other algorithms, as in the fully associative cache.
The fully associative and direct-mapped caches can be treated as special cases of the set associative cache. In the case of the fully associative cache there is only one set, encompassing the entire cache. In the case of the direct-mapped cache, the set size is one block, and there is one set for each block in the cache.
The set associative cache requires fewer hardware resources than the fully associative cache in that the multiplexors which route data into the data memories can be smaller (since the number of possible block frames in which a given block may be stored is typically fewer in number), and the number of comparators necessary within the tag RAM if a parallel search is to be done is fewer. On the other hand, the direct-mapped cache requires no multiplexors and only one comparator.
An advantage of the set associative cache over the direct-mapped
cache is that if a particular block frame is occupied, it is possible
that alternate block frames within the same set will be available
for new data.
In the 1940's, researchers at Harvard University built the Mark series of computing machines. The Mark-III and the Mark-IV were stored program machines, with separate memories for the instructions and data. While this type of architecture is rare today, it is common for a machine to have a shared main memory, but separate instruction and data caches; this is called a "Harvard architecture." The alternative, a single, shared cache, is called a "unified architecture" or "Von Neumann architecture." Figure 1.7 illustrates each type of cache architecture.
The advantages of separating the data and instruction caches
is that it makes it simpler for instructions and data to be simultaneously
fetched from memory. In systems in which the CPU is pipelined
and can fetch instructions and operand simultaneously, it is a
great advantage to have separate buses to memory to fetch these
items, rather than forcing a stall and fetching them one at a
time.
On the other hand, when a Harvard architecture is used, it is
generally impossible to modify the instruction stream (the instruction
cache is capable of reading memory but not writing to it), and
thus self-modifying code, an important element of certain types
of artificial intelligence systems, becomes much more difficult
to implement. In addition, since the two caches can not share
cache RAM's, the cache hierarchy is less able to adapt to changing
conditions by partitioning different amounts of memory to holding
instructions and data. It is possible, however, to optimize each
cache since they need not be identical.