Chapter 2

Cache Architecture and Microarchitecture

In designing a cache for a given processor, the organization and parametric issues which don't involve the actual circuit-level design need to be streamlined to provide maximum performance while requiring a minimum quantity of devices. These issues are labeled "architecture" although, unlike the traditional computer engineering definition of the word, the cache architecture is implementation specific, and should be largely transparent to the programmer Given the technology to be used in implementing a processor, it is possible to evaluate the design space to determine the cache parameters and architecture which result in the best overall system performance. Architectural parameters include cache dimensions and organization, communications bus widths, and write behavior.

"Architecture," in computer engineering terms, is usually defined as the part of the computer design which is visible to the programmer [Henn96], either in terms of functionality or relative timings (the difference in timings of various operations on a given processor due to available functional units, as opposed to the overall speed differences between one processor and another.) In the general sense this would include issues such as the instruction set, addressing modes, quantity of registers, virtual address space, trap behavior, and number of functional units.

Design issues which apply only to a given implementation and are not visible to the programmer are termed "implementation" or "design" issues to differentiate them from architectural issues.

"Microarchitecture" usually refers to high-level issues such as dataflow organization which are not visible to the programmer.

Since any cache design should be essentially transparent to the programmer (except insofar as varying the cache design may effect the relative timings of memory-accessing instructions), almost all of the actual cache design would be implementation specific given the traditional definition. For this reason the line between cache architecture and cache microarchitecture is less important.

  1. Trace-driven Simulation

In evaluating cache designs it is helpful to consider metrics which take into account the performance of the system as a whole, given a particular cache configuration. One useful metric is Cycles per Instruction (CPI), which is defined as the average number of processor cycles required to execute a single instruction.

Of course, determining how long a given instruction will require to execute entails knowing whether the instruction accesses the cache, whether the cache contains the required data or needs to perform a copyback, and many other state-dependent details.

One technique frequently used to determine the performance of cache designs is trace-driven simulation. The idea behind this technique is to analyze the behavior of workloads (sets of processes) actually intended to operate on a target design. A "cache trace" is a time-ordered list of memory references made by this target workload. Typically the trace is obtained by introducing hardware between a CPU and memory which captures which addresses are referenced when, and in what manner. An alternative method would be to simulate the target code on a software simulator capable of storing memory accesses. In order to evaluate the performance of different cache designs in the F-RISC/G processor, the DineroIII trace-driven simulator [Hill84] was used, with three traces based on actual running code.

In order to use cache traces as a method for determining the overall performance of a cache architecture over a wider variety of uses, three problems must be overcome [Ston90].

General purpose machines such as F-RISC are particularly difficult to represent with a single trace, and thus multiple traces are often used. Typically, suites of traces such as those encompassed by the SPEC92 benchmarks are used. Alternatively to using traces captured from actual running code, one may use traces generated by statistical models of memory reference patterns (synthetic traces).

The length of the cache trace is particularly important because the trace length required is estimated to grow as the cache size raised to the 1.5 power [Ston90]. As cache size increases the number of misses which occur in a trace of given length decreases, thus making them statistically less significant.

In order to determine how long a trace must be, a statistical model can be introduced. It may be assumed that cache misses are a Bernoulli process will probability m of "success" and h=1-m of "failure." This is not a particularly good approximation because in a Bernoulli process cache misses would occur independently of each other, which is clearly not the case in the cache, but it should serve as a useful lower bound. If the miss-generating process is Bernoulli in nature, then the distribution of misses is binomial [Deve91].

If the trace length is T the expected number of misses is M=mT. The variance from M is mTh=Mh. The expected miss ratio would then be M / T and the variance of the expected miss ratio would be M h / T2 = m h / T. The standard deviation, which is frequently used as a measure of how close to the expected value the true mean of the data is likely to be, is the square root of the variance, or SQRT(m h / T). Most of the variation will be found within 2 standard deviations from the mean.

The true mean miss-ratio can therefore be estimated by m +/- SQRT(m h / T). So, if we want the trace-based cache evaluation procedure to produce an accurate result we must minimize the deviation term.

The traces used in evaluating the F-RISC cache were on the order of 1,000,000 references in length. Since the F-RISC cache is small, a comparatively poor hit rate of .85 will be assumed (solely for the purposes of determining whether the cache traces are of sufficient length. The actually miss rate can be determined from simulations.) In this case, the true mean miss ratio would be estimated to be 0.15+/- 0.000714 , meaning that the variance is sufficiently small to suppose that the cache lengths are long enough to properly represent the workload they were derived from.

The three traces used in the DineroIII simulations are tex, spice, and gcc which are representative of three common code bases frequently executed on workstation class machines. Table 2.1 shows some of the important characteristics of these traces.

Data Read
Data Write


The value of CPI for a given trace is calculated by divided the number of cycles required to execute the trace by the number of instructions in the trace. The CPI for a processor is a compilation of CPI figures from several sources. For example, the processor may be designed so that every ALU operation requires one cycle to execute, in which case the CPI due to ALU operations would be 1. If the processor were always able to execute two instructions in parallel, then the CPI due to ALU operations would be 0.5.

In the case of the F-RISC / G processor, the pipeline structure is such that the CPU component of CPI is expected to be approximately 1.45 [Phil93, Greu90]. This CPI component is sometimes called the latency component of CPI. Although the CPU is designed to execute with a peak CPI of 1.0, pipeline latencies (the inability to keep parts of the processor busy due to code dependencies) result in reduced performance.

In addition to CPI penalties caused by pipeline inefficiencies, the cache contributes to an increased CPI. This component of CPI is known as the stall component of CPI. Some percentage of the instructions executed by any given code will result in cache accesses (in fact, all instruction will result in instruction cache accesses), and the overall CPI of the processor depends on the cache's ability to perform the requested transaction in the time allotted. Whenever the primary cache must resort to communicating with the secondary cache, the effect is to stall the processor and prevent it from accomplishing any useful work.

Table 2.2 shows the results of the cache simulation using the SPICE trace, with associativity set to 1, a Harvard architecture, and 2048 byte caches. Copyback and write allocate are assumed. A miss penalty of 5 cycles is used in producing all of the results given in this chapter. The numbers shown are CPI, ignoring losses due to inefficiencies in pipeline management. The following equation is used to calculate stall CPI from the data provided by DineroIII for the Harvard architecture case:

CPI = (Instructions + 5 (Bus Traffic/Bus Width)) / Instructions

Bus Width (Bytes)
1 2.350
2 3.666 2.3330
4 6.300 3.6502 2.3251
8 7.550 4.2750 2.6375 1.8187
16 10.034 5.5168 3.2584 2.1292 1.5646
32 13.606 7.3031 4.1515 2.5757 1.7878 1.3939
64 23.438 12.2189 6.6094 3.8047 2.4023 1.7011 1.3505
128 55.223 28.1117 14.5558 7.7779 4.3889 2.6944 1.8472 1.424
256 125.32 63.1575 32.0787 16.539 8.7696 4.8848 2.9424 1.971 1.4856
512 356.89 178.940 89.9704 45.485 23.2426 12.1213 6.5606 3.780 2.3902 1.695
1024 871.15 436.076 218.538 109.76 55.3845 28.1922 14.5961 7.798 4.3990 2.70 1.850
2048 2242.7 1121.8 561.413 281.20 141.103 71.0516 36.0258 18.513 9.7565 5.378 3.189 2.094


Bus traffic and bus width must share the same units, typically words. Instructions are used rather than accesses in the numerator since in a Harvard architecture the cache can respond to both an instruction and a data access simultaneously, so the minimum number of accesses that will be made if every access is a hit is equal to the number of instructions fetched.

It is clear from these results that for this trace small block sizes perform best. This is due to the fact that since the cache memory is so small, a large block size imposes great penalty in terms of percentage of available space taken up by a single block. When a new block address needs to be stored there are few places to store it when block sizes are large.



Memory accesses tend to be clustered around several physically separated addresses (Figure 2.1) rather than a single centralized address (Figure 2.2) [Ston90]. One cluster may represent the instruction pointer and nearby addresses, while others may represent the various data structures in use by various processes. Each of these clusters of likely references would tend to be stored in a separate block frame or group of frames in the cache, so long as enough block frames exist. If there are very few block frames, then a large percentage of the time that an access to a cluster that is not currently in the cache occurs, another cluster of frames likely to be referenced soon must be removed from the cache. In this case more block frames are preferable to more space per frame.


Table 2.2 also shows that for a given block size the best performance occurs when the bus width is equal to the block width. If the bus width were smaller than the block size, then multiple bus accesses would be necessary. If the cache is not needed on the next cycle it would be possible to un-stall the CPU while the remainder of the block is transferred, although this costs more hardware. Figure 2.3 is a graph showing the cache stall component of CPI as a function of block size and bus width for the Spice trace.
Bus Width (Bytes)
1 1.756
2 1.772 1.3858
4 1.798 1.3991 1.1996
8 1.863 1.4314 1.2157 1.1079
16 2.132 1.5662 1.2831 1.1416 1.0708
32 3.05 2.0251 1.5126 1.2563 1.1281 1.0641
64 6.018 3.509 2.2546 1.6273 1.3136 1.157 1.0784
128 18.673 9.836 5.4182 3.2091 2.1046 1.552 1.2761 1.1381
256 112.58 56.79 28.895 14.947 7.9737 4.487 2.7434 1.8717 1.4359
512 370.35 185.68 93.338 47.169 24.085 12.542 6.7711 3.8856 2.4428 1.7214
1024 1153.2 577.11 289.05 145.03 73.013 37.007 19.003 10.001 5.5008 3.2504 2.125
2048 3072.4 1536.7 768.84 384.92 192.96 96.980 48.990 24.995 12.998 6.9988 4.0 2.5


Table 2.3 shows the simulation results for the Tex trace. The magnitude of the CPI figures is less than that obtained with the Spice trace, however the overall behavior is similar. The best results occur with equal block sizes and bus widths, and the best overall result (1.064) occurs with a block size and bus width of 32 bytes. The Spice trace exhibited the best CPI (1.351) with a bus size and block width of 64 bytes.

Figure 2.4, plotted on the same scale as Figure 2.3, shows that the magnitude of the CPI results obtained using the Tex trace is lower overall all than the results obtained with the Spice trace. Once again, at a given bus width, smaller block sizes seem to yield superior performance.


Table 2.4 and Figure 2.5 show the results of running the simulation with gcc trace. As with the other traces, the best results for a given bus width occur at smaller block sizes. The optimal point (1.56) is a block size and bus width of 256 bytes.
Bus Width (Bytes)
1 3.651
2 6.278 3.6390
4 11.53 6.2685 3.6342
8 13.68 7.3409 4.1704 2.5852
16 17.76 9.3805 5.1902 3.0951 2.047
32 25.921 13.460 7.2304 4.1152 2.557 1.7788
64 43.190 22.095 11.547 6.2738 3.636 2.3184 1.6592
128 82.083 41.542 21.271 11.135 6.067 3.5338 2.2669 1.6334
256 143.3 72.17 36.585 18.792 9.8963 5.4481 3.2241 2.1120 1.556
512 322.41 161.70 81.352 41.176 21.088 11.044 6.0221 3.5110 2.256 1.6278
1024 1013.0 507.00 254.00 127.50 64.250 32.625 16.813 8.9063 4.953 2.9766 1.9883
2048 1990.8 995.9 498.46 249.73 125.36 63.183 32.09 16.546 8.773 4.8865 2.9432 1.9716



Bus Width (Bytes)
1 24 816 3264 128256 5121024 2048
24.062 2.531
46.898 3.9492.474
88.133 4.5672.783 1.892
1610.564 5.7823.391 2.1951.598
3215.020 8.0104.505 2.7531.876 1.438
6426.934 13.9677.483 4.2422.621 1.8101.405
12854.527 27.76314.382 7.6914.345 2.6731.836 1.418
256128.143 64.57232.786 16.8938.946 4.9732.987 1.9931.497
512348.433 174.71687.858 44.42922.715 11.8576.429 3.7142.357 1.679
10241000.236 500.618250.809 125.90563.452 32.22616.613 8.8074.903 2.9521.976
20482385.306 1193.153 597.077299.038 150.01975.510 38.25519.627 10.3145.657 3.3282.164



Table 2.5 shows the weighted (by length) average of the CPI results for each of the three traces. This table summarizes the behavior of the cache given a Harvard architecture with direct-mapped, 2 kB caches.

The optimum point (1.41) occurs with a block size and bus width of 64 bytes. Given the estimated latency CPI component of 1.45, the total CPI taking into account both latency and stall CPI components would be 1.86. This is due to the fact that pipeline latencies would allow instructions to access the cache once every 1.45 cycles on average at most. The cache simulator assumed that instructions could access the cache every cycle as an upper bound. Adding the 0.45 to the stall CPI therefore serves as a crude estimate of the overall CPI.

Figure 2.6 shows a plot of the weighted mean stall CPI for all three cache traces as a function of block size and bus width.

Having determined that the optimum configuration occurs when block size is equal to bus width, it is possible to plot the stall component CPI as a function of equal block sizes and bus widths as shown in Figure 2.7, where the results of each trace are plotted along with the weighted mean. From this plot it is clear that the minimum CPI occurs at a bus width and block size of 64 bytes (512 bits). It is also interesting to note that at half that size (32 bytes) the CPI is approximately 1.44, which is only 0.03 cycles per instruction worse than the optimal point.

Although there is little apparent CPI penalty to be paid for utilizing a 256 bit block size rather than a 512 bit block size, technical issues complicate the matter for F-RISC / G. The fundamental RAM block used both for cache tags and cache RAM data in F-RISC / G is 16 bits wide by 32 rows deep. An 8 bit wide by 32 row higher speed RAM block is also available. (The larger RAM block offers an advantage in that using it rather than two smaller blocks incurs a smaller hardware penalty since address decoding hardware can be shared.)


In the case of a 512 bit block size, it is possible to place four of the larger block or eight of the smaller block on each die, and simply organize each block in parallel within each die to give 512 input and output lines (64 per chip). Figure 2.8 illustrates the partitioning if the larger RAM blocks are used.

Given the low yield of the process used in fabricating the F-RISC / G chips, the maximum number of the larger blocks which can be placed on a die is 4, and the maximum number of the smaller blocks which can be placed on a die is 8. Implementing a 2048 byte cache requires 8 of these die.


If the bus is 256 bits wide one would ideally like to use half the number of RAM chips, placing 8 large blocks per chip or 16 small blocks per chip, where three quarters of the blocks on each chip are used to increase the depth of the cache. In the case of F-RISC / G, yield is insufficient to allow such an organization. Instead it would be necessary to use 16 chips and use half of the blocks on each chip to increase cache depth. Multiplexors and control logic would need to be employed on I/O paths to select which block to read from and which block to write into.



Figure 2.9 illustrates how this partitioning would be accomplished. Note that the circuitry required to select the proper block for reading and writing represents a large hardware penalty in addition to imposing additional delays on the cache critical paths.

Figure 2.10 is a plot of stall CPI as a function of set size and equal block and bus sizes for a Harvard architecture with dual 2 kB caches and copyback. The Least Recently Used replacement algorithm, which is is fairly simple to implement for very small numbers of sets, is assumed. From this plot it can be seen that larger numbers of sets tend to produce better CPI's, although the bus width and block size seem to have a larger effect on the overall CPI.

Figure 2.11 shows the effect of varying set size for the three block sizes and bus widths which provide the best results for 2 kB Harvard caches employing copyback and the timing constraints mentioned earlier. As can be seen, the CPI does not improve markedly as set size is increased beyond 4. The effect of moving from a direct-mapped (set size=1) cache to a 4-way set associative cache, however, is fairly significant.

Increasing set size is not without its costs, however. First of all, as the set size increases, it is necessary to employ multiple comparators in order to check each line of the target set to the CPU tag simultaneously. In addition, the tag RAM size must increase proportionally to the number of lines per set. If it is possible to store n tags in a particular set, then the tag RAM must increase in size by a factor of n. The net result is that almost all of the cache control circuitry must be increased by a factor equal to the number of lines per set.


In a low-yield technology such as that employed in F-RISC / G, it may not be possible to include all of this circuitry on a single die. In this case the control circuitry must be partitioned among several die. One possible partition would have each set divided into n die where n is the number of lines per set. Each die would contain a tag RAM and a comparator. Some mechanism of communication between the die would be necessary in order to determine whether any of the die contained the appropriate data. One slice would likely be configured as the "master," and each of the other slices would report the results of the tag search to it.

In using a set associative scheme one must pay a speed penalty in addition to the aforementioned hardware penalty. At minimum there will be two multiplexor delays on the critical path, as it is necessary to select between multiple locations for each RAM read or write. Deciding where to store data within each set (implementing a replacement algorithm) can also cause a delay. In a low-yield technology in which the controller circuitry must be partitioned, one must also pay driver / receiver and communications delays. In the case of F-RISC / G the cache critical path requirements could not be met with these additional delays.

The apparent advantage of a set associative cache is mitigated somewhat if one considers the complicated hardware required to implement replacement algorithms. The popular Least Recently Used algorithm, for example, requires memory hardware to keep track of which line in a set was last used, hardware to update this memory when appropriate, and comparison hardware to determine which line "wins" when the write into the data memories is to take place.

Other algorithms, such as "first-in-first-out" (FIFO), and "random" are somewhat simpler to implement. Simulations show, however, that these algorithms to not ususally match the performance of LRU.


Figure 2.12 shows the effectiveness of different replacement algorithms on the gcc and spice traces (assuming 2kB Harvard caches with a 512-bit block / bus size.) As can be seen from the figure, LRU does perform significantly better than FIFO or random replacement; using any of these algorithms, however, would still be a CPI improvement over a direct-mapped cache. There is little doubt, however, that implementing a set associative cache of any size would negatively impact cycle time.


Figure 2.13 illustrates the effect of various cache architectures on stall CPI. The graph shows CPI as a function of block and bus width for a Harvard cache with 2 kB per cache, a unified cache with 4 kB of single-ported RAM, and a unified cache with 4 kB of dual-ported RAM. A direct-mapped cache employing copyback is assumed.

In a unified cache with dual ported RAM it is possible to read both an instruction and data simultaneously, while, for a single-ported RAM scheme, it is possible to perform only one access at time.

As the graph shows, the Harvard cache tends to perform the best. For the unified cache designs, the use of dual-ported RAM provides the best results except at extreme block sizes.

A dual-ported Unified cache would be expected to perform comparably to a Harvard cache since both can satisfy two requests at once, which is the case. The Unified cache has a slight advantage at small block sizes because in this case there are many blocks and the Unified cache can allocate them between caches with a fine resolution; the Harvard cache suffers from the assumption that 50% of total cache size should go to each cache.

The equation used for calculating CPI from DineroIII output for a unified cache with dual ported RAM is identical to that used for the Harvard architecture since in both cases the cache is capable of performing a data and instruction access simultaneously.

For the single ported unified cache, at most one cache access, instruction or data, can be accomplished at any time. As a result, the equation used to calculate CPI from the DineroIII output is as follows:

CPI = (Total Accesses + 5 (Bus Traffic/Bus Width)) / Instructions


Figure 2.14 shows the effect of varying cache size given a Harvard direct mapped cache employing copyback and a 64 byte block and bus width. The stall component of CPI drops below 1.5 at a cache size of 2048 bytes per cache. At twice that memory size there is comparatively little improvement in CPI, and there is little doubt that it would be extremely difficult to implement that much memory given the interconnect lengths that would be required and the difficulty in removing the heat from that many bipolar RAM blocks.
Cache Dimensions:
WideWhen a miss occurs, a large number of consecutive words can be transferred to the cache at once, and no more misses will occur until an addressed outside that block is accessed. A large amount of hardware and signal routing is required. In addition, a large number of I/O drivers may switch simultaneously.
DeepThe probability of having to replace a line is decreased. Either deeper RAM blocks (which tend to be slower due to increased bit line capacitance) or many blocks are needed.
Write Policies:
Write-throughReduces control logic complexity and cache coherency problems in multi-processing systems. Requires either a large write through buffer or will result in many stalls.
CopybackGreatly reduces bus traffic to secondary cache. Complicates control logic in cache controller.
Write Location:
Write allocate:Tends to increase CPI since words that the CPU modifies will tend to be used again. Complicates cache control logic.
Write around:Simplifies cache control logic. Far simpler to implement in write-through designs.
Direct mappedGreatly reduced cache control complexity. Increases miss rate.
Set AssociativeReduced miss rate compared to direct-mapped cache. Not as hardware / time intensive as an associative design. Requires more hardware and delay than direct-mapped.
AssociativeMinimizes miss rate. Replacement algorithm is expensive to implement in hardware.
HarvardSimpler control hardware. No need for multiple ports. No self modifying code. Rigid memory allocation.
UnifiedFlexible memory allocation can increase hit rate. Self-modifying code. Multiple memory ports needed.


Table 2.6 shows qualitatively how varying the cache architecture and dimensions can affect the cache cycle time and hit rate, as well as other penalties and benefits of each design decision.

Based on these cache simulations, the design point which was chosen for the F-RISC / G prototype cache is as listed in Table 2.7.
Instruction Cache Size:2kB
Data Cache Size:2kB
Write Policy:Copyback, Write Allocate
Bus Width:512 bits (64 bytes)
Block Width:512 bits (64 bytes)


Assuming a miss penalty of 5 cycles, the predicted stall CPI for this design is approximately 1.41.

Table 2.8 shows the results of the cache simulations broken down by type of event. The probability of each event occurring is also given. Based merely on these events, the stall CPI would add to 0.73. What remains unaccounted for are 68% of the instructions which may be either ALU operations or BRANCH's. Each ALU or BRANCH operation can be assumed to take 1 cycle (since BRANCH misses are already accounted for in the "Instruction miss" category.) Therefore, the net stall CPI would be 0.73 + 0.68 = 1.41, as reported above. Note that the "Reads" figure presented in Table 2.8 includes the reading cycle that occurs at the beginning of each STORE, thus the write penalty would only be 1 additional cycle.
Event# Occurances ProbabilityCycles Weight
Reads:440985.21 1.21
Writes:254081.12 1.12
ALU / Branch1442343 .681.68
Instruction misses:77527 .045.20
LOAD misses:50635.02 5.10
STORE misses:17881.01 5.05
Copybacks:27179.01 5.05
TOTAL1.0 1.41


To calculate the latency CPI it is necessary to examine the way in which instructions can be rearranged by the compiler in order to avoid latencies. It is expected that the compiler will be able to fill approximately one half of the latencies with useful instructions [Greu90]. The latency for a LOAD (including the cache read portion of a STORE) is 3, and the probability of this occurring is 0.21. Assuming an optimizing compiler capable of filling half of the latency, LOADs and STOREs would contribute 0.32 to the overall CPI. Taken BRANCHs also incur a latency penalty, but the probability of occurrence is not derivable from the cache simulations. If a 10% likelihood and an optimizing compiler capable of filling half of the latency is assumed, then the contribution to CPI from taken BRANCHs is 0.15. The total CPI is therefore predicted to be 1.88.

  1. Cache Transactions

From the perspective of the CPU, the only operations which can take place in the cache are a "LOAD," in which information is fetched from the primary cache and sent to the CPU, and a "STORE," in which information is sent from the CPU to the primary cache for storage. Depending on the state of the cache when the CPU attempts a transaction, however, things can get more complicated.

Perhaps the most complicated situation occurs in caches which employ copyback for writes. In such a cache not only is it necessary to handle hits and misses, but clean and dirty operations become an issue, as well.

For a copyback cache the possible transactions are a load hit, clean load miss, dirty load miss, store hit, clean store miss, and dirty store miss.

  1. Load Hit

A load hit occurs when the CPU attempts to read from an address which is available in the cache. In this case the CPU generates an address which gets sent to the cache. The cache will usually simultaneously check its tag RAM to see if the requested address is available and start reading from the appropriate location in the data RAM. The data will be sent to the CPU when it is found.


Figure 2.15 shows a timing diagram for this operation. The address needed by the tag RAM consists of the tag, which is used for comparison to the data stored in the tag RAM, and the line address, which tells which block (or set of blocks) to look at. The data ram needs merely the line address in order to access the requested data. The tag comparison and the data RAM access take place simultaneously in order to reduce the overall transaction time.

  1. Clean Load Miss

A clean load miss occurs when the CPU attempts to retrieve data which is not available in the cache. In this case the cache must retrieve the requested data from the next level of memory.

Whatever the method used, the block retrieved from the next level of memory will be mapped to a specific location in the primary cache. If this location contains data which has been modified by the CPU, and if those modifications have not been reflected in the higher levels of memory (as in a copyback cache), then the primary cache must first send the next level of memory the data stored in that cache location before storing anything new there.


In the clean load case, the data already stored in the cache has not been modified, so this step is not necessary.

Figure 2.16 is a timing diagram for a clean load miss. Note that the secondary cache may begin its access in parallel with the primary cache, and that the data arriving from the secondary cache may or may not be written into the primary cache simultaneously to sending it to the CPU.

  1. Dirty Load Miss

When a dirty load miss occurs, the block fetched from the secondary cache can not be stored in the primary cache until the modified data already in that primary cache location is updated in the higher levels of memory.


Figure 2.17 shows a representative timing diagram for the dirty load miss case, also known as a "load copyback." Note that in this particular implementation the address from the CPU is sent to the secondary cache in parallel with the primary cache, allowing the secondary cache to get an early start retrieving the required data. By the time the primary cache realizes that a dirty miss has occurred (at the end of the tag comparison step) and has sent the dirty data to the secondary cache, the secondary cache has finished retrieving the requested data.

In the implementation shown in this figure, the secondary cache latches this data and stores the dirty data sent by the primary cache before sending the originally requested data. If the primary to secondary cache communications consists of separate input and output buses, then it would be possible for the secondary cache to send the requested data as soon as it is ready, in parallel with receiving the dirty data from the primary cache.

  1. Store Hit

A store hit occurs when the CPU attempts to write into a memory address which is already available in the primary cache. Regardless of whether the contents of that address have already been modified by the CPU, the primary cache need not perform a copyback to higher levels of memory.


Figure 2.18 is a timing diagram for a typical store hit. Note that two cache RAM cycles are required for stores, since it is necessary to first read the destination block in case a copyback needs to take place before actually performing the write. In the figure, the block write takes place after the compare, when the cache hit is detected.

  1. Clean Store Miss

In a copyback cache, if the CPU attempts to perform a store into an address which is not available in the cache, the cache must determine whether the block frame into which the requested address will be stored already contains an address which has been modified in the cache but not in higher levels of memory (and therefore needs to be copied back.) If the data has not been modified (the line is not dirty), then no copyback need take place.


Figure 2.19 is a timing diagram for a typical clean store miss. Since the CPU word size is typically smaller than the cache block size, and the cache always transfers blocks to and from higher levels of memory, it is necessary for the cache to perform the write in two stages. First, the entire block fetched from the secondary cache is written into the primary cache's data RAM, then the portion of that block which the CPU has asked to be modified (presumably one word) is written into with the data supplied by the CPU.

  1. Dirty Store Miss

If the CPU attempts to write into a block already occupied by data which has been modified in the cache but not in the higher levels of memory, a copyback must take place.


Figure 2.20 is a timing diagram for a typical dirty store miss. The primary cache checks its tag RAM while the secondary cache begins its access. By the time the primary cache has determined that a dirty condition exists and send the dirty data to the secondary cache, the secondary cache will hopefully have finished its read access.

The two caches can perform their writes in parallel. Once again, the primary cache must first write the entire block from the secondary cache and then write the data that the CPU actually wanted to modify.

  1. Secondary Cache

The assumed access miss penalty for the primary cache, as previously stated, was 5 ns. This means that in order to achieve the CPI figures reported in [Phil93] the average time to access the next higher level of memory must be 5 ns, including the time required to transfer the address and receive the data.

It would be impossible to implement the entire 32 bit address space of the F-RISC / G processor in RAM with such a short access time, so a secondary level of cache is required.

Using the DineroIII simulator it is possible to generate a secondary cache trace. By simulating the primary cache with appropriate parameters, the transactions which would have occurred between cache levels can be stored into a file which can be converted to a new cache trace.

Simulating a primary cache employing a Harvard architecture with 2048 byte direct-mapped caches, copyback, and 512-bit wide blocks, secondary cache traces were created for the tex, spice, and gcc traces. Table 2.9 lists the lengths of these traces.

Cache Size


Hit Rate
20.085 0.2920.090 0.099
40.335 0.5970.310 0.335
80.579 0.7490.476 0.529
160.731 0.8240.657 0.693
320.909 0.8750.798 0.843
640.961 0.9090.863 0.901
1280.969 0.9090.911 0.932

The secondary cache was simulated, assuming 512-bit wide blocks, a direct-mapped unified architecture, and copyback. The hit rates for various secondary memory sizes is given in Table 2.10. A unified cache was simulated since information about whether particular transactions occurred as a result of an instruction or data miss is lost in the trace translation process, but these figures should provide a rough estimate of the performance of various secondary cache configurations.

The secondary cache mean access time, T, can be written as a function of the hit rate, h, the miss penalty for the secondary cache (the tertiary cache mean access time), P, and the time it takes the secondary cache to process a hit, H.

T = h H + (1 - h) P

The CPI calculations in this chapter assumed T = 5 cycles. From this equation it is possible to determine the required secondary cache hit time given a tertiary cache mean access time and a secondary cache hit rate. Figure 2.21 is a plot showing the required secondary cache hit time given a secondary cache hit rate and a tertiary cache mean access time.

Based on these figures, one can reach the conclusion that a 1.88 overall CPI can be achieved if a direct-mapped Harvard secondary cache with 512-bit wide blocks, 32 kB per cache of RAM, copyback write allocation, and 20 ns mean access time for the tertiary cache is employed. This would require the hit time on the secondary cache to be around 3.5 ns.


Since the primary cache sends the secondary cache any address which is received from the CPU as soon as it reaches the primary cache controller, in some cases the primary cache may have more time to complete its access. If the secondary cache receives no addresses between when it initially receives an address and when the primary cache signals that a miss has occurred, the secondary cache can use the intervening time to access that cache address. In the F-RISC / G system the secondary cache may receive two additional addresses between receiving an address and its associated miss signal. If no addresses are received, the secondary cache has two additional cycles to perform its cache access. If the fraction of uninterrupted accesses is U, then the mean access time of the secondary cache can be written as:

T = (1 - U) h H + U h (H - 2) + U (1 - h) (P - 2) + (1 - U) (1 - h) P

It was possible to analyze the secondary cache traces to determine how often the secondary cache would be able to get a two cycle head start on its memory access. From this analysis it was determined that 66% of secondary cache accesses can benefit from this head start (Table 2.11).
Head startNo head start


Assuming a tertiary cache penalty of 20 ns, a secondary cache hit rate of 0.90, the required hit access time of the secondary cache to achieve the 5 ns overall access time is 4.84 ns. By having the secondary cache detect that it has had an additional 2 ns to access two thirds of the incoming requests, the mean hit time can be slowed by 1.34 ns, which allows the secondary cache RAMs to be significantly slower.

In the case of the instruction cache, the situation is even better. The instruction cache will fetch a new instruction every cycle (barring stalls and traps). If no branches occur, then the secondary cache will be required to fetch a new block once every sixteen cycles, since each block contains sixteen instructions. As a result, if the secondary cache is capable of keeping track of how long it has been since a branch has occurred, it can take advantage of the large block size to significantly reduce the constraints on its data RAM access speed. Since cache simulations suggest that only 4% of all instruction fetches result in an access to the secondary cache, and, of these accesses, nearly all will be consecutive branches, the instruction cache can buy quite a bit of time by predicting ahead of time which block will be needed next (through a remote program counter). If a branch occurs too close to the end of a valid data fetch, the cache can be made to restart its access. Determining the exact proportion and distribution of branch instructions among miss instructions, however, is a difficult proposition, requiring special trace-analysis software.

  1. Yield and Redundancy Analysis

In order to achieve higher chip yield despite the high likelihood of critical faults in yield-limited technologies, it is common to utilize redundancy and fault-tolerance techniques. RAM chips are particularly well suited to these techniques as it is usually possible simply to include redundant memory cells which can be swapped for failed cells in the event of failure.

CML circuitry, unlike CMOS, does not benefit from low static power consumption; CML circuits draw power whether they are switching or not. As a result the inclusion of extra circuitry for the purposes of redundancy must be carefully considered. Additionally, the inclusion of redundant circuitry will usually increase die area, and, therefore, possibly increase signal transmission delays and slow down the circuit. A further speed penalty must be paid due to the need to introduce multiplexing logic to select which redundant structure to actually use, although, since the multiplexors are statically set at start-up, the delay is due to propagation only, and not switching.

In order to minimize the amount of multiplexing logic needed, "block replacement," "column replacement," and "nibble replacement" are the most obvious redundancy schemes. In order to minimize the area requirements of a replacement scheme, adding additional columns to each block is an additional possibility.


  1. Block Replacement

In block replacement an additional cache RAM block is placed on the RAM chip and swapped for a block containing a non-functional memory cell at system startup. The dimension of the replacement block must be equal in dimension to that of the other blocks.

Figure 2.22 is a floorplan of the F-RISC / G cache RAM block. The block consists of sixteen columns of thirty-two rows. In order for a particular column to work, each memory cell in the column, the read / write logic, the threshold voltage circuit, all of the line drivers, the address decoder, and the sense amp must fully function.


Figure 2.23 shows how block replacement might be implemented giving the F-RISC / G RAM chip micro-architecture. The cache RAM normally consists of four 16-bit wide cache RAM blocks. When communicating with the secondary cache, all 64 bits of the cache RAM blocks are used. When communicating with the CPU, however, one nibble is selected from among the four blocks.

The top multiplexor (A), which is used to select a nibble from among the cache RAM blocks, already exists in the cache RAM, but would have to be expanded to include inputs from the redundant block. Replacing the four 4-input multiplexors with four 5-input muxes would require at least 24 additional transistors (since F-RISC / G uses 3-level CML gates, adding an additional input to a four-input multiplexor requires an entire new gate, which imposes an additional hardware penalty of one current switch).

The next layer of multiplexors (B), which is used to select one of four nibbles from among each of the cache RAM blocks to send to the top multiplexor, also exists, and is actually part of each cache RAM block.

The next lower level of multiplexors (C), would have to be added to select from either the redundant block or the main blocks when reading out to the secondary cache. This would entail adding 64 2-input multiplexors which would require 384 transistors.

Each cache block has a set of multiplexors on its input which is used to select from either the wide data path or the narrow data path. A multiplexor would have to be added (D) to the input of the redundant block to enable it to receive data that was actually intended for one of the other, malfunctioning blocks. This multiplexor would actually have to multiplex four, 16-bit buses, and would likely be implemented as sixteen 4-input multiplexors (requiring 222 transistors).

There is little time penalty to be paid on a read through the narrow data path (only multiplexor A becomes more complicated). The L2 path has an additional multiplexor delay on reads. On writes from the narrow path, there is no additional time penalty for implementing this scheme. Writes from the L2 path incur an additional multiplexor delay due to D. Of course, this discussion includes only logic delays; by making the chip larger and requiring more routing area and loading on data and address lines, circuit speed will be adversely affected.

Implementing block replacement would require 630 additional devices for multiplexing, in addition to the hardware costs of the additional block and control circuitry.

  1. Nibble Replacement

In the F-RISC / G RAM chips, each of the four cache RAM blocks is divided into four nibbles, one of which is supplied to a multiplexor which further selects between the nibbles provided by each block. The nibble output of this multiplexor is sent to the CPU.

Since the cache RAMs are partitioned along nibble boundaries, it might seem reasonable to implement a nibble replacement policy in order to add fault tolerance.

One nibble replacement scheme would have an additional cache RAM block which would allow the capability to replace any four malfunctioning nibbles. While the block replacement scheme allows an infinite number of faults in exactly one cache RAM block, the nibble replacement scheme allows an infinite number of faults scattered among exactly four nibbles, which may be physically located in any of the cache RAM blocks.

Figure 2.24 illustrates one possible implementation of nibble replacement for the F-RISC / G RAM chip.


This scheme is more complicated to implement than block replacement. The previously existing nibble-selecting multiplexors for the four-bit data path (A and B) can be left as they are, although the B multiplexors may have to be removed from the cache RAM blocks. In order to allow any of the four redundant nibbles to be swapped for any other malfunctioning nibble (for both the wide and narrow data paths), the C multiplexors are added. These multiplexors must select from between five four-bit buses. Since sixteen such multiplexors are required, this scheme has a fairly high hardware cost (1280 transistors). In addition, this layer of multiplexors imposes two additional gate logic delays on reads on both the wide and narrow data paths.

In order to accommodate writes, the D and E multiplexors must be capable of selecting any four out of sixteen incoming data nibbles. This is significantly more complicated and time-consuming that what is necessary for the block replacement scheme, requiring 1120 additional transistors.

Overall, the nibble replacement scheme requires 2400 additional transistors for multiplexing, in additional to the hardware cost of the additional block and control circuitry.

  1. Column Replacement

In a column replacement scheme, an extra block is added to provide replacement memory columns for any malfunctioning columns. In the F-RISC / G RAM chip, an additional cache RAM block could enable any 16 columns to be replaced, thus an infinite number of faults could be scattered across any 16 columns and the RAM chip would still function.

The block diagram for column replacement for the F-RISC / G RAM chip is similar to that for nibble replacement. Instead of selecting from five 4-bit buses, the C multiplexors must select four out of 20 bits (at a cost of 360 transistors). This would require three levels of multiplexors, and thus three gate delays.

The B and A multiplexors would function as before. The D and E multiplexors would have to be capable of selecting any four of the 64-bit input lines (1176 transistors). The layer of input multiplexors on the block would have to be able to each select any one of the 4-bit input lines. This would require still more hardware (56 transistors) than the nibble replacement case, as well as adding more delay.

The multiplexors in the column replacement scheme would require 1592 transistors. The additional block and control circuitry represent other hardware costs.

  1. Additional Columns

Adding an additional block to the cache RAM would require a great deal of space. The only way to do this and keep the die size under 1 cm2 would be to hand craft all of the additional circuitry to minimize the space it requires. An alternative possibility would be to add additional columns to each block, solely for the purpose of replacing malfunctioning columns within that block.

Figure 2.25 is a block diagram for a possible implementation of additional column replacement in F-RISC / G. Each column contains one or more extra columns which can be swapped for malfunctioning columns using multiplexing circuitry. The C multiplexors are used to allow the output of any of the primary columns to be overridden by the contents of any of the additional columns. If there is only one additional column, then each of these multiplexors must accept two inputs, requiring 384 devices in all.


If there are three additional columns per block, then the C multiplexors would require 896 transistors. Implementing more than three additional columns per block would require an additional layer of multiplexors, eliminating the possibility of achieving the required RAM access time.

Additional multiplexing circuitry is also required on the inputs of the additional columns to select between any of sixteen wide inputs or four narrow inputs. This will require three levels of multiplexors, which will very negatively affect the write access time of the RAM. 360 devices would be required for one additional column, while 1080 would be required for three additional columns.

  1. Analysis

In determining whether some sort of redundant circuitry should be included in the RAM chip in order to aid in fault-tolerance the fault process must be modeled. Once a model for fault distribution is established, the circuitry of the chip must be analyzed to determine which circuits can be replicated efficiently, and how many circuits must work and in what quantity they must work in order to have a chip which fully functions.

Once circuitry is replicated in order to add redundancy (and thus fault-tolerance), certain parts of the chip can experience faults while still allowing the full chip to function, while other sections of the chip can not tolerate any faults if a functional chip is to be obtained. For example, merely adding an additional cache RAM block does no good if a critical fault occurs on the control circuitry which determines which blocks to use, or on the I / O circuitry which allows one to bring the required data out of the chip.

The simplest assumption has the chip consisting entirely of circuitry which is duplicated to provide fault tolerance. For example, if block replacement is used, one might model the chip as consisting of five cache RAM blocks, any four of which must work completely in order to have a fully functioning chip. If fewer than four cache RAM blocks work, then the chip is no good.

Assume the chance of a single block containing no critical defects (no defects which prevent that block from fully functioning) is equal for each of the blocks on the chip and is represented as YBLOCK. Further suppose that block failures occur independently of each other. Then the number of working blocks which can be expected is characterized by the binomial distribution:


N is the total number of blocks, x is the number of functioning blocks, and P(X=x) is the probability of exactly x functioning blocks given N total blocks.

Similarly, YCOLUMN, the probability of a fault free column, and YNIBBLE , the probability of a fault free nibble, can be substituted into the equation to find the probability of working chips given column or nibble replacement.

Given that four working blocks (block replacement), sixteen working nibbles (nibble replacement), or sixty-four working columns (column replacement) are necessary for each RAM chip to work, Table 2.12 gives the probability of a working chip, still ignoring support circuitry, for each replacement scheme:
Block Replacement4Yblock4(1-Yblock)+Yblock
Nibble ReplacementSUM(x=16..20)[20!/(x!(20-x)!)]Ynibblex(1-Ynibble)20-x
Column ReplacementSUM(x=64..80)[80!/(x!(80-x)!)]Ycolumnx(1-Ycolumn)80-x
Single Extra Column Replacement Ycolumn16+16Ycolumn16(1-Ycolumn)
Three Extra Columns ReplacementSUM(x=13..16)[16!/(x!(16-x)!)]Ycolumn16(1-Ycolumn)16-x


In order to introduce faults in the support circuitry to the model, the simplest assumption to make is that all of the circuitry on the chip is either replicated for redundancy, or support circuitry which, if a fault occurs, results in a non-functioning chip. If the probability of a chip having enough functioning replicated circuitry to function is Yfault-tolerant and the probability of the support circuitry containing no faults is Yfault-intolerant, then the probability of a fully functioning chip is Yfault-tolerant x Yfault-intolerant.

In order to calculate the relative desirability of each replacement scheme, it is necessary to relate YBLOCK, YCOLUMN, and YNIBBLE, and determine the Yfault-intolerant for each scheme.

The first step is to generate a model for critical defects. If it is assumed that critical defects occur independently of each other, then the binomial distribution is appropriate. If it is further assumed that failures occur as point defects scattered across the wafer's surface, then the probability of a defect being located at any specific point is extremely small, and the number of possible points where a defect can occur is extremely large. The Poisson distribution is a limiting case of the binomial distribution with small probability of "success" and large number of "trials." For this reason, it is often used as a first-pass model for defect distributions in many types of systems [Devo91]. The probability of k defects on a wafer, assuming a Poisson distribution of defects, would be e-lambdalambdak/k!, where lambda is the population mean rate of defects per wafer. Since the probability of a working wafer is the probability of zero defects occurring in the block, k would be 0, so the equation would reduce to Ywafer=e-lambda.

An alternative analysis would assume "clustering" of faults across the wafer or reticle. Such an assumption would tend to disfavor the inclusion of redundant circuitry, however, since faults will tend to cluster in such a way as they wipe out large portions of a few die.

The number of defects per wafer is, by definition, the number of defects per unit area (p) multiplied by the area (a) of the wafer (lambda=pa). As an approximation, the area of a column is 1/16 that of a block, and the area of a nibble is 1/4 that of a block. In general, the yield for a circuit of area a would be . Once p is known for the process, the yield for a circuit of size a can be solved for.

Given the area relationships between a block, column, and nibble, YBLOCK, YCOLUMN, and YNIBBLE can be solved for in terms of only the column area (c) and p:

YBLOCK = e-16pc

YNIBBLE = e-4pc

YCOLUMN = e-pc

The pc term is the average number of critical defects expected per column.

The approximate device penalty for implementing each type of replacement scheme can be found in Table 2.13.
1 extra column
3 extra columns


A rough estimate of the minimum support circuitry required without any replacement scheme is 2500 devices. A column contains approximately 140 devices (each bit uses only two devices, but these special devices are approximately twice as large as the devices used for other circuits). Making the rather poor assumption that the area required by a circuit is proportional to the number of devices in the circuit (the cache RAM blocks are much more dense than most other circuitry), the yield of the support circuitry for each replacement scheme can be written in terms of only p and c:

Yfault-intolerant,BLOCK = e-22pc

Yfault-intolerant,NIBBLE = e-35pc

Yfault-intolerant,COLUMN = e-30pc

Yfault-intolerant,1-EXTRA-COLUMN = e-23pc

Yfault-intolerant,3-EXTRA-COLUMNS = e-32pc


Since the circuitry in the cache RAM blocks is much more dense than the rest of the circuitry in the cache RAM, a defect occurring in the area of the cache block may be more likely to cause one or more faults. The calculations would more properly be done by determining the "critical area" for each type of circuit. The critical area is the area of a circuit which is sensitive to a defect of some given size, and is difficult to calculate for complicated circuits without specialized software tools. Since the non-cache RAM block areas, while having lower device density, tend to have more metallization, the critical area of a non-cache RAM block circuit with a given number of transistors may be on par with the critical area of a RAM block circuit with comparable device count, even while occupying more physical space.

Figure 2.26 shows the chip yield for each replacement scheme given a cp figure. At low fault probabilities the chip yield (the probability that a chip will work) approaches 1 for each of the five schemes. The differences between the scheme are minute, with single extra column replacement having performing the best. Although adding three columns per block would result in more flexibility and allow multiple bad columns per block, the additional support circuitry required results in reduced yield.


When yield is particularly bad, none of the schemes performs very well. When one in ten columns contains a fault, the chance of a working chip using any of these schemes is essentially zero.

For several reasons it was decided that implementing fault-tolerance in F-RISC / G was impossible. Any of these algorithms will negatively impact the cache RAM access time. Aside from adding a minimum of one gate delay (block replacement or additional column per block replacement) to the read critical path, the RAM chip will be significantly larger. This increase in size causes several difficulties. First of all, chip size must be less than 1 cm2 in order for the die to fit on the reticle. Secondly, increased die size will result in additional loading delays as interconnect becomes longer. In most cases the logic delays are already overshadowed by loading delays on the F-RISC / G critical paths. Even if a smaller cache RAM block were used (there is an 8-bit wide design with faster access time which is used in the datapath chips as a register file), thus requiring approximately half the area, the additional multiplexors and support circuitry which would be needed would require a great deal of space.

The critical path in the cache is too tight to survive these sorts of delays. In addition, analysis of the results of testing the RPI Testchip indicated that yield, at the time, was so bad that little would be gained from adding fault-tolerance.

The RPI Testchip contained several testable circuits of comparable complexity to a cache RAM block column (the chip actually contained a version of the cache RAM block, but, unfortunately, too few of the chips had enough working circuitry to allow any meaningful results to be obtained.)

Table 2.14 [Phil93] shows the test results for some circuits of comparable critical area to a cache RAM block column. From these results cp is found to range from 0.5 to 0.73, which, as Figure 2.26 illustrates, is the region where these fault-tolerance strategies have little beneficial effect.

Device Count

Write oscillator
Carry chain
Address LFSR


The conclusion to be drawn is that implementing fault tolerant strategies in extremely low-yield technologies may not be worthwhile. This is particularly the case when the low-yield technology is being used because of its high-performance. In such cases, the performance penalty induced by the implementation of fault-tolerance may negate the performance benefit of the low-yield technology. Also, even if fault-tolerance is implemented, simple, inflexible schemes may provide superior chip yield results to more flexible, and therefore, more complicated, solutions.