In designing a cache for a given processor, the organization
and parametric issues which don't involve the actual circuit-level
design need to be streamlined to provide maximum performance while
requiring a minimum quantity of devices. These issues are labeled
"architecture" although, unlike the traditional computer
engineering definition of the word, the cache architecture is
implementation specific, and should be largely transparent to
the programmer Given the technology to be used in implementing
a processor, it is possible to evaluate the design space to determine
the cache parameters and architecture which result in the best
overall system performance. Architectural parameters include cache
dimensions and organization, communications bus widths, and write
behavior.
"Architecture," in computer engineering terms, is usually defined as the part of the computer design which is visible to the programmer [Henn96], either in terms of functionality or relative timings (the difference in timings of various operations on a given processor due to available functional units, as opposed to the overall speed differences between one processor and another.) In the general sense this would include issues such as the instruction set, addressing modes, quantity of registers, virtual address space, trap behavior, and number of functional units.
Design issues which apply only to a given implementation and are not visible to the programmer are termed "implementation" or "design" issues to differentiate them from architectural issues.
"Microarchitecture" usually refers to high-level issues such as dataflow organization which are not visible to the programmer.
Since any cache design should be essentially transparent to the
programmer (except insofar as varying the cache design may effect
the relative timings of memory-accessing instructions), almost
all of the actual cache design would be implementation specific
given the traditional definition. For this reason the line between
cache architecture and cache microarchitecture is less important.
In evaluating cache designs it is helpful to consider metrics which take into account the performance of the system as a whole, given a particular cache configuration. One useful metric is Cycles per Instruction (CPI), which is defined as the average number of processor cycles required to execute a single instruction.
Of course, determining how long a given instruction will require to execute entails knowing whether the instruction accesses the cache, whether the cache contains the required data or needs to perform a copyback, and many other state-dependent details.
One technique frequently used to determine the performance of cache designs is trace-driven simulation. The idea behind this technique is to analyze the behavior of workloads (sets of processes) actually intended to operate on a target design. A "cache trace" is a time-ordered list of memory references made by this target workload. Typically the trace is obtained by introducing hardware between a CPU and memory which captures which addresses are referenced when, and in what manner. An alternative method would be to simulate the target code on a software simulator capable of storing memory accesses. In order to evaluate the performance of different cache designs in the F-RISC/G processor, the DineroIII trace-driven simulator [Hill84] was used, with three traces based on actual running code.
In order to use cache traces as a method for determining the
overall performance of a cache architecture over a wider variety
of uses, three problems must be overcome [Ston90].
General purpose machines such as F-RISC are particularly difficult to represent with a single trace, and thus multiple traces are often used. Typically, suites of traces such as those encompassed by the SPEC92 benchmarks are used. Alternatively to using traces captured from actual running code, one may use traces generated by statistical models of memory reference patterns (synthetic traces).
The length of the cache trace is particularly important because the trace length required is estimated to grow as the cache size raised to the 1.5 power [Ston90]. As cache size increases the number of misses which occur in a trace of given length decreases, thus making them statistically less significant.
In order to determine how long a trace must be, a statistical model can be introduced. It may be assumed that cache misses are a Bernoulli process will probability m of "success" and h=1-m of "failure." This is not a particularly good approximation because in a Bernoulli process cache misses would occur independently of each other, which is clearly not the case in the cache, but it should serve as a useful lower bound. If the miss-generating process is Bernoulli in nature, then the distribution of misses is binomial [Deve91].
If the trace length is T the expected number of misses is M=mT. The variance from M is mTh=Mh. The expected miss ratio would then be M / T and the variance of the expected miss ratio would be M h / T2 = m h / T. The standard deviation, which is frequently used as a measure of how close to the expected value the true mean of the data is likely to be, is the square root of the variance, or SQRT(m h / T). Most of the variation will be found within 2 standard deviations from the mean.
The true mean miss-ratio can therefore be estimated by m +/- SQRT(m h / T). So, if we want the trace-based cache evaluation procedure to produce an accurate result we must minimize the deviation term.
The traces used in evaluating the F-RISC cache were on the order of 1,000,000 references in length. Since the F-RISC cache is small, a comparatively poor hit rate of .85 will be assumed (solely for the purposes of determining whether the cache traces are of sufficient length. The actually miss rate can be determined from simulations.) In this case, the true mean miss ratio would be estimated to be 0.15+/- 0.000714 , meaning that the variance is sufficiently small to suppose that the cache lengths are long enough to properly represent the workload they were derived from.
The three traces used in the DineroIII simulations are tex, spice, and gcc which are representative of three common code bases frequently executed on workstation class machines. Table 2.1 shows some of the important characteristics of these traces.
| tex | |||
| spice | |||
| gcc |
The value of CPI for a given trace is calculated by divided the number of cycles required to execute the trace by the number of instructions in the trace. The CPI for a processor is a compilation of CPI figures from several sources. For example, the processor may be designed so that every ALU operation requires one cycle to execute, in which case the CPI due to ALU operations would be 1. If the processor were always able to execute two instructions in parallel, then the CPI due to ALU operations would be 0.5.
In the case of the F-RISC / G processor, the pipeline structure is such that the CPU component of CPI is expected to be approximately 1.45 [Phil93, Greu90]. This CPI component is sometimes called the latency component of CPI. Although the CPU is designed to execute with a peak CPI of 1.0, pipeline latencies (the inability to keep parts of the processor busy due to code dependencies) result in reduced performance.
In addition to CPI penalties caused by pipeline inefficiencies, the cache contributes to an increased CPI. This component of CPI is known as the stall component of CPI. Some percentage of the instructions executed by any given code will result in cache accesses (in fact, all instruction will result in instruction cache accesses), and the overall CPI of the processor depends on the cache's ability to perform the requested transaction in the time allotted. Whenever the primary cache must resort to communicating with the secondary cache, the effect is to stall the processor and prevent it from accomplishing any useful work.
Table 2.2 shows the results of the cache simulation using the SPICE trace, with associativity set to 1, a Harvard architecture, and 2048 byte caches. Copyback and write allocate are assumed. A miss penalty of 5 cycles is used in producing all of the results given in this chapter. The numbers shown are CPI, ignoring losses due to inefficiencies in pipeline management. The following equation is used to calculate stall CPI from the data provided by DineroIII for the Harvard architecture case:
| Bus Width (Bytes) | |||||||||||||||||
| 1 | 2.350 | ||||||||||||||||
| 2 | 3.666 | 2.3330 | |||||||||||||||
| 4 | 6.300 | 3.6502 | 2.3251 | ||||||||||||||
| 8 | 7.550 | 4.2750 | 2.6375 | 1.8187 | |||||||||||||
| 16 | 10.034 | 5.5168 | 3.2584 | 2.1292 | 1.5646 | ||||||||||||
| 32 | 13.606 | 7.3031 | 4.1515 | 2.5757 | 1.7878 | 1.3939 | |||||||||||
| 64 | 23.438 | 12.2189 | 6.6094 | 3.8047 | 2.4023 | 1.7011 | 1.3505 | ||||||||||
| 128 | 55.223 | 28.1117 | 14.5558 | 7.7779 | 4.3889 | 2.6944 | 1.8472 | 1.424 | |||||||||
| 256 | 125.32 | 63.1575 | 32.0787 | 16.539 | 8.7696 | 4.8848 | 2.9424 | 1.971 | 1.4856 | ||||||||
| 512 | 356.89 | 178.940 | 89.9704 | 45.485 | 23.2426 | 12.1213 | 6.5606 | 3.780 | 2.3902 | 1.695 | |||||||
| 1024 | 871.15 | 436.076 | 218.538 | 109.76 | 55.3845 | 28.1922 | 14.5961 | 7.798 | 4.3990 | 2.70 | 1.850 | ||||||
| 2048 | 2242.7 | 1121.8 | 561.413 | 281.20 | 141.103 | 71.0516 | 36.0258 | 18.513 | 9.7565 | 5.378 | 3.189 | 2.094 | |||||
Bus traffic and bus width must share the same units, typically words. Instructions are used rather than accesses in the numerator since in a Harvard architecture the cache can respond to both an instruction and a data access simultaneously, so the minimum number of accesses that will be made if every access is a hit is equal to the number of instructions fetched.
It is clear from these results that for this trace small block
sizes perform best. This is due to the fact that since the cache
memory is so small, a large block size imposes great penalty in
terms of percentage of available space taken up by a single block.
When a new block address needs to be stored there are few places
to store it when block sizes are large.
FIGURE 2.2
: NAIVE MEMORY REFERENCE PATTERN
Memory accesses tend to be clustered around several physically
separated addresses (Figure 2.1) rather than a single centralized
address (Figure 2.2) [Ston90]. One cluster may represent the instruction
pointer and nearby addresses, while others may represent the various
data structures in use by various processes. Each of these clusters
of likely references would tend to be stored in a separate block
frame or group of frames in the cache, so long as enough block
frames exist. If there are very few block frames, then a large
percentage of the time that an access to a cluster that is not
currently in the cache occurs, another cluster of frames likely
to be referenced soon must be removed from the cache. In this
case more block frames are preferable to more space per frame.
FIGURE 2.3: SIMULATIONS: 2KB
HARVARD CACHES, DIRECT-MAPPED, SPICE TRACE
| Bus Width (Bytes) | |||||||||||||||||
| 1 | 1.756 | ||||||||||||||||
| 2 | 1.772 | 1.3858 | |||||||||||||||
| 4 | 1.798 | 1.3991 | 1.1996 | ||||||||||||||
| 8 | 1.863 | 1.4314 | 1.2157 | 1.1079 | |||||||||||||
| 16 | 2.132 | 1.5662 | 1.2831 | 1.1416 | 1.0708 | ||||||||||||
| 32 | 3.05 | 2.0251 | 1.5126 | 1.2563 | 1.1281 | 1.0641 | |||||||||||
| 64 | 6.018 | 3.509 | 2.2546 | 1.6273 | 1.3136 | 1.157 | 1.0784 | ||||||||||
| 128 | 18.673 | 9.836 | 5.4182 | 3.2091 | 2.1046 | 1.552 | 1.2761 | 1.1381 | |||||||||
| 256 | 112.58 | 56.79 | 28.895 | 14.947 | 7.9737 | 4.487 | 2.7434 | 1.8717 | 1.4359 | ||||||||
| 512 | 370.35 | 185.68 | 93.338 | 47.169 | 24.085 | 12.542 | 6.7711 | 3.8856 | 2.4428 | 1.7214 | |||||||
| 1024 | 1153.2 | 577.11 | 289.05 | 145.03 | 73.013 | 37.007 | 19.003 | 10.001 | 5.5008 | 3.2504 | 2.125 | ||||||
| 2048 | 3072.4 | 1536.7 | 768.84 | 384.92 | 192.96 | 96.980 | 48.990 | 24.995 | 12.998 | 6.9988 | 4.0 | 2.5 | |||||
Table 2.3 shows the simulation results for the Tex trace. The magnitude of the CPI figures is less than that obtained with the Spice trace, however the overall behavior is similar. The best results occur with equal block sizes and bus widths, and the best overall result (1.064) occurs with a block size and bus width of 32 bytes. The Spice trace exhibited the best CPI (1.351) with a bus size and block width of 64 bytes.
Figure 2.4, plotted on the same scale as Figure 2.3, shows that
the magnitude of the CPI results obtained using the Tex trace
is lower overall all than the results obtained with the Spice
trace. Once again, at a given bus width, smaller block sizes seem
to yield superior performance.
FIGURE 2.4: SIMULATIONS: 2KB
HARVARD CACHES, DIRECT-MAPPED, TEX TRACE
Table 2.4 and Figure 2.5 show the results of running the simulation
with gcc trace. As with the other traces, the best results for
a given bus width occur at smaller block sizes. The optimal point
(1.56) is a block size and bus width of 256 bytes.
| Bus Width (Bytes) | ||||||||||||||||||||||
| 1 | 3.651 | |||||||||||||||||||||
| 2 | 6.278 | 3.6390 | ||||||||||||||||||||
| 4 | 11.53 | 6.2685 | 3.6342 | |||||||||||||||||||
| 8 | 13.68 | 7.3409 | 4.1704 | 2.5852 | ||||||||||||||||||
| 16 | 17.76 | 9.3805 | 5.1902 | 3.0951 | 2.047 | |||||||||||||||||
| 32 | 25.921 | 13.460 | 7.2304 | 4.1152 | 2.557 | 1.7788 | ||||||||||||||||
| 64 | 43.190 | 22.095 | 11.547 | 6.2738 | 3.636 | 2.3184 | 1.6592 | |||||||||||||||
| 128 | 82.083 | 41.542 | 21.271 | 11.135 | 6.067 | 3.5338 | 2.2669 | 1.6334 | ||||||||||||||
| 256 | 143.3 | 72.17 | 36.585 | 18.792 | 9.8963 | 5.4481 | 3.2241 | 2.1120 | 1.556 | |||||||||||||
| 512 | 322.41 | 161.70 | 81.352 | 41.176 | 21.088 | 11.044 | 6.0221 | 3.5110 | 2.256 | 1.6278 | ||||||||||||
| 1024 | 1013.0 | 507.00 | 254.00 | 127.50 | 64.250 | 32.625 | 16.813 | 8.9063 | 4.953 | 2.9766 | 1.9883 | |||||||||||
| 2048 | 1990.8 | 995.9 | 498.46 | 249.73 | 125.36 | 63.183 | 32.09 | 16.546 | 8.773 | 4.8865 | 2.9432 | 1.9716 | ||||||||||
FIGURE 2.5: SIMULATIONS: 2KB HARVARD
CACHES, DIRECT MAPPED, GCC TRACE
| Bus Width (Bytes) | |||||||||||||||||||||||
| 1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 | 2048 | ||||||||||||
| 1 | 2.645 | ||||||||||||||||||||||
| 2 | 4.062 | 2.531 | |||||||||||||||||||||
| 4 | 6.898 | 3.949 | 2.474 | ||||||||||||||||||||
| 8 | 8.133 | 4.567 | 2.783 | 1.892 | |||||||||||||||||||
| 16 | 10.564 | 5.782 | 3.391 | 2.195 | 1.598 | ||||||||||||||||||
| 32 | 15.020 | 8.010 | 4.505 | 2.753 | 1.876 | 1.438 | |||||||||||||||||
| 64 | 26.934 | 13.967 | 7.483 | 4.242 | 2.621 | 1.810 | 1.405 | ||||||||||||||||
| 128 | 54.527 | 27.763 | 14.382 | 7.691 | 4.345 | 2.673 | 1.836 | 1.418 | |||||||||||||||
| 256 | 128.143 | 64.572 | 32.786 | 16.893 | 8.946 | 4.973 | 2.987 | 1.993 | 1.497 | ||||||||||||||
| 512 | 348.433 | 174.716 | 87.858 | 44.429 | 22.715 | 11.857 | 6.429 | 3.714 | 2.357 | 1.679 | |||||||||||||
| 1024 | 1000.236 | 500.618 | 250.809 | 125.905 | 63.452 | 32.226 | 16.613 | 8.807 | 4.903 | 2.952 | 1.976 | ||||||||||||
| 2048 | 2385.306 | 1193.153 | 597.077 | 299.038 | 150.019 | 75.510 | 38.255 | 19.627 | 10.314 | 5.657 | 3.328 | 2.164 | |||||||||||
FIGURE 2.6: SIMULATION RESULTS FOR
BENCHMARK SUITE
Table 2.5 shows the weighted (by length) average of the CPI results for each of the three traces. This table summarizes the behavior of the cache given a Harvard architecture with direct-mapped, 2 kB caches.
The optimum point (1.41) occurs with a block size and bus width of 64 bytes. Given the estimated latency CPI component of 1.45, the total CPI taking into account both latency and stall CPI components would be 1.86. This is due to the fact that pipeline latencies would allow instructions to access the cache once every 1.45 cycles on average at most. The cache simulator assumed that instructions could access the cache every cycle as an upper bound. Adding the 0.45 to the stall CPI therefore serves as a crude estimate of the overall CPI.
Figure 2.6 shows a plot of the weighted mean stall CPI for all three cache traces as a function of block size and bus width.
Having determined that the optimum configuration occurs when block size is equal to bus width, it is possible to plot the stall component CPI as a function of equal block sizes and bus widths as shown in Figure 2.7, where the results of each trace are plotted along with the weighted mean. From this plot it is clear that the minimum CPI occurs at a bus width and block size of 64 bytes (512 bits). It is also interesting to note that at half that size (32 bytes) the CPI is approximately 1.44, which is only 0.03 cycles per instruction worse than the optimal point.
Although there is little apparent CPI penalty to be paid for
utilizing a 256 bit block size rather than a 512 bit block size,
technical issues complicate the matter for F-RISC / G. The fundamental
RAM block used both for cache tags and cache RAM data in F-RISC
/ G is 16 bits wide by 32 rows deep. An 8 bit wide by 32 row higher
speed RAM block is also available. (The larger RAM block offers
an advantage in that using it rather than two smaller blocks incurs
a smaller hardware penalty since address decoding hardware can
be shared.)
FIGURE 2.7: 2 KB HARVARD CACHES, DIRECT-MAPPED, BLOCK SIZE EQUALS BUS WIDTH
In the case of a 512 bit block size, it is possible to place four of the larger block or eight of the smaller block on each die, and simply organize each block in parallel within each die to give 512 input and output lines (64 per chip). Figure 2.8 illustrates the partitioning if the larger RAM blocks are used.
Given the low yield of the process used in fabricating the F-RISC
/ G chips, the maximum number of the larger blocks which can be
placed on a die is 4, and the maximum number of the smaller blocks
which can be placed on a die is 8. Implementing a 2048 byte cache
requires 8 of these die.
FIGURE 2.8: 512 BIT BUS RAM
PARTITIONING
If the bus is 256 bits wide one would ideally like to use half
the number of RAM chips, placing 8 large blocks per chip or 16
small blocks per chip, where three quarters of the blocks on each
chip are used to increase the depth of the cache. In the case
of F-RISC / G, yield is insufficient to allow such an organization.
Instead it would be necessary to use 16 chips and use half of
the blocks on each chip to increase cache depth. Multiplexors
and control logic would need to be employed on I/O paths to select
which block to read from and which block to write into.
FIGURE 2.9: 256 BIT BUS RAM
PARTITIONING
FIGURE 2.10: CPI AS A FUNCTION OF
SET QUANTITY, BLOCK SIZE, ASSUMING LRU
Figure 2.9 illustrates how this partitioning would be accomplished. Note that the circuitry required to select the proper block for reading and writing represents a large hardware penalty in addition to imposing additional delays on the cache critical paths.
Figure 2.10 is a plot of stall CPI as a function of set size and equal block and bus sizes for a Harvard architecture with dual 2 kB caches and copyback. The Least Recently Used replacement algorithm, which is is fairly simple to implement for very small numbers of sets, is assumed. From this plot it can be seen that larger numbers of sets tend to produce better CPI's, although the bus width and block size seem to have a larger effect on the overall CPI.
Figure 2.11 shows the effect of varying set size for the three block sizes and bus widths which provide the best results for 2 kB Harvard caches employing copyback and the timing constraints mentioned earlier. As can be seen, the CPI does not improve markedly as set size is increased beyond 4. The effect of moving from a direct-mapped (set size=1) cache to a 4-way set associative cache, however, is fairly significant.
Increasing set size is not without its costs, however. First
of all, as the set size increases, it is necessary to employ multiple
comparators in order to check each line of the target set to the
CPU tag simultaneously. In addition, the tag RAM size must increase
proportionally to the number of lines per set. If it is possible
to store n tags in a particular set, then the tag RAM must
increase in size by a factor of n. The net result is that
almost all of the cache control circuitry must be increased by
a factor equal to the number of lines per set.
FIGURE 2.11: EFFECT OF SET SIZE
In a low-yield technology such as that employed in F-RISC / G, it may not be possible to include all of this circuitry on a single die. In this case the control circuitry must be partitioned among several die. One possible partition would have each set divided into n die where n is the number of lines per set. Each die would contain a tag RAM and a comparator. Some mechanism of communication between the die would be necessary in order to determine whether any of the die contained the appropriate data. One slice would likely be configured as the "master," and each of the other slices would report the results of the tag search to it.
In using a set associative scheme one must pay a speed penalty in addition to the aforementioned hardware penalty. At minimum there will be two multiplexor delays on the critical path, as it is necessary to select between multiple locations for each RAM read or write. Deciding where to store data within each set (implementing a replacement algorithm) can also cause a delay. In a low-yield technology in which the controller circuitry must be partitioned, one must also pay driver / receiver and communications delays. In the case of F-RISC / G the cache critical path requirements could not be met with these additional delays.
The apparent advantage of a set associative cache is mitigated somewhat if one considers the complicated hardware required to implement replacement algorithms. The popular Least Recently Used algorithm, for example, requires memory hardware to keep track of which line in a set was last used, hardware to update this memory when appropriate, and comparison hardware to determine which line "wins" when the write into the data memories is to take place.
Other algorithms, such as "first-in-first-out" (FIFO),
and "random" are somewhat simpler to implement. Simulations
show, however, that these algorithms to not ususally match the
performance of LRU.
FIGURE 2.12: EFFECT OF REPLACEMENT
ALGORITHM
Figure 2.12 shows the effectiveness of different replacement
algorithms on the gcc and spice traces (assuming 2kB Harvard caches
with a 512-bit block / bus size.) As can be seen from the figure,
LRU does perform significantly better than FIFO or random replacement;
using any of these algorithms, however, would still be a CPI improvement
over a direct-mapped cache. There is little doubt, however, that
implementing a set associative cache of any size would negatively
impact cycle time.
Figure 2.13 illustrates the effect of various cache architectures on stall CPI. The graph shows CPI as a function of block and bus width for a Harvard cache with 2 kB per cache, a unified cache with 4 kB of single-ported RAM, and a unified cache with 4 kB of dual-ported RAM. A direct-mapped cache employing copyback is assumed.
In a unified cache with dual ported RAM it is possible to read both an instruction and data simultaneously, while, for a single-ported RAM scheme, it is possible to perform only one access at time.
As the graph shows, the Harvard cache tends to perform the best. For the unified cache designs, the use of dual-ported RAM provides the best results except at extreme block sizes.
A dual-ported Unified cache would be expected to perform comparably to a Harvard cache since both can satisfy two requests at once, which is the case. The Unified cache has a slight advantage at small block sizes because in this case there are many blocks and the Unified cache can allocate them between caches with a fine resolution; the Harvard cache suffers from the assumption that 50% of total cache size should go to each cache.
The equation used for calculating CPI from DineroIII output for
a unified cache with dual ported RAM is identical to that used
for the Harvard architecture since in both cases the cache is
capable of performing a data and instruction access simultaneously.
For the single ported unified cache, at most one cache access,
instruction or data, can be accomplished at any time. As a result,
the equation used to calculate CPI from the DineroIII output is
as follows:
CPI = (Total Accesses + 5 (Bus Traffic/Bus Width)) / Instructions
FIGURE 2.14: EFFECT OF HARVARD CACHE SIZE ON CPI
Figure 2.14 shows the effect of varying cache size given a Harvard direct mapped cache employing copyback and a 64 byte block and bus width. The stall component of CPI drops below 1.5 at a cache size of 2048 bytes per cache. At twice that memory size there is comparatively little improvement in CPI, and there is little doubt that it would be extremely difficult to implement that much memory given the interconnect lengths that would be required and the difficulty in removing the heat from that many bipolar RAM blocks.
| Cache Dimensions: | ||
| Wide | When a miss occurs, a large number of consecutive words can be transferred to the cache at once, and no more misses will occur until an addressed outside that block is accessed. | A large amount of hardware and signal routing is required. In addition, a large number of I/O drivers may switch simultaneously. |
| Deep | The probability of having to replace a line is decreased. | Either deeper RAM blocks (which tend to be slower due to increased bit line capacitance) or many blocks are needed. |
| Write Policies: | ||
| Write-through | Reduces control logic complexity and cache coherency problems in multi-processing systems. | Requires either a large write through buffer or will result in many stalls. |
| Copyback | Greatly reduces bus traffic to secondary cache. | Complicates control logic in cache controller. |
| Write Location: | ||
| Write allocate: | Tends to increase CPI since words that the CPU modifies will tend to be used again. | Complicates cache control logic. |
| Write around: | Simplifies cache control logic. | Far simpler to implement in write-through designs. |
| Associativity: | ||
| Direct mapped | Greatly reduced cache control complexity. | Increases miss rate. |
| Set Associative | Reduced miss rate compared to direct-mapped cache. Not as hardware / time intensive as an associative design. | Requires more hardware and delay than direct-mapped. |
| Associative | Minimizes miss rate. | Replacement algorithm is expensive to implement in hardware. |
| Architecture: | ||
| Harvard | Simpler control hardware. No need for multiple ports. | No self modifying code. Rigid memory allocation. |
| Unified | Flexible memory allocation can increase hit rate. Self-modifying code. | Multiple memory ports needed. |
Table 2.6 shows qualitatively how varying the cache architecture and dimensions can affect the cache cycle time and hit rate, as well as other penalties and benefits of each design decision.
Based on these cache simulations, the design point which was
chosen for the F-RISC / G prototype cache is as listed in Table 2.7.
| Architecture: | Harvard |
| Instruction Cache Size: | 2kB |
| Data Cache Size: | 2kB |
| Write Policy: | Copyback, Write Allocate |
| Bus Width: | 512 bits (64 bytes) |
| Block Width: | 512 bits (64 bytes) |
Assuming a miss penalty of 5 cycles, the predicted stall CPI for this design is approximately 1.41.
Table 2.8 shows the results of the cache simulations broken down
by type of event. The probability of each event occurring is also
given. Based merely on these events, the stall CPI would add to
0.73. What remains unaccounted for are 68% of the instructions
which may be either ALU operations or BRANCH's.
Each ALU or BRANCH operation can be assumed to take 1 cycle (since
BRANCH misses are already
accounted for in the "Instruction miss" category.) Therefore,
the net stall CPI would be 0.73 + 0.68 = 1.41, as reported above.
Note that the "Reads" figure presented in Table 2.8
includes the reading cycle that occurs at the beginning of each
STORE, thus the write
penalty would only be 1 additional cycle.
| Event | # Occurances | Probability | Cycles | Weight |
| Instructions: | 2137409 | |||
| Reads: | 440985 | .21 | 1 | .21 |
| Writes: | 254081 | .12 | 1 | .12 |
| ALU / Branch | 1442343 | .68 | 1 | .68 |
| Instruction misses: | 77527 | .04 | 5 | .20 |
| LOAD misses: | 50635 | .02 | 5 | .10 |
| STORE misses: | 17881 | .01 | 5 | .05 |
| Copybacks: | 27179 | .01 | 5 | .05 |
| TOTAL | 1.0 | 1.41 |
To calculate the latency CPI it is necessary to examine the way
in which instructions can be rearranged by the compiler in order
to avoid latencies. It is expected that the compiler will be able
to fill approximately one half of the latencies with useful instructions
[Greu90]. The latency for a LOAD (including the cache read portion
of a STORE) is 3, and the probability of this occurring is 0.21.
Assuming an optimizing compiler capable of filling half of the
latency, LOADs and STOREs
would contribute 0.32 to the overall CPI. Taken BRANCHs
also incur a latency penalty, but the probability of occurrence
is not derivable from the cache simulations. If a 10% likelihood
and an optimizing compiler capable of filling half of the latency
is assumed, then the contribution to CPI from taken BRANCHs
is 0.15. The total CPI is therefore predicted to be 1.88.
From the perspective of the CPU, the only operations which can take place in the cache are a "LOAD," in which information is fetched from the primary cache and sent to the CPU, and a "STORE," in which information is sent from the CPU to the primary cache for storage. Depending on the state of the cache when the CPU attempts a transaction, however, things can get more complicated.
Perhaps the most complicated situation occurs in caches which employ copyback for writes. In such a cache not only is it necessary to handle hits and misses, but clean and dirty operations become an issue, as well.
For a copyback cache the possible transactions are a load hit,
clean load miss, dirty load miss, store hit, clean store miss,
and dirty store miss.
A load hit occurs when the CPU attempts to read from an address
which is available in the cache. In this case the CPU generates
an address which gets sent to the cache. The cache will usually
simultaneously check its tag RAM to see if the requested address
is available and start reading from the appropriate location in
the data RAM. The data will be sent to the CPU when it is found.
FIGURE 2.15: LOAD HIT
Figure 2.15 shows a timing diagram for this operation. The address needed by the tag RAM consists of the tag, which is used for comparison to the data stored in the tag RAM, and the line address, which tells which block (or set of blocks) to look at. The data ram needs merely the line address in order to access the requested data. The tag comparison and the data RAM access take place simultaneously in order to reduce the overall transaction time.
A clean load miss occurs when the CPU attempts to retrieve data which is not available in the cache. In this case the cache must retrieve the requested data from the next level of memory.
Whatever the method used, the block retrieved from the next level
of memory will be mapped to a specific location in the primary
cache. If this location contains data which has been modified
by the CPU, and if those modifications have not been reflected
in the higher levels of memory (as in a copyback cache), then
the primary cache must first send the next level of memory the
data stored in that cache location before storing anything new
there.
FIGURE 2.16: CLEAN LOAD MISS
In the clean load case, the data already stored in the cache has not been modified, so this step is not necessary.
Figure 2.16 is a timing diagram for a clean load miss. Note that
the secondary cache may begin its access in parallel with the
primary cache, and that the data arriving from the secondary cache
may or may not be written into the primary cache simultaneously
to sending it to the CPU.
When a dirty load miss occurs, the block fetched from the secondary
cache can not be stored in the primary cache until the modified
data already in that primary cache location is updated in the
higher levels of memory.
FIGURE 2.17: DIRTY LOAD MISS
Figure 2.17 shows a representative timing diagram for the dirty load miss case, also known as a "load copyback." Note that in this particular implementation the address from the CPU is sent to the secondary cache in parallel with the primary cache, allowing the secondary cache to get an early start retrieving the required data. By the time the primary cache realizes that a dirty miss has occurred (at the end of the tag comparison step) and has sent the dirty data to the secondary cache, the secondary cache has finished retrieving the requested data.
In the implementation shown in this figure, the secondary cache
latches this data and stores the dirty data sent by the primary
cache before sending the originally requested data. If the primary
to secondary cache communications consists of separate input and
output buses, then it would be possible for the secondary cache
to send the requested data as soon as it is ready, in parallel
with receiving the dirty data from the primary cache.
A store hit occurs when the CPU attempts to write into a memory
address which is already available in the primary cache. Regardless
of whether the contents of that address have already been modified
by the CPU, the primary cache need not perform a copyback to higher
levels of memory.
FIGURE 2.18: STORE HIT
Figure 2.18 is a timing diagram for a typical store hit. Note
that two cache RAM cycles are required for stores, since it is
necessary to first read the destination block in case a copyback
needs to take place before actually performing the write. In the
figure, the block write takes place after the compare, when the
cache hit is detected.
In a copyback cache, if the CPU attempts to perform a store into
an address which is not available in the cache, the cache must
determine whether the block frame into which the requested address
will be stored already contains an address which has been modified
in the cache but not in higher levels of memory (and therefore
needs to be copied back.) If the data has not been modified (the
line is not dirty), then no copyback need take place.
FIGURE 2.19
: CLEAN STORE MISS
Figure 2.19 is a timing diagram for a typical clean store miss.
Since the CPU word size is typically smaller than the cache block
size, and the cache always transfers blocks to and from higher
levels of memory, it is necessary for the cache to perform the
write in two stages. First, the entire block fetched from the
secondary cache is written into the primary cache's data RAM,
then the portion of that block which the CPU has asked to be modified
(presumably one word) is written into with the data supplied by
the CPU.
If the CPU attempts to write into a block already occupied by
data which has been modified in the cache but not in the higher
levels of memory, a copyback must take place.
FIGURE 2.20: DIRTY STORE MISS
Figure 2.20 is a timing diagram for a typical dirty store miss. The primary cache checks its tag RAM while the secondary cache begins its access. By the time the primary cache has determined that a dirty condition exists and send the dirty data to the secondary cache, the secondary cache will hopefully have finished its read access.
The two caches can perform their writes in parallel. Once again,
the primary cache must first write the entire block from the secondary
cache and then write the data that the CPU actually wanted to
modify.
The assumed access miss penalty for the primary cache, as previously stated, was 5 ns. This means that in order to achieve the CPI figures reported in [Phil93] the average time to access the next higher level of memory must be 5 ns, including the time required to transfer the address and receive the data.
It would be impossible to implement the entire 32 bit address space of the F-RISC / G processor in RAM with such a short access time, so a secondary level of cache is required.
Using the DineroIII simulator it is possible to generate a secondary cache trace. By simulating the primary cache with appropriate parameters, the transactions which would have occurred between cache levels can be stored into a file which can be converted to a new cache trace.
Simulating a primary cache employing a Harvard architecture with
2048 byte direct-mapped caches, copyback, and 512-bit wide blocks,
secondary cache traces were created for the tex, spice, and gcc
traces. Table 2.9 lists the lengths of these traces.
| Trace | Fetches |
| Spice | 64003 |
| Tex | 9405 |
| gcc | 99891 |
|
| |||
| 2 | 0.085 | 0.292 | 0.090 | 0.099 |
| 4 | 0.335 | 0.597 | 0.310 | 0.335 |
| 8 | 0.579 | 0.749 | 0.476 | 0.529 |
| 16 | 0.731 | 0.824 | 0.657 | 0.693 |
| 32 | 0.909 | 0.875 | 0.798 | 0.843 |
| 64 | 0.961 | 0.909 | 0.863 | 0.901 |
| 128 | 0.969 | 0.909 | 0.911 | 0.932 |
The secondary cache was simulated, assuming 512-bit wide blocks, a direct-mapped unified architecture, and copyback. The hit rates for various secondary memory sizes is given in Table 2.10. A unified cache was simulated since information about whether particular transactions occurred as a result of an instruction or data miss is lost in the trace translation process, but these figures should provide a rough estimate of the performance of various secondary cache configurations.
The secondary cache mean access time, T, can be written
as a function of the hit rate, h, the miss penalty for
the secondary cache (the tertiary cache mean access time), P,
and the time it takes the secondary cache to process a hit, H.
The CPI calculations in this chapter assumed T = 5 cycles. From this equation it is possible to determine the required secondary cache hit time given a tertiary cache mean access time and a secondary cache hit rate. Figure 2.21 is a plot showing the required secondary cache hit time given a secondary cache hit rate and a tertiary cache mean access time.
Based on these figures, one can reach the conclusion that a 1.88
overall CPI can be achieved if a direct-mapped Harvard secondary
cache with 512-bit wide blocks, 32 kB per cache of RAM, copyback
write allocation, and 20 ns mean access time for the tertiary
cache is employed. This would require the hit time on the secondary
cache to be around 3.5 ns.
FIGURE 2.21: REQUIRED SECONDARY
CACHE HIT TIME
Since the primary cache sends the secondary cache any address
which is received from the CPU as soon as it reaches the primary
cache controller, in some cases the primary cache may have more
time to complete its access. If the secondary cache receives no
addresses between when it initially receives an address and when
the primary cache signals that a miss has occurred, the secondary
cache can use the intervening time to access that cache address.
In the F-RISC / G system the secondary cache may receive two additional
addresses between receiving an address and its associated miss
signal. If no addresses are received, the secondary cache has
two additional cycles to perform its cache access. If the fraction
of uninterrupted accesses is U, then the mean access time
of the secondary cache can be written as:
It was possible to analyze the secondary cache traces to determine
how often the secondary cache would be able to get a two cycle
head start on its memory access. From this analysis it was determined
that 66% of secondary cache accesses can benefit from this head
start (Table 2.11).
| Head start | No head start | |
| gcc | 62731 | 37122 |
| spice | 45368 | 18634 |
| tex | 5441 | 3926 |
| TOTAL | 113540 | 59682 |
Assuming a tertiary cache penalty of 20 ns, a secondary cache hit rate of 0.90, the required hit access time of the secondary cache to achieve the 5 ns overall access time is 4.84 ns. By having the secondary cache detect that it has had an additional 2 ns to access two thirds of the incoming requests, the mean hit time can be slowed by 1.34 ns, which allows the secondary cache RAMs to be significantly slower.
In the case of the instruction cache, the situation is even better.
The instruction cache will fetch a new instruction every cycle
(barring stalls and traps). If no branches occur, then the secondary
cache will be required to fetch a new block once every sixteen
cycles, since each block contains sixteen instructions. As a result,
if the secondary cache is capable of keeping track of how long
it has been since a branch has occurred, it can take advantage
of the large block size to significantly reduce the constraints
on its data RAM access speed. Since cache simulations suggest
that only 4% of all instruction fetches result in an access to
the secondary cache, and, of these accesses, nearly all will be
consecutive branches, the instruction cache can buy quite a bit
of time by predicting ahead of time which block will be needed
next (through a remote program counter). If a branch occurs too
close to the end of a valid data fetch, the cache can be made
to restart its access. Determining the exact proportion and distribution
of branch instructions among miss instructions, however, is a
difficult proposition, requiring special trace-analysis software.
In order to achieve higher chip yield despite the high likelihood of critical faults in yield-limited technologies, it is common to utilize redundancy and fault-tolerance techniques. RAM chips are particularly well suited to these techniques as it is usually possible simply to include redundant memory cells which can be swapped for failed cells in the event of failure.
CML circuitry, unlike CMOS, does not benefit from low static power consumption; CML circuits draw power whether they are switching or not. As a result the inclusion of extra circuitry for the purposes of redundancy must be carefully considered. Additionally, the inclusion of redundant circuitry will usually increase die area, and, therefore, possibly increase signal transmission delays and slow down the circuit. A further speed penalty must be paid due to the need to introduce multiplexing logic to select which redundant structure to actually use, although, since the multiplexors are statically set at start-up, the delay is due to propagation only, and not switching.
In order to minimize the amount of multiplexing logic needed,
"block replacement," "column replacement,"
and "nibble replacement" are the most obvious redundancy
schemes. In order to minimize the area requirements of a replacement
scheme, adding additional columns to each block is an additional
possibility.
FIGURE 2.22: CACHE RAM BLOCK
FLOORPLAN
In block replacement an additional cache RAM block is placed on the RAM chip and swapped for a block containing a non-functional memory cell at system startup. The dimension of the replacement block must be equal in dimension to that of the other blocks.
Figure 2.22 is a floorplan of the F-RISC / G cache RAM block.
The block consists of sixteen columns of thirty-two rows. In order
for a particular column to work, each memory cell in the column,
the read / write logic, the threshold voltage circuit, all of
the line drivers, the address decoder, and the sense amp must
fully function.
FIGURE 2.23: BLOCK REPLACEMENT
Figure 2.23 shows how block replacement might be implemented giving the F-RISC / G RAM chip micro-architecture. The cache RAM normally consists of four 16-bit wide cache RAM blocks. When communicating with the secondary cache, all 64 bits of the cache RAM blocks are used. When communicating with the CPU, however, one nibble is selected from among the four blocks.
The top multiplexor (A), which is used to select a nibble from among the cache RAM blocks, already exists in the cache RAM, but would have to be expanded to include inputs from the redundant block. Replacing the four 4-input multiplexors with four 5-input muxes would require at least 24 additional transistors (since F-RISC / G uses 3-level CML gates, adding an additional input to a four-input multiplexor requires an entire new gate, which imposes an additional hardware penalty of one current switch).
The next layer of multiplexors (B), which is used to select one of four nibbles from among each of the cache RAM blocks to send to the top multiplexor, also exists, and is actually part of each cache RAM block.
The next lower level of multiplexors (C), would have to be added to select from either the redundant block or the main blocks when reading out to the secondary cache. This would entail adding 64 2-input multiplexors which would require 384 transistors.
Each cache block has a set of multiplexors on its input which is used to select from either the wide data path or the narrow data path. A multiplexor would have to be added (D) to the input of the redundant block to enable it to receive data that was actually intended for one of the other, malfunctioning blocks. This multiplexor would actually have to multiplex four, 16-bit buses, and would likely be implemented as sixteen 4-input multiplexors (requiring 222 transistors).
There is little time penalty to be paid on a read through the narrow data path (only multiplexor A becomes more complicated). The L2 path has an additional multiplexor delay on reads. On writes from the narrow path, there is no additional time penalty for implementing this scheme. Writes from the L2 path incur an additional multiplexor delay due to D. Of course, this discussion includes only logic delays; by making the chip larger and requiring more routing area and loading on data and address lines, circuit speed will be adversely affected.
Implementing block replacement would require 630 additional devices
for multiplexing, in addition to the hardware costs of the additional
block and control circuitry.
In the F-RISC / G RAM chips, each of the four cache RAM blocks is divided into four nibbles, one of which is supplied to a multiplexor which further selects between the nibbles provided by each block. The nibble output of this multiplexor is sent to the CPU.
Since the cache RAMs are partitioned along nibble boundaries, it might seem reasonable to implement a nibble replacement policy in order to add fault tolerance.
One nibble replacement scheme would have an additional cache RAM block which would allow the capability to replace any four malfunctioning nibbles. While the block replacement scheme allows an infinite number of faults in exactly one cache RAM block, the nibble replacement scheme allows an infinite number of faults scattered among exactly four nibbles, which may be physically located in any of the cache RAM blocks.
Figure 2.24 illustrates one possible implementation of nibble
replacement for the F-RISC / G RAM chip.
FIGURE 2.24: NIBBLE REPLACEMENT
This scheme is more complicated to implement than block replacement. The previously existing nibble-selecting multiplexors for the four-bit data path (A and B) can be left as they are, although the B multiplexors may have to be removed from the cache RAM blocks. In order to allow any of the four redundant nibbles to be swapped for any other malfunctioning nibble (for both the wide and narrow data paths), the C multiplexors are added. These multiplexors must select from between five four-bit buses. Since sixteen such multiplexors are required, this scheme has a fairly high hardware cost (1280 transistors). In addition, this layer of multiplexors imposes two additional gate logic delays on reads on both the wide and narrow data paths.
In order to accommodate writes, the D and E multiplexors must be capable of selecting any four out of sixteen incoming data nibbles. This is significantly more complicated and time-consuming that what is necessary for the block replacement scheme, requiring 1120 additional transistors.
Overall, the nibble replacement scheme requires 2400 additional
transistors for multiplexing, in additional to the hardware cost
of the additional block and control circuitry.
In a column replacement scheme, an extra block is added to provide replacement memory columns for any malfunctioning columns. In the F-RISC / G RAM chip, an additional cache RAM block could enable any 16 columns to be replaced, thus an infinite number of faults could be scattered across any 16 columns and the RAM chip would still function.
The block diagram for column replacement for the F-RISC / G RAM chip is similar to that for nibble replacement. Instead of selecting from five 4-bit buses, the C multiplexors must select four out of 20 bits (at a cost of 360 transistors). This would require three levels of multiplexors, and thus three gate delays.
The B and A multiplexors would function as before. The D and E multiplexors would have to be capable of selecting any four of the 64-bit input lines (1176 transistors). The layer of input multiplexors on the block would have to be able to each select any one of the 4-bit input lines. This would require still more hardware (56 transistors) than the nibble replacement case, as well as adding more delay.
The multiplexors in the column replacement scheme would require
1592 transistors. The additional block and control circuitry represent
other hardware costs.
Adding an additional block to the cache RAM would require a great deal of space. The only way to do this and keep the die size under 1 cm2 would be to hand craft all of the additional circuitry to minimize the space it requires. An alternative possibility would be to add additional columns to each block, solely for the purpose of replacing malfunctioning columns within that block.
Figure 2.25 is a block diagram for a possible implementation
of additional column replacement in F-RISC / G. Each column contains
one or more extra columns which can be swapped for malfunctioning
columns using multiplexing circuitry. The C multiplexors
are used to allow the output of any of the primary columns to
be overridden by the contents of any of the additional columns.
If there is only one additional column, then each of these multiplexors
must accept two inputs, requiring 384 devices in all.
FIGURE 2.25: EXTRA COLUMN PER BLOCK REPLACEMENT
If there are three additional columns per block, then the C multiplexors would require 896 transistors. Implementing more than three additional columns per block would require an additional layer of multiplexors, eliminating the possibility of achieving the required RAM access time.
Additional multiplexing circuitry is also required on the inputs
of the additional columns to select between any of sixteen wide
inputs or four narrow inputs. This will require three levels of
multiplexors, which will very negatively affect the write access
time of the RAM. 360 devices would be required for one additional
column, while 1080 would be required for three additional columns.
In determining whether some sort of redundant circuitry should be included in the RAM chip in order to aid in fault-tolerance the fault process must be modeled. Once a model for fault distribution is established, the circuitry of the chip must be analyzed to determine which circuits can be replicated efficiently, and how many circuits must work and in what quantity they must work in order to have a chip which fully functions.
Once circuitry is replicated in order to add redundancy (and thus fault-tolerance), certain parts of the chip can experience faults while still allowing the full chip to function, while other sections of the chip can not tolerate any faults if a functional chip is to be obtained. For example, merely adding an additional cache RAM block does no good if a critical fault occurs on the control circuitry which determines which blocks to use, or on the I / O circuitry which allows one to bring the required data out of the chip.
The simplest assumption has the chip consisting entirely of circuitry which is duplicated to provide fault tolerance. For example, if block replacement is used, one might model the chip as consisting of five cache RAM blocks, any four of which must work completely in order to have a fully functioning chip. If fewer than four cache RAM blocks work, then the chip is no good.
Assume the chance of a single block containing no critical defects
(no defects which prevent that block from fully functioning) is
equal for each of the blocks on the chip and is represented as
YBLOCK. Further suppose that block failures
occur independently of each other. Then the number of working
blocks which can be expected is characterized by the binomial
distribution:
N is the total number of blocks, x is the number of functioning blocks, and P(X=x) is the probability of exactly x functioning blocks given N total blocks.
Similarly, YCOLUMN, the probability of a fault free column, and YNIBBLE , the probability of a fault free nibble, can be substituted into the equation to find the probability of working chips given column or nibble replacement.
Given that four working blocks (block replacement), sixteen working
nibbles (nibble replacement), or sixty-four working columns (column
replacement) are necessary for each RAM chip to work, Table 2.12
gives the probability of a working chip, still ignoring support
circuitry, for each replacement scheme:
| Block Replacement | 4Yblock4(1-Yblock)+Yblock |
| Nibble Replacement | SUM(x=16..20)[20!/(x!(20-x)!)]Ynibblex(1-Ynibble)20-x |
| Column Replacement | SUM(x=64..80)[80!/(x!(80-x)!)]Ycolumnx(1-Ycolumn)80-x |
| Single Extra Column Replacement | Ycolumn16+16Ycolumn16(1-Ycolumn) |
| Three Extra Columns Replacement | SUM(x=13..16)[16!/(x!(16-x)!)]Ycolumn16(1-Ycolumn)16-x |
In order to introduce faults in the support circuitry to the model, the simplest assumption to make is that all of the circuitry on the chip is either replicated for redundancy, or support circuitry which, if a fault occurs, results in a non-functioning chip. If the probability of a chip having enough functioning replicated circuitry to function is Yfault-tolerant and the probability of the support circuitry containing no faults is Yfault-intolerant, then the probability of a fully functioning chip is Yfault-tolerant x Yfault-intolerant.
In order to calculate the relative desirability of each replacement scheme, it is necessary to relate YBLOCK, YCOLUMN, and YNIBBLE, and determine the Yfault-intolerant for each scheme.
The first step is to generate a model for critical defects. If
it is assumed that critical defects occur independently of each
other, then the binomial distribution is appropriate. If it is
further assumed that failures occur as point defects scattered
across the wafer's surface, then the probability of a defect being
located at any specific point is extremely small, and the number
of possible points where a defect can occur is extremely large.
The Poisson distribution is a limiting case of the binomial distribution
with small probability of "success" and large number
of "trials." For this reason, it is often used as a
first-pass model for defect distributions in many types of systems
[Devo91]. The probability of k defects on a wafer, assuming
a Poisson distribution of defects, would be e-lambdalambdak/k!, where lambda is the population
mean rate of defects per wafer. Since the probability of a working
wafer is the probability of zero defects occurring in the block,
k would be 0, so the equation would reduce to Ywafer=e-lambda.
An alternative analysis would assume "clustering" of
faults across the wafer or reticle. Such an assumption would tend
to disfavor the inclusion of redundant circuitry, however, since
faults will tend to cluster in such a way as they wipe out large
portions of a few die.
The number of defects per wafer is, by definition, the number
of defects per unit area (p) multiplied by the area
(a) of the wafer (lambda=pa). As an approximation, the
area of a column is 1/16 that of a block, and the area of a nibble
is 1/4 that of a block. In general, the yield for a circuit of
area a would be . Once p is known for the process,
the yield for a circuit of size a can be solved for.
Given the area relationships between a block, column, and nibble,
YBLOCK, YCOLUMN, and YNIBBLE
can be solved for in terms of only the column area (c)
and p:
The pc term is the average number of critical defects
expected per column.
The approximate device penalty for implementing each type of
replacement scheme can be found in Table 2.13.
A rough estimate of the minimum support circuitry required without
any replacement scheme is 2500 devices. A column contains approximately
140 devices (each bit uses only two devices, but these special
devices are approximately twice as large as the devices used for
other circuits). Making the rather poor assumption that the area
required by a circuit is proportional to the number of devices
in the circuit (the cache RAM blocks are much more dense than
most other circuitry), the yield of the support circuitry for
each replacement scheme can be written in terms of only p
and c:
Since the circuitry in the cache RAM blocks is much more dense
than the rest of the circuitry in the cache RAM, a defect occurring
in the area of the cache block may be more likely to cause one
or more faults. The calculations would more properly be done by
determining the "critical area" for each type of circuit.
The critical area is the area of a circuit which is sensitive
to a defect of some given size, and is difficult to calculate
for complicated circuits without specialized software tools. Since
the non-cache RAM block areas, while having lower device density,
tend to have more metallization, the critical area of a non-cache
RAM block circuit with a given number of transistors may be on
par with the critical area of a RAM block circuit with comparable
device count, even while occupying more physical space.
Figure 2.26 shows the chip yield for each replacement scheme
given a cp figure. At low fault probabilities the chip
yield (the probability that a chip will work) approaches 1 for
each of the five schemes. The differences between the scheme are
minute, with single extra column replacement having performing
the best. Although adding three columns per block would result
in more flexibility and allow multiple bad columns per block,
the additional support circuitry required results in reduced yield.
FIGURE 2.26: CHIP YIELDS AS
A FUNCTION OF FAULT PROBABILITY
When yield is particularly bad, none of the schemes performs
very well. When one in ten columns contains a fault, the chance
of a working chip using any of these schemes is essentially zero.
For several reasons it was decided that implementing fault-tolerance
in F-RISC / G was impossible. Any of these algorithms will negatively
impact the cache RAM access time. Aside from adding a minimum
of one gate delay (block replacement or additional column per
block replacement) to the read critical path, the RAM chip will
be significantly larger. This increase in size causes several
difficulties. First of all, chip size must be less than 1 cm2
in order for the die to fit on the reticle. Secondly, increased
die size will result in additional loading delays as interconnect
becomes longer. In most cases the logic delays are already overshadowed
by loading delays on the F-RISC / G critical paths. Even if a
smaller cache RAM block were used (there is an 8-bit wide design
with faster access time which is used in the datapath chips as
a register file), thus requiring approximately half the area,
the additional multiplexors and support circuitry which would
be needed would require a great deal of space.
The critical path in the cache is too tight to survive these
sorts of delays. In addition, analysis of the results of testing
the RPI Testchip indicated that yield, at the time, was so bad
that little would be gained from adding fault-tolerance.
The RPI Testchip contained several testable circuits of comparable
complexity to a cache RAM block column (the chip actually contained
a version of the cache RAM block, but, unfortunately, too few
of the chips had enough working circuitry to allow any meaningful
results to be obtained.)
Table 2.14 [Phil93] shows the test results for some circuits
of comparable critical area to a cache RAM block column. From
these results cp is found to range from 0.5 to 0.73, which,
as Figure 2.26 illustrates, is the region where these fault-tolerance
strategies have little beneficial effect.
.
The conclusion to be drawn is that implementing fault tolerant
strategies in extremely low-yield technologies may not be worthwhile.
This is particularly the case when the low-yield technology is
being used because of its high-performance. In such cases, the
performance penalty induced by the implementation of fault-tolerance
may negate the performance benefit of the low-yield technology.
Also, even if fault-tolerance is implemented, simple, inflexible
schemes may provide superior chip yield results to more flexible,
and therefore, more complicated, solutions.
Circuit VCO Write oscillator Carry chain Data LFSR Address LFSR