F-RISC existed as an architecture long before the F-RISC/G
design team committed it to GaAs. The instruction set was designed
to minimize hardware complexity while still allowing a reasonable
amount of programming flexibility; the overriding idea being to
minimize hardware complexity in order to maximize clock rates.
While the fastest available devices tend to have low yields, it
is sometimes possible to sacrifice some device speed for greatly
increased device integration. Some of the future iterations of
F-RISC are expected to explore these areas. In addition, the technology
of F-RISC can be applied to other high-speed digital design arenas,
such as network switching. Many of the ideas advanced in this
chapter illustrate that the overall throughput of F-RISC / G could
probably have been improved markedly had the design process proceeded
differently.
The F-RISC architecture has been implemented in several technologies and several more implementations are planned. As the architecture doesn't have to support a massive software base as of yet, it is permissible to make changes to it to suit research purposes. Any such changes should fit with the F-RISC philosophy: blinding speed at any cost. F-RISC/H, the successor to the current design, has begun to take shape on paper. It is hoped that faster devices, higher device integration, multiple, perhaps deeper pipelines (using VLIW technology), better CAD tools, and more advanced packaging will allow at least an eight times increase in single node throughput.
It is possible, however, that some of these technologies can be tested with the current F-RISC/G design, leveraging the considerable effort and expense that went into it to allow experimentation with a few of these more advanced technologies before committing to the whole package for F-RISC / H.
The most important lesson to be learned from the F-RISC / G cache design is the necessity to design the entire processor holistically; if the cache is relegated to secondary importance, the overall design will work more slowly than it could have. In the F-RISC / G case, the fact that the cache was designed after the CPU design was frozen resulted in tighter than necessary critical paths in the cache, excessive circuit complexity in the cache controllers, and unnecessarily long MCM routing.
In [Przy90], this idea is validated:
"Guideline 8: Design the memory hierarchy in tandem with the CPU
It is an easy trap to fall into: design the CPU, shrink its cycle time as much as possible, then design the best cache possible that can cycle at the same rate. If either the cache or CPU must be designed first, it should be the cache. Look at the resources available to it and determine what attainable cycle time and size pair yield the highest performance. Then build the CPU with whatever resources are left. Better yet, consider the whole problem: the system design. Partition the resources among the various functional units so that they are all matched in cycle time and together they yield the best overall system-level performance."
The F-RISC / G design process proceeded exactly as in the "trap"
that Przybylski describes, and, as a result, the overall throughput
of the design was sacrificed in order to attain minimum cycle
times. Additionally, some of the techniques in this chapter would
allow higher throughput with no impact on cycle time, but were
not available as options due to the frozen status of the CPU design.
FIGURE 5.1: CYCLE TIME / CPI TRADE-OFF
When test data revealed that the Rockwell process produced worse-than-expected interconnect performance (due to an anisotropic dielectric with high planar dielectric constant), a great deal of effort was required to improve the routing and final layout. Many of these problems could have been avoided entirely if the CPU and cache had been designed together.
In a commercial processor, trading off throughput for a shorter
cycle time would be intolerable. Since F-RISC is a research effort,
however, there is some flexibility in the design objectives. Future
versions of F-RISC, however, may very well benefit from different
design priorities. Figure 5.1 illustrates the trade-off between
CPI and cycle-time; throughput can be maintained if CPI is decreased
as cycle time is increased. More importantly, however, if it is
necessary to slightly increase cycle time in order to greatly
reduce CPI, throughput will be increased despite the increased
cycle time.
The current partitioning of the F-RISC/G system was chosen so
as to minimize the length of the primary CPU critical paths, and
thus maximize CPU clock rate. The partitioning was in large part
enforced by economic and technical limitations which may, in the
future, become less important. The use of even more exotic packaging
alternatives, the ability to produce larger die with more devices
per die, and faster devices would all affect the partitioning
decisions made in the F-RISC/G system.
The overall processor throughput would be greatly improved if the device yield were higher, allowing larger device integration. While my research was focused on low-yield architectures, it is interesting to consider the effect of even doubling the amount of devices per die (which would still result in extremely low device integration as compared to other technologies.)
When millions of transistors are available cache designers tend to incorporate complicated cache architectures rather than dedicating all of the transistors to cache memory. As memory sizes become arbitrarily large, rather than using additional transistors to make a small percentage change in the size of the cache, it is preferable to use the devices to improve the control circuitry to make better use of the RAM already available.
Improvements in control logic which may be feasible when the device budget is high include adding multiple cache sets, branch prediction, and complicated replacement algorithms.
In the low transistor regime, however, it makes more sense to
simply increase the size of the cache RAM. This is borne out by
the Dinero simulations.
In the F-RISC/G design, the instruction cache controller contains the Remote Program Counter, as was discussed in Chapter 2. The benefit of including this circuitry on the cache controller is that it reduced the address bus traffic between the CPU and the caches; only on branches does the cache controller need to receive an address from the CPU.
By reducing this traffic it was possible to have both caches share an address bus, reducing pin-out, device counts, and routing complexity on the datapath chips. An important disadvantage of this combined bus is that the load capacitance on the bus is approximately twice as high as each of two separate buses would have been. This increases the cycle time for each data cache access by approximately 150 ps.
In addition, the inclusion of the remote program counter results in an additional multiplexor delay on the tag RAM address bits and cache RAM address bits. In addition, if not for the RPC, the low order address bits could pass directly to the cache RAMs. This would shorten the cache memory cycle time by approximately 300 ps (presuming the tag RAM and comparator could produce a result quickly enough.) Alternatively, more RAM chips could be included, slowing the access time but increasing hit rate.
Aside from splitting the address bus into instruction and data address buses, the RPC could be eliminated if the cache controller contained some of the functionality of the instruction decoder, or, alternatively, if the cache controller functions were merged into the instruction decoder.
If the device yield were sufficient, the cache controller could decode PC-relative unconditional BRANCH instructions immediately. Conditional BRANCHes would require control signals to be exchanged between the datapath and cache controller. Also, BRANCH instructions which do not use PC relative addressing would have to rely entirely on the datapath. This particular solution merely complicates the cache controller while adding further communications complexity between the cache and CPU. The possible advantage is that PC-relative addressing may be so common that an additional cycle of latency for non-PC-relative BRANCH instructions can be tolerated.
The other option is to incorporate a tag RAM and comparator on
the instruction decoder, eliminating the instruction cache controller
entirely. This would require the address to be communicated from
the datapath to the instruction decoder on every cycle, which
would do little to improve the communications latency problem.
If, however, the comparator and tag RAMs were distributed across
the datapath instead, each slice of the comparison could be performed
very quickly, and it would be necessary only to accumulate one
signal across the four slices, indicating whether a miss occurred.
This signal accumulation is similar to what occurs in the F-RISC
/ G design with the adder carry chain.
In the instruction cache, the inclusion of the remote program counter mandates that the nine low order address bits be routed to the cache controller, and, from there, to the cache RAMs. In the data cache, where there is no need for an RPC, and if the instruction cache were modified to eliminate the RPC, it would still be necessary to route these address bits through the cache controllers in order to handle misses; the addresses must be saved since by the time the cache RAMs need them (to allow the secondary cache to write into them), the CPU will have placed alternate addresses on the bus. In the current implementation, these pipeline latches are located in the cache controller.
If the latches for these nine address bits are moved to the cache
RAMs (which are already among the smallest chips in the system)
or to the datapath, it may be possible to eliminate a pair of
I/O delays, as well as decrease the MCM time-of-flight; this could
decrease the length of the primary critical path by approximately
500 ps.
FIGURE 5.2: PROPOSED REVISED SYSTEM DIAGRAM
Figure 5.2 illustrates the F-RISC/G system with the elimination of the remote program counter, split 9-bit address buses to each cache, an L2 block size of 512 bits (sub-block placement would require additional address bits to be routed from the cache RAMs to the secondary cache), and the addition of pipeline latches to the cache RAMs or datapath.
If the pipeline caches are included on the cache RAMs, then the cache RAMs must be informed as to when a miss has occurred (either by the cache controller or the datapath), and must receive a Valid Address strobe to indicate when to advance the pipeline.
If, on the other hand, the datapath is responsible for maintaining the pipeline, these signals are not necessary. In fact, the datapath already includes program counter history registers which contain the necessary addresses for the instruction cache. At the time the instruction decoder receives MISSI, the PC_I2 register will contain the address which the cache RAMs will need in order to process the miss. For a data cache miss, at the time MISSD arrives at the instruction decoder the RES_D1 register will contain the address which data cache RAMs will need. The pipelines will advance once before the datapath can be informed of the miss.
The datapath already includes a datapath from the RES_D1 through the ALU and out to the address bus. In order to remove the cache controller from the address critical path it should only be necessary to modify the control circuitry in the CPU.
FIGURE 5.3: PROPOSED DATAPATH BLOCK DIAGRAM (ADAPTED FROM [PHIL93])
The program counter in the datapath (PC_I1 in [Phil93]) is either loaded or incremented on each phase 3. Currently, the only datapath from the program counter to the address bus is through two PC history registers and then through the ALU. As a result, the PC can not be put on the address bus for at least an entire cycle after it increments. In addition, since the PC normally increments each cycle (except on branches), the ALU would never be available for other use.
Figure 5.3 is a block diagram for a proposed modification to
the F-RISC/G datapath. This particular implementation contains
as few modifications from the current design as possible, and
is not necessarily the optimum configuration.
The address bus is shown to be split so that the ALU need only become involved on BRANCH instructions. There is a direct path from the PC to the instruction address bus, and a multiplexor is added to select from the current PC, or the PC_DE history register (in the event of a miss). The output of the ALU is also an input to the multiplexor, in order to allow branches to take place.
The path from the RES_D2 register through the ALU, which would be used in the event of a data cache miss, is marked as well. The control logic in the datapath would have to be modified to cause the appropriate pipelined address to be put on the correct bus in the event of a cache miss.
Figure 5.4 shows that implementing these changes in the instruction decoder may result in critical path savings of up to 650 ps. While this diagram shows a pipeline in the cache controller, it is possible to remove the cache pipeline from the cache controller, route a copy of the address bus to the cache controller, and use it to address the tag RAM. This is because the CC Pipe 1 master latch, which is used to address the tag and cache RAMs, mirrors what the datapath would put on the data address bus given these modifications. By increasing the complexity and pad count on the instruction decoder and datapath chips, it is possible to greatly simplify the cache controller (eliminating approximately two thousand devices), and reduce total system device count.
Before such modifications can be implemented in the datapath
and instruction decoder, careful consideration should be given
to the effects of such changes on the main CPU critical paths.
Such modifications would clearly increase the size of the core
CPU chips, and if the increase in size is significant, the cycle
time of the core CPU will be adversely affected.
Figure 5.5 is a block diagram for the cache controller given a split bus, removal of the remote program counter and the pipeline latches, and the use of the pipeline latches already present in the datapath.
While such modifications to the chip set appear practical, it is unclear how much circuitry would have to be added to the core CPU to accomplish all of the objectives of the cache control circuitry. Copybacks, cache start-up, and page faults are all fairly complicated, and care must be taken to ensure that the cache and tag RAMs are addressed properly at all times. It may be necessary to maintain some pipeline latches on the cache controller in order to handle the write and valid signals, as well. However, since there would be no RPC, there would no longer be any purpose in the third pipeline stage; the minimum two is all that need be implemented.
It should also be noted that while such modifications immediately
speed up the secondary cache memory critical path by eliminating
several gate delays (multiplexor and latch delays between the
I/O receivers and the tag RAM) and decreasing the size of the
die, further work may be necessary to reduce the tag RAM ./ comparator
cycle time sufficiently to prevent it from becoming the primary
critical path. As much of the comparator delay time is caused
by routing through the top standard cell area to the tag RAM,
removing the pipeline latches which spatially dominate that portion
of the chip would greatly reduce the comparator propagation delay.
In addition, if time permitted, the tag RAM could be reduced from
32 bits to 24 bits wide, and the 25% savings in power could be
used to increase the speed of the block if necessary.
In the current implementation, accessing a word of memory requires the use of all eight RAM chips in a cache. The memories are interleaved so that four bits of the word are contained in each RAM. This scheme is beneficial in reducing MCM routing density; only four data lines need to be routed to each RAM. Figure 5.6 illustrates the F-RISC / G nibble interleaving scheme.
If a single RAM supplied all 32 bits of data, then all 32 data
lines would have to be routed to all 8 RAM chips, and a tri-state
bus would have to be used. This would also greatly increase the
capacitive load on the data bus. Future implementations of F-RISC
may benefit from this or another interleaving scheme, however
(Figure 5.7).
FIGURE 5.6: NIBBLE INTERLEAVING
If a single RAM chip was responsible for each word, for example,
it would be possible to simulate multi-ported RAM so long as the
words to be accessed simultaneously were not contained on the
same RAM. This technique could allow a relaxation of the prohibition
against a LOAD or STORE
immediately following a STORE;
if the second instruction accesses a memory location contained
in a different RAM than the first instruction, the first address
can be written into simultaneously to the second address read.
It can either be left to the compiler to schedule memory accesses
(in which case the programming model would be made to depend on
the cache implementation), or, preferably, hardware can be included
in the cache to perform the second operation if possible, and
otherwise stall the CPU.
FIGURE 5.7: WORD INTERLEAVING
The difficulty here would be the tag RAM; like the cache RAMs, the tag RAM must be able to support a read and write in the same cycle in order for this technique to work. This could possibly be accomplished by increasing the access speed of or interleaving the tag RAM.
This type of interleaving will probably be a necessity in any future F-RISC VLIW design; if at all possible the goal should be to allow at least one LOAD or STORE, and probably more, to occur during each cycle in order to keep the ALUs busy.
A related issue is the bus width to the secondary cache. While the secondary cache must always send the primary cache an entire line in order to gain the full benefit of large block size (taking advantage of the principal of spatial locality), it is less clear that the primary cache must always copyback an entire block to the secondary cache.
In F-RISC / G, the data bus between the primary and secondary
caches is bi-directional and there is little to be gained by allowing
the cache to copyback sub-blocks. In designs where a bi-directional
bus is not used, multiple dirty bits per block could be used to
allow sub-block copybacks, thus reducing the number of MCM traces
(assuming multiplexing takes place on the cache RAMs). If word
interleaving is used to allow reads and writes to occur simultaneously,
it would be possible to copyback just the dirty word in a block
while simultaneously writing into other words of the block (reading
from other words would not be an issue - in order for a copyback
to occur, a miss must have occurred, in which case the entire
block is invalid.)
FIGURE 5.8: 1 KB L0 CACHE
Rather than interleaving the accesses spatially among several RAM chips it may be desirable to "temporally interleave" accesses. A method of accomplishing this is to increase the size of the bus between the primary cache and the CPU and assume that some memory, a "L0 Cache" exists on the CPU to hold excess data. The idea would be to transmit to the CPU more information than is immediately needed, thus allowing the CPU to use the L0 cache to access data while the L1 cache is busy with a new access.
Figure 5.8 shows the locus of primary cache access times and
L0 block transfer sizes which achieve a 1.86 CPI given 1024 bit
L0 caches. While this cache size is clearly too large to be feasible
in F-RISC/G, it may be possible to implement in future designs.
The figure results from several Dinero simulations run on a suite
of five traces. A Harvard write-back architecture is assumed.
FIGURE 5.9: 1 KB VS. 2 KB
L0 CACHE
As the figure shows, for small L0 block sizes it is possible to allow the primary cache to take multiple cycles to perform a transfer (as opposed to "1 cycle" in the current design) and achieve the same performance as in the current design. This effect can be leveraged to reduce the number of pipeline stages assigned to memory accesses, or to reduce the speed, and hence the power consumption, of the cache. A large bus size between the L0 and L1 caches increases the benefit of this technique.
Figure 5.9 shows the difference in allowed access time between 1024 bit and 2048 bit L0 cache (Harvard) assuming a 128 bit bus between the CPU and primary cache. As can be seen, the smaller cache size works nearly as well as the larger, suggesting that a small cache may be sufficient. If multiple pipelines are used in F-RISC / H, then the size of the L0 cache may need to be increased.
It is expected that the L0 instruction cache will be located
on the same die as the instruction decoding hardware; the decoding
and the fetch can be combined and circuits can be reduced in complexity.
The L0 data cache will need to be located on the same die as the
pipelines in order to have sufficiently small access time.
One method of improving the hit rate of a direct-mapped cache with little additional hardware is the column-associative cache [Agar87].
A column associative cache is essentially a hybrid between a direct-mapped cache and a set-associative cache in that there is sufficient hardware only to check a single tag at a time but there are multiple possible locations in which any block may be stored.
The simplest implementation would allow only two locations to store any particular block. One location would be the normal location where the block would reside in a standard direct-mapped cache. The second location should be easily computed from the address, say by simply inverting one of the line address bits.
When data is to be retrieved from the cache and it is not found in the primary location, the CPU and secondary cache are informed as in the normal direct-mapped cache. While the secondary cache is retrieving the necessary data, the primary cache checks the alternate location. If the data is found, the CPU is informed that the data is available, and the CPU can end its stall.
As a result, there are three possibilities: the data is in the primary location in which case the CPU need not stall, the data is in the secondary location in which case the CPU need stall only for one cache RAM cycle (in the F-RISC / G case, one CPU cycle), or the data isn't in either location, in which CPU need stall for the same amount of time as in a miss in a conventional direct-mapped case (presuming that the stall time would have been more than a cycle.)
This idea has particular merit in F-RISC / G since the CPU can
handle variable stall times.
FIGURE 5.10: COLUMN ASSOCIATIVE
CACHE - SLOW HIT
FIGURE 5.11: ASSOCIATIVITY SCHEMES
Figure 5.10 shows the approximate timing of a slow cache hit in a column associative cache. The data will arrive at the CPU later than if the primary location contained the data, but more quickly than if it had been necessary to go to the secondary cache.
Using a variety of traces, the F-RISC / G system was simulated in Dinero with various associativity schemes. For the column-associative case it was assumed that a fast hit takes one cycle, a slow hit has a two cycle penalty, and a miss has a five cycle penalty.
Figure 5.11 shows the effect of several of the simpler associativity schemes on the predicted cache CPI based on a suite of five traces. As can be seen, using a column associative cache seems to buy half the benefit of going to a fully associative cache of the same size. The cost of implementing a column associative cache is minimal - it is necessary only to modify the control circuitry on the cache controller. As the number of sets is increased the relative decrease in CPI diminishes.
Although the effects of implementing column associativity may vary in other design spaces, it should definitely be investigated in future F-RISC implementations.
One of the ways that the F-RISC architecture is likely to be expanded in the future is through the implementation of fine-grained parallelism. This would entail adding additional parallel pipelines to the system, each of which is capable of independently processing instructions. Two methods of accomplishing this are "superscalar" architectures and "Very Long Instruction Word" (VLIW) architectures.
In each of these architectures, parallel pipelines and additional functional units are added to the processor to enable multiple instruction streams to be executed simultaneously.
In superscalar architectures the hardware will typically examine the incoming instruction stream for code dependencies and is responsible for scheduling instructions for execution. The instructions need not execute in the order in which they occur in the code.
In VLIW architectures the compiler is largely responsible for determining which instructions can be executed in parallel, and the instruction word is widened to accommodate multiple parallel instructions. Typically there is far less decoding by the hardware, which makes it ideal for low-yield technologies. The negative aspect of VLIW, however, is that it makes it difficult to maintain code compatibility among successive generations of processors.
These types of architectures raise special complications in designing the memory hierarchy.
In any processor with multiple conventional pipes, there might be several instruction fetches occurring simultaneously. In addition, several memory LOAD's and STORE's may also be occurring.
The problem becomes significantly more complicated in superscalar systems where the CPU buffers many upcoming instructions and, based on dependencies between instructions, executes the ones it determines can be executed together (out-of-order-issuing).
In order to prevent instruction fetch latencies, the instruction cache must be capable of providing instructions to each pipe on every clock cycle. In the case of the VLIW architecture, a single, very wide instruction would be transferred, while, in the case of a superscalar architecture, several smaller instructions would have to be fetched.
The VLIW case requires a simpler hardware implementation than the superscalar case. A single instruction address would either be transferred from the CPU or generated using a remote program counter. The cache controller circuitry would then address its cache RAMs, each of which would send some bits of the instruction word to the instruction decoding circuitry. The main difference between the VLIW and single-pipe implementations is in the width of the data bus between the cache and the CPU.
A VLSI architecture could be implemented fairly easily (with respect to the cache) if limitations are placed on the types of instructions which can be placed in parallel. Specifically, unless the cache RAMs are multi-ported or interleaved, only one LOAD or STORE instruction can be performed at a time. There is a possible exception, however. If multiple LOAD's or STORE's are to be performed to addresses within the same cache line, then they can be executed in parallel if the data and address buses are sufficiently wide. Such an event would probably tend to occur only on consecutive word addresses. For example, a programmer may wish to load two consecutive registers with two consecutive words from memory. If the CPU contains two ALUs it would then be possible to fetch a 64 bit long word from memory, perform an operation on it, and put it back in memory in three cycles. Such a capability is particularly useful for floating point operations.
As the width of the instruction word is increased the cost to add more ALUs and functional units increases linearly. The cost in terms of memory bandwidth is much more severe. Multi-porting the cache is extremely expensive in terms of speed and hardware, but, if multiple simultaneous accesses to memory are not allowed, it will be difficult to make full use of the parallel pipelines. The functional units need data on which to operate, and that data will always originate in the cache.
Due to this memory bottleneck, it may be desirable to increase the cycle time of the cache if doing so would provide a net increase in speed. For example, if double-porting the cache results in less than a doubling of the cache cycle time, the net CPI may be improved.
An additional consideration in VLIW architectures is bubbles
in the instruction stream. It may not be possible for the compiler
to schedule instructions for each functional unit at all times.
In order to increase memory bandwidth it may be desirable to avoid
sending wide instructions with void bit-fields corresponding to
unused pipelines. An alternate approach is to add a bit-field
to each instruction field to indicate which pipeline the instruction
is intended for. If pipeline two can not be used, for example,
a cache transfer may contain two instructions intended for pipeline
one, to be executed sequentially. The second instruction would
be stored by the issue unit until the appropriate time. Given
the small number of pipelines in the system, the logic necessary
to accomplish this should not be extensive.
The F-RISC / G MCM would make an interesting node of a multiprocessor machine. If a thousand F-RISC / G's could somehow be wired together in a useful manner then the resulting system would, at peak, be capable of performing 1015 instructions per second, reaching the infamous "tera-op" barrier.
Memory organization is a major area of research in the area of
multiprocessing. If all of the processors in a multiprocessor
system share a common memory address space, then the cache represents
a major problem. If each processor has its own cache, then care
must be taken to ensure that if data in one processor's cache
is modified, then other processors are made aware of it. This
is known as the problem of "cache coherency."
FIGURE 5.12: SHARED MEMORY
MULTI-PROCESSOR IMPLEMENTATION
Copyback caches are particularly difficult to keep coherent. If each CPU has its own primary cache (Figure 5.12), then if a processor modifies a memory address main memory will not necessarily be updated to reflect the change. In addition, any of the other caches which hold the out-of-date contents of that address will be in the dark as well. This latter problem is present in write-through architectures as well.
The F-RISC/G processor has the capability to operate in a write-through mode, since a miss can be forced by setting the IOCTRL field in the STORE instruction appropriately. The design does not, however, have all the features one would desire in an efficient multi-processing cache system.
In order to handle cache coherency, it is necessary either to ensure that each of the caches is at all times up-to-date, or to maintain state information for each cache block indicating whether the line needs to be updated with the contents of main memory.
There are several methods by which the cache could be kept coherent. The cache could, each time it is accessed, poll each of the other caches to determine if an updated copy of the contents of that address exists. This would be very time consuming, however, and would greatly reduce the speed of the cache.
A better alternative is to implement write-through in the primary cache and allow the secondary cache to asynchronously update each of the primary caches of the change. The secondary cache could keep a directory of which caches need to be updated since it has sufficient information as to which caches had previously requested or updated that address. It is also possible to implement special hardware between the primary and secondary caches to implement this functionality.
"Directory based" cache coherency protocols use centralized hardware (contained in the secondary cache or between the primary and secondary caches) to maintain the information as to which caches contain which versions of which addresses.
An alternative is a "snooping" cache, in which all
primary caches share a bus with the secondary cache and thus can
keep track of modifications to memory addresses.
FIGURE 5.13: SNOOPING CACHE
COMMUNICATIONS
Either solution will result in a net decrease in speed of memory access in the F-RISC / G prototype. A directory based system would require that the directory be accessed in parallel to the tag and data RAMs. This access will likely become the critical path, as the directory is likely to be located a large physical distance from the cache controller. In addition, more control logic will be necessary to deal with the directory and to replace blocks when necessary.
A snooping protocol greatly increases loading on the bus to the secondary cache. If RC delays dominate due to large routing distances, then the delay on the bus will increase quadratically as a function of the number of processors in the system. On the other hand, the bus to the secondary cache already exists, and thus implementing a snooping protocol eliminates the need to create an entirely new communications path (Figure 5.13). The tag RAM can also be used as a repository for sharing-status information, although there are good reasons not to do so if it can be avoided.
In a typical implementation, the tag RAM would contain extra status bits for each cache block. When any cache misses on a read, each of the other caches must check its tag RAM and determine whether it has a modified version of the block, and, if so, must put it on the bus. At any time only one cache will have permission to modify a block. When a write to memory occurs, each cache checks to see if it has a copy of the block and either invalidates it or updates it to remain current.
In order to ensure that cache activities in other processors don't affect a particular CPU's throughput, a second tag RAM can be put on the cache controller specifically for use in snooping, thus allowing normal cache operations to proceed in parallel. Only when the cache finds it must perform some coherency-maintaining operation is there a chance of stalling the CPU.
The disadvantage of this technique is that the size of the cache controller is approximately doubled: two comparators and two tag RAMs are necessary. In addition, while there would be no unnecessary contention stalls, communications with the secondary cache are limited by the speed of the bus, and clock synchronization across CPUs becomes important; if a processor is out of synchronization with the others, than it may proceed with an illegal write because its snooping tag RAM was not updated in time.
One of the most intriguing ideas for extending the speed and capabilities of the F-RISC architecture is the inclusion of processing logic in the cache memory subsystem. The idea is to include extended processing capabilities in the cache subsystem while the CPU handles a small set of core instructions.
Perhaps one of the most useful ways in which cache pre-processing can be incorporated into F-RISC is through architecture translation. While F-RISC/G has the ability to process at 1000 peak MIPS, the software base for the F-RISC architecture is essentially non-existent. Cache pre-processing holds the promise of allowing the F-RISC processor to run a large base of existing software for other architectures at speeds greater than that of processors based on the native architecture.
The idea is that programs in main memory could be either native
F-RISC binaries, or binaries intended for other architectures
like MIPS or SPARC. As the foreign binaries are transferred down
the cache hierarchy toward the CPU, the processing capabilities
of the cache convert them to native F-RISC code which the core
CPU executes at full speed. If the code in the primary cache is
native code, then the vast majority of the time there will be
no speed penalty for making this translation.
FIGURE 5.14: ARCHITECTURE
TRANSLATION
Figure 5.14 illustrates architecture translation within the secondary cache. There should be only one multiplexor delay on the path through the secondary cache for native code. Since the logic in the secondary cache is expected to be slower than that in the core CPU, this delay in not inconsiderable. Nonetheless, the penalty must be paid only when the secondary cache misses and must transfer a block from the next higher level of memory; since the target hit rate for the primary caches is 95%, the secondary cache can be expected to miss on an extremely small percentage of instructions.
The translation penalty is reduced even more when one considers that once a translated instruction is transmitted to the primary cache, a miss on that data in the primary or secondary cache will result in the translated data being passed back up through the cache hierarchy; the cache will not have to perform the translation again.
Another example of cache pre-processing which can result in a faster architecture is the byte operations chip. While this chip has not been designed or fabricated, the instruction decoder and cache controller contain hooks to allow it to be integrated into the F-RISC/G prototype.
The byte operations chip is essentially a byte multiplexor, allowing particular bytes within a word to be written into and read from. Unfortunately, in the current F-RISC/G implementation the cache critical paths don't allow much slack; it may be possible to fabricate a byte operations chip which works with the current system timings, but it would undoubtedly require a great deal of power in order to operate quickly enough. The byte-ops chip could be combined with architecture translation to allow either little or big endian architectures to be emulated in hardware.
Still another manner in which processing in the cache can be
used to advantage is through the introduction of dedicated branch
circuitry. The instruction cache controller, for example, could
clearly handle unconditional branches without intervention of
the datapath chips. In the event of conditional branches, the
cache could conceivably speculatively fetch both possible addresses
(if dual porting were available or the memories were interleaved
in such a way as to make that possible), or, at the very least,
pass the branch target address to the secondary cache so that
if a miss occurs the secondary cache will have the needed data
available more rapidly. Alternatively an active or passive branch
prediction scheme could be implemented in the cache controller.
The F-RISC / G memory hierarchy spends much of its time communicating data and addresses between chips. If future packaging were able to eliminate or at least seriously reduce these delays, then the processor cycle time could be decreased significantly.
One of the most promising new packaging technologies which could have a great effect on reducing the cycle time of future processors is three-dimensional (3-D) packaging. In 3-D packaging, rather than laying out the chips in a single layer on a flat module, they are stacked vertically. Since the chips are much thinner than they are wide or long, the distance between chips is much reduced. If a way can be found to take advantage of this vertical communication distance, then the overall cycle time can be much reduced.
From a practical point of view, one of the most difficult problems with stacking chips vertically rather than distributing them on a surface is that the vertical chip stack has poor thermal qualities.
A single chip or MCM package provides a comparatively large surface area through which to dissipate excess heat generated on-chip. If chips are vertically stacked, then the top and bottom chips on the stack may have a surface through which to dissipate heat, but the chips sandwiched in-between can only dissipate heat laterally through the edges (which, due to the small edge surface area, is not very helpful) or into neighboring chips.
In the nascent stage of a device technology, such as F-RISC/G's
GaAs HBTs, power tends to be a particular problem. When using
an exotic device technology, however, high circuit frequency is
usually the primary goal, making it desirable to try to overcome
this thermal issue.
One possible method of attacking this problem is shown in Figure 5.15. Recognizing that the chips themselves do not conduct heat well in the lateral directions, diamond sheets are interposed between the die. The diamond sheets have "fins" which extend beyond the dimensions of the die, allowing them to conduct heat into an appropriate thermo-conductive substance.
A second critical problem with the three-dimensional chip stack is inter-chip signal routing. Traditional integrated circuit fabrication techniques do not allow signals to be routed through the backside of a die. As a result, signals must be routed to the edges of the stack, and, from there routed along the die stack edge to other die in the stack.
In the solution proposed in Figure 5.15, only one edge of the stack is available for routing (the others being interrupted by heat-conducting fins which prevent a smooth surface onto which metallization can be deposited.) Furthermore, due to the difficulty of providing interconnect on such a surface, it is unlikely that more than one routing layer can be provided.
The difficulty of routing is further exacerbated when a chip stack is to contain multiple identical chips. For example, a chip stack may contain many RAM chips, each of which accepts the same address but which provides different I/O buses (as is the case in F-RISC/G). Assuming these chips were redesigned with all of the I/O pads on one edge of the chip (which would be undesirable for other reasons), the parallel routing lanes on the edge of the chip stack would result in the I/O buses of each chip being shorted together, since the pads are located in the same place on each chip. The nine bit address bus between the cache controller and the RAM chips is the only bus which does not suffer from the problem of aligned die pad locations, since all of the RAM chips receive the same address. Several control signals, such as the signals used to latch the inputs and outputs of the RAM chips, are also immune to this problem. The data buses, however, are a problem.
One solution would be to fabricate several cache RAM chips, each
with different pad-outs (a solution which quickly becomes very
expensive), or to provide extra I/O's which are located in different
routing channels (which could, perhaps, be accessed by rotating
some chips with relation to the others).
FIGURE 5.16: CHIP WITH INTERPOSER
A simpler and cheaper solution would be to use "interposer" dies which contain the multi-layer interconnect necessary to route signals from the die I/O touchdowns to the stack edge solder connectors. The interposers would contain no active devices and thus would be considerably cheaper to manufacture than several varieties of each architecture chip (four distinct data path chips, two distinct cache controllers, etc.)
Figure 5.16 is such an interposer die, connected to one of the architectural dies. Such an interposer could be fabricated with several metallization layers, some of which may be dedicated to power and ground planes, which, aside from helping to eliminate problems such as voltage droop and ringing, also has thermal dissipation advantages.
The use of interposers is not without its disadvantages, however. Since all routing must be brought to one side of the stack, the nets which originate on the opposite side of the stack must be routed the length of at least one chip edge before it reaches the stack routing channel. In the worst case this extra routing will be necessary at both the driving and receiving chips. This would result in a minimum net length of two chip edges in that case. If the planar route could be accomplished in a shorter distance, then the advantage of the chip stack is eliminated.
If these technical obstacles can be overcome, the use of 3-D chip stacks could greatly increase the clock rate of the F-RISC/G CPU, even if the device technology is not improved.
Using the "conventional" planar MCM arrangement, the largest communications component of delay in the cache memory critical path is the address transfer from each of the cache controllers to the cache RAMs (estimated at 300 ps). Eliminating this delay would allow the use of lower power cache RAMs (1.05 ns access time vs. 750 ps for the current RAM chip).
In order to use 3-D stacking to eliminate this delay, the RAM
chips may be combined in a stack with the appropriate cache controller.
While this merely removes around a clock phase from the critical
path (which is not enough to have any effect on overall CPU throughput),
the slight communications delay caused by the vertical separation
between chips on the stack remains very small as more chips are
added, so the CPI may be reduced by including more RAM chips and
increasing the cache hit rate.
|
|
| ||
A | Address I/O (datapath): | |||
B | Address Transfer
(DP to CC): | |||
C,D | Address I/O (CC): | |||
E | Cache RAM Address Transfer (CC to RAM): | |||
F | RAM Access Time: | |||
G | Data Transfer: | |||
Total |
Figure 5.17 shows the MCM layout given this type of chip stack. An added benefit to stacking the chips this way is that the other communications components of the cache subsystem critical path are significantly reduced as well. Data transfer between the cache RAMs and the CPU requires fewer than two chip edges.
Table 5.1 shows that the estimated critical path delay using this scheme is reduced to 1500 ps. This may be fast enough to eliminate the D1 stage of the CPU pipeline.
Additional speed improvements could be made by stacking the datapath
chips (Figure 5.18). It is doubtful that the instruction decoder
could be included in the stack due to the complexity of the resulting
stack routing.
In this arrangement the CPU critical path would be greatly reduced, which would allow the cycle time to be decreased accordingly. In addition, the address broadcast from the datapath to the caches will increase in speed, resulting in a modest decrease in critical path length. Since this decrease is small, if the CPU cycle time is reduced it may be necessary to remain with a seven stage pipeline. There is a possibility that further gains may be possible by tailoring the drivers and receivers of the chips to take advantage of the reduced load capacitances.
A final 3-D stacking solution would be to incorporate all of the core CPU and cache chips in a single stack. The benefits of this arrangement over the three stack arrangement are difficult to quantify, and depend largely on the quality of the inter-chip route.
Specifically, the calculations for two and three chip stacks were based on the assumption that signals needed up to 50 ps to traverse the stack. This is based on the conjecture that interposer routing will be required and that the dielectric used is equivalent to that used in the "conventional" MCM. If the pad locations on the various die are optimized (and multiple layouts of each die type are economically feasible), it is possible that these "vertical" distances can be traversed much more quickly.
In addition, if all of the chips are in the same stack, it will
probably be possible to eliminate I/O drivers and receivers completely,
replacing them with superbuffers as required. As shown in Figure 5.18,
it may be possible to reduce the cache cycle time to 995 ps. Of
course, once this cycle time is reduced to that level, other paths
in the cache may become critical. The most important such path
is the comparator path, in which the address, once it arrives
at the cache controller, is used to access the tag RAM. Once the
tag RAM is read, the tag is compared to the address from the CPU.
FIGURE 5.19: CACHE CRITICAL PATHS
Figure 5.19 shows the primary critical path as dark lines, and
the comparator critical path as dashed lines. The comparator critical
path is limited to approximately 2.5 ns (the exact time depends
on which cache is involved).
| ||
Address I/O (datapath): | ||
Address Transfer (DP to CC): | ||
Address I/O (CC): | ||
Tag RAM Access Time: | ||
Comparator. MUX, Latch Time: | ||
MISS Transfer: | ||
Total |
Table 5.2 gives the path breakdown for this sub-critical path in the current F-RISC/G implementation. Most of the path delay is caused by on-chip logic in the cache controller.
If the primary critical path is reduced through chip stacking
to around 1 ns, this secondary critical path length must be reduced
as well, or no benefit is gained. In the single chip stack implementation,
the time could probably be reduced to under 1700 ps. Hand crafting
and optimizing the layout of the comparator could shave off perhaps
another clock phase or so. Still more time can be saved by re-partitioning
the cache pipeline (5.1.3 Pipeline Partitioning).
The manner in which virtual memory is supported in the F-RISC / G prototype is inefficient, largely due to compromises made in the cache design. Due to cost, power, and timing constraints, it was impossible to implement a translation lookaside buffer in the primary cache. Doing so would enable the primary cache to perform virtual-to-physical address translations within the normal cache access time as long as the virtual address was in the cache.
Without this support, a higher level of cache memory must make the translation and perform page swapping as necessary. When a single thread (or, equivalently, several threads accessing the same page frame) is being executed, there is little difference between these techniques. When multiple threads, each accessing individual page frames, are being accessed in a multi-tasking environment, the F-RISC / G prototype cache will perform very poorly. While the primary cache would store addresses from multiple pages, due to the small size of the cache, each time the processor switches tasks it is likely that the entire cache will need to be swapped to the second cache level.
In the F-RISC / G prototype, the cache is a "virtual cache" meaning that virtual addresses, rather than physical addresses are cached. As a result, each time the operating system switches processes, the virtual addresses in the cache will map to differing physical addresses, resulting in a page fault. If each process is given the same range of virtual addresses to work with, then in order to switch processes it is necessary for the operating system to flush the entire cache (via the IOCTRL mechanism which is, in itself, very inefficient.) While the data cache could be flushed with 32 consecutive LOAD's or STORE's, in order to flush the instruction cache without external hardware intervention would require 497 cycles.
An alternative would be to have external hardware monitor the
IOCTRL lines and execute
the cache initialization routine which would invalidate the entire
cache in far less time.
Given a fixed technology, one would expect that the quality and capabilities of the CAD tools used would have a comparatively minor effect on the overall quality of the design. This is only true insofar as by throwing sufficient manpower at the design, the dilemma of poor CAD tools can be overcome.
The cache RAM and cache controller went through a slightly different design process than that described by Philhower in [Phil93]. Changes in the technology (the addition of a third layer of metallization, design rule changes, and the like) as well as tighter timing constraints resulted in much more of the work being done by hand.
Far more use was made of 2-D and 3-D capacitance extraction tools and SPICE in the design of the cache chips than was used in the core CPU. The Cutter program [Loy93] was modified to reduce the amount of human interaction required in producing matched pair differential routes. Unlike previous chips, which were essentially computer routed but hand "cut," the cache controller and cache RAM depended heavily on hand routing, but were cut automatically.
Based on SPICE and back-annotated simulations, the overall quality of the route in the cache chips was higher than in the core CPU chips (a necessity given the tight cache timing and large quantity of cache chips.)
Given more time or better CAD tools, however, it would have been possible to achieve large speed gains on the cache controller. The 32 bit tag RAM block, for example, is simply a waste of chip area and power.
One layer of XOR gates from the comparator could have been included in the cache RAM block, consolidating space and decreasing cycle time, if there had been need to do so. The pipeline block is another area where hand crafted layout would have been far superior to the layout produced by the VTITools placer and router, and where gains in speed would have been possible.
Much of this hand crafting was considered at various points of the design, and rejected when it was determined that the design would be sufficiently fast without it.
Another area in which the CAD tools could stand some improvement is in the area of trace simulation. While the DineroIII simulator is a workable tool, the traces which were used are suspect. It is hoped that the version of F-RISC which is being implemented in FPGA's will eventually provide a memory profiling capability, essentially allowing the user to run real programs on it (at greatly reduced speed) while it gathers cache access statistics which can be fed back into DineroIII.
The initial stages of the design effort were hampered by the lack of a high-level description of the cache. An incomplete Verilog model was available, but the design group didn't have a license for the Verilog software. In addition, the model was based on incorrect assumptions regarding device and interconnect timing, an incorrect MCM floorplan, and an out of date cache architecture. In addition, even its assumptions regarding CPU operation were in some cases incorrect as modifications to the design occurred which had little impact on CPU operation but which were critical for the cache design given the latest information regarding slower-than-expected devices and interconnect.
The procedure used in the design of the cache controller was particularly complicated since it was left to this chip to correctly interface with the CPU, the design of which was already frozen, the cache RAMs, which, due to quantity on the MCM, were unable to incorporate additional circuitry to make interfacing simpler, and the secondary cache, the design and even technology of which were still undefined.
Using the VTITools schematic capture and digital simulation tools it was possible to design the chip in such a way as communications with the cache RAMs were well tested. More difficult was communicating with the CPU since there was no way to run the behavioral model and the designer wasn't available. The FPGA emulation project provided a partial solution by allowing phase accurate simulations of CPU operation.
An additional difficulty with the CAD tools is that they didn't
allow for re-simulating the circuitry after extracting interconnect
resistances. The instruction decoder and datapath designs were
completed based on the assumption that transmission line delays
dominated and RC delays were negligible. It wasn't until an investigation
of slower-than-expected interconnect on test wafers was undertaken
that the difficulties imposed by RC delays was fully understood.
This occurred after the cache controller and cache RAM were placed
and routed, and while the process of shrinking wires to meet new,
more aggressive design rules was underway. After much effort a
method was discovered to force the simulator to take RC delays
into account, and the chips were modified to account for these
additional delays.
One of the most difficult aspects of the cache design was the problem of clocking. Aside from the fact that clock skew can interfere with pipeline operation, if there are too few clock phases per cycle the design must rely heavily on routing and buffer delays to provide intermediate clocks.
The problems of coarse clock phasing can be illustrated in the differences between the F-RISC/G core CPU design and the cache controller design. Since the core CPU was designed at a time prior to the cache controller, most of the I/O signals are timed so that they are expected or sent on one of the four clock phases. While there were "early" and "late" versions of these clocks available, the exact timing of these signals depended heavily on on-chip placement and routing (since they were created by delaying clock phases by sending them through chains of buffers).
Due to the difficulty involved in accurately profiling the timing
of these signals and replicating that timing on other chips, the
cache controller I/O timing was needlessly complicated. Furthermore,
by relying on such a coarse clock, time was often wasted.
FIGURE 5.20: COARSE CLOCKING
Figure 5.20 illustrates this problem. Ideally one would want to clock a latch as soon as possible after the latest possible time the data at the latch will be valid. Any delay before the latch is clocked results in added cycle time.
While the use of a finite number of clocks will usually mandate that at least some latches in a design will experience this sort of clock lag, on critical paths it is necessary to minimize this to the degree possible.
Using a "single wire" clock in future designs may be
desirable. This would force more rigorous control over clocking
through balanced buffer trees, but would reduce the amount of
power required in these trees considerably.
Due to the high speed of the F-RISC / G processor, the chip set can be adapted for many uses aside from general purpose processing; it is necessary only to properly interface external devices to the cache through the SRAM chips and load the appropriate program into the processor. For this reason the SRAM chip could be used in a vast array of digital systems. Aside from its extremely fast access time, the cache RAM's ability to multiplex between a 64 and 4 bit bus could be useful as well.
In addition, while the core of the CPU is synchronous to a high
speed clock, the communications between the primary and secondary
cache are asynchronous, meaning that a wide variety of external
devices can be wired to the 512 bit data bus for special-purpose
systems. In fact, if memory mapping is used, many such devices
could be wired to the bus simultaneously.
One possible use of this bus is in systems which perform active analysis and filtering of high speed (radio frequency) electromagnetic waveforms. A system has been proposed where high speed analog-to-digital converters would be used to sample an incoming radar signal. The digitally sampled waveforms would be transferred through the L2 data bus into the primary cache, and from there into the CPU. The CPU would filter and transform this data and send it back through the primary cache onto the L2 bus where it would be received by digital-to-analog converters and amplifiers which would be capable of producing radar waveforms which cancel the incoming radar signal. Such a system would have broad military use, enabling aircraft to actively cancel incoming radar so that there is no net reflection returned to the radar broadcast station.
Figure 5.21 shows how high speed A/D and D/A converters could be interfaced to the L2 data bus. The F-RISC/G CPU is fast enough that one would expect the returned radar signature of an aircraft so protected to be very small. The system can also be used for radio communications at extremely high frequencies. Several mechanisms exist within the cache design to allow this type of system to be implemented without any modifications to F-RISC/G.
Aside from the asynchronous nature of the interface between the primary and secondary caches, which allows arbitrary devices to be connected to the bus, it is possible to disable the copyback mechanism in the primary cache. If the copyback mechanism were not disabled, it would take twice as long to force STOREd data onto the L2 bus. Copyback can be disabled by correctly setting the IOCNTRL bit field during the STORE.
Additionally, it is possible to override the comparator on the cache controller and force a LOAD miss to occur; this is necessary to perform LOADs from the A/D converter. This is also accomplished by properly setting the IOCNTRL bit field.
The nominal 50 GHz Rockwell HBT process is capable of being the basis of a 1 ns cycle time CPU with limited instruction set and complexity. In order to get the best use of these devices, the cache design must be kept as simple as possible. A 2kB per cache Harvard architecture with copyback and a single way set is sufficient to achieve a 1.86 overall CPU CPI figure given the trace data available.
Future efforts would benefit from more precise trace data and, hopefully, greater device integration levels. Problems in the CPU microarchitecture and design, particularly the inclusion of the Remote Program Counter and the inability of the datapath to put Missed addresses on the bus at the proper time, greatly complicated, and slowed, the cache design. Without these problems, a more complicated cache architecture would have been possible - of particular interest is the column associative cache, which could, in itself, have lowered overall CPI to 1.80. It is also possible that a two-way set associative cache could have been implemented, lowering overall CPI to 1.76.
Among the greatest problems facing future designers is the increasing importance of interconnect delays. Exotic 3-D chip stacks and other schemes buy enough cycle time to support a generation or two more of F-RISC designs, but eventually there will be no benefit to increasing device speed due to the overwhelming dominance of interconnect delay; a single-chip implementation will be necessary.
The cache RAM test scheme has wide applicability to the class of testing problems involving high-speed, moderate pin-out circuits. Skew is minimized by the reverse clocking (rather than broadcast) scheme used, but as clock speeds and the number of pads to speed test increases, skew will eventually limit the resolution of these tests.
[Agar93] Agarwal, A. and Pudar, S.D., "Column-associative caches: A technique for reducing the miss rate of direct-mapped caches," 20th Annual International Symposium on Computer Architecture ISCA '20, San Diego ,Calif., May 16-19. Computer Architecture News 21:2 (May), 179-90.
[Beac88] Beach, W. F. and Austin, T. M. "Parylene as dielectric for the next generation of high density circuits," proceedings of the 2nd International SAMPLE Electronics Conference, June 14-16, 1988 pp 25-45.
[Bens95] Benschneider, Bradley J., A. J. Black, W. J. Bowhill, S. M. Britton, D. E. Dever, et. al., "A 300-MHz 64-b quad-issue CMOS RISC microprocessor," IEEE Journal of Solid-State Circuits, Vol. 30, No. 11, Nov. 1995, pp. 1203-1214.
[Casc91] Cascade Microtech, Incorporated. "Multicontact high-speed integrated circuit probes." Beaverton, Oregon, 1991.
[Chan92] Chang, H. and J. A. Abraham. "Delay test techniques for boundary scan based architectures" IEEE 1992 Custom Integrated Circuits Conference, pp 13.2.1-13.2.4, 1992.
[Dabr93] S. Dabral, X. Zhang, X. M. Wu, G. -R. Yang, L. You, H. Bakhru, R. Olson, J. .A. Moore, T. -M. Lu, and J. F. McDonald, "aa'a"a'" Poly-tetrafluoro-p-xylene as an interlayer dielectric for thin film multichip modules and integrated circuits," Journal of Vacuum Science and Technology, B 11(5), Sep/Oct 1993.
[Deve91] Devore, Jay S. Probability and Statistics for Engineering and the Sciences, Third Edition. Pacific Grove, California. Brooks / Cole Publishing, 1991.
[Dill88] Dillinger T.E. VLSI Engineering. pp. 624-93, Englewood Cliffs, New Jersey: Prentice Hall, 1988.
[Faus95] Faust, Bruce. "Designing Alpha-based systems." Byte Magazine, pp. 239-240, June 1995
[Fris95] A. Frisch, M. Aigner, T. Almy, H. Greub, M. Hazra, S. Mohr, N. Naclerio, W. Russell and M. Stebnisky, "Supplying Known Good Die for MCM Applications using Low Cost Embedded Testing," IEEE International Test Conference, Washington DC, October 23-25, 1995.
[GE95] G.E. Corporate Research & Development Advanced Electronics Assemblies Program, "Microwave High Density Interconnect Design Guide." February 1995
[Greu90] Greub, H. J. "FRISC - A fast reduced instruction set computer for implementation with advanced bipolar and hybrid wafer scale technology." Ph.D. dissertation, Rensselaer Polytechnic Institute, Troy, New York, December 1990.
[Greu91] Greub, H. J., et. al. "High-performance standard cell library and modeling technique for differential advanced bipolar current tree logic." IEEE Journal of Solid-State Circuits, Vol. 26, No. 5, pp. 749-62, May 1991.
[Hall93] Haller, T. R., et. al. "High frequency performance of GE high density interconnect modules." IEEE Transactions on Components, Hybrids, and Manufacturing Technology, Vol. 16, No. 1, pp. 21-27, February 1993.
[Henn96] Hennessy, J. L., and D. A. Patterson. Computer Architecture: A Quantitative Approach, second edition,. San Mateo, California: Morgan Kaufmann, 1996.
[Hill84] Hill, Mark D. and Alan Jay Smith. "Experimental evaluation of on-chip microprocessor cache memories," Proc. Eleventh International Symposium on Computer Architecture, June 1984, Ann Arbor, MI, 1984.
[Kilb62] Kilburn, T., D. B. G. Edwards, M. J. Lanigan, and F. H. Sumner. "One-Level Storage System," IRE Transactions on Electronic Computers, Vol. EC-11, No. 2, pp. 223-236, April 1962.
[Lev95] Lev., Lavi A., A. Charnas, M. Tremblay, A. R. Dalal, B. A. Frederick, et. al., "A 64-b microprocessor with multimedia support," IEEE Journal of Solid-State Circuits, Vol. 30, No. 11, Nov. 1995, pp. 1227-1236/
[Long90] Long, S. I., S. E. Butner. Gallium Arsenide Digital Integrated Circuit Design, New York, McGraw-Hill Publishing Company, 1990.
[Loy93] Loy, J. R.,. "Managing Differential Signal Placement" Ph.D. Thesis, Rensselaer Polytechnic Institute, August 1993.
[Maie94] Maier, C. "A testing scheme for a sub-nanosecond access time static RAM" Masters Thesis, Rensselaer Polytechnic Institute, 1994.
[Maji89] Majid, N., Dabral, S., and J. F. McDonald. "The parylene-aluminum multilayer interconnection system for wafer scale integration and wafer scale hybrid packaging." Journal of Electronic Materials, Vol. 18, No.2, pp. 301-311, 1989.
[Matt70] Mattson, R. L., J. Gecsei, D. R. Slutz, and I. L. Traiger. "Evaluation techniques for storage hierarchies." IBM Systems Journal, 9, pp. 78-117, 1970.
[Maun86] Maunder, C. "Paving the way for testability standards." IEEE Design and Test of Computers, Vol. 3, No. 4, p. 65, 1986.
[Maun92] Maunder, C. M. and R. E. Tulloss. "Testability on TAP." IEEE Spectrum, pp. 34-37, February 1992.
[Nah91] Nah, K., R. Philhower, J. S. Van Etten, S. Simmons, V. Tsinker, J. Loy, H. Greub, and J. J. McDonald. "F-RISC/G: AlGaAs/GaAs HBT standard cell library," Proc. 1991 IEEE International Conference on Computer Design: VLSI In Computers & Processors, pp. 297-300, 1991.
[Nah94] Nah, K. "An adaptive clock deskew scheme and a 500 ps 32 by 8 bit register file for a high speed digital system" Ph. D. Dissertation, Rensselaer Polytechnic Institute, 1994.
[Phil93] Philhower, B. "Spartan RISC architecture for yield-limited technologies" Ph.D. Dissertation, Rensselaer Polytechnic Institute, 1993.
[Przy90] Przybylski, S. A. Cache and Memory Hierarchy Design: A Performance-Directed Approach. San Mateo, California: Morgan Kaufmann, 1990.
[Salm93] Salmon, Linton G. "Evaluation of thin film MCM materials for high-speed applications." IEEE Trans. On Components, Hybrids, and Manufacturing Technology, Vol. 16, No. 4, June 1993.
[Ston90] Stone, Harold S. High Performance Computer Architecture, Second Edition. Reading, Massachusetts. Addison-Wesley, 1990.
[Sze81] Sze, S. M. Physics of Semiconductor Devices. Second Edition, pp. 182-3, New York: John Wiley and Sons, 1981.
[Sze90] Sze, S. M. High-Speed Semiconductor Devices. pp 371-373, New York: John Wiley and Sons, 1990.
[Tien95] Tien, C-K. "System design analysis, implementation, and testing of a 32-bit GaAs microprocessor" Doctoral Thesis, Rensselaer Polytechnic Institute, 1995.
[Webe92] Weber, S. "JTAG finally becomes an off-the-shelf solution." Electronics, Vol. 65, No. 9, p. 13, 10 August 1992.
[Zhan95] Xin Zhang, "Parylene as an interlayer dielectric,"
Ph. D. Dissertation, Rensselaer Polytechnic Institute, 1995.
The figures given are words transferred between the primary and secondary cache.
Spice | Tex | gcc | ||||
1 | 52842 | 22579 | 100393 | |||
2 | 104348 | 23049 | 199864 | |||
4 | 207449 | 23843 | 399011 | |||
8 | 256356 | 25770 | 480230 | |||
16 | 353560 | 33820 | 634696 | |||
32 | 493384 | 61232 | 943728 | |||
64 | 1024032 | 149872 | 1597648 | |||
128 | 2122208 | 527808 | 3070464 | |||
256 | 4865472 | 3332352 | 5390080 | |||
512 | 13928576 | 11030912 | 12171008 | |||
1024 | 34056192 | 34411264 | 38321920 | |||
2048 | 87734272 | 91727872 | 75351040 | |||
Spice | Tex | gcc | ||||
1 | 42253 | 22430 | 91522 | |||
2 | 83231 | 22531 | 172915 | |||
4 | 165231 | 22701 | 363568 | |||
8 | 193898 | 23148 | 425360 | |||
16 | 251996 | 24936 | 543240 | |||
32 | 374264 | 32056 | 773048 | |||
64 | 701456 | 61856 | 1254544 | |||
128 | 1380576 | 205408 | 2318592 | |||
256 | 2593664 | 2499968 | 4942464 | |||
512 | 6492672 | 8878080 | 12083968 | |||
1024 | 22738688 | 26765824 | 32429568 |
Spice | Tex | gcc | |||||
1 | 34069 | 22415 | 88816 | ||||
2 | 66934 | 22472 | 176659 | ||||
4 | 132665 | 22582 | 352652 | ||||
8 | 154262 | 22614 | 411748 | ||||
16 | 205496 | 22684 | 521936 | ||||
32 | 319656 | 22832 | 731784 | ||||
64 | 545904 | 22976 | 1170800 | ||||
128 | 1109696 | 23296 | 2063808 | ||||
256 | 2578752 | 625984 | 4110656 | ||||
512 | 5939968 | 4323968 | 10512768 | ||||
Spice | Tex | gcc | |||||
1 | 32006 | 22415 | 88483 | ||||
2 | 62804 | 22472 | 175994 | ||||
4 | 124453 | 22582 | 351304 | ||||
8 | 144724 | 22614 | 388366 | ||||
16 | 186544 | 22684 | 518104 | ||||
32 | 282600 | 22832 | 721928 | ||||
64 | 516208 | 22976 | 1140688 | ||||
128 | 1107616 | 23296 | 2011840 | ||||
256 | 2639296 | 24512 | 3935936 | ||||
Spice | Tex | gcc | |||||
1 | 31732 | 22415 | 88599 | ||||
2 | 62252 | 22472 | 176250 | ||||
4 | 123408 | 22582 | 351719 | ||||
8 | 144220 | 22614 | 408474 | ||||
16 | 186664 | 22684 | 513104 | ||||
32 | 252912 | 22832 | 718080 | ||||
64 | 470992 | 22976 | 1123520 | ||||
128 | 1075168 | 23296 | 1978400 |
Spice | Tex | gcc | ||||
1 | 31528 | 22415 | 88859 | |||
2 | 61837 | 22472 | 176795 | |||
4 | 122534 | 22582 | 352938 | |||
8 | 142198 | 22614 | 411102 | |||
16 | 184884 | 22684 | 512732 | |||
32 | 246304 | 22832 | 712792 | |||
64 | 484320 | 22976 | 1132080 |
Spice | Tex | gcc | ||||
1 | 54838 | 22801 | 95467 | |||
2 | 108827 | 23886 | 190326 | |||
4 | 216886 | 25843 | 381089 | |||
8 | 291244 | 30420 | 469854 | |||
16 | 439760 | 131876 | 638220 | |||
32 | 740968 | 262264 | 957824 | |||
64 | 1349840 | 576192 | 1660800 | |||
128 | 2841856 | 1709344 | 3356032 | |||
256 | 6415424 | 4209216 | 7978240 | |||
512 | 17354880 | 13380992 | 21006976 | |||
1024 | 51411200 | 36640512 | 61892608 | |||
2048 | 177528832 | 13999714 | 184351232 | |||
Spice | Tex | gcc | ||||
1 | 41326 | 22458 | 83023 | |||
2 | 81566 | 22557 | 165189 | |||
4 | 162070 | 22753 | 330992 | |||
8 | 212236 | 22800 | 405584 | |||
16 | 297120 | 25052 | 543720 | |||
32 | 448616 | 30488 | 788312 | |||
64 | 824048 | 54016 | 1285872 | |||
128 | 1675936 | 215936 | 2341664 | |||
256 | 3905600 | 2283520 | 5244928 | |||
512 | 10390272 | 8999936 | 12749568 | |||
1024 | 26640384 | 30102016 | 35143424 | |||
2048 | 91744256 | 91728384 | 98295808 |
Spice | Tex | gcc | |||||
1 | 32835 | 22458 | 79473 | ||||
2 | 64529 | 22557 | 158071 | ||||
4 | 127950 | 22753 | 316486 | ||||
8 | 162080 | 22800 | 386080 | ||||
16 | 219480 | 22904 | 515068 | ||||
32 | 320480 | 23048 | 742096 | ||||
64 | 611200 | 23248 | 1181696 | ||||
128 | 1181056 | 34688 | 2060576 | ||||
256 | 2841024 | 1891264 | 4023040 | ||||
512 | 6770944 | 8523776 | 9860992 | ||||
1024 | 19251712 | 26764544 | 28868864 | ||||
Spice | Tex | gcc | |||||
1 | 25929 | 22458 | 77033 | ||||
2 | 51233 | 22557 | 153218 | ||||
4 | 112836 | 22753 | 306939 | ||||
8 | 131724 | 22800 | 376198 | ||||
16 | 169352 | 22904 | 504544 | ||||
32 | 251792 | 23048 | 725352 | ||||
64 | 529088 | 23248 | 1152224 | ||||
128 | 1143296 | 23648 | 2001856 | ||||
256 | 2616384 | 24768 | 3808832 | ||||
512 | 5682048 | 5272064 | 8806784 | ||||
Spice | Tex | gcc | |||||
1 | 28464 | 22458 | 76017 | ||||
2 | 55721 | 22557 | 151210 | ||||
4 | 110246 | 22753 | 302962 | ||||
8 | 125826 | 22800 | 371690 | ||||
16 | 157676 | 22904 | 501204 | ||||
32 | 225168 | 23048 | 721432 | ||||
64 | 454576 | 23248 | 1144369 | ||||
128 | 1081728 | 23648 | 1988224 | ||||
256 | 2626944 | 24768 | 3774086 |
Spice | Tex | gcc | ||||
1 | 28536 | 22458 | 75808 | |||
2 | 55873 | 22557 | 150787 | |||
4 | 118572 | 22753 | 301938 | |||
8 | 125318 | 22800 | 369144 | |||
16 | 153820 | 22904 | 496376 | |||
32 | 217952 | 23048 | 724320 | |||
64 | 442704 | 23248 | 1147808 | |||
128 | 1126432 | 23648 | 1979840 |
Memory Size | ||||||||||||||||
32 | 3234984 | 2986000 | 3121744 | |||||||||||||
64 | 2907976 | 1859144 | 2707496 | 5242960 | 4359408 | 5168928 | ||||||||||
128 | 2037192 | 1427176 | 2329128 | 4196672 | 2978896 | 4313488 | 8601728 | 6330720 | 8808064 | |||||||
256 | 1693272 | 1088936 | 1912816 | 2956720 | 2216784 | 3442576 | 6091104 | 4516864 | 7024992 | |||||||
512 | 1293424 | 863048 | 1525752 | 2267168 | 1833312 | 2682416 | 4519904 | 3400128 | 5333504 | |||||||
1024 | 875968 | 683472 | 1224848 | 1521872 | 1538016 | 2113744 | 3095424 | 2481472 | 4085024 | |||||||
2048 | 570440 | 61232 | 943728 | 1024032 | 149872 | 1597648 | 2122208 | 527808 | 3070464 | |||||||
4096 | 403304 | 41664 | 634376 | 685072 | 85424 | 1035584 | 1349408 | 272608 | 1974336 | |||||||
8192 | 207736 | 30496 | 429896 | 341984 | 53712 | 696752 | 640416 | 153120 | 1276480 |
Signal | ||
CPUMISS | ||
STALLM | ||
WDC | ||
VDA | ||
CRADR4 | ||
CRADR5 | ||
CRADR6 | ||
BRANCH | ||
CRADR7 | ||
CRADR8 | ||
ADR16 | ||
ADR17 | ||
ADR18 | ||
ADR19 | ||
ADR20 | ||
ADR21 | ||
ADR22 | ||
ADR23 | ||
ADR24 | ||
ADR25 | ||
ADR26 | ||
ADR27 | ||
ADR28 | ||
ADR29 | ||
ADR30 | ||
ADR31 | ||
ADR15 | ||
ADR14 | ||
ADR13 | ||
ADR12 | ||
ADR11 | ||
ADR10 |
ADR9 | ||
ADR8 | ||
ADR7 | ||
ADR6 | ||
ADR5 | ||
ADR4 | ||
ADR0 | ||
ADR3 | ||
ADR2 | ||
ADR1 | ||
CRADR0 | ||
CRADR1 | ||
CRADR2 | ||
CRADR3 | ||
CRDILTCH1 | ||
CRADR4 | ||
CRADR5 | ||
CRADR6 | ||
CRADR7 | ||
CRADR8 | ||
CRADR8 | ||
CRADR7 | ||
CRADR6 | ||
CRADR5 | ||
CRADR4 | ||
CRADR3 | ||
CRADR2 | ||
CRADR1 | ||
CRADR0 | ||
CRHOLD | ||
CRDILTCH1 | ||
CRRECEIVE |
CRADR8 | ||
CRADR8 | ||
CRADR7 | ||
CRADR6 | ||
CRADR5 | ||
CRADR4 | ||
CRADR3 | ||
CRADR2 | ||
CRADR1 | ||
CRADR0 | ||
CRHOLD | ||
CRDILTCH1 | ||
CRRECEIVE | ||
CRRECEIVE | ||
CRWIDE | ||
CRWIDE | ||
ACK | ||
L2ADR0 | ||
L2ADR1 | ||
L2ADR2 | ||
L2ADR3 | ||
L2ADR4 | ||
L2ADR5 | ||
L2ADR6 | ||
L2ADR7 | ||
L2ADR8 | ||
L2ADR9 | ||
L2ADR10 | ||
L2ADR11 | ||
L2ADR12 | ||
L2ADR13 | ||
L2ADR14 | ||
L2ADR15 | ||
L2ADR16 | ||
L2ADR17 | ||
L2ADR18 | ||
L2ADR19 | ||
L2ADR20 | ||
L2ADR21 |
L2ADR22 | ||
L2VDA | ||
L2MISS | ||
L2DIRTY | ||
L2SYNCH | ||
IS_DCC? | ||
IOCNTRL0 | ||
IOCNTRL1 | ||
CRWRITE0 | ||
CRWRITE1 | ||
SAMPLE CLOCK CONTROL | ||
SAMPLE DELAY MUX SELECT 2 | Configuration | |
SAMPLE DELAY MUX SELECT 1 | Configuration | |
SAMPLE DELAY MUX SELECT 0 | Configuration | |
SAMPLE PHASE MUX SELECT 1 | Configuration | |
SAMPLE PHASE MUX SELECT 0 | Configuration | |
PRESAMPLE OVERRIDE WAIT | Configuration | |
PRESAMPLE DELAY MUX SELECT 2 | Configuration | |
PRESAMPLE DELAY MUX SELECT 1 | Configuration | |
PRESAMPLE DELAY MUX SELECT 0 | Configuration | |
PRESAMPLE PHASE MUX SELECT 1 | Configuration | |
PRESAMPLE PHASE MUX SELECT 0 | Configuration | Scan in |