Chapter 5

Beyond F-RISC / G

F-RISC existed as an architecture long before the F-RISC/G design team committed it to GaAs. The instruction set was designed to minimize hardware complexity while still allowing a reasonable amount of programming flexibility; the overriding idea being to minimize hardware complexity in order to maximize clock rates. While the fastest available devices tend to have low yields, it is sometimes possible to sacrifice some device speed for greatly increased device integration. Some of the future iterations of F-RISC are expected to explore these areas. In addition, the technology of F-RISC can be applied to other high-speed digital design arenas, such as network switching. Many of the ideas advanced in this chapter illustrate that the overall throughput of F-RISC / G could probably have been improved markedly had the design process proceeded differently.

The F-RISC architecture has been implemented in several technologies and several more implementations are planned. As the architecture doesn't have to support a massive software base as of yet, it is permissible to make changes to it to suit research purposes. Any such changes should fit with the F-RISC philosophy: blinding speed at any cost. F-RISC/H, the successor to the current design, has begun to take shape on paper. It is hoped that faster devices, higher device integration, multiple, perhaps deeper pipelines (using VLIW technology), better CAD tools, and more advanced packaging will allow at least an eight times increase in single node throughput.

It is possible, however, that some of these technologies can be tested with the current F-RISC/G design, leveraging the considerable effort and expense that went into it to allow experimentation with a few of these more advanced technologies before committing to the whole package for F-RISC / H.

The most important lesson to be learned from the F-RISC / G cache design is the necessity to design the entire processor holistically; if the cache is relegated to secondary importance, the overall design will work more slowly than it could have. In the F-RISC / G case, the fact that the cache was designed after the CPU design was frozen resulted in tighter than necessary critical paths in the cache, excessive circuit complexity in the cache controllers, and unnecessarily long MCM routing.

In [Przy90], this idea is validated:

"Guideline 8: Design the memory hierarchy in tandem with the CPU

It is an easy trap to fall into: design the CPU, shrink its cycle time as much as possible, then design the best cache possible that can cycle at the same rate. If either the cache or CPU must be designed first, it should be the cache. Look at the resources available to it and determine what attainable cycle time and size pair yield the highest performance. Then build the CPU with whatever resources are left. Better yet, consider the whole problem: the system design. Partition the resources among the various functional units so that they are all matched in cycle time and together they yield the best overall system-level performance."

The F-RISC / G design process proceeded exactly as in the "trap" that Przybylski describes, and, as a result, the overall throughput of the design was sacrificed in order to attain minimum cycle times. Additionally, some of the techniques in this chapter would allow higher throughput with no impact on cycle time, but were not available as options due to the frozen status of the CPU design.

FIGURE 5.1: CYCLE TIME / CPI TRADE-OFF

When test data revealed that the Rockwell process produced worse-than-expected interconnect performance (due to an anisotropic dielectric with high planar dielectric constant), a great deal of effort was required to improve the routing and final layout. Many of these problems could have been avoided entirely if the CPU and cache had been designed together.

In a commercial processor, trading off throughput for a shorter cycle time would be intolerable. Since F-RISC is a research effort, however, there is some flexibility in the design objectives. Future versions of F-RISC, however, may very well benefit from different design priorities. Figure 5.1 illustrates the trade-off between CPI and cycle-time; throughput can be maintained if CPI is decreased as cycle time is increased. More importantly, however, if it is necessary to slightly increase cycle time in order to greatly reduce CPI, throughput will be increased despite the increased cycle time.

  1. Cache Organization and Partitioning

The current partitioning of the F-RISC/G system was chosen so as to minimize the length of the primary CPU critical paths, and thus maximize CPU clock rate. The partitioning was in large part enforced by economic and technical limitations which may, in the future, become less important. The use of even more exotic packaging alternatives, the ability to produce larger die with more devices per die, and faster devices would all affect the partitioning decisions made in the F-RISC/G system.

  1. Use of Higher Device Integration

The overall processor throughput would be greatly improved if the device yield were higher, allowing larger device integration. While my research was focused on low-yield architectures, it is interesting to consider the effect of even doubling the amount of devices per die (which would still result in extremely low device integration as compared to other technologies.)

When millions of transistors are available cache designers tend to incorporate complicated cache architectures rather than dedicating all of the transistors to cache memory. As memory sizes become arbitrarily large, rather than using additional transistors to make a small percentage change in the size of the cache, it is preferable to use the devices to improve the control circuitry to make better use of the RAM already available.

Improvements in control logic which may be feasible when the device budget is high include adding multiple cache sets, branch prediction, and complicated replacement algorithms.

In the low transistor regime, however, it makes more sense to simply increase the size of the cache RAM. This is borne out by the Dinero simulations.

  1. Remote Program Counter

In the F-RISC/G design, the instruction cache controller contains the Remote Program Counter, as was discussed in Chapter 2. The benefit of including this circuitry on the cache controller is that it reduced the address bus traffic between the CPU and the caches; only on branches does the cache controller need to receive an address from the CPU.

By reducing this traffic it was possible to have both caches share an address bus, reducing pin-out, device counts, and routing complexity on the datapath chips. An important disadvantage of this combined bus is that the load capacitance on the bus is approximately twice as high as each of two separate buses would have been. This increases the cycle time for each data cache access by approximately 150 ps.

In addition, the inclusion of the remote program counter results in an additional multiplexor delay on the tag RAM address bits and cache RAM address bits. In addition, if not for the RPC, the low order address bits could pass directly to the cache RAMs. This would shorten the cache memory cycle time by approximately 300 ps (presuming the tag RAM and comparator could produce a result quickly enough.) Alternatively, more RAM chips could be included, slowing the access time but increasing hit rate.

Aside from splitting the address bus into instruction and data address buses, the RPC could be eliminated if the cache controller contained some of the functionality of the instruction decoder, or, alternatively, if the cache controller functions were merged into the instruction decoder.

If the device yield were sufficient, the cache controller could decode PC-relative unconditional BRANCH instructions immediately. Conditional BRANCHes would require control signals to be exchanged between the datapath and cache controller. Also, BRANCH instructions which do not use PC relative addressing would have to rely entirely on the datapath. This particular solution merely complicates the cache controller while adding further communications complexity between the cache and CPU. The possible advantage is that PC-relative addressing may be so common that an additional cycle of latency for non-PC-relative BRANCH instructions can be tolerated.

The other option is to incorporate a tag RAM and comparator on the instruction decoder, eliminating the instruction cache controller entirely. This would require the address to be communicated from the datapath to the instruction decoder on every cycle, which would do little to improve the communications latency problem. If, however, the comparator and tag RAMs were distributed across the datapath instead, each slice of the comparison could be performed very quickly, and it would be necessary only to accumulate one signal across the four slices, indicating whether a miss occurred. This signal accumulation is similar to what occurs in the F-RISC / G design with the adder carry chain.

  1. Pipeline Partitioning

In the instruction cache, the inclusion of the remote program counter mandates that the nine low order address bits be routed to the cache controller, and, from there, to the cache RAMs. In the data cache, where there is no need for an RPC, and if the instruction cache were modified to eliminate the RPC, it would still be necessary to route these address bits through the cache controllers in order to handle misses; the addresses must be saved since by the time the cache RAMs need them (to allow the secondary cache to write into them), the CPU will have placed alternate addresses on the bus. In the current implementation, these pipeline latches are located in the cache controller.

If the latches for these nine address bits are moved to the cache RAMs (which are already among the smallest chips in the system) or to the datapath, it may be possible to eliminate a pair of I/O delays, as well as decrease the MCM time-of-flight; this could decrease the length of the primary critical path by approximately 500 ps.

FIGURE 5.2: PROPOSED REVISED SYSTEM DIAGRAM

Figure 5.2 illustrates the F-RISC/G system with the elimination of the remote program counter, split 9-bit address buses to each cache, an L2 block size of 512 bits (sub-block placement would require additional address bits to be routed from the cache RAMs to the secondary cache), and the addition of pipeline latches to the cache RAMs or datapath.

If the pipeline caches are included on the cache RAMs, then the cache RAMs must be informed as to when a miss has occurred (either by the cache controller or the datapath), and must receive a Valid Address strobe to indicate when to advance the pipeline.

If, on the other hand, the datapath is responsible for maintaining the pipeline, these signals are not necessary. In fact, the datapath already includes program counter history registers which contain the necessary addresses for the instruction cache. At the time the instruction decoder receives MISSI, the PC_I2 register will contain the address which the cache RAMs will need in order to process the miss. For a data cache miss, at the time MISSD arrives at the instruction decoder the RES_D1 register will contain the address which data cache RAMs will need. The pipelines will advance once before the datapath can be informed of the miss.

The datapath already includes a datapath from the RES_D1 through the ALU and out to the address bus. In order to remove the cache controller from the address critical path it should only be necessary to modify the control circuitry in the CPU.

FIGURE 5.3: PROPOSED DATAPATH BLOCK DIAGRAM (ADAPTED FROM [PHIL93])

The program counter in the datapath (PC_I1 in [Phil93]) is either loaded or incremented on each phase 3. Currently, the only datapath from the program counter to the address bus is through two PC history registers and then through the ALU. As a result, the PC can not be put on the address bus for at least an entire cycle after it increments. In addition, since the PC normally increments each cycle (except on branches), the ALU would never be available for other use.

Figure 5.3 is a block diagram for a proposed modification to the F-RISC/G datapath. This particular implementation contains as few modifications from the current design as possible, and is not necessarily the optimum configuration.

FIGURE 5.4: MODIFIED TIMING DIAGRAM

The address bus is shown to be split so that the ALU need only become involved on BRANCH instructions. There is a direct path from the PC to the instruction address bus, and a multiplexor is added to select from the current PC, or the PC_DE history register (in the event of a miss). The output of the ALU is also an input to the multiplexor, in order to allow branches to take place.

The path from the RES_D2 register through the ALU, which would be used in the event of a data cache miss, is marked as well. The control logic in the datapath would have to be modified to cause the appropriate pipelined address to be put on the correct bus in the event of a cache miss.

Figure 5.4 shows that implementing these changes in the instruction decoder may result in critical path savings of up to 650 ps. While this diagram shows a pipeline in the cache controller, it is possible to remove the cache pipeline from the cache controller, route a copy of the address bus to the cache controller, and use it to address the tag RAM. This is because the CC Pipe 1 master latch, which is used to address the tag and cache RAMs, mirrors what the datapath would put on the data address bus given these modifications. By increasing the complexity and pad count on the instruction decoder and datapath chips, it is possible to greatly simplify the cache controller (eliminating approximately two thousand devices), and reduce total system device count.

Before such modifications can be implemented in the datapath and instruction decoder, careful consideration should be given to the effects of such changes on the main CPU critical paths. Such modifications would clearly increase the size of the core CPU chips, and if the increase in size is significant, the cycle time of the core CPU will be adversely affected.

FIGURE 5.5: MODIFIED CACHE CONTROLLER BLOCK DIAGRAM

Figure 5.5 is a block diagram for the cache controller given a split bus, removal of the remote program counter and the pipeline latches, and the use of the pipeline latches already present in the datapath.

While such modifications to the chip set appear practical, it is unclear how much circuitry would have to be added to the core CPU to accomplish all of the objectives of the cache control circuitry. Copybacks, cache start-up, and page faults are all fairly complicated, and care must be taken to ensure that the cache and tag RAMs are addressed properly at all times. It may be necessary to maintain some pipeline latches on the cache controller in order to handle the write and valid signals, as well. However, since there would be no RPC, there would no longer be any purpose in the third pipeline stage; the minimum two is all that need be implemented.

It should also be noted that while such modifications immediately speed up the secondary cache memory critical path by eliminating several gate delays (multiplexor and latch delays between the I/O receivers and the tag RAM) and decreasing the size of the die, further work may be necessary to reduce the tag RAM ./ comparator cycle time sufficiently to prevent it from becoming the primary critical path. As much of the comparator delay time is caused by routing through the top standard cell area to the tag RAM, removing the pipeline latches which spatially dominate that portion of the chip would greatly reduce the comparator propagation delay. In addition, if time permitted, the tag RAM could be reduced from 32 bits to 24 bits wide, and the 25% savings in power could be used to increase the speed of the block if necessary.

  1. Temporal and Spatial Interleaving

In the current implementation, accessing a word of memory requires the use of all eight RAM chips in a cache. The memories are interleaved so that four bits of the word are contained in each RAM. This scheme is beneficial in reducing MCM routing density; only four data lines need to be routed to each RAM. Figure 5.6 illustrates the F-RISC / G nibble interleaving scheme.

If a single RAM supplied all 32 bits of data, then all 32 data lines would have to be routed to all 8 RAM chips, and a tri-state bus would have to be used. This would also greatly increase the capacitive load on the data bus. Future implementations of F-RISC may benefit from this or another interleaving scheme, however (Figure 5.7).

FIGURE 5.6: NIBBLE INTERLEAVING

If a single RAM chip was responsible for each word, for example, it would be possible to simulate multi-ported RAM so long as the words to be accessed simultaneously were not contained on the same RAM. This technique could allow a relaxation of the prohibition against a LOAD or STORE immediately following a STORE; if the second instruction accesses a memory location contained in a different RAM than the first instruction, the first address can be written into simultaneously to the second address read. It can either be left to the compiler to schedule memory accesses (in which case the programming model would be made to depend on the cache implementation), or, preferably, hardware can be included in the cache to perform the second operation if possible, and otherwise stall the CPU.

FIGURE 5.7: WORD INTERLEAVING

The difficulty here would be the tag RAM; like the cache RAMs, the tag RAM must be able to support a read and write in the same cycle in order for this technique to work. This could possibly be accomplished by increasing the access speed of or interleaving the tag RAM.

This type of interleaving will probably be a necessity in any future F-RISC VLIW design; if at all possible the goal should be to allow at least one LOAD or STORE, and probably more, to occur during each cycle in order to keep the ALUs busy.

A related issue is the bus width to the secondary cache. While the secondary cache must always send the primary cache an entire line in order to gain the full benefit of large block size (taking advantage of the principal of spatial locality), it is less clear that the primary cache must always copyback an entire block to the secondary cache.

In F-RISC / G, the data bus between the primary and secondary caches is bi-directional and there is little to be gained by allowing the cache to copyback sub-blocks. In designs where a bi-directional bus is not used, multiple dirty bits per block could be used to allow sub-block copybacks, thus reducing the number of MCM traces (assuming multiplexing takes place on the cache RAMs). If word interleaving is used to allow reads and writes to occur simultaneously, it would be possible to copyback just the dirty word in a block while simultaneously writing into other words of the block (reading from other words would not be an issue - in order for a copyback to occur, a miss must have occurred, in which case the entire block is invalid.)

FIGURE 5.8: 1 KB L0 CACHE

Rather than interleaving the accesses spatially among several RAM chips it may be desirable to "temporally interleave" accesses. A method of accomplishing this is to increase the size of the bus between the primary cache and the CPU and assume that some memory, a "L0 Cache" exists on the CPU to hold excess data. The idea would be to transmit to the CPU more information than is immediately needed, thus allowing the CPU to use the L0 cache to access data while the L1 cache is busy with a new access.

Figure 5.8 shows the locus of primary cache access times and L0 block transfer sizes which achieve a 1.86 CPI given 1024 bit L0 caches. While this cache size is clearly too large to be feasible in F-RISC/G, it may be possible to implement in future designs. The figure results from several Dinero simulations run on a suite of five traces. A Harvard write-back architecture is assumed.

FIGURE 5.9: 1 KB VS. 2 KB L0 CACHE

As the figure shows, for small L0 block sizes it is possible to allow the primary cache to take multiple cycles to perform a transfer (as opposed to "1 cycle" in the current design) and achieve the same performance as in the current design. This effect can be leveraged to reduce the number of pipeline stages assigned to memory accesses, or to reduce the speed, and hence the power consumption, of the cache. A large bus size between the L0 and L1 caches increases the benefit of this technique.

Figure 5.9 shows the difference in allowed access time between 1024 bit and 2048 bit L0 cache (Harvard) assuming a 128 bit bus between the CPU and primary cache. As can be seen, the smaller cache size works nearly as well as the larger, suggesting that a small cache may be sufficient. If multiple pipelines are used in F-RISC / H, then the size of the L0 cache may need to be increased.

It is expected that the L0 instruction cache will be located on the same die as the instruction decoding hardware; the decoding and the fetch can be combined and circuits can be reduced in complexity. The L0 data cache will need to be located on the same die as the pipelines in order to have sufficiently small access time.

  1. Column Associativity

One method of improving the hit rate of a direct-mapped cache with little additional hardware is the column-associative cache [Agar87].

A column associative cache is essentially a hybrid between a direct-mapped cache and a set-associative cache in that there is sufficient hardware only to check a single tag at a time but there are multiple possible locations in which any block may be stored.

The simplest implementation would allow only two locations to store any particular block. One location would be the normal location where the block would reside in a standard direct-mapped cache. The second location should be easily computed from the address, say by simply inverting one of the line address bits.

When data is to be retrieved from the cache and it is not found in the primary location, the CPU and secondary cache are informed as in the normal direct-mapped cache. While the secondary cache is retrieving the necessary data, the primary cache checks the alternate location. If the data is found, the CPU is informed that the data is available, and the CPU can end its stall.

As a result, there are three possibilities: the data is in the primary location in which case the CPU need not stall, the data is in the secondary location in which case the CPU need stall only for one cache RAM cycle (in the F-RISC / G case, one CPU cycle), or the data isn't in either location, in which CPU need stall for the same amount of time as in a miss in a conventional direct-mapped case (presuming that the stall time would have been more than a cycle.)

This idea has particular merit in F-RISC / G since the CPU can handle variable stall times.

FIGURE 5.10: COLUMN ASSOCIATIVE CACHE - SLOW HIT

FIGURE 5.11: ASSOCIATIVITY SCHEMES

Figure 5.10 shows the approximate timing of a slow cache hit in a column associative cache. The data will arrive at the CPU later than if the primary location contained the data, but more quickly than if it had been necessary to go to the secondary cache.

Using a variety of traces, the F-RISC / G system was simulated in Dinero with various associativity schemes. For the column-associative case it was assumed that a fast hit takes one cycle, a slow hit has a two cycle penalty, and a miss has a five cycle penalty.

Figure 5.11 shows the effect of several of the simpler associativity schemes on the predicted cache CPI based on a suite of five traces. As can be seen, using a column associative cache seems to buy half the benefit of going to a fully associative cache of the same size. The cost of implementing a column associative cache is minimal - it is necessary only to modify the control circuitry on the cache controller. As the number of sets is increased the relative decrease in CPI diminishes.

Although the effects of implementing column associativity may vary in other design spaces, it should definitely be investigated in future F-RISC implementations.

  1. Superscalar / VLIW CPU

One of the ways that the F-RISC architecture is likely to be expanded in the future is through the implementation of fine-grained parallelism. This would entail adding additional parallel pipelines to the system, each of which is capable of independently processing instructions. Two methods of accomplishing this are "superscalar" architectures and "Very Long Instruction Word" (VLIW) architectures.

In each of these architectures, parallel pipelines and additional functional units are added to the processor to enable multiple instruction streams to be executed simultaneously.

In superscalar architectures the hardware will typically examine the incoming instruction stream for code dependencies and is responsible for scheduling instructions for execution. The instructions need not execute in the order in which they occur in the code.

In VLIW architectures the compiler is largely responsible for determining which instructions can be executed in parallel, and the instruction word is widened to accommodate multiple parallel instructions. Typically there is far less decoding by the hardware, which makes it ideal for low-yield technologies. The negative aspect of VLIW, however, is that it makes it difficult to maintain code compatibility among successive generations of processors.

These types of architectures raise special complications in designing the memory hierarchy.

In any processor with multiple conventional pipes, there might be several instruction fetches occurring simultaneously. In addition, several memory LOAD's and STORE's may also be occurring.

The problem becomes significantly more complicated in superscalar systems where the CPU buffers many upcoming instructions and, based on dependencies between instructions, executes the ones it determines can be executed together (out-of-order-issuing).

In order to prevent instruction fetch latencies, the instruction cache must be capable of providing instructions to each pipe on every clock cycle. In the case of the VLIW architecture, a single, very wide instruction would be transferred, while, in the case of a superscalar architecture, several smaller instructions would have to be fetched.

The VLIW case requires a simpler hardware implementation than the superscalar case. A single instruction address would either be transferred from the CPU or generated using a remote program counter. The cache controller circuitry would then address its cache RAMs, each of which would send some bits of the instruction word to the instruction decoding circuitry. The main difference between the VLIW and single-pipe implementations is in the width of the data bus between the cache and the CPU.

A VLSI architecture could be implemented fairly easily (with respect to the cache) if limitations are placed on the types of instructions which can be placed in parallel. Specifically, unless the cache RAMs are multi-ported or interleaved, only one LOAD or STORE instruction can be performed at a time. There is a possible exception, however. If multiple LOAD's or STORE's are to be performed to addresses within the same cache line, then they can be executed in parallel if the data and address buses are sufficiently wide. Such an event would probably tend to occur only on consecutive word addresses. For example, a programmer may wish to load two consecutive registers with two consecutive words from memory. If the CPU contains two ALUs it would then be possible to fetch a 64 bit long word from memory, perform an operation on it, and put it back in memory in three cycles. Such a capability is particularly useful for floating point operations.

As the width of the instruction word is increased the cost to add more ALUs and functional units increases linearly. The cost in terms of memory bandwidth is much more severe. Multi-porting the cache is extremely expensive in terms of speed and hardware, but, if multiple simultaneous accesses to memory are not allowed, it will be difficult to make full use of the parallel pipelines. The functional units need data on which to operate, and that data will always originate in the cache.

Due to this memory bottleneck, it may be desirable to increase the cycle time of the cache if doing so would provide a net increase in speed. For example, if double-porting the cache results in less than a doubling of the cache cycle time, the net CPI may be improved.

An additional consideration in VLIW architectures is bubbles in the instruction stream. It may not be possible for the compiler to schedule instructions for each functional unit at all times. In order to increase memory bandwidth it may be desirable to avoid sending wide instructions with void bit-fields corresponding to unused pipelines. An alternate approach is to add a bit-field to each instruction field to indicate which pipeline the instruction is intended for. If pipeline two can not be used, for example, a cache transfer may contain two instructions intended for pipeline one, to be executed sequentially. The second instruction would be stored by the issue unit until the appropriate time. Given the small number of pipelines in the system, the logic necessary to accomplish this should not be extensive.

  1. Multiprocessing

The F-RISC / G MCM would make an interesting node of a multiprocessor machine. If a thousand F-RISC / G's could somehow be wired together in a useful manner then the resulting system would, at peak, be capable of performing 1015 instructions per second, reaching the infamous "tera-op" barrier.

Memory organization is a major area of research in the area of multiprocessing. If all of the processors in a multiprocessor system share a common memory address space, then the cache represents a major problem. If each processor has its own cache, then care must be taken to ensure that if data in one processor's cache is modified, then other processors are made aware of it. This is known as the problem of "cache coherency."

FIGURE 5.12: SHARED MEMORY MULTI-PROCESSOR IMPLEMENTATION

Copyback caches are particularly difficult to keep coherent. If each CPU has its own primary cache (Figure 5.12), then if a processor modifies a memory address main memory will not necessarily be updated to reflect the change. In addition, any of the other caches which hold the out-of-date contents of that address will be in the dark as well. This latter problem is present in write-through architectures as well.

The F-RISC/G processor has the capability to operate in a write-through mode, since a miss can be forced by setting the IOCTRL field in the STORE instruction appropriately. The design does not, however, have all the features one would desire in an efficient multi-processing cache system.

In order to handle cache coherency, it is necessary either to ensure that each of the caches is at all times up-to-date, or to maintain state information for each cache block indicating whether the line needs to be updated with the contents of main memory.

There are several methods by which the cache could be kept coherent. The cache could, each time it is accessed, poll each of the other caches to determine if an updated copy of the contents of that address exists. This would be very time consuming, however, and would greatly reduce the speed of the cache.

A better alternative is to implement write-through in the primary cache and allow the secondary cache to asynchronously update each of the primary caches of the change. The secondary cache could keep a directory of which caches need to be updated since it has sufficient information as to which caches had previously requested or updated that address. It is also possible to implement special hardware between the primary and secondary caches to implement this functionality.

"Directory based" cache coherency protocols use centralized hardware (contained in the secondary cache or between the primary and secondary caches) to maintain the information as to which caches contain which versions of which addresses.

An alternative is a "snooping" cache, in which all primary caches share a bus with the secondary cache and thus can keep track of modifications to memory addresses.

FIGURE 5.13: SNOOPING CACHE COMMUNICATIONS

Either solution will result in a net decrease in speed of memory access in the F-RISC / G prototype. A directory based system would require that the directory be accessed in parallel to the tag and data RAMs. This access will likely become the critical path, as the directory is likely to be located a large physical distance from the cache controller. In addition, more control logic will be necessary to deal with the directory and to replace blocks when necessary.

A snooping protocol greatly increases loading on the bus to the secondary cache. If RC delays dominate due to large routing distances, then the delay on the bus will increase quadratically as a function of the number of processors in the system. On the other hand, the bus to the secondary cache already exists, and thus implementing a snooping protocol eliminates the need to create an entirely new communications path (Figure 5.13). The tag RAM can also be used as a repository for sharing-status information, although there are good reasons not to do so if it can be avoided.

In a typical implementation, the tag RAM would contain extra status bits for each cache block. When any cache misses on a read, each of the other caches must check its tag RAM and determine whether it has a modified version of the block, and, if so, must put it on the bus. At any time only one cache will have permission to modify a block. When a write to memory occurs, each cache checks to see if it has a copy of the block and either invalidates it or updates it to remain current.

In order to ensure that cache activities in other processors don't affect a particular CPU's throughput, a second tag RAM can be put on the cache controller specifically for use in snooping, thus allowing normal cache operations to proceed in parallel. Only when the cache finds it must perform some coherency-maintaining operation is there a chance of stalling the CPU.

The disadvantage of this technique is that the size of the cache controller is approximately doubled: two comparators and two tag RAMs are necessary. In addition, while there would be no unnecessary contention stalls, communications with the secondary cache are limited by the speed of the bus, and clock synchronization across CPUs becomes important; if a processor is out of synchronization with the others, than it may proceed with an illegal write because its snooping tag RAM was not updated in time.

  1. Cache Pre-processing

One of the most intriguing ideas for extending the speed and capabilities of the F-RISC architecture is the inclusion of processing logic in the cache memory subsystem. The idea is to include extended processing capabilities in the cache subsystem while the CPU handles a small set of core instructions.

Perhaps one of the most useful ways in which cache pre-processing can be incorporated into F-RISC is through architecture translation. While F-RISC/G has the ability to process at 1000 peak MIPS, the software base for the F-RISC architecture is essentially non-existent. Cache pre-processing holds the promise of allowing the F-RISC processor to run a large base of existing software for other architectures at speeds greater than that of processors based on the native architecture.

The idea is that programs in main memory could be either native F-RISC binaries, or binaries intended for other architectures like MIPS or SPARC. As the foreign binaries are transferred down the cache hierarchy toward the CPU, the processing capabilities of the cache convert them to native F-RISC code which the core CPU executes at full speed. If the code in the primary cache is native code, then the vast majority of the time there will be no speed penalty for making this translation.

FIGURE 5.14: ARCHITECTURE TRANSLATION

Figure 5.14 illustrates architecture translation within the secondary cache. There should be only one multiplexor delay on the path through the secondary cache for native code. Since the logic in the secondary cache is expected to be slower than that in the core CPU, this delay in not inconsiderable. Nonetheless, the penalty must be paid only when the secondary cache misses and must transfer a block from the next higher level of memory; since the target hit rate for the primary caches is 95%, the secondary cache can be expected to miss on an extremely small percentage of instructions.

The translation penalty is reduced even more when one considers that once a translated instruction is transmitted to the primary cache, a miss on that data in the primary or secondary cache will result in the translated data being passed back up through the cache hierarchy; the cache will not have to perform the translation again.

Another example of cache pre-processing which can result in a faster architecture is the byte operations chip. While this chip has not been designed or fabricated, the instruction decoder and cache controller contain hooks to allow it to be integrated into the F-RISC/G prototype.

The byte operations chip is essentially a byte multiplexor, allowing particular bytes within a word to be written into and read from. Unfortunately, in the current F-RISC/G implementation the cache critical paths don't allow much slack; it may be possible to fabricate a byte operations chip which works with the current system timings, but it would undoubtedly require a great deal of power in order to operate quickly enough. The byte-ops chip could be combined with architecture translation to allow either little or big endian architectures to be emulated in hardware.

Still another manner in which processing in the cache can be used to advantage is through the introduction of dedicated branch circuitry. The instruction cache controller, for example, could clearly handle unconditional branches without intervention of the datapath chips. In the event of conditional branches, the cache could conceivably speculatively fetch both possible addresses (if dual porting were available or the memories were interleaved in such a way as to make that possible), or, at the very least, pass the branch target address to the secondary cache so that if a miss occurs the secondary cache will have the needed data available more rapidly. Alternatively an active or passive branch prediction scheme could be implemented in the cache controller.

  1. Future Packaging

The F-RISC / G memory hierarchy spends much of its time communicating data and addresses between chips. If future packaging were able to eliminate or at least seriously reduce these delays, then the processor cycle time could be decreased significantly.

One of the most promising new packaging technologies which could have a great effect on reducing the cycle time of future processors is three-dimensional (3-D) packaging. In 3-D packaging, rather than laying out the chips in a single layer on a flat module, they are stacked vertically. Since the chips are much thinner than they are wide or long, the distance between chips is much reduced. If a way can be found to take advantage of this vertical communication distance, then the overall cycle time can be much reduced.

From a practical point of view, one of the most difficult problems with stacking chips vertically rather than distributing them on a surface is that the vertical chip stack has poor thermal qualities.

A single chip or MCM package provides a comparatively large surface area through which to dissipate excess heat generated on-chip. If chips are vertically stacked, then the top and bottom chips on the stack may have a surface through which to dissipate heat, but the chips sandwiched in-between can only dissipate heat laterally through the edges (which, due to the small edge surface area, is not very helpful) or into neighboring chips.

In the nascent stage of a device technology, such as F-RISC/G's GaAs HBTs, power tends to be a particular problem. When using an exotic device technology, however, high circuit frequency is usually the primary goal, making it desirable to try to overcome this thermal issue.

FIGURE 5.15: 3-D CHIP STACK

One possible method of attacking this problem is shown in Figure 5.15. Recognizing that the chips themselves do not conduct heat well in the lateral directions, diamond sheets are interposed between the die. The diamond sheets have "fins" which extend beyond the dimensions of the die, allowing them to conduct heat into an appropriate thermo-conductive substance.

A second critical problem with the three-dimensional chip stack is inter-chip signal routing. Traditional integrated circuit fabrication techniques do not allow signals to be routed through the backside of a die. As a result, signals must be routed to the edges of the stack, and, from there routed along the die stack edge to other die in the stack.

In the solution proposed in Figure 5.15, only one edge of the stack is available for routing (the others being interrupted by heat-conducting fins which prevent a smooth surface onto which metallization can be deposited.) Furthermore, due to the difficulty of providing interconnect on such a surface, it is unlikely that more than one routing layer can be provided.

The difficulty of routing is further exacerbated when a chip stack is to contain multiple identical chips. For example, a chip stack may contain many RAM chips, each of which accepts the same address but which provides different I/O buses (as is the case in F-RISC/G). Assuming these chips were redesigned with all of the I/O pads on one edge of the chip (which would be undesirable for other reasons), the parallel routing lanes on the edge of the chip stack would result in the I/O buses of each chip being shorted together, since the pads are located in the same place on each chip. The nine bit address bus between the cache controller and the RAM chips is the only bus which does not suffer from the problem of aligned die pad locations, since all of the RAM chips receive the same address. Several control signals, such as the signals used to latch the inputs and outputs of the RAM chips, are also immune to this problem. The data buses, however, are a problem.

One solution would be to fabricate several cache RAM chips, each with different pad-outs (a solution which quickly becomes very expensive), or to provide extra I/O's which are located in different routing channels (which could, perhaps, be accessed by rotating some chips with relation to the others).

FIGURE 5.16: CHIP WITH INTERPOSER

A simpler and cheaper solution would be to use "interposer" dies which contain the multi-layer interconnect necessary to route signals from the die I/O touchdowns to the stack edge solder connectors. The interposers would contain no active devices and thus would be considerably cheaper to manufacture than several varieties of each architecture chip (four distinct data path chips, two distinct cache controllers, etc.)

Figure 5.16 is such an interposer die, connected to one of the architectural dies. Such an interposer could be fabricated with several metallization layers, some of which may be dedicated to power and ground planes, which, aside from helping to eliminate problems such as voltage droop and ringing, also has thermal dissipation advantages.

The use of interposers is not without its disadvantages, however. Since all routing must be brought to one side of the stack, the nets which originate on the opposite side of the stack must be routed the length of at least one chip edge before it reaches the stack routing channel. In the worst case this extra routing will be necessary at both the driving and receiving chips. This would result in a minimum net length of two chip edges in that case. If the planar route could be accomplished in a shorter distance, then the advantage of the chip stack is eliminated.

If these technical obstacles can be overcome, the use of 3-D chip stacks could greatly increase the clock rate of the F-RISC/G CPU, even if the device technology is not improved.

Using the "conventional" planar MCM arrangement, the largest communications component of delay in the cache memory critical path is the address transfer from each of the cache controllers to the cache RAMs (estimated at 300 ps). Eliminating this delay would allow the use of lower power cache RAMs (1.05 ns access time vs. 750 ps for the current RAM chip).

In order to use 3-D stacking to eliminate this delay, the RAM chips may be combined in a stack with the appropriate cache controller. While this merely removes around a clock phase from the critical path (which is not enough to have any effect on overall CPU throughput), the slight communications delay caused by the vertical separation between chips on the stack remains very small as more chips are added, so the CPI may be reduced by including more RAM chips and increasing the cache hit rate.

FIGURE 5.17: 3-D RAM STACK MCM LAYOUT
Two Cache

RAM / CC
Three

Stacks
Single

Stack
Stacks
AAddress I/O (datapath):
145
145
45
BAddress Transfer

(DP to CC):

170
100
< 10
C,DAddress I/O (CC):
334
334
90
ECache RAM Address Transfer (CC to RAM):
< 50
< 50
< 50
FRAM Access Time:
750
750
750
GData Transfer:
< 50
< 50
< 50
Total
< 1499
< 1429
< 995
TABLE 5.1: ESTIMATE OF CRITICAL PATH LENGTHS USING 3-D STACKING

Figure 5.17 shows the MCM layout given this type of chip stack. An added benefit to stacking the chips this way is that the other communications components of the cache subsystem critical path are significantly reduced as well. Data transfer between the cache RAMs and the CPU requires fewer than two chip edges.

Table 5.1 shows that the estimated critical path delay using this scheme is reduced to 1500 ps. This may be fast enough to eliminate the D1 stage of the CPU pipeline.

Additional speed improvements could be made by stacking the datapath chips (Figure 5.18). It is doubtful that the instruction decoder could be included in the stack due to the complexity of the resulting stack routing.

FIGURE 5.18: 3-D RAM AND CPU STACK MCM LAYOUT

In this arrangement the CPU critical path would be greatly reduced, which would allow the cycle time to be decreased accordingly. In addition, the address broadcast from the datapath to the caches will increase in speed, resulting in a modest decrease in critical path length. Since this decrease is small, if the CPU cycle time is reduced it may be necessary to remain with a seven stage pipeline. There is a possibility that further gains may be possible by tailoring the drivers and receivers of the chips to take advantage of the reduced load capacitances.

A final 3-D stacking solution would be to incorporate all of the core CPU and cache chips in a single stack. The benefits of this arrangement over the three stack arrangement are difficult to quantify, and depend largely on the quality of the inter-chip route.

Specifically, the calculations for two and three chip stacks were based on the assumption that signals needed up to 50 ps to traverse the stack. This is based on the conjecture that interposer routing will be required and that the dielectric used is equivalent to that used in the "conventional" MCM. If the pad locations on the various die are optimized (and multiple layouts of each die type are economically feasible), it is possible that these "vertical" distances can be traversed much more quickly.

In addition, if all of the chips are in the same stack, it will probably be possible to eliminate I/O drivers and receivers completely, replacing them with superbuffers as required. As shown in Figure 5.18, it may be possible to reduce the cache cycle time to 995 ps. Of course, once this cycle time is reduced to that level, other paths in the cache may become critical. The most important such path is the comparator path, in which the address, once it arrives at the cache controller, is used to access the tag RAM. Once the tag RAM is read, the tag is compared to the address from the CPU.

FIGURE 5.19: CACHE CRITICAL PATHS

Figure 5.19 shows the primary critical path as dark lines, and the comparator critical path as dashed lines. The comparator critical path is limited to approximately 2.5 ns (the exact time depends on which cache is involved).
Planar MCM
Single

Chip Stack
Address I/O (datapath):
145
45
Address Transfer (DP to CC):
170
< 10
Address I/O (CC):
334
90
Tag RAM Access Time:
500
500
Comparator. MUX, Latch Time:
1000
1000
MISS Transfer:
120-200
< 50
Total
2350
1695

TABLE 5.2: SECONDARY CRITICAL PATH BREAKDOWN

Table 5.2 gives the path breakdown for this sub-critical path in the current F-RISC/G implementation. Most of the path delay is caused by on-chip logic in the cache controller.

If the primary critical path is reduced through chip stacking to around 1 ns, this secondary critical path length must be reduced as well, or no benefit is gained. In the single chip stack implementation, the time could probably be reduced to under 1700 ps. Hand crafting and optimizing the layout of the comparator could shave off perhaps another clock phase or so. Still more time can be saved by re-partitioning the cache pipeline (5.1.3 Pipeline Partitioning).

  1. Improved Virtual Memory Support

The manner in which virtual memory is supported in the F-RISC / G prototype is inefficient, largely due to compromises made in the cache design. Due to cost, power, and timing constraints, it was impossible to implement a translation lookaside buffer in the primary cache. Doing so would enable the primary cache to perform virtual-to-physical address translations within the normal cache access time as long as the virtual address was in the cache.

Without this support, a higher level of cache memory must make the translation and perform page swapping as necessary. When a single thread (or, equivalently, several threads accessing the same page frame) is being executed, there is little difference between these techniques. When multiple threads, each accessing individual page frames, are being accessed in a multi-tasking environment, the F-RISC / G prototype cache will perform very poorly. While the primary cache would store addresses from multiple pages, due to the small size of the cache, each time the processor switches tasks it is likely that the entire cache will need to be swapped to the second cache level.

In the F-RISC / G prototype, the cache is a "virtual cache" meaning that virtual addresses, rather than physical addresses are cached. As a result, each time the operating system switches processes, the virtual addresses in the cache will map to differing physical addresses, resulting in a page fault. If each process is given the same range of virtual addresses to work with, then in order to switch processes it is necessary for the operating system to flush the entire cache (via the IOCTRL mechanism which is, in itself, very inefficient.) While the data cache could be flushed with 32 consecutive LOAD's or STORE's, in order to flush the instruction cache without external hardware intervention would require 497 cycles.

An alternative would be to have external hardware monitor the IOCTRL lines and execute the cache initialization routine which would invalidate the entire cache in far less time.

  1. CAD Improvements

Given a fixed technology, one would expect that the quality and capabilities of the CAD tools used would have a comparatively minor effect on the overall quality of the design. This is only true insofar as by throwing sufficient manpower at the design, the dilemma of poor CAD tools can be overcome.

The cache RAM and cache controller went through a slightly different design process than that described by Philhower in [Phil93]. Changes in the technology (the addition of a third layer of metallization, design rule changes, and the like) as well as tighter timing constraints resulted in much more of the work being done by hand.

Far more use was made of 2-D and 3-D capacitance extraction tools and SPICE in the design of the cache chips than was used in the core CPU. The Cutter program [Loy93] was modified to reduce the amount of human interaction required in producing matched pair differential routes. Unlike previous chips, which were essentially computer routed but hand "cut," the cache controller and cache RAM depended heavily on hand routing, but were cut automatically.

Based on SPICE and back-annotated simulations, the overall quality of the route in the cache chips was higher than in the core CPU chips (a necessity given the tight cache timing and large quantity of cache chips.)

Given more time or better CAD tools, however, it would have been possible to achieve large speed gains on the cache controller. The 32 bit tag RAM block, for example, is simply a waste of chip area and power.

One layer of XOR gates from the comparator could have been included in the cache RAM block, consolidating space and decreasing cycle time, if there had been need to do so. The pipeline block is another area where hand crafted layout would have been far superior to the layout produced by the VTITools placer and router, and where gains in speed would have been possible.

Much of this hand crafting was considered at various points of the design, and rejected when it was determined that the design would be sufficiently fast without it.

Another area in which the CAD tools could stand some improvement is in the area of trace simulation. While the DineroIII simulator is a workable tool, the traces which were used are suspect. It is hoped that the version of F-RISC which is being implemented in FPGA's will eventually provide a memory profiling capability, essentially allowing the user to run real programs on it (at greatly reduced speed) while it gathers cache access statistics which can be fed back into DineroIII.

The initial stages of the design effort were hampered by the lack of a high-level description of the cache. An incomplete Verilog model was available, but the design group didn't have a license for the Verilog software. In addition, the model was based on incorrect assumptions regarding device and interconnect timing, an incorrect MCM floorplan, and an out of date cache architecture. In addition, even its assumptions regarding CPU operation were in some cases incorrect as modifications to the design occurred which had little impact on CPU operation but which were critical for the cache design given the latest information regarding slower-than-expected devices and interconnect.

The procedure used in the design of the cache controller was particularly complicated since it was left to this chip to correctly interface with the CPU, the design of which was already frozen, the cache RAMs, which, due to quantity on the MCM, were unable to incorporate additional circuitry to make interfacing simpler, and the secondary cache, the design and even technology of which were still undefined.

Using the VTITools schematic capture and digital simulation tools it was possible to design the chip in such a way as communications with the cache RAMs were well tested. More difficult was communicating with the CPU since there was no way to run the behavioral model and the designer wasn't available. The FPGA emulation project provided a partial solution by allowing phase accurate simulations of CPU operation.

An additional difficulty with the CAD tools is that they didn't allow for re-simulating the circuitry after extracting interconnect resistances. The instruction decoder and datapath designs were completed based on the assumption that transmission line delays dominated and RC delays were negligible. It wasn't until an investigation of slower-than-expected interconnect on test wafers was undertaken that the difficulties imposed by RC delays was fully understood. This occurred after the cache controller and cache RAM were placed and routed, and while the process of shrinking wires to meet new, more aggressive design rules was underway. After much effort a method was discovered to force the simulator to take RC delays into account, and the chips were modified to account for these additional delays.

  1. Clocking

One of the most difficult aspects of the cache design was the problem of clocking. Aside from the fact that clock skew can interfere with pipeline operation, if there are too few clock phases per cycle the design must rely heavily on routing and buffer delays to provide intermediate clocks.

The problems of coarse clock phasing can be illustrated in the differences between the F-RISC/G core CPU design and the cache controller design. Since the core CPU was designed at a time prior to the cache controller, most of the I/O signals are timed so that they are expected or sent on one of the four clock phases. While there were "early" and "late" versions of these clocks available, the exact timing of these signals depended heavily on on-chip placement and routing (since they were created by delaying clock phases by sending them through chains of buffers).

Due to the difficulty involved in accurately profiling the timing of these signals and replicating that timing on other chips, the cache controller I/O timing was needlessly complicated. Furthermore, by relying on such a coarse clock, time was often wasted.

FIGURE 5.20: COARSE CLOCKING

Figure 5.20 illustrates this problem. Ideally one would want to clock a latch as soon as possible after the latest possible time the data at the latch will be valid. Any delay before the latch is clocked results in added cycle time.

While the use of a finite number of clocks will usually mandate that at least some latches in a design will experience this sort of clock lag, on critical paths it is necessary to minimize this to the degree possible.

Using a "single wire" clock in future designs may be desirable. This would force more rigorous control over clocking through balanced buffer trees, but would reduce the amount of power required in these trees considerably.

  1. Radio Frequency Data Memory

Due to the high speed of the F-RISC / G processor, the chip set can be adapted for many uses aside from general purpose processing; it is necessary only to properly interface external devices to the cache through the SRAM chips and load the appropriate program into the processor. For this reason the SRAM chip could be used in a vast array of digital systems. Aside from its extremely fast access time, the cache RAM's ability to multiplex between a 64 and 4 bit bus could be useful as well.

In addition, while the core of the CPU is synchronous to a high speed clock, the communications between the primary and secondary cache are asynchronous, meaning that a wide variety of external devices can be wired to the 512 bit data bus for special-purpose systems. In fact, if memory mapping is used, many such devices could be wired to the bus simultaneously.

FIGURE 5.21: RADIO FREQUENCY DATA MEMORY INTERFACING

One possible use of this bus is in systems which perform active analysis and filtering of high speed (radio frequency) electromagnetic waveforms. A system has been proposed where high speed analog-to-digital converters would be used to sample an incoming radar signal. The digitally sampled waveforms would be transferred through the L2 data bus into the primary cache, and from there into the CPU. The CPU would filter and transform this data and send it back through the primary cache onto the L2 bus where it would be received by digital-to-analog converters and amplifiers which would be capable of producing radar waveforms which cancel the incoming radar signal. Such a system would have broad military use, enabling aircraft to actively cancel incoming radar so that there is no net reflection returned to the radar broadcast station.

Figure 5.21 shows how high speed A/D and D/A converters could be interfaced to the L2 data bus. The F-RISC/G CPU is fast enough that one would expect the returned radar signature of an aircraft so protected to be very small. The system can also be used for radio communications at extremely high frequencies. Several mechanisms exist within the cache design to allow this type of system to be implemented without any modifications to F-RISC/G.

Aside from the asynchronous nature of the interface between the primary and secondary caches, which allows arbitrary devices to be connected to the bus, it is possible to disable the copyback mechanism in the primary cache. If the copyback mechanism were not disabled, it would take twice as long to force STOREd data onto the L2 bus. Copyback can be disabled by correctly setting the IOCNTRL bit field during the STORE.

Additionally, it is possible to override the comparator on the cache controller and force a LOAD miss to occur; this is necessary to perform LOADs from the A/D converter. This is also accomplished by properly setting the IOCNTRL bit field.

  1. Conclusions

The nominal 50 GHz Rockwell HBT process is capable of being the basis of a 1 ns cycle time CPU with limited instruction set and complexity. In order to get the best use of these devices, the cache design must be kept as simple as possible. A 2kB per cache Harvard architecture with copyback and a single way set is sufficient to achieve a 1.86 overall CPU CPI figure given the trace data available.

Future efforts would benefit from more precise trace data and, hopefully, greater device integration levels. Problems in the CPU microarchitecture and design, particularly the inclusion of the Remote Program Counter and the inability of the datapath to put Missed addresses on the bus at the proper time, greatly complicated, and slowed, the cache design. Without these problems, a more complicated cache architecture would have been possible - of particular interest is the column associative cache, which could, in itself, have lowered overall CPI to 1.80. It is also possible that a two-way set associative cache could have been implemented, lowering overall CPI to 1.76.

Among the greatest problems facing future designers is the increasing importance of interconnect delays. Exotic 3-D chip stacks and other schemes buy enough cycle time to support a generation or two more of F-RISC designs, but eventually there will be no benefit to increasing device speed due to the overwhelming dominance of interconnect delay; a single-chip implementation will be necessary.

The cache RAM test scheme has wide applicability to the class of testing problems involving high-speed, moderate pin-out circuits. Skew is minimized by the reverse clocking (rather than broadcast) scheme used, but as clock speeds and the number of pads to speed test increases, skew will eventually limit the resolution of these tests.


REFERENCES

[Agar93] Agarwal, A. and Pudar, S.D., "Column-associative caches: A technique for reducing the miss rate of direct-mapped caches," 20th Annual International Symposium on Computer Architecture ISCA '20, San Diego ,Calif., May 16-19. Computer Architecture News 21:2 (May), 179-90.

[Beac88] Beach, W. F. and Austin, T. M. "Parylene as dielectric for the next generation of high density circuits," proceedings of the 2nd International SAMPLE Electronics Conference, June 14-16, 1988 pp 25-45.

[Bens95] Benschneider, Bradley J., A. J. Black, W. J. Bowhill, S. M. Britton, D. E. Dever, et. al., "A 300-MHz 64-b quad-issue CMOS RISC microprocessor," IEEE Journal of Solid-State Circuits, Vol. 30, No. 11, Nov. 1995, pp. 1203-1214.

[Casc91] Cascade Microtech, Incorporated. "Multicontact high-speed integrated circuit probes." Beaverton, Oregon, 1991.

[Chan92] Chang, H. and J. A. Abraham. "Delay test techniques for boundary scan based architectures" IEEE 1992 Custom Integrated Circuits Conference, pp 13.2.1-13.2.4, 1992.

[Dabr93] S. Dabral, X. Zhang, X. M. Wu, G. -R. Yang, L. You, H. Bakhru, R. Olson, J. .A. Moore, T. -M. Lu, and J. F. McDonald, "aa'a"a'" Poly-tetrafluoro-p-xylene as an interlayer dielectric for thin film multichip modules and integrated circuits," Journal of Vacuum Science and Technology, B 11(5), Sep/Oct 1993.

[Deve91] Devore, Jay S. Probability and Statistics for Engineering and the Sciences, Third Edition. Pacific Grove, California. Brooks / Cole Publishing, 1991.

[Dill88] Dillinger T.E. VLSI Engineering. pp. 624-93, Englewood Cliffs, New Jersey: Prentice Hall, 1988.

[Faus95] Faust, Bruce. "Designing Alpha-based systems." Byte Magazine, pp. 239-240, June 1995

[Fris95] A. Frisch, M. Aigner, T. Almy, H. Greub, M. Hazra, S. Mohr, N. Naclerio, W. Russell and M. Stebnisky, "Supplying Known Good Die for MCM Applications using Low Cost Embedded Testing," IEEE International Test Conference, Washington DC, October 23-25, 1995.

[GE95] G.E. Corporate Research & Development Advanced Electronics Assemblies Program, "Microwave High Density Interconnect Design Guide." February 1995

[Greu90] Greub, H. J. "FRISC - A fast reduced instruction set computer for implementation with advanced bipolar and hybrid wafer scale technology." Ph.D. dissertation, Rensselaer Polytechnic Institute, Troy, New York, December 1990.

[Greu91] Greub, H. J., et. al. "High-performance standard cell library and modeling technique for differential advanced bipolar current tree logic." IEEE Journal of Solid-State Circuits, Vol. 26, No. 5, pp. 749-62, May 1991.

[Hall93] Haller, T. R., et. al. "High frequency performance of GE high density interconnect modules." IEEE Transactions on Components, Hybrids, and Manufacturing Technology, Vol. 16, No. 1, pp. 21-27, February 1993.

[Henn96] Hennessy, J. L., and D. A. Patterson. Computer Architecture: A Quantitative Approach, second edition,. San Mateo, California: Morgan Kaufmann, 1996.

[Hill84] Hill, Mark D. and Alan Jay Smith. "Experimental evaluation of on-chip microprocessor cache memories," Proc. Eleventh International Symposium on Computer Architecture, June 1984, Ann Arbor, MI, 1984.

[Kilb62] Kilburn, T., D. B. G. Edwards, M. J. Lanigan, and F. H. Sumner. "One-Level Storage System," IRE Transactions on Electronic Computers, Vol. EC-11, No. 2, pp. 223-236, April 1962.

[Lev95] Lev., Lavi A., A. Charnas, M. Tremblay, A. R. Dalal, B. A. Frederick, et. al., "A 64-b microprocessor with multimedia support," IEEE Journal of Solid-State Circuits, Vol. 30, No. 11, Nov. 1995, pp. 1227-1236/

[Long90] Long, S. I., S. E. Butner. Gallium Arsenide Digital Integrated Circuit Design, New York, McGraw-Hill Publishing Company, 1990.

[Loy93] Loy, J. R.,. "Managing Differential Signal Placement" Ph.D. Thesis, Rensselaer Polytechnic Institute, August 1993.

[Maie94] Maier, C. "A testing scheme for a sub-nanosecond access time static RAM" Masters Thesis, Rensselaer Polytechnic Institute, 1994.

[Maji89] Majid, N., Dabral, S., and J. F. McDonald. "The parylene-aluminum multilayer interconnection system for wafer scale integration and wafer scale hybrid packaging." Journal of Electronic Materials, Vol. 18, No.2, pp. 301-311, 1989.

[Matt70] Mattson, R. L., J. Gecsei, D. R. Slutz, and I. L. Traiger. "Evaluation techniques for storage hierarchies." IBM Systems Journal, 9, pp. 78-117, 1970.

[Maun86] Maunder, C. "Paving the way for testability standards." IEEE Design and Test of Computers, Vol. 3, No. 4, p. 65, 1986.

[Maun92] Maunder, C. M. and R. E. Tulloss. "Testability on TAP." IEEE Spectrum, pp. 34-37, February 1992.

[Nah91] Nah, K., R. Philhower, J. S. Van Etten, S. Simmons, V. Tsinker, J. Loy, H. Greub, and J. J. McDonald. "F-RISC/G: AlGaAs/GaAs HBT standard cell library," Proc. 1991 IEEE International Conference on Computer Design: VLSI In Computers & Processors, pp. 297-300, 1991.

[Nah94] Nah, K. "An adaptive clock deskew scheme and a 500 ps 32 by 8 bit register file for a high speed digital system" Ph. D. Dissertation, Rensselaer Polytechnic Institute, 1994.

[Phil93] Philhower, B. "Spartan RISC architecture for yield-limited technologies" Ph.D. Dissertation, Rensselaer Polytechnic Institute, 1993.

[Przy90] Przybylski, S. A. Cache and Memory Hierarchy Design: A Performance-Directed Approach. San Mateo, California: Morgan Kaufmann, 1990.

[Salm93] Salmon, Linton G. "Evaluation of thin film MCM materials for high-speed applications." IEEE Trans. On Components, Hybrids, and Manufacturing Technology, Vol. 16, No. 4, June 1993.

[Ston90] Stone, Harold S. High Performance Computer Architecture, Second Edition. Reading, Massachusetts. Addison-Wesley, 1990.

[Sze81] Sze, S. M. Physics of Semiconductor Devices. Second Edition, pp. 182-3, New York: John Wiley and Sons, 1981.

[Sze90] Sze, S. M. High-Speed Semiconductor Devices. pp 371-373, New York: John Wiley and Sons, 1990.

[Tien95] Tien, C-K. "System design analysis, implementation, and testing of a 32-bit GaAs microprocessor" Doctoral Thesis, Rensselaer Polytechnic Institute, 1995.

[Webe92] Weber, S. "JTAG finally becomes an off-the-shelf solution." Electronics, Vol. 65, No. 9, p. 13, 10 August 1992.

[Zhan95] Xin Zhang, "Parylene as an interlayer dielectric," Ph. D. Dissertation, Rensselaer Polytechnic Institute, 1995.


APPENDIX A

Dinero Simulation Results

The figures given are words transferred between the primary and secondary cache.

Harvard Architecture, Copyback, 2kB caches
1 Line Per Set
Block / Bus
Spice Texgcc
Widths
(Bytes)
152842 22579 100393
2104348 23049 199864
4207449 23843 399011
8256356 25770 480230
16353560 33820 634696
32493384 61232 943728
641024032 149872 1597648
1282122208 527808 3070464
2564865472 3332352 5390080
51213928576 11030912 12171008
102434056192 34411264 38321920
204887734272 91727872 75351040
2 Lines Per Set
Block / Bus
Spice Texgcc
Widths
(Bytes)
142253 22430 91522
283231 22531 172915
4165231 22701 363568
8193898 23148 425360
16251996 24936 543240
32374264 32056 773048
64701456 61856 1254544
1281380576 205408 2318592
2562593664 2499968 4942464
5126492672 8878080 12083968
102422738688 26765824 32429568

4 Lines Per Set
Block / Bus
Spice Texgcc
Widths
(Bytes)
134069 22415 88816
266934 22472 176659
4132665 22582 352652
8154262 22614 411748
16205496 22684 521936
32319656 22832 731784
64545904 22976 1170800
1281109696 23296 2063808
2562578752 625984 4110656
5125939968 4323968 10512768
8 Lines Per Set
Block / Bus
Spice Texgcc
Widths
(Bytes)
132006 22415 88483
262804 22472 175994
4124453 22582 351304
8144724 22614 388366
16186544 22684 518104
32282600 22832 721928
64516208 22976 1140688
1281107616 23296 2011840
2562639296 24512 3935936
16 Lines Per Set
Block / Bus
Spice Texgcc
Widths
(Bytes)
131732 22415 88599
262252 22472 176250
4123408 22582 351719
8144220 22614 408474
16186664 22684 513104
32252912 22832 718080
64470992 22976 1123520
1281075168 23296 1978400

32 Lines Per Set
Block / Bus
SpiceTex gcc
Widths
(Bytes)
131528 22415 88859
261837 22472 176795
4122534 22582 352938
8142198 22614 411102
16184884 22684 512732
32246304 22832 712792
64484320 22976 1132080

Unified Architecture, Copyback, 4kB
1 Line Per Set
Block / Bus
Spice Texgcc
Widths
(Bytes)
154838 22801 95467
2108827 23886 190326
4216886 25843 381089
8291244 30420 469854
16439760 131876 638220
32740968 262264 957824
641349840 576192 1660800
1282841856 1709344 3356032
2566415424 4209216 7978240
51217354880 13380992 21006976
102451411200 36640512 61892608
2048177528832 13999714 184351232
2 Lines Per Set
Block / Bus
Spice Texgcc
Widths
(Bytes)
141326 22458 83023
281566 22557 165189
4162070 22753 330992
8212236 22800 405584
16297120 25052 543720
32448616 30488 788312
64824048 54016 1285872
1281675936 215936 2341664
2563905600 2283520 5244928
51210390272 8999936 12749568
102426640384 30102016 35143424
204891744256 91728384 98295808

4 Lines Per Set
Block / Bus
Spice Texgcc
Widths
(Bytes)
132835 22458 79473
264529 22557 158071
4127950 22753 316486
8162080 22800 386080
16219480 22904 515068
32320480 23048 742096
64611200 23248 1181696
1281181056 34688 2060576
2562841024 1891264 4023040
5126770944 8523776 9860992
102419251712 26764544 28868864
8 Lines Per Set
Block / Bus
Spice Texgcc
Widths
(Bytes)
125929 22458 77033
251233 22557 153218
4112836 22753 306939
8131724 22800 376198
16169352 22904 504544
32251792 23048 725352
64529088 23248 1152224
1281143296 23648 2001856
2562616384 24768 3808832
5125682048 5272064 8806784
16 Lines Per Set
Block / Bus
Spice Texgcc
Widths
(Bytes)
128464 22458 76017
255721 22557 151210
4110246 22753 302962
8125826 22800 371690
16157676 22904 501204
32225168 23048 721432
64454576 23248 1144369
1281081728 23648 1988224
2562626944 24768 3774086

32 Lines Per Set
Block / Bus
SpiceTex gcc
Widths
(Bytes)
128536 22458 75808
255873 22557 150787
4118572 22753 301938
8125318 22800 369144
16153820 22904 496376
32217952 23048 724320
64442704 23248 1147808
1281126432 23648 1979840

Harvard Architecture, Copyback, Direct Mapped
Blocksize=32 Bytes
Blocksize=64 Bytes
Blocksize=128 Bytes
Memory Size
SPICE
TEX
GCC
SPICE
TEX
GCC
SPICE
TEX
GCC
323234984 2986000 3121744
642907976 1859144 2707496 5242960 4359408 5168928
1282037192 1427176 2329128 4196672 2978896 4313488 8601728 6330720 8808064
2561693272 1088936 1912816 2956720 2216784 3442576 6091104 4516864 7024992
5121293424 863048 1525752 2267168 1833312 2682416 4519904 3400128 5333504
1024875968 683472 1224848 1521872 1538016 2113744 3095424 2481472 4085024
2048570440 61232 943728 1024032 149872 1597648 2122208 527808 3070464
4096403304 41664 634376 685072 85424 1035584 1349408 272608 1974336
8192207736 30496 429896 341984 53712 696752 640416 153120 1276480

APPENDIX B

Cache Controller Scan Chain
Signal
Type of I/O
Normally Asserted
CPUMISS
Driver
<2
STALLM
Receiver
>1
WDC
Receiver
>>1
VDA
Receiver
<<1
CRADR4
Driver
4
CRADR5
Driver
4
CRADR6
Driver
4
BRANCH
Receiver
>>>1
CRADR7
Driver
4
CRADR8
Driver
4
ADR16
Receiver
>>>1
ADR17
Receiver
>>>1
ADR18
Receiver
>>>1
ADR19
Receiver
>>>1
ADR20
Receiver
>>>1
ADR21
Receiver
>>>1
ADR22
Receiver
>>>1
ADR23
Receiver
>>>1
ADR24
Receiver
>>>1
ADR25
Receiver
>>>1
ADR26
Receiver
>>>1
ADR27
Receiver
>>>1
ADR28
Receiver
>>>1
ADR29
Receiver
>>>1
ADR30
Receiver
>>>1
ADR31
Receiver
>>>1
ADR15
Receiver
>>>1
ADR14
Receiver
>>>1
ADR13
Receiver
>>>1
ADR12
Receiver
>>>1
ADR11
Receiver
>>>1
ADR10
Receiver
>>>1

ADR9
Receiver
>>>1
ADR8
Receiver
>>>1
ADR7
Receiver
>>>1
ADR6
Receiver
>>>1
ADR5
Receiver
>>>1
ADR4
Receiver
>>>1
ADR0
Receiver
>>>1
ADR3
Receiver
>>>1
ADR2
Receiver
>>>1
ADR1
Receiver
>>>1
CRADR0
Driver
4
CRADR1
Driver
4
CRADR2
Driver
4
CRADR3
Driver
4
CRDILTCH1
Driver
<1
CRADR4
Driver
4
CRADR5
Driver
4
CRADR6
Driver
4
CRADR7
Driver
4
CRADR8
Driver
4
CRADR8
Driver
4
CRADR7
Driver
4
CRADR6
Driver
4
CRADR5
Driver
4
CRADR4
Driver
4
CRADR3
Driver
4
CRADR2
Driver
4
CRADR1
Driver
4
CRADR0
Driver
4
CRHOLD
Driver
>4
CRDILTCH1
Driver
<1
CRRECEIVE
Driver
<<1

CRADR8
Driver
4
CRADR8
Driver
4
CRADR7
Driver
4
CRADR6
Driver
4
CRADR5
Driver
4
CRADR4
Driver
4
CRADR3
Driver
4
CRADR2
Driver
4
CRADR1
Driver
4
CRADR0
Driver
4
CRHOLD
Driver
>4
CRDILTCH1
Driver
<1
CRRECEIVE
Driver
<<1
CRRECEIVE
Driver
<<1
CRWIDE
Driver
<<1
CRWIDE
Driver
<<1
ACK
Receiver
>4
L2ADR0
Driver
>>3
L2ADR1
Driver
>>3
L2ADR2
Driver
>>3
L2ADR3
Driver
>>3
L2ADR4
Driver
>>3
L2ADR5
Driver
>>3
L2ADR6
Driver
>>3
L2ADR7
Driver
>>3
L2ADR8
Driver
>>3
L2ADR9
Driver
>>3
L2ADR10
Driver
>>3
L2ADR11
Driver
>>3
L2ADR12
Driver
>>3
L2ADR13
Driver
>>3
L2ADR14
Driver
>>3
L2ADR15
Driver
>>3
L2ADR16
Driver
>>3
L2ADR17
Driver
>>3
L2ADR18
Driver
>>3
L2ADR19
Driver
>>3
L2ADR20
Driver
>>3
L2ADR21
Driver
>>3

L2ADR22
Driver
>>3
L2VDA
Driver
>>1
L2MISS
Driver
>>1
L2DIRTY
Driver
>>2
L2SYNCH
Driver
>>1
IS_DCC?
Receiver
Static
IOCNTRL0
Receiver
>>1
IOCNTRL1
Receiver
>>1
CRWRITE0
Driver
<2
CRWRITE1
Driver
<2
SAMPLE CLOCK CONTROL
Configuration
SAMPLE DELAY MUX SELECT 2 Configuration
SAMPLE DELAY MUX SELECT 1 Configuration
SAMPLE DELAY MUX SELECT 0 Configuration
SAMPLE PHASE MUX SELECT 1 Configuration
SAMPLE PHASE MUX SELECT 0 Configuration
PRESAMPLE OVERRIDE WAIT Configuration
PRESAMPLE DELAY MUX SELECT 2 Configuration
PRESAMPLE DELAY MUX SELECT 1 Configuration
PRESAMPLE DELAY MUX SELECT 0 Configuration
PRESAMPLE PHASE MUX SELECT 1 Configuration
PRESAMPLE PHASE MUX SELECT 0 ConfigurationScan in