Chapter Eight


Beyond F-RISC/G

 

It has become customary for members of the F-RISC group to impart a few suggestions for the next generations of F-RISCers to guide them in the development and implementation of future architectures. Typically, this advice has been related to architectural issues but there is also a need for suggestions that are closer to the physical levels of design. In this chapter, a brief review of what has transpired is presented along with some parting words of wisdom that may light a path for future designs.

Introduction

The F-RISC/G design began in 1990 with the expectation that the design, implementation and fabrication processes would be completed within three to four years. In hindsight, the goals were achievable but the fabrication technology was not on par with what the original designers expected and what the vendor promised. For this reason alone, the design process has required extensive rework and a whole new generation of F-RISCers. One benefit of working on someone else’s design is that you get to see what they were trying to do and how you could do it differently or better (e.g. you can complain about their design and promote something completely different without having to make it work). To this end I’ve included some comments regarding the work I’ve done and some of the future directions which have been floating around the group.

Multi-ported Register Files

Most processors make use of a register file with multiple READ and WRITE ports to allow several operations to occur simultaneously. On the other hand, the F-RISC/G design uses a single-port memory in order to keep the total device count low (yield was one of the most serious considerations at the start of the project). While device count was a serious issue then and is so today, in retrospect the use of a single-port circuit has had serious repercussions. The problem with a single-port design is that the READ operations must be completed within of the cycle time. Tremendous effort was required to drop the access time below 200 ps and also allow for some safety margin, which at times didn’t seem possible. If a multi-port design was used, each operation would have almost 1 ns to complete.

The number of ports is one of the questions the architects of the next generation will have to answer. Certainly three ports (two READ, one WRITE) are desirable but perhaps more could be put to good use. However, the implementation of multi-ported register files within the three-level metallization Rockwell process presents problems that need to be addressed. The most pressing problems are the sheer size of a multi-port memory cell and the parasitic capacitance of the bit and word lines. With the restriction that only one metal layer can cross over devices, adding more ports increases the size of the cell by about 6-10 mm in width and about 7 mm in height, or about 20% in width and ~14% in height when compared to the single-port cell.

Triple-ported Register File Analysis

The use of a multi-port memory circuit does have some drawbacks in terms of both circuit design and physical layout. The circuit design becomes more complex simply due to the additional ports and requires more investigation of the interaction between circuit components. A three-port memory cell schematic from [CHANG87] is shown in Figure 8.1 that is somewhat similar to the F-RISC/G memory cell.

The addition of two more READ ports is accomplished through the use of emitter-followers between the memory cell collector nodes and the bitlines. The WRITE mechanism is also different, using two discrete devices to adjust the collector voltages rather than one dual-emitter transistor. Because the F-RISC/G single-port memory circuit is not conducive to multi-port operation, it is somewhat difficult to draw conclusions regarding the performance of a multi-port circuit. However, most of the circuit changes should not be extensive and the multi-port circuit would most likely behave similarly to the single-port. Based upon this assumption, the single-port circuit is used with the parasitics of a multi-port memory cell to investigate the difference in performance.

Figure 8.1- Three-port memory cell (2 READ, 1 WRITE)

The layout of a multi-port memory is difficult due to the additional address, bit, and word lines. For comparison purposes, I have designed a three-port memory cell that has two READ ports and one WRITE port (Figure 8.2). The three-port cell is 92.5 mm X 57.0 mm while the original single-port memory cell is 59.1 mm X 51.1 mm, a difference of 56.5% X 11.5%. The READ bitlines have a 4.0 mm spacing between them and are each 2.0 mm wide (the same width as in the single-port design). The word lines are 6.0 mm wide and are minimally spaced based upon the width of the metal-2/metal-3 vias.

Figure 8.2- Comparison of three-port (left) and single-port (right) memory cells

An analysis with QuickCap indicated that the bitline capacitance has increased by 45.3% to 340.5 fF while the wordlines have increased by 166.5% to 222.5 fF (see Table 8.1). The dramatic increases in capacitance are due simply to the addition of more bit and word lines, creating more crossovers and closer wire spacings. In order to achieve the 200 ps READ time, the single-port memory cell had large spacing between adjacent bit and word lines to keep the parasitic capacitance low. The three-port cell cannot afford this luxury without a tremendous increase in total area. However, simulations indicate that the significantly larger capacitances are not necessarily a problem.

Node

Total Capacitance

Miller Capacitance

Bit A0

337.1 fF

8.67 fF

Bit A1

340.5 fF

7.27 fF

Bit B0

314.5 fF

94.5 fF

Bit B1

309.0 fF

90.3 fF

Bit C0

381.5 fF

94.5 fF

Bit C1

377.4 fF

90.3 fF

Word READ (Port A)

162 fF

94.5 fF

Word READ (Port B)

222.5 fF

94.5 fF

Word WRITE (Port C)

93 fF

30.5 fF

Table 8.1- Three-port memory cell parasitic capacitance values (32 X 8 array)

To evaluate the impact of the multiport cell upon the total performance, the bit and word line capacitance values for the new cell were used with the single-port register file SPICE simulation netlist. The triple-port circuit is nearly identical to the single-port with the exception of the memory cells. Although three sets of bit and word lines are now required, the same circuits may be used for the single-port sense amplifiers, read/write logic, and word decoders. Some modifications may be required to accommodate the emitter-follower circuits in the memory cells but should not present much of an obstacle or increased delay.

Simulations indicated that the READ access time rose from 189.4 ps to 239.1 ps, a 26.2% increase. The simulations did not include the new emitter-followers inside the memory cells which create the READ ports but these would typically contribute 10-20 ps of delay, making the total READ time ~ 249 - 259 ps or up to 36.7% larger. For a 1 ns cycle time, each memory access should have up to 750 ps or so for completion, consequently the increased bit and word line capacitance does not have much impact.

Increased Metallization Layers

Time and time again, physical layout was constrained by the metallization rather than the devices themselves. The Rockwell process is currently limited to three levels of metallization with increasing via sizes. More troublesome is the restriction that only metal-3 is allowed to cross over devices. Many modern CMOS processes have four or more layers of interconnection, simplifying routing and layout. For our purposes, two layers of metal that can cross devices would be preferable for localized interconnection. Currently, most of the chip area is occupied by interconnection but with better metallization the area could be drastically reduced. In particular, a multi-ported register file would benefit from better metallization because the cell size could be greatly reduced which should also reduce the overall parasitics. Of course, routing wires over devices would also increase the parasitic capacitance of the nets but the greatly reduced wire length may offset this.

In addition to two local levels of interconnection, there should also be one or two global wire layers for routing across the chip. Because these nets tend to be long, the minimum feature size could be relaxed for these levels. In order to reduce the onset of RC effects the dielectric thickness between these layers and the next should be relatively large. This would then focus most of the capacitance between the two wires in a differential pair that could consequently be managed by increasing the spacing between wires.

Finally, at least one metal layer (and preferably two) should be kept exclusively for power distribution. Some of the fixes to the F-RISC/G chips have required the use of metal-3 that creates conflicts with the power distribution network. By using one level exclusively for power, these conflicts are avoided and power may be routed directly to the distribution points where it would be fed down to the circuits themselves. The via sizes for upper metallization levels is often a concern but the regularity of the power network should help to offset any problems.

Improved Voltage-Controlled Oscillator Designs

The VCO was sort of an anomaly in terms of the F-RISC project. The circuit itself had no relation to the processor and was done merely as a vehicle to test the capabilities of the Rockwell process. In this effort it was successful and helped us design and implement circuits that performed better. It also provided us with valuable experience in several key areas that will help with clock generation and low-skew circuit design for future generations of F-RISC.

Sadly, however, there does not seem to be any place for the VCO itself in the F-RISC project except as a process monitor and test circuit. This is especially unfortunate due to the wide bandwidth capabilities of the VCO, from 13.66 GHz to 0.255 GHz (assuming that the untested divider circuit is functional). This circuit is definitely useful within a phase-locked loop (PLL) and could provide a frequency-tracking range that was previously unavailable, but the applications for such a circuit aren’t obvious. Most PLLs do not require such a broad bandwidth capability, although the tremendous increase in communications and networking may still provide a need and an opportunity for further research.

Given the difficulty in testing the VCO without an oscilloscope trigger signal, it is difficult to provide suggestions regarding improvements. In general, future work on the VCO should focus upon improving the frequency range of the circuit, especially for the high-gain differential output amplifier. This circuit required the most design time and is still a limiting factor due to the high output swing required. The novel XOR circuit is definitely useful outside of the VCO and consequently is used as a phase detector in the F-RISC/G clock deskew circuit. Although it was not investigated directly, the phase detection capabilities of the XOR should be examined more closely for application in PLLs. With the simplicity and perfect symmetry of the XOR, it is a good phase detector with very high sensitivity to phase error. However, an analytical comparison to the dual Gilbert-multiplier presented in Chapter 2 is needed before the characteristics of the XOR can be stated with absolute certainty.

Conclusions

The development of F-RISC/G has taken significantly more time than anticipated, due primarily to deficiencies both within the fabricated devices and within our design tools. We have no capabilities to modify the process, so we must make up for the shortcomings with better designs that require better design methodologies and tools. This requirement is not unique to F-RISC. Indeed, as circuit speeds in general rise, so do the requirements placed upon the designers. Many of the high-speed processors today require the talents of hundreds of circuit designers and many more layout artists who analyze, optimize and often hand-craft the circuits. To gain a competitive edge, they must recognize the sources of delay in a circuit and compensate for them.

The tremendous increase in device speed has contributed to the need for better interconnection analysis. As devices get faster, the delay due to interconnection becomes larger in comparison and must be either minimized or compensated to obtain higher performance. A number of designs attempt to hide the interconnection delays through the use of wave pipelining in which a new signal is applied before the old one has reached it’s destination. While this approach is feasible, it is also more difficult to design, especially in a poorly characterized process.

To minimize the interconnection delay and its effect upon the circuit requires highly accurate CAD tools and good process models. During my tenure within the F-RISC group, we have acquired tools with much more accuracy and have developed models from experimental measurements that we believe are rather accurate (although to date we’ve been proven wrong more often than right). The most important component in determining the interconnection delay is the parasitic capacitance which itself is determined by the dielectric parameters. Now that we have a highly accurate capacitance estimation tool in QuickCap, it is time to focus more upon the dielectric properties and develop better test structures to measure them.

Future generations of F-RISC will rely heavily upon the physical implementation of the circuits. The current wiring restrictions definitely limit the performance of F-RISC simply because the majority of the chip area is relegated to wiring. Reductions in minimum feature sizes will also help reduce the area consumed by wiring but to a limited extent. Longer nets will not be able to take advantage of smaller width requirements due to the onset of RC delay effects. If the interconnections (including vias) could be routed on top of devices, this would allow the devices to be packed much closer together. This in turn would reduce the area of standard cells probably by 50% on average before wire routing again becomes a limiting factor. The use of wiring channels could be drastically reduced with relaxed wiring restrictions but probably not completely eliminated using only two levels of interconnection. With the use of more efficient layout tools and better interconnection analysis capabilities, the amount of performance which has previously been lost due to the physical implementation should decrease and help future versions of F-RISC achieve even higher levels of performance.