Critical ALU Path Optimization and Implementation in a BiCMOS Process for Gigahertz Range Processors
By
Matthew W. Ernest
A Thesis Submitted to the Graduate Faculty of Rensselaer Polytechnic Institute in Partial Fulfillment of the Requirements of the Degree of
Doctor of Philosophy
Major subject: Electrical Engineering
Approved:
John F. McDonald, ECSE
Committee Chair
Mukkai Krishnamoorthy, CSCI
Committee Member
Michael Savic, ECSE
Committee Member
Paul Schoch, ECSE
Committee Member
Rensselaer Polytechnic Institute
Troy, New York
December 2002
© Copyright 2002, Matthew W. Ernest
All Rights Reserved
1.3 Parallel circuits and prefix
computation
1.4 Silicon Germanium bipolar
and BiCMOS for highspeed processors
1.5 The DARPA2 and SMI00
Reticles
Chapter 2: Addition as a Parallel Prefix Problem
2.3 Depth/Size tradeoff in
prefix circuits
2.4 The Carry is a Prefix
Operation
Chapter 3: Digital Circuit Design with Bipolar Transistors
and Current Steering Logic
3.2.1 Series Gating and Emitter
Followers
3.2.3 An Issue of Nomenclature
3.3 Bipolar Circuits and
Designing Logic for Speed
3.3.2 Latency versus Bandwidth
3.4 Noise margin and voltage
swing
3.5 Device Sizing in Loaded
Buffers
3.5.2 CurrentSwitch Transistor
Size
Chapter 4: Carry Select Optimization
4.1 Introduction and Background
4.1.1 On adders and critical
paths
4.1.2 On yield limited
technologies
4.2 Origin and Theory of Carry
Select Addition
4.3 Optimization of Carry Select
Stage Sizes
4.4 Optimal 32bit ALU with
Carry Select Addition
4.4.3 Layout of a Monolithic ALU
4.5 Considerations Affecting the
Layout
4.5.1 Simulation of the design
4.6 FRISC “ByteSlice” Carry
Select Implementation
4.6.1 A MultiChip Processor and
Another Look at Yield Limitation
4.6.2 Comparison to Optimized
Adder, And Other Possibilities
Chapter 5: The PseudoCarry LookAhead Adder
5.2 Pseudocarry Theory of
Operation
5.2.1 Generalized pseudocarry
equations
5.3.1 Logical structure of the
carry tree and circuit implementation
5.3.3 Measurement and Analysis of
DARPA2 Test Structure
5.3.3.3 Interconnect Parasitics. 92
5.4.2 Observable paths in Test
structure
5.4.3 Expanded transistor sizes
5.4.5 Interconnect changes due to
design kit and fabrication options
5.4.7 Interconnect parasitic
extraction
5.4.8 Cell Layout for SMI00
Reticle
5.4.9 Test Structure Layout for
SMI00 Reticle
5.4.10 Measurement and
Analysis of the Second Test Structure
5.4.10.1 Continuing Parasitic Analysis. 119
5.4.10.2 Continuing Temperature Concerns. 119
Chapter 6: QuickCap Usage and Design Flow in the FRISC
Group at RPI
6.1.1 QuickCap 3D capacitance
extraction
6.1.2 Parasitic Extraction with
Cadence Design System and the SiGe5HP Design Kit.
6.1.3 Using CDS and QuickCap on
SiGe5HP Designs
6.2.1 CDS/QuickCap Theoretical
Design Flow
6.2.2 CDS/QuickCap/SmartSpice
Theoretical Design Flow
6.2.4.1 Schematic to Spectre netlist, via
Affirma netlister127
6.2.4.2 Schematic to HSpice netlist, via
Affirma netlister127
6.2.4.3 Layout to GDSII stream, via PIPO.. 127
6.2.4.4 GDSII stream to CAP and SPICE, via
gds2cap127
6.2.4.5 HSpice and SPICE to SPICE and rename,
via SvS127
6.2.4.6 CAP and rename to SPICE fragment, via
QuickCap128
6.2.4.7 Spectre and SPICE fragment, via
Spectre simulator128
6.2.5 Pending Tasks and
Experiments
6.2.5.1 _G<number> filter for hspiceS
netlister128
6.2.5.2 GDSII rewriting attributes. 129
6.2.5.3 Pin Conversion on Stream Out 130
6.2.5.4 Make gds2cap understand the hierarchy
of our GDSII streams.130
6.2.5.5 Make Affirma include the SPICE
fragment from QuickCap.131
6.2.5.6 Keep technology files up to date with
current design kit.131
6.2.5.7 Understanding QuickCap parameters. 132
6.2.5.8 Migration to Parallel Processing
Facilities132
6.2.5.9 Break up FEOL and BEOL layers into
separate technology files132
6.2.5.10 Identify SmartSpiceisms. 133
6.2.5.11 Handle losses in substrate. 133
6.3.6.2 PROGRAM ERROR IN EstimateResistance() 142
6.4.1 Name mapping alternatives
in the Cadence Design System to QuickCap transition
6.4.1.1 Keep pin information as attribute
number142
6.4.1.2 UserDefined Property Mapping File. 143
6.4.1.4 Convert Pin Label Layer to Pin Layer 144
6.4.1.5 Scripted label file generator 145
6.4.1.6 SvS (Schematic versus Schematic. 145
6.4.2 Backannotation in the
QuickCap to Cadence transition
6.4.2.1 Intermediate SPICE to Spectre parsing
script145
6.4.2.2 Intermediate numeric to Spectre
parsing script145
6.4.2.3 Cadence SPICE reader for Spectre
netlister146
6.4.2.4 cap+spef and SPEF interchange format 146
6.4.3 Cell Hierarchy in
Extraction
6.4.3.1 Cell Name Mapping. 147
6.4.3.2 Exporting Hierarchies. 147
6.4.3.3 Insertion in Spectre Netlist 148
Chapter 7: State of the Art and Future Directions
7.3 Exploitation of the
idempotency of the prefix operation
7.4 Full ArithmeticLogic Unit
7.5 Increasing depth of
seriesgating in lookahead gates
7.6 Dottedemitter/dottedcollector
circuitry
7.8 Other emitterfollower
enhancements
7.10 Utilization of BiCMOS circuit
designs
7.11 SiGe 7HP, 8T and Further
Processes
7.12 Adaptations for nonCPU
applications
Chapter 8: Research Conclusions
Appendix B: DARPA02 TESTPCL Pseudocarry Lookahead Test
Structure Netlists and Schematics
Appendix C: SMI00 TESTPCLL2u Pseudocarry Lookahead Test
Structure Netlists and Schematics
Appendix D: Auxiliary Files for QuickCap Usage
D.3 QuickCap Technology File Declarations
D.4 Details of the SiGe5HP technology files
Figure 1‑1: A very simple
representation of the basic operations of a processor
Figure 1‑2: Basic units
of a pipelined RISC processor, and connections between them
Figure 1‑3:
Threedimensional rendering of a SiliconGermanium HBT
Figure 2‑1: Logical
diagram of ripple carry
Figure 2‑2: Prefix
graph for ripple carry
Figure 2‑4: Prefix
graph for carry select
Figure 2‑5: Logical
diagram of (flat) carry lookahead
Figure 2‑6: Logical
diagram of block lookahead
Figure 2‑7: Prefix
graph for KoggeStone adder
Figure 2‑8: Prefix
graph for BrentKung adder
Figure 3‑1: A bipolar
current switch, configured as a digital buffer
Figure 3‑2:
Seriesgating of current switches
Figure 3‑4: Graphical
solution for voltage swing/noise margin relations
Figure 3‑5: Buffer
delay, for buffer transistor size equal to emitterfollower size
Figure 3‑6: Fully
differential gates to perform lookahead for two or three bits
Figure 3‑7: Lookahead
gate with mixed singleended and differential inputs for two or three bits
Figure 4‑1:
Representation of a Carry Select Adder
Figure 4‑2: Logical
Organization of the Optimized ALU
Figure 4‑3: Carry
Generation Circuit
Figure 4‑4: Sum
Generation Circuit
Figure 4‑5: Carry
Selection Circuit
Figure 4‑6: ALU
Function Generator
Figure 4‑10: Five bit
carry select stage from optimized ALU layout
Figure 4‑12: 32SPICE
Simulation of 32bit Carry Select Adder
Figure 4‑13: Comparison
of Adder and Register File Areas
Figure 5‑1: Prefix
graph for the pseudocarry lookahead test structure on the DARPA02 reticle
Figure 5‑2: Blocks
arranged in a pseudocarry lookahead tree
Figure 5‑3: Carry tree
test structure
Figure 5‑4: Buffer
delay vs. tail current for 9805A design kit
Figure 5‑5: Layout of
the PCLA test structure on the DARPA02 reticle
Figure 5‑6: HSpice
simulation of DARPA02 test structure
Figure 5‑7:
Oscilloscope trace of the highspeed output
Figure 5‑8: Annotated
microphotograph of fabricated test structure
Figure 5‑9: Breakdown
of measured delay by source
Figure 5‑10: Variation
of HBT device model resistance paramters
Figure 5‑11: Variation
of HBT device model capacitance parameters
Figure 5‑13: Extended
test structure for the SMI reticle
Figure 5‑14: Delay for
minimum pitch wiring using various methods of parasitic estimation
Figure 5‑20: Layout for
the PCLA test structure on the SMI00 reticle
Figure 5‑21: Simulation
output of the PCLA test structure on the SMI00 reticle
Figure 5‑22: Quarter
wafer carrying the SMI00 reticle
Figure 5‑23: A closer
view of the sites on the SMI00 reticle
Figure 5‑25:
Oscilloscope output of the SMI00 test structure
Figure 6‑1: Proposed
RPI CDS/QuickCap design flow
Figure 6‑2: 3D
rendering of a SiGeHP NPN via QuickPrint and POVRay
Figure 7‑1: Schematic
of the Intel Pentium 4 "doublepumped" adder, from [HINT01]
Figure 7‑2: Completing
the sum with parallel term
Figure 7‑3: Clearing
the carry for logic operations
Figure 7‑4: Moving the
carryclearing circuitry off the
critical path
Figure 7‑5: Increasing
height of decision tree and variation of delay as a function of input level
Figure 7‑6:
Dottedemitter (dottedOR)
Figure 7‑7:
Dottedcollector (dottedAND)
Figure 7‑8: Limiting
the lowlevel output of the "dottedand"
Figure 7‑9: Proposed
dotted and/or implementation of threeway lookahead function
Figure 7‑10: Typical
emitterfollower with passive current sources
Figure 7‑11:
Emitterfollower with crosscoupled active pulldowns
Figure 7‑12: Peak of
the f_{T} curve for a 0.12 by 0.8 micron device in the 8T process
Figure 7‑13: Delay as a
function of driven gate loads for 5HP and 8T processes
Figure B‑1: Schematic
for toplevel cell “testpcl”
Figure B‑2: Schematic
for cell “Vref_8mA”.
Figure B‑3: Schematic
for cell “PadReceiver_ESD_3o_1u”
Figure B‑4: Schematic
for cell “PadDriver_8m_1u”
Figure B‑5: Schematic
for cell “orb2q”
Figure B‑6: Schematic
for cell “buf2q”
Figure B‑7: Schematic
for cell “latch”
Figure B‑8: Schematic
for cell "staticq"
Figure B‑9: Schematic
for cell "bufq"
Figure B‑10: Schematic
for cell "vref2"
Figure B‑11: Schematic
for cell "ef3"
Figure B‑12: Schematic
for cell "ef2"
Figure B‑13: Schematic
for cell "etree2"
Figure B‑14: Schematic
for cell "pcl14noc"
Figure B‑15: Schematic
for cell "pcl18c"
Figure B‑16: Schematic
for cell "and3"
Figure B‑17: Schematic
for cell "pclsub2"
Figure B‑18: Schematic
for cell "ef4"
Figure B‑19: Schematic
for cell "A.19 etree3"
Figure B‑20: Schematic
for cell "pcl6noc"
Figure B‑21: Schematic
for cell "pcl6c"
Figure B‑22: Schematic
for cell "vref1"
Figure B‑23: Schematic
for cell "hstart"
Figure B‑24: Schematic
for cell "istart"
Figure B‑25: Schematic
for cell "pclsub2c"
Figure B‑26: Schematic
for cell "hsc1q"
Figure C‑1: Schematic
for cell "testpclL2u"
Figure C‑2: Schematics
for cell "padd_RF_SE"
Figure C‑3: Schematic
for cell "padr_ESD_DC_SE_3o_1u"
Figure C‑4: Scgematic
for cell "efL2u"
Figure C‑5: Schematic
for cell "Vref_8mA_45_5l"
Figure C‑6: Schematic
for cell "and2bqL3u"
Figure C‑7: Schematic
for cell "etree2L2u"
Figure C‑8:
Schematic for cell "efL6u"
Figure C‑9: Schematic
for cell "vref2L2u_a"
Figure C‑10: Schematic
for cell "mslatchL2u"
Figure C‑11: Schematic
for cell "buf2qL2u"
Figure C‑12: Schematic
for cell "efL4u"
Figure C‑13: Schematic
for cell "pclrosc16L2u"
Figure C‑14: Schematic
for cell "pclrosc16cL2u"
Figure C‑15: Schematic
for cell "staticq"
Figure C‑16: Schematic
for cell "pclrosc4L2u"
Figure C‑17: Schematic
for cell "and3L2u"
Figure C‑18: Schematic
for cell "etree3L2u"
Figure C‑19: Schematic
for cell "pclrosc6L2u"
Figure C‑20: Schematic
for cell "pclrosc4cL2u"
Figure C‑21: Schematic
for cell "and2bL2u"
Figure C‑22: Schematic
for cell "pclroscc2L2u"
Figure C‑23: Schematic
for cell "ef2L4u"
Figure C‑24: Schematic
for cell "istartL2u"
Figure C‑25: Schematic
for cell "hstartL2u"
Table 3‑1: Buffer delay in picoseconds
for one load
Table 3‑2: Buffer delay
in picoseconds for two loads
Table 3‑3: Buffer delay
in picoseconds for three loads
Table 3‑4: Buffer delay
in picoseconds for six loads.
Table 4‑1: Estimation
of delay versus number of stages: s=1, B=32
Table 4‑2: Estimation
of delay versus number of stages: s=2, B=32
Table 4‑3:
Representative Timings From SPICE SImulation
Table 5‑1: Effects of
erroneous resistor modeling in 9805A design kit
Table 5‑2: Delay for
each feedback path in the SMI00 test structure, at 75°C
Table 5‑3: Minimum
measured delay from the SMI00 test structure
Table 5‑5: Delay for
various temperatures.
Table 7‑1: Published
adder/carry data
Table 7‑2: Delays for
latched lookahead gate
For their roles in assisting and inspiring this research, thanks go out to:
· John F. McDonald, thesis advisor
· Mukkai Krishnamoorthy, Michael Savic, and Paul Schoch, members of the doctoral committee
· Hans Greub and Russ Kraft, current and former faculty associated with the FRISC group
· The numerous students of the FRISC group over the years
This research was sponsored in part by Defense Advanced Research Projects Agency (DARPA) under contracts N66001968606, DAAH0493G04777, and N00173991G013.
Binary addition is a simple ubiquitous component of computational circuits. One can hardly imagine a computer that did not add; to many it wouldn’t even merit the name. In both generalpurpose and applicationspecific processors the adder delay is a strong metric for cycle time.
This research spans three areas that contribute to adder speeds: logical arrangement of carry generation, circuits to implement that arrangement, and highspeed semiconductor devices to realize those circuits.
Carry generation belongs to a class of parallel computation problem knows as parallel prefixes. The basis of this work’s logical design is pseudocarry lookahead, a method that uses treelike structures to minimize gate depth on critical paths and trades delay from critical paths to noncritical paths.
The logical forms that reduce the serial computations necessary for addition are interrelated with the circuit forms that allow the fastest generation of those computations. Special circuits to compute lookahead in a single gate reduce signal path length and allow driving of signals at high speeds.
Silicon Germanium HBTs provide highspeed devices while leveraging the mature lithography of traditional silicon processes. Not only can fast circuits be built, but high integration allows not just large units like adder but whole systems into which adders would be embedded.
The combination of these three areas has allowed the construction of a 32bit pseudocarry lookahead circuit with a delay of 146 ps in a 50 GHz f_{T} SiGe process. In addition, direction for future work have been established that lead to delays on the order of 32 ps.
“Speed has always been important otherwise one wouldn't need the computer.”
Seymour Cray
Binary addition is one of the smallest “complex” Boolean circuits, something beyond just the Boolean primitives. It is one of the simplest applications of Boolean logic that has meaning and usefulness in the real world.
Adders are at the center of every computer. Adding machines were in fact the forebears of computers, which were envisioned as calculators for more advanced arithmetic. Circuits for the computation of higherorder mathematical functions such as multiplication employ adders as subcircuits [SWAR90].
Adders are also not restricted to generalpurpose processors. Digital signal processors and network processors also need to perform arithmetic in their operations [SWAR97]. Improved performance can be made possible if a fast adder is available, possibly with even greater effect than on a generalpurpose processor. As the bottleneck in computation shifts between the processor and data I/O, there remains a demand for fast addition. [FLYN01]
Being both a simple circuit and a building block for a very popular system, binary adders are strong candidates for specialcase optimization. The simplicity of the operation leads to the tractability of the problem of delay optimization. The ubiquity of the operation allows the savings due to optimization to be reaped multiple times, resulting in design effort being transformed into latency reduction with high efficiency. Extensive logic design and handcrafted circuits for adders carry the possibility of great returns on effort invested in terms of processor performance. Examination of the logic of addition exposes the underlying parallel computation issue. Exploiting this parallelism speeds up the delay by doing work in different areas at the same time when possible instead of serially. Crafting the circuits by hand results in the optimum performance for the specific conditions by the parallel logic. Through adjustment in areas such as gate topology, input symmetry, and reordering of functions, delay can be moved off the critical paths that define the overall delay of the adder and onto other, shorter paths.
These activities could in theory be undertaken on any circuit in a processor. However, available designer effort is finite, and not every circuit will produce great gains for the overall system when more design effort is applied to it.
After the trivial ripple carry adder, early research led to carryskip [KILB59] and carryselect [BEDR62], both of which have the advantage of a nearly bitslice arrangement in the physical layout [OKLO85]. Full examination of parallelization led to block carry lookahead [BREN82]. Although requiring a large area for its speed increases, carry lookahead often attracts the interest of bipolar designers who are already driving hard toward the fastest possible circuits [BEWE88], and that of BiCMOS designers as well [KUO93]. Recent work has focused, instead of on generalpurpose stand alone adders, on adders incorporated in other circuits, such as multipliers, which present an uneven input profile to an adder [STEL96].
A processor in its most basic form may be described as running in a loop of the following actions:
· Retrieve some data from memory
· Perform some operation on that data
· Place the result back into memory.
For an actual processor, there of course needs to be some sort of controlling state machine which manages these actions, and the memory system requires rather more detail. However, this description begins to provide the framework for identifying where delay is produced which created the cycle time.
An empirical analysis of processor cycles times shows a strong correlation with this simplified model. Delay can be modeled as the sum of the time for the arithmeticlogic unit (ALU) to perform the operation plus the wiring delay of the longest connection from the controlling unit to the operations unit [SAIH95].
If a more sophisticated example of a pipelined RISC processor is examined, it can still be seen that the delay between pipeline latches, and thus the cycle time, is at a minimum the ALU delay plus the delay of the control signals to it. The registerfile delay plus its control signal delay is also another possibility for bounding the cycle time, but it can be presumed that since the longest delay between pipeline stages defines the clock cycle for all stages, the relative importance of the ALU delay versus registerfile delay is determined merely by which is currently the larger value. Nothing inherent to general properties of either functional unit makes its delay more important than the other. The ALU delay is still worthy of investigation.
The critical path in adders is often the carry signals. Each carry depends on all preceding operand bits. This kind of computation belongs to a general category called “prefix operations”. These have turned out to be one of the fundamental areas of the study of parallel algorithms. Recognizing the parallelism that can be brought to bear in solving prefix problems is key to development of fast adders [BREN82].
The connectivity of a carry circuit can be abstracted as a directed graph. Inputs are comprised of the operand bits at each bit position. “Processor” nodes, in contrast to simple buffer nodes, combine partial carries via the appropriate block carry operation for the carry method being considered. The nodes that represent the carry outputs for each bit position must have as predecessors in the carry graph the operand bits for each bit position of lesser or equal significance.
Given this graph representation, the analogy can be drawn between improving adder latency and minimizing the depth of spanning trees. The shallower the tree that can be constructed, the fewer gates each path to the carry outputs must pass through. However, there are constraints on both the construction and the drive capability of the gate circuitry that translate into finite capabilities for the “processor” nodes. These restrictions typically manifest themselves as limits on fanin and fanout of the “processor” nodes, with function complexity possibly being a concern as well. Thus, the mathematical basis of latency reduction of the arithmetic carry is established as the bounded fanin/fanout reduction of spanning trees.
The first area that this work will pursue is the mathematics of parallel prefixes and the mapping between prefixes and types of adders. This will provide an analytical basis for delay optimization of addition.
The very first transistor was formed with a bipolar junction. When MOSFETS were developed, bipolar circuits held a speed advantage. However, CMOS was able to take over more and more of VLSI arena since it had lower needs for both standby current and voltage supply. It increased use lead to the more rapid maturing of CMOS lithography as compared to bipolar, creating a further advantage in terms of transistor density. Still, when the utmost speed was called for bipolar circuits were needed.
The FRISC project was generated a body of work concerned with the development of a Spartan RISC processor in a Gallium Arsenide HBT process from Rockwell. Heterostructures and IIIV technologies offer higher device speeds than Silicon bipolar devices, let alone silicon MOSFETS. However, the yield of these technologies is terribly low for highly integrated systems. For a 10,000transistor die in the Gallium Arsenide technology, testing often passes no more than one in three. Other technologies, such as Indium Phosphide, might offer faster devices but with even worse yields. They are typically only seen in very low integration levels in applications such as communications amplifiers, which might use only a handful of devices and can therefore be produced with useable yields. For the digital designer looking for highintegration, the “exotic” nature of these technologies has retarded the maturation of the semiconductor fabrication and presents great difficulty from a practical manufacturing standpoint.
Silicon Germanium BiCMOS processes, such as the SiGeHP [AHLG97] from IBM, provide access to the speeds of bipolar circuits while leveraging the mature lithography of CMOS [CRES98]. It is possible to combine the steps need for a polysilicion emitter and graded germanium base into an extant CMOS process. This makes small feature sizes available to the bipolar layers via lithographic methods whose development is amortized over the high production volumes of CMOS chips. Such high levels of development would be difficult to cover with a bipolaronly process. In turn, the integration and power characteristics of the CMOS circuits can be used outside the speedcritical areas, bringing processing onto the bipolar die that previously required a separate package. This also permits combining quality CMOS devices with quality bipolar devices into single circuits that were not previously possible.
The silicon germanium heterojunction bipolar transistor (HBT) also possesses advantages over highperformance IIIV HBTs. The silicon substrate possesses advantageous mechanical properties, such as lowdefect wafers, high thermal conductivity, mechanical strength, and acceptance of wide ranges of doping.
Bipolar devices provide the ability to construct effective smallswing differential logic circuits. If two HBTs are arranged as a differential current switch, the switching of the current from one device to the other is an exponential function of the differential voltage across the inputs of the pair. This results in a high gain for the circuit, which can be traded off against a very small differential input swing to produce short switching times.
The construction of HBTbased digital gates will be addressed, as a necessary preamble to the development of an arithmetic unit. Description and analysis of differential current switches and smallswing outputs will be related to device and circuit parameters and how they affect circuit latency and loaddriving capability.
A major difficulty that research in highspeed circuitry faces is that the advanced semiconductor manufacturing required often outstrips the budget of such a research program. Some processes might not even be available to such a market at all. Whenever an opportunity to get circuits on a fabrication run presents itself, every effort is made to utilize the space available as much as possible. However, since the timing of the fabrication run is in the hands of the third party supplying the reticle space, this can cause disruptions to the design cycle that reduce the efficiency with which the design can be verified. This unfortunately causes delays in analysis and communication and leads to design decisions which common sense would fault. Part of the university research writing then is on the one hand to identify and explain these decisions influenced by third party nontechnical issues, and on the other to anticipate and prepare for such situations by creating a design process with reduced sensitivity to such factors and that can produce a reliable design at more frequent intervals.
Two such fabrication opportunities arose during this work. DARPA sponsored a tworun multiuser reticle program to expand the research opportunities in silicon germanium for the projects under its direction. The first reticle carried a threeport register file from the FRISC group. The second reticle carried a new register file and a SERDES [KRAW00], as well as the pseudocarry lookahead test structure described in this work.
Sierra Monolithics, Inc., also graciously donated some space on two reticles for test structures built by the FRISC. The first such reticle included an improved SERDES. The second included a BiCMOS FPGA design, along with an updated pseudocarry lookahead test structure that is discussed in this work.
This work will cover the design, analysis, fabrication, and testing of the carry test structures for these two reticles. Central to this work will be the analysis of the measured results. Sources of error need to be identified and the portions of the design process that created these error sources corrected.
Arithmetic carries belong to the set of functions called “prefix problems”. This kind of function generates a series of results where the each term depends on the previous term, i.e. each result is the “prefix” for the next. Efficient solutions to prefix problems depend on generating in parallel certain subsections of prefixes and then combining them to produce the complete results. Prefixes are one of the core ideas of parallel computation.
Consider a series of terms on some associative operator “·”, e.g. x_{0}, x_{0} · x_{1}, x_{0} · x_{1} · x_{2}, etc. If we look at each term of the series, F_{n}=x_{0} · … · x_{n1} · x_{n}, it’s clear that due to the associativity of our operator each term may be rewritten as a recurrence F_{n}=F_{n1} · x_{n}. In other words, each term of the series is generated by applying our associative operator to a new variable and a prefix that turns out to be the previous term in the series. The set of problems that such a construction applies to are referred collectively as “prefix problems”. Prefix problems form the theoretical basis of quite a few practical computational circuits, notably among them carry trees for addition. The very idea of prefix circuits was first introduced as part of a fast binary adder [OFMA63].
If the series is generated purely by means of direct application of the recurrence relation, it is apparent that the time to generate the output of a prefix problem will grow linearly with the size of the input. To reduce this time, methods of parallel prefix generation are required. Indeed, prefix computation has become fundamental to field of parallel algorithms [LAKS94], dating back to the scan operator in APL, which applies and operator to a vector to produce a vector of all of the prefix results. [IVER62].
A serial prefix circuit with n inputs can be shown by inspection to require n 1 operations and require time n 1 to complete. It can be shown for any prefix circuit that the lower bound of the sum of the time and the size of the circuit is 2 n –2 [SNIR86]. This suggests that increasing the number of operations could be used to reduce the depth of the circuit. In VLSI circuits, the everincreasing integration causes this tradeoff to be easy to make as well as desirable.
Since the operator used to build the prefixes is associative (by definition), the prefix circuit could be built up by binary division. The inputs are divided into a lowerorder and a higherorder half, and the prefixes for each half computed. The results from the lowerorder half are then applied to the partial prefixes for the higherorder half to produce the complete prefixes. This division would be applied recursively to produce the prefixes for each half. The size of an n input circuit would be twice the size of an n /2 circuit plus the operations to produce the complete prefixes for the higherorder half, while the depth would be one more than the depth for n /2 inputs. Removing the recurrence, that means that the size is n /2 log n and the depth is log n. A different divideandconquer strategy was used in [BREN82] to reduce the size to O(n).
It can be shown that addition is a type of prefix operation. Specifically, it is the carry generation that constitutes a system of prefixes. The sum at each bit position is a function of the operand bits at that position and the carryout out from the preceding position. Carryout of position n depends on the operands at position n and the carry from the proceeding position n1, which can be expressed in terms of the recurrence relation for a prefix operation given above [BREN82]. The basic series computation of prefixes is the equivalent of ripple carry. If we consider the delay of a gate that computes “·” and the area that the circuit occupies as our basic units, generating prefixes in series over n terms will take n1 time units but occupies a space of n 1. At the other extreme, each prefix could be computed independently in constant time at great expense in area (and with highly impractical fanin and fanout requirements). For a large set of these carry structures, the relationship between depth and size (in terms of processing nodes) is so strong that is possible to expand or contract the prefix graph with a nonheuristic algorithm to pass from one structure to another [ZIMM96].
Ripple carry is the simplest form of addition carry, deriving straight from the definition of addition, or from the recurrence relation for a prefix operation. As mentioned above, it is simply the serial generation of prefixes as applied to the case of addition as the associative operator. It consists of two singledigit additions per bit. For each bit, the operands are "halfadded", the lowest digit of the result being the sum and the rest being a carry. The carry is then added to the sum of the next bit, producing the addition result. This mechanism of adding the carry at one position to the sum at the next position creates the next carry. The carry signals are said to "ripple" from one bit to the next. The logical structure needed at each bit, as well as the interconnections between bits, are identical. The time to produce all result digits and the necessary circuit area are linear functions of time.
Improvements on carry times involve computation of prefixes over groups in order to generate intermediate results in parallel. Carryselect [BEDR62] is based on computing carries over specific groups of bits based on assumed inputs. Partial prefixes that depend only on the bits within a single group are generated in parallel for each group. Then the appropriate partial prefixes from different groups are combined in series to create complete prefixes. The carry out of one group is the carry in of the succeeding group, building a complete prefix for all for the preceding bits in series as it progresses through each group. For each group, partial prefix generation is accomplished by including two complete sets of the circuitry for a local carrycomputation method. One copy has an assumed carryin of one, the other a carryin of zero. When the carryin from the previous group is determined, selecting the proper output for each partial prefix can generate the complete prefix. All of the "assumed carry" values can be generated in parallel, and then the proper alternatives selected in turn when the carryin is applied.
Though it is not the fastest of carry structures, carry select offers some implementation advantages, being very easy and quick to implement due to the small number of subcell types needed and requiring not much more than twice the area for serial carry generation.
Carry lookahead [BREN82] uses a tree structure to parallelize carry generation and obtain an O(log n) computation time. The tree structure is based on two intermediate signals the, "carry propagate" and the "carry generate". If the generate signal at a node is asserted, there is an unconditional carry out at that position, i.e. a carry is "generated" at that point. If the propagate is asserted, the carry out follows the carry in, i.e. a carry is "propagated" though that circuit.
When units creating generate and propagate signals are combined into groups, the generate for the group is asserted if any unit has its generate asserted and all subsequent propagates are asserted. The group propagate is asserted if all unit propagates are asserted. In this manner, the partial prefixes are built up at each node until the root node is reached, at which point the complete set of prefixes has been generated.
The BrentKung topology builds its trees by diving the inputs into odd and even sets. A partial prefix is constructed from each odd input and its next higher even input. These partial prefixes then become the inputs for a recursive subdivision. The outputs of this subdivision provide the complete prefixes for the even outputs, which are then combined with the next higher odd input to produce the rest of the complete prefixes.
While the BrentKung prefix topology does exhibit O(log n) depth, maximum fanout also increases with log n unless a factor of 2 increase in depth is allowed in order to insert buffers to reduce the fanout at each level. Either of these alternatives has a significant negative impact on delay when realized in circuitry. However, another tradeoff is available for prefix structures, this one reducing fanout by using a much higher number of circuits. The KoggeStone prefix graph [KOGG73] exhibits a constant maximum fanout, but the number of operation nodes increases with n log n.
Carryskip is a very old type of arithmetic carry[KILB59], based on a concept originally invented by Charles Babbage in 1837. Carryskip is based diving the operand bits into groups and computing carries within each group of bits via a simple method, such as ripple carry, in parallel, as with for carryselect. The underlying property is that for any group of bits, either the carryout is generated solely by the current group or it is solely propagated from the previous group, but not both. Each block not only contains a serial path for building prefixes, but also a decision circuit to determine whether the result of the partial prefix generated by the block would actually impact the result of a complete prefix that included it. If for every bit the propagate signal (see carry lookahead) is true but the generate signal is not, then the carryout for the block is equal to the carryin. In all other cases, i.e. there is at least one carry generate asserted or one bit where neither the generate or propagate is asserted, the carry out can be computed solely from the bits in the group.
Figure 2‑9: Carry skip 
Looking at the diagram of carry skip in Figure 2‑9 may give the impression that it depends on a serial path through every bit. However, it is known that if it were possible for a signal to propagate through each block that has a skip, the skip path would also become open and allow any incoming signal “around” the block with constant delay. As a prefix operation, the computation of each piece of a prefix is accompanied by a determination of whether the accumulated result for the prefix is impacted by the piece in question. Considering the topology of the prefix graph, without regard to the actual operation occurring at each node, this turns out to be very similar to a carryselect. The main difference lies in the presence of divergent edges as well as convergent edges.
Carryskip can be quite fast, with the critical path bypassing whole blocks with a single gate. However, the delay through carryskip is highly dependant on a complicated relationship between block sizes [KANT93]. Instead of a progression from one end to the other or growth orthogonal to span of bits, the blocks in carry select must expand from each end towards the middle. The longest path is actually through the first and last block and bypassing the intermediate blocks. Increasing the operand size also is not a simple expansion on the end of the adder, and the implementation requires less regular interconnect than other designs [CHAN92]. However, the physical layout of carryskip can be amenable to a bitslice organization in a manner similar to carryselect [OKLO85]
There is a Boolean function of a higher order of complexity than AND, OR, and NOT which is of some note in the field of parallel computation. This function is called the “threshold function”. Imagine taking the weighted average of the inputs to a gate. Internally, the gate could compare that average to some value, and the output of the gate would be dependant on that comparison. In other word, the gate determines if the weighted average exceeds a given threshold, hence the term threshold function.
If we make a more precise definition of the action of the threshold function, such as
T(X) = 1 if S (w_{i}x_{i}) ³ t,
0 if S (w_{i}x_{i}) < t,
we can see that any threshold gate would perform the same basic action, and any feasible action for a specific instance can be specified once given the weight vector W and the threshold t. Threshold gates can be categorized by the characteristics of the weights and thresholds supplied to them. Categories in common use include small weights, integer weights, and bound weights. Threshold circuits are key in the field of neural computation. The threshold function closely resembles the action of an element of a neural network. These networks can be realized with threshold gates in less than the exponential size that would be required with simple ANDOR circuits, making the fabrication of networks of useful size feasible.
It is known that of the set of all possible n variable Boolean functions, the great majority will require a number gates that increases exponentially just to compute at all, let alone with a small depth and thus a short delay [SHAN49]. Implementation of a circuit whose size grows exponentially is unlikely to be feasible. No means of determining what functions can be implemented with circuits of less than exponential size has thus far been found, short of building such a circuit. As should be apparent by now, it turns out that an ANDOR circuit for binary addition can be constructed in less than exponential size. However, there is a theoretical lower bound of 2 gates for the depth of an ANDOR circuit that computed binary addition, the circuit for which would require exponential size [YAO85][HAST86].
While the main interest of this work is in making every effort to minimize circuit delay, through such means as restricting gate depth and fanout, if the circuit were of such extraordinary size as to be difficult to fabricate or to integrate within a complete processor the research would be moot. At this point threshold circuits come into the picture, since in a number of cases it has been possible to replace an ANDOR circuit of exponential size with a threshold circuit of the same depth but with more reasonable size constraints. The effect is one of replacing a large number of simple ANDOR Boolean gates with a smaller number of more complex threshold gates.
The difficulty with employing threshold circuits is one of fanin. The replacement circuit using threshold gates expects unbounded fanin. High fanin is frequently a “feature” of threshold circuits, and is a contributing factor in their prevalence in neural computation [SIU95]. However, once the implementation of highspeed digital logic is considered, high fanin there can bee seen as a liability instead. Each method of accepting additional inputs into a circuit contributes to the delay of the gate. This will be shown to be as prohibitive in terms of gate delay as the ANDOR circuit was in terms of size. Fanin as high as would be necessary for even a modest size adder could make each gate so slow as to actually result in a net loss in total circuit delay, despite the fact that the circuit would be less than half as deep in terms of gates. Even if a gate could be constructed with the technology used in this work, using them would not be practical and thus will not be considered for the present work.
The computational design for an addition circuit has a mathematical analog in the prefix problem. As the prefix problem entails generation terms of a series in parallel, so does addition circuitry involve parallel computation paths. Study of parallel prefixes and their solution forms the theoretical basis of advanced arithmetic circuits.
Currentsteering logic is a category of integrated transistor digital logic gate using a constant current source. Instead of turning current on and off to pull an output line high or low, a constant current is switched between two (or more) possible paths. This kind of digital logic gate is well suited to implementation with bipolar transistors due to the exponential relation between input voltage and output current. A small voltage swing on an input can still rapidly control a large current change, which in turn can rapidly produce a small voltage change on an output.
The basic building block of current steering logic is the “current switch”. The current switch and its operating principles will be described. Following that is discussion of how current switches can be combined to make gates that compute complex functions. Since current switches comprised of bipolar transistors are being considered, important characteristics of bipolar transistors and how they relates to the speed and load driving ability of the gates need to be considered. Simulated results for optimizing gates delay are included in that discussion, as well as interconnect parasitics. Finally, a gate will be constructed which computes a function that is key in lookahead carry structures.
Consider a pair of bipolar transistors with a common emitter connection. Given a fixed tail current, the proportion of the current drawn through each collector can be found to be a function of the difference between the two base voltages. As the voltage of one base moves from slightly less than that of the other base to slightly more, the current is “switched” from the second collector to the first.
If resistors of a particular size are connected from each collector to a voltage rail, the switching of current from 0 to I_{CS} can be converted to a switching of voltage differential from the rail from 0 to I_{CS}R. This voltage switching can in turn be used to drive another current switch. What might be described in analog terms as a differential commonemitter amplifier thereby in digital terms will operate as a currentswitched buffer or inverter. (Note that with two pullup resistors a fully differential output is produced, so that inversion is a “free” gate.) Traditionally, this upper voltage rail has been connected to ground since the output voltage is most sensitive to that rail. A negative voltage supply in that case is used to drive the lower rail.
If one base is tied to a fixed reference voltage, swinging the voltage on the opposite base from V_{ref} – V_{sw} to V_{ref} + V_{sw} will switch the current to the collector of the transistor not connected to the reference. The pullup resistor connected to that collector can be sized to swing the voltage on the collector from V_{ref} + V_{sw} to V_{ref}  V_{sw}. If a signal above the reference is considered a logical 1 and a signal below the reference a logical 0, the result is a logical NOT operation. This circuit can be referred to as a singleended current switch, since a single line that switches against a reference represents the input. If multiple transistors are connected in parallel (common collector and emitter connections), raising the base voltage of any of them will switch the current. If each base is driven by a different signal, the output will produce the logical NOR of those signals.
Recall that the current switch inherently provides a differential output. If instead of taking one output and connecting it to a current switch opposite a fixed reference voltage both output rails are connected to a base, a differential current switch is created.
A current switch diverts current from a single source at the common emitter to one of two connections at the collectors. The path from each collector through the current source in turn looks like a controllable current source connected to each collector node. The current source in a current switch could be replaced with a connection to the collector in another current switch. This topology is called “seriesgating”. The output of the entire circuit would be a logical combination of the inputs of each seriesgated level, depending on the connection. A tree built from N levels of seriesgated differential current pairs can be used to generate any function of N variables or multiplex up to 2(N1) signals
The common emitter node of a current switch is V_{BE}, on below the highest input voltage. Attempting to drive two seriesgated level of current switches with the same input voltages would mean that V_{CE} of the transistors of the lower switch would be driven to 0 volts and the transistors would be put into saturation. The inputs of a lower level must be driven at a lower voltage relative to the inputs of a higher level. Emitter followers make this possible.
Emitter followers are the equivalent of commoncollector amplifiers connected to the outputs of a current switched gate. With current flowing through the transistor, the output of the emitter follower will be V_{BE}, on below the direct output of the currentswitched gate. When driving a lower seriesgated level with emitter follower outputs, V_{CE} is equal to V_{BE}, on, maintaining the nominal V_{CE} of a nonseries gated current switch. While there are reasons to desire a larger V_{CE}, this value is easily generated by the simple emitter follower circuit and depends only on device matching issues that are already required in the current switched gate itself.
Emitter followers also have a benefit for driving large loads. While there is additional device delay while signals propagate through an additional transistor, the sensitivity of the gate delay to loading (RC parasitics in interconnect, high fanout) is reduced. The sensitivity of the output voltage to current drawn out of the gate is also reduced, although the effects are not typically significant to begin with.
A current switch pair can control the flow of current, but a source of the current is still needed. Current sources can be categorized as active or passive based on the presence of a transistor in the source circuit.
A bipolar active current source is constructed from a BJT with an emitter resistor. A reference voltage V_{CS} applied to the base of the BJT will in turn fix the voltage across the emitter resistor at V_{CS}V_{BE, on} – V_{EE}. The current source will then supply (V_{CS}V_{BE, on} – V_{EE})/R_{E} into the collector of the source’s transistor. Small variations in the common emitter node of the current switch connected to the current switch that occur during switching will have little affect on the supplied current. However, the supply voltage must be large enough to allow biasing of the BJT as well as the resistor. If the transistor enters the saturated region, less than the nominal source current will be available. Also required would be a voltage reference to generate V_{CS}. A current mirror would be setup to force a transistor base to the correct bias voltage for a set current, which would then be applied to the base of the current source transistor. If the R_{E} in the reference and the R_{E} in the current source are equal, the current through the source mirrors the current in the reference circuit. This line could be global, and each reference circuit could be shared by many current sources. The nominal value of the current source could be scaled in relation to the current of the reference by adjusting the R_{E} of the current source. The ratio of the current would be the inverse ratio of the R_{E} values.
A passive current source uses only a resistor connected to the common emitter node of the current switch. The input voltages set the biasing voltage, since the common emitter node will be at a potential V_{BE, on} below the higher input voltage. The voltage across the resistor will vary somewhat during switching, due to the difference between the swinging input voltage and the changing baseemitter voltage due to the changing current. The passive current source is also sensitive to variation in V_{EE} in a fairly linear manner. Changing the power supply requires adjusting each gate individually, while updating an active source needs no more than an adjustment to the source in the reference circuit.
There are two kinds of differential output bipolar logic circuits used in the present work. The nomenclature used for these circuits in the current literature is not clear. The commonly used terms are Emitter Coupled Logic (ECL) and Current Mode Logic (CML). As identified by the “authorities” in the field [TREA89], these categories become no more than distinctions without difference, in ways that have been found to be more confusing than useful. Some use ECL to describe solely single ended inputs switching against a reference, or singleheight current switches without seriesgating, whereas CML is used for differential inputs in tall seriesgated trees. Others use ECL to refer to the presence of emitterfollower outputs, while CML implies their lack. The term “differential current switch” (DCS) logic has also been used to describe current switches used in combination with dottedemitter outputs. It becomes evident that this system of nomenclature does not successfully describe the circuits in question. An effort has been made to explicitly describe the major characteristics, and their relevance to the circuit designs of note. The primary distinctions that appear in the current work are singleended versus differential inputs, and presence versus absence of emitter followers.
Figure 3‑3: Simulated baseemitter potential versus collector current for a 0.5 x 1.0 micron device, V_{ce}=0.3 
The speed of a bipolar transistor is most importantly characterized by the transition or toggle frequency f_{T}. This is the frequency at which the short circuit gain in a common emitter configuration reaches unity. At low collector currents, the f_{T} is controlled by the depletion capacitances and the collector current, while being bound at high currents by the base transit time. The peak of the f_{T} versus I_{C} curve is quoted as a figure of merit for the device, which will be mainly a function of the forward transit time.
The maximum oscillation frequency f_{M OSC} is defined at the point of unity power gain as opposed to unity current gain. While this is a more complex arrangement, requiring a load matched to the output resistance at each I_{C} of interest, it more closely matches the environment of a transistor embedded in a circuit driving loads and being driven itself. A value for f_{M OSC} can be found computationally from f_{T}:
f_{M OSC} = (f_{T}/8πC_{jc}r_{bb})^{1/2}.
Both the collector capacitance [ARMS95] and base resistance [BARN75] reduce this frequency. Note that while reducing the intrinsic base thickness would reduce the forward transit time and f_{T}, this would also increase r_{bb}, thereby creating a need for compromise in transistor design.
The factors of “latency” and “bandwidth” both fall under the bailiwick of the highspeed circuit designer. In a functional block comprising of a plurality of gates, minimizing latency is not necessarily equivalent to maximizing bandwidth, and this dichotomy is represented by the division between logic and communications circuits. Logic gates need only enough bandwidth to ensure that the outputs of each functional unit or block of logic have switched one block delay after the input switching. As the cycle time of the input signals would be more closely related to the block delay which is many multiples of the gate delay, having enough bandwidth to prevent the rise and fall times from overwhelming the delay through the gates is sufficient. On the other hand, since fanin and functional complexity lead to increased device parasitics loading each line, long chains of simple gates improve the bandwidth of the entire chain while extending the latency greatly. The latency of each individual gate is reduced, improving the bandwidth of the whole.
Toggle frequencies by themselves are primarily measures of the bandwidth of the input signal that can be pushed through the circuit. However, the propagation delay, or latency, of the current switch, and hence the speed of the entire circuit which is out primary interest, is dependent on more than the f_{T} and f_{M OSC} of the transistors used. Indeed, even when taking the limit as f_{T} approaches infinity the propagation delay remains nonzero. The base resistance significantly affects the delay when it becomes greater than about half of the source resistance and the impact of the capacitance of the collectorbase junction increases as emitter current drops [BARN81].
An examination of the sensitivity of propagation delay in ECL ring oscillators [CHOR88] demonstrates the effects of these parameters on the delay of the circuit. The largest sensitivities appear for the terms involving base resistance and collector capacitance. It should be noted that the projections for a delayoptimized 0.5 micron device in that work show that the two largest contributors to the total delay are the transit time and the RC load time constant. However, the sensitivities to base resistance and collector capacitance still stand, so that deviations could still impact performance noticeably. In addition, the third and fourth largest contributors of delay involve base resistance and collector capacitance respectively. These sensitivities are for gates with a fanin of only one. With more complex gates, collector capacitance and similar parameters become even more important in comparison to f_{T} [JOUP94].
Circuit design need for logic gates are divergent from the needs of communications circuits in certain areas:
· Fanin: Logic circuits should be designed with high fanin to reduce total gate depth for a functional block. Communications circuits should be built from low fanin gates in trees or chains to increase the bandwidth of each individual gate. The break even point for a single gate with a large number of inputs and depthof2 tree with a lesser number of inputs per gate is much higher with respect to latency than it is for delay.
· Output stages: Communications circuits can be improved with a commonbase stage between the currentswitching tree proper and the output nodes and pullups. This eliminated the Miller effect on the collector junction capacitance between the bases at the switching inputs and collectors at the output nodes that are moving in opposite directions and at high gain. Logic gates are hampered by the additional delay since all input to output paths now pass through and additional device. With seriesgating, there is an increase in delay for each level lower in the tree.
The voltage swing necessary on the input of a current switch is chosen so as to provide for a large noise margin for the gate [TREA89]. Maximum noise margin occurs at the input voltage when the gate is at unity gain:
v_{n} = V_{s}/2 + V_{s}v_{t}/(V_{s}v_{t})  v_{t} ln (V_{s}/v_{t}2)
Simultaneously, the baseemitter voltage must be sufficient for the transistor to be in the forward active region while the collectoremitter voltage must be sufficient to keep the transistor out of saturation:
V_{BE, on} = V_{CE} +V_{s} + v_{n}
Figure 3‑4: Graphical solution for voltage swing/noise margin relations 
Solving both of these equations simultaneously provides the appropriate output voltage swing to maximize the noise margin. Figure 3‑4 shows a graphical solution for the relations between noise margin and voltage swing. The point where the two curves cross identifies the proper voltage swing to maximize the noise margin, as well as the noise margin that is available. V_{s} is the maximum voltage between high and low values for a singleended input switching against a constant reference voltage. The maximum voltage differential across the inputs of a current switch is half of the magnitude of V_{s}, as the reference voltage is nominally half way between the high and low input levels. The swing for a differential pair is V_{s}/2, since opposing inputs of the current switch are always at the opposite extremes of the voltage swing and switching is not being performed against a static reference voltage.
While derivation of figures of merit may provide a point from which to design, eventually the transistors must be placed into a circuit. The loading that a current switch sees due to pullups, emitter followers, subsequent gates, and interconnect parasitics can affect the bias point, which minimizes delay.
A simple test case that allows examination of sizing to account for loading should refrain from applying ideal inputs directly to the gate under test. A chain of three buffers with emitter followers driven by the ideal source will produce a signal comparable to what might be observed by the circuit when in place. Similarly, three buffers following the gate under test present a realistic load on the output signal and provide the possibility of comparing the delays between several gates to check for variations. Since the currentswitching gates under examination all use the current switch/pullup/emitterfollower arrangement, using buffers gives a good representative of the set of gates as a whole. While we are interested in the delay on a per gate basis, it is important to replicate the loading and driving of the gate under test in a fair manner. Since a buffer is being considered representative of any gate and the driver and load would just be other gates, it is reasonable to represent them with buffers as well. This allows generating inputs in simulation with simple ideal sources, using the first buffer they generate a more realistic waveform, as well as automatically including the correct impedance for driver and load instead going through calculations for every sizing variation examined.
For the current investigation the assumption will be made that each buffer instance and each emitter follower instance will be biased similarly. While using different sized gates is useful when loading due to unusual fanout or wiring length occurs or if there path under consideration extends directly and solely to offchip connections, examination of the first case will be reserved for specific occurrences while the second case does not apply to the type of function that is the focus of this research. The second case does come into play in the bandwidthoriented design of communication circuits, especially in the case of multiplexing two signals of a given frequency into one signal at twice that frequency.
The parameters that are free to vary are primarily the transistor size in the buffer and emitter follower, and secondarily the tail current in the buffer and emitter follower. For the purpose of streamlining the design process, the tail currents relative to transistor size will be fixed early on. Any gain from adjusting tail current to suit specific loading is likely to be too small to be worthwhile, and required detail of the specific circuit conditions cannot in any case be supplied until a preliminary design has been completed anyway.
Since delay of the circuit is the object of optimization, the major figures of interest regarding tail current are:
· the peak f_{T} current of a transistor of a given emitter size,
· the rated current of a transistor of a given emitter size.
While the initial bias point is the peak f_{T} current, the introduction of loading in a delayoriented design leads to seeking improvements in delay when the current is increased even beyond that point. However, this current cannot be increased without bound. The transistor is rated for a maximum current density beyond which the device will fail. (Successive design kits have progressively reduced this current rating. The 9805 kit had a 2 mA per square micron current density limit. By the time the 1999B kit was available, the current density had been reduced to 1.4 mA per square micron.) Furthermore, the transistor models have known inaccuracies past the peak f_{T} current, and the devicemodeling engineers have not focused on improving the models in this regime.
In the typical usage of these devices the concern is usually either bandwidthoriented design, where biasing right at the peak f_{T} current is often wanted, or noise where the biasing is much less than the peak f_{T} current. These areas, being the focus of commercial development, receive the most modeling effort. Logic circuitry has thus far been mainly considered only in support of bandwidthoriented communications circuits. Designing for latency minimization is somewhat beyond the pale.
On the other hand, it can be shown that at the highest current where the models are qualified, the peak f_{T} current, the delay is still improving. This would seem to justify exceeding that point by some slight amount, but the informal opinion of designers in the area is that it is unwise to exceed it by much. Devoting fabrication funds for the characterization of the current versus delay relation would take away from functional “payload” circuits. The task itself would be expected to be daunting, in light of the fact that the device engineers have net even seen fit to address it. In that light, an upper bound for the tail current of the peak f_{T} current must be set.
In the meantime, the rated current has decreased to the point where it is the major limiting factor of the “excess” current anyway. Biasing at less that the current for minimum delay can be used for power savings on gates off the critical path, however power consumption has not been one of the areas of concerns for the current work. In addition, the design is such that a great percentage of gates are involved in the critical path, limiting the amount of power that could be traded off for delay.
Once given a delayminimizing current density, it is tempting to then use the smallest transistors possible to reduce power consumption and circuit area. However, it was noted above that the transistor speed is very sensitive to variation in base resistance. Small devices have high base resistances. While increasing the device size also increases the collector capacitance, there is a point where the combined effects are at a minimum and this point is above the minimum device size. Comparing the results for different sized current switch devices shows a larger jump between 1micron and 2micron emitter length devices than between other intervals. This indicates that it is worth the power to increase the fundamental emitter length to 2 microns.
1
load 
EF
Size (um) 









Buffer
Size (um) 
1 
2 
3 
4 
6 
8 
10 
12 
14 
16 
18 
20 

1 
16.1 
14.5 
14.5 
14.7 
15.3 
16 