Critical ALU Path Optimization and Implementation in a BiCMOS Process for Gigahertz Range Processors

By

Matthew W. Ernest

A Thesis Submitted to the Graduate Faculty of Rensselaer Polytechnic Institute in Partial Fulfillment of the Requirements of the Degree of

Doctor of Philosophy

Major subject: Electrical Engineering

Approved:


 

                                                

John F. McDonald, ECSE

Committee Chair

 

                                                

Mukkai Krishnamoorthy, CSCI

Committee Member

 

 

                                                

Michael Savic, ECSE

Committee Member

 

                                                

Paul Schoch, ECSE

Committee Member

 


 

 

Rensselaer Polytechnic Institute

Troy, New York

December 2002

 


 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

© Copyright 2002, Matthew W. Ernest

All Rights Reserved


 Table of Contents

Table of Contents. iii

Table of Figures. xi

Table of Tables. xvii

Table of Tables. xvii

Acknowledgment.. xviii

Abstract.. xix

Chapter 1: Introduction.. 1

1.1      Wanting a fast adder.. 1

1.2      A metric for cycle time. 3

1.3      Parallel circuits and prefix computation.. 5

1.4      Silicon Germanium bipolar and BiCMOS for high-speed processors. 6

1.5      The DARPA2 and SMI00 Reticles. 8

Chapter 2: Addition as a Parallel Prefix Problem... 10

2.1      Introduction.. 10

2.2      The Prefix Operator.. 10

2.3      Depth/Size tradeoff in prefix circuits. 11

2.4      The Carry is a Prefix Operation.. 12

2.5      Ripple Carry.. 13

2.6      Carry select.. 14

2.7      Carry look-ahead.. 16

2.8      Carry skip. 18

2.9      Threshold Circuits. 20

2.10     Conclusions. 22

Chapter 3: Digital Circuit Design with Bipolar Transistors and Current Steering Logic   23

3.1      Introduction.. 23

3.2      The Current Switch.. 24

3.2.1       Series Gating and Emitter Followers. 25

3.2.2       Current Sources. 27

3.2.3       An Issue of Nomenclature. 28

3.3      Bipolar Circuits and Designing Logic for Speed.. 29

3.3.1       fT and fM OSC. 29

3.3.2       Latency versus Bandwidth. 30

3.4      Noise margin and voltage swing.. 32

3.5      Device Sizing in Loaded Buffers. 34

3.5.1       Tail Current 35

3.5.2       Current-Switch Transistor Size. 37

3.5.3       Emitter-Follower Size. 38

3.5.4       Interconnect Parasitics. 41

3.6      The Look-Ahead Gate. 42

3.7      Conclusions. 45

Chapter 4: Carry Select Optimization.. 47

4.1      Introduction and Background.. 47

4.1.1       On adders and critical paths. 47

4.1.2       On yield limited technologies. 48

4.2      Origin and Theory of Carry Select Addition.. 48

4.3      Optimization of Carry Select Stage Sizes. 50

4.4      Optimal 32-bit ALU with Carry Select Addition.. 55

4.4.1       Logic Design. 55

4.4.2       Circuit Design. 57

4.4.3       Layout of a Monolithic ALU.. 62

4.5      Considerations Affecting the Layout.. 67

4.5.1       Simulation of the design. 68

4.6      F-RISC “Byte-Slice” Carry Select Implementation.. 70

4.6.1       A Multi-Chip Processor and Another Look at Yield Limitation. 71

4.6.2       Comparison to Optimized Adder, And Other Possibilities. 72

4.7      Conclusions. 75

Chapter 5: The Pseudo-Carry Look-Ahead Adder.. 76

5.1      Goals and objectives. 76

5.1.1       Previous work: FRISC-G.. 76

5.1.2       The next FRISC.. 76

5.1.3       SiGe. 77

5.1.4       Pseudo-carry Look-ahead. 77

5.2      Pseudo-carry Theory of Operation.. 77

5.2.1       Generalized pseudo-carry equations. 81

5.3      The DARPA2 Reticle. 82

5.3.1       Logical structure of the carry tree and circuit implementation. 82

5.3.2       Test structure. 83

5.3.3       Measurement and Analysis of DARPA2 Test Structure. 88

5.3.3.1 Resistor model 90

5.3.3.2 Temperature. 91

5.3.3.3 Interconnect Parasitics. 92

5.3.3.4 Device models. 94

5.4      SMI December 2000 Reticle. 96

5.4.1       Carry-in to a PCLA.. 97

5.4.2       Observable paths in Test structure. 99

5.4.3       Expanded transistor sizes. 100

5.4.4       Transistor bias point 100

5.4.5       Interconnect changes due to design kit and fabrication options. 101

5.4.6       Simulation temperature. 102

5.4.7       Interconnect parasitic extraction. 102

5.4.8       Cell Layout for SMI00 Reticle. 108

5.4.9       Test Structure Layout for SMI00 Reticle. 112

5.4.10      Measurement and Analysis of the Second Test Structure. 117

5.4.10.1 Continuing Parasitic Analysis. 119

5.4.10.2 Continuing Temperature Concerns. 119

5.5      Conclusions. 120

Chapter 6: QuickCap Usage and Design Flow in the FRISC Group at RPI 121

6.1      Introduction.. 121

6.1.1       QuickCap 3D capacitance extraction. 121

6.1.2       Parasitic Extraction with Cadence Design System and the SiGe5HP Design Kit. 121

6.1.3       Using CDS and QuickCap on SiGe5HP Designs. 122

6.2      Design Tool Integration.. 123

6.2.1       CDS/QuickCap Theoretical Design Flow.. 123

6.2.2       CDS/QuickCap/SmartSpice Theoretical Design Flow.. 125

6.2.3       Manual Method. 126

6.2.4       Component Status. 127

6.2.4.1 Schematic to Spectre netlist, via Affirma netlister 127

6.2.4.2 Schematic to HSpice netlist, via Affirma netlister 127

6.2.4.3 Layout to GDSII stream, via PIPO.. 127

6.2.4.4 GDSII stream to CAP and SPICE, via gds2cap. 127

6.2.4.5 HSpice and SPICE to SPICE and rename, via SvS. 127

6.2.4.6 CAP and rename to SPICE fragment, via QuickCap. 128

6.2.4.7 Spectre and SPICE fragment, via Spectre simulator 128

6.2.5       Pending Tasks and Experiments. 128

6.2.5.1 _G<number> filter for hspiceS netlister 128

6.2.5.2 GDSII rewriting attributes. 129

6.2.5.3 Pin Conversion on Stream Out 130

6.2.5.4 Make gds2cap understand the hierarchy of our GDSII streams. 130

6.2.5.5 Make Affirma include the SPICE fragment from QuickCap. 131

6.2.5.6 Keep technology files up to date with current design kit. 131

6.2.5.7 Understanding QuickCap parameters. 132

6.2.5.8 Migration to Parallel Processing Facilities. 132

6.2.5.9 Break up FEOL and BEOL layers into separate technology files. 132

6.2.5.10 Identify SmartSpice-isms. 133

6.2.5.11 Handle losses in substrate. 133

6.3      Tool Procedure. 133

6.3.1       GDSII Stream out 134

6.3.2       gds2xxx. 134

6.3.2.1 Technology File. 134

6.3.2.2 Command Line. 135

6.3.3       SvS. 136

6.3.3.1 Definitions File. 137

6.3.3.2 Command Line. 137

6.3.4       QuickCap. 138

6.3.4.1 Command Line. 138

6.3.5       QuickPrint 140

6.3.6       Error Messages. 141

6.3.6.1 Gds2xx. 142

6.3.6.2 PROGRAM ERROR IN EstimateResistance() 142

6.4      Problems and Shortcomings. 142

6.4.1       Name mapping alternatives in the Cadence Design System to QuickCap transition. 142

6.4.1.1 Keep pin information as attribute number 142

6.4.1.2 User-Defined Property Mapping File. 143

6.4.1.3 Convert Pin to. 143

6.4.1.4 Convert Pin Label Layer to Pin Layer 144

6.4.1.5 Scripted label file generator 145

6.4.1.6 SvS (Schematic versus Schematic. 145

6.4.2       Back-annotation in the QuickCap to Cadence transition. 145

6.4.2.1 Intermediate SPICE to Spectre parsing script 145

6.4.2.2 Intermediate numeric to Spectre parsing script 145

6.4.2.3 Cadence SPICE reader for Spectre netlister 146

6.4.2.4 cap+spef and SPEF interchange format 146

6.4.3       Cell Hierarchy in Extraction. 147

6.4.3.1 Cell Name Mapping. 147

6.4.3.2 Exporting Hierarchies. 147

6.4.3.3 Insertion in Spectre Netlist 148

6.4.4       RC Extraction. 148

Chapter 7: State of the Art and Future Directions.. 149

7.1      Introduction.. 149

7.2      The State of the Art.. 149

7.3      Exploitation of the idempotency of the prefix operation.. 151

7.4      Full Arithmetic-Logic Unit.. 152

7.5      Increasing depth of series-gating in look-ahead gates. 156

7.6      Dotted-emitter/dotted-collector circuitry.. 157

7.7      Increased operand width.. 161

7.8      Other emitter-follower enhancements. 162

7.9      Micropipelining.. 164

7.10     Utilization of BiCMOS circuit designs. 166

7.11     SiGe 7HP, 8T and Further Processes. 167

7.12     Adaptations for non-CPU applications. 170

Chapter 8: Research Conclusions.. 172

Appendix A: References.. 174

Appendix B: DARPA02 TESTPCL Pseudo-carry Look-ahead Test Structure Netlists and Schematics   183

B.1      Testpcl. 183

B.2      Vref_8mA.. 186

B.3      PadReceiver_ESD_3o_1u.. 187

B.4      PadDriver_8m_1u.. 189

B.5      orb2q.. 191

B.6      buf2q.. 193

B.7      latch.. 194

B.8      staticq.. 196

B.9      bufq.. 197

B.10        vref2. 199

B.11        ef3. 200

B.12        ef2. 201

B.13        etree2. 203

B.14        pcl14noc.. 204

B.15        pcl18c.. 207

B.16        and3. 210

B.17        pclsub2. 211

B.18        ef4. 212

B.19        etree3. 213

B.20        pcl6noc.. 216

B.21        pcl6c.. 218

B.22        vref1. 220

B.23        hstart.. 221

B.24        istart.. 223

B.25        pclsub2c.. 224

B.26        hsc1q.. 226

Appendix C: SMI00 TESTPCLL2u Pseudo-carry Look-ahead Test Structure Netlists and Schematics   228

C.1      testpclL2u.. 228

C.2      padd_RF_SE. 230

C.3      padr_ESD_DC_SE_3o_1u.. 232

C.4      efL2u.. 234

C.5      Vref_8mA_45_5l. 235

C.6      and2bqL3u.. 236

C.7      etree2L2u.. 237

C.8      efL6u.. 239

C.9      vref2L2u_a.. 240

C.10        mslatchL2u.. 241

C.11        buf2qL2u.. 243

C.12        efL4u.. 244

C.13        pclrosc16L2u.. 245

C.14        pclrosc16cL2u.. 248

C.15        staticq.. 250

C.16        pclrosc4L2u.. 251

C.17        and3L2u.. 253

C.18        etree3L2u.. 254

C.19        pclrosc6L2u.. 256

C.20        pclrosc4cL2u.. 258

C.21        and2bL2u.. 259

C.22        pclroscc2L2u.. 260

C.23        ef2L4u.. 261

C.24        istartL2u.. 262

C.25        hstartL2u.. 264

Appendix D: Auxiliary Files for QuickCap Usage. 266

D.1      Strmout.template. 266

D.2      Pinlabels.il. 266

D.3      QuickCap Technology File Declarations. 267

D.4      Details of the SiGe5HP technology files. 268

D.5      .SvSdefs. 273

Table of Figures

Figure 1‑1: A very simple representation of the basic operations of a processor.. 3

Figure 1‑2: Basic units of a pipelined RISC processor, and connections between them... 4

Figure 1‑3: Three-dimensional rendering of a Silicon-Germanium HBT.. 8

Figure 2‑1: Logical diagram of ripple carry.. 12

Figure 2‑2: Prefix graph for ripple carry.. 13

Figure 2‑3: Carry select.. 14

Figure 2‑4: Prefix graph for carry select.. 15

Figure 2‑5: Logical diagram of (flat) carry look-ahead.. 16

Figure 2‑6: Logical diagram of block look-ahead.. 17

Figure 2‑7: Prefix graph for Kogge-Stone adder.. 18

Figure 2‑8: Prefix graph for Brent-Kung adder.. 18

Figure 2‑9: Carry skip. 19

Figure 3‑1: A bipolar current switch, configured as a digital buffer.. 24

Figure 3‑2: Series-gating of current switches. 26

Figure 3‑3: Simulated base-emitter potential versus collector current for a 0.5 x 1.0 micron device, Vce=0.3  29

Figure 3‑4: Graphical solution for voltage swing/noise margin relations. 33

Figure 3‑5: Buffer delay, for buffer transistor size equal to emitter-follower size. 37

Figure 3‑6: Fully differential gates to perform look-ahead for two or three bits. 42

Figure 3‑7: Look-ahead gate with mixed single-ended and differential inputs for two or three bits  43

Figure 3‑8: Annotated layout of a two-way look-ahead gate with differential and single-ended switches identified.. 44

Figure 3‑9: Annotated layout of a three-way look-ahead gate with differential and single-ended switches identified.. 46

Figure 4‑1: Representation of a Carry Select Adder.. 49

Figure 4‑2: Logical Organization of the Optimized ALU.. 57

Figure 4‑3: Carry Generation Circuit.. 58

Figure 4‑4: Sum Generation Circuit.. 59

Figure 4‑5: Carry Selection Circuit.. 60

Figure 4‑6: ALU Function Generator.. 61

Figure 4‑7: HEAD cell Layout.. 62

Figure 4‑8: MID cell Layout.. 64

Figure 4‑9: CMUX Cell Layout.. 66

Figure 4‑10: Five bit carry select stage from optimized ALU layout.. 67

Figure 4‑11: Testing the ALU.. 69

Figure 4‑12: 32-SPICE Simulation of 32-bit Carry Select Adder.. 70

Figure 4‑13: Comparison of Adder and Register File Areas. 74

Figure 5‑1: Prefix graph for the pseudo-carry look-ahead test structure on the DARPA02 reticle  81

Figure 5‑2: Blocks arranged in a pseudo-carry look-ahead tree. 83

Figure 5‑3: Carry tree test structure. 84

Figure 5‑4: Buffer delay vs. tail current for 9805A design kit.. 85

Figure 5‑5: Layout of the PCLA test structure on the DARPA02 reticle. 86

Figure 5‑6: HSpice simulation of DARPA02 test structure. 87

Figure 5‑7: Oscilloscope trace of the high-speed output.. 88

Figure 5‑8: Annotated microphotograph of fabricated test structure. 89

Figure 5‑9: Breakdown of measured delay by source. 90

Figure 5‑10: Variation of HBT device model resistance paramters. 95

Figure 5‑11: Variation of HBT device model capacitance parameters. 96

Figure 5‑12: Prefix graph for the pseudo-carry look-ahead test structure on the SMI reticle, employing the "32-plus" carry-in method.. 98

Figure 5‑13: Extended test structure for the SMI reticle. 99

Figure 5‑14: Delay for minimum pitch wiring using various methods of parasitic estimation   106

Figure 5‑15: Delay for sparse pitch wiring using various methods of parasitic estimation. The horizontal axis is wire length in microns, while the vertical axis is delay in picoseconds, including intrinsic gate delay. 107

Figure 5‑16: Annotated layout of a two-way look-ahead gate with differential and single-ended switches identified.. 108

Figure 5‑17: Annotated layout of a three-way look-ahead gate with differential and single-ended switches identified.. 109

Figure 5‑18: Layout of the "hstart" gate used to start the generation of pseudo-carries from the operand bits. 110

Figure 5‑19: "Long and thin" layout of an emitter-follower, shifting a signal one to three levels down. 110

Figure 5‑20: Layout for the PCLA test structure on the SMI00 reticle. 111

Figure 5‑21: Simulation output of the PCLA test structure on the SMI00 reticle. 112

Figure 5‑22: Quarter wafer carrying the SMI00 reticle. 114

Figure 5‑23: A closer view of the sites on the SMI00 reticle. 115

Figure 5‑24: Annotated microphotograph of the pseudo-carry look-ahead test structure on the SMI00 reticle. 116

Figure 5‑25: Oscilloscope output of the SMI00 test structure. 117

Figure 6‑1: Proposed RPI CDS/QuickCap design flow... 124

Figure 6‑2: 3D rendering of a SiGeHP NPN via QuickPrint and POVRay.. 140

Figure 7‑1: Schematic of the Intel Pentium 4 "double-pumped" adder, from [HINT01] 150

Figure 7‑2: Completing the sum with parallel term... 153

Figure 7‑3: Clearing the carry for logic operations. 154

Figure 7‑4: Moving the carry-clearing  circuitry off the critical path.. 155

Figure 7‑5: Increasing height of decision tree and variation of delay as a function of input level  156

Figure 7‑6: Dotted-emitter (dotted-OR) 158

Figure 7‑7: Dotted-collector (dotted-AND) 158

Figure 7‑8: Limiting the low-level output of the "dotted-and". 159

Figure 7‑9: Proposed dotted and/or implementation of three-way look-ahead function   160

Figure 7‑10: Typical emitter-follower with passive current sources. 163

Figure 7‑11: Emitter-follower with cross-coupled active pull-downs. 163

Figure 7‑12: Peak of the fT curve for a 0.12 by 0.8 micron device in the 8T process. 168

Figure 7‑13: Delay as a function of driven gate loads for 5HP and 8T processes. 169

Figure B‑1: Schematic for toplevel cell “testpcl”. 183

Figure B‑2: Schematic for cell “Vref_8mA”. 186

Figure B‑3: Schematic for cell “PadReceiver_ESD_3o_1u”. 187

Figure B‑4: Schematic for cell “PadDriver_8m_1u”. 189

Figure B‑5: Schematic for cell “orb2q”. 191

Figure B‑6: Schematic for cell “buf2q”. 192

Figure B‑7: Schematic for cell “latch”. 194

Figure B‑8: Schematic for cell "staticq". 196

Figure B‑9: Schematic for cell "bufq". 197

Figure B‑10: Schematic for cell "vref2". 198

Figure B‑11: Schematic for cell "ef3". 200

Figure B‑12: Schematic for cell "ef2". 201

Figure B‑13: Schematic for cell "etree2". 202

Figure B‑14: Schematic for cell "pcl14noc". 204

Figure B‑15: Schematic for cell "pcl18c". 207

Figure B‑16: Schematic for cell "and3". 210

Figure B‑17: Schematic for cell "pclsub2". 211

Figure B‑18: Schematic for cell "ef4". 213

Figure B‑19: Schematic for cell "A.19 etree3". 214

Figure B‑20: Schematic for cell "pcl6noc". 216

Figure B‑21: Schematic for cell "pcl6c". 218

Figure B‑22: Schematic for cell "vref1". 220

Figure B‑23: Schematic for cell "hstart". 221

Figure B‑24: Schematic for cell "istart". 223

Figure B‑25: Schematic for cell "pclsub2c". 224

Figure B‑26: Schematic for cell "hsc1q". 226

Figure C‑1: Schematic for cell "testpclL2u". 228

Figure C‑2: Schematics for cell "padd_RF_SE". 230

Figure C‑3: Schematic for cell "padr_ESD_DC_SE_3o_1u". 232

Figure C‑4: Scgematic for cell "efL2u". 234

Figure C‑5: Schematic for cell "Vref_8mA_45_5l". 235

Figure C‑6: Schematic for cell "and2bqL3u". 236

Figure C‑7: Schematic for cell "etree2L2u". 237

Figure C‑8: Schematic  for cell "efL6u". 238

Figure C‑9: Schematic for cell "vref2L2u_a". 240

Figure C‑10: Schematic for cell "mslatchL2u". 241

Figure C‑11: Schematic for cell "buf2qL2u". 243

Figure C‑12: Schematic for cell "efL4u". 244

Figure C‑13: Schematic for cell "pclrosc16L2u". 245

Figure C‑14: Schematic for cell "pclrosc16cL2u". 248

Figure C‑15: Schematic for cell "staticq". 250

Figure C‑16: Schematic for cell "pclrosc4L2u". 251

Figure C‑17: Schematic for cell "and3L2u". 252

Figure C‑18: Schematic for cell "etree3L2u". 254

Figure C‑19: Schematic for cell "pclrosc6L2u". 256

Figure C‑20: Schematic for cell "pclrosc4cL2u". 258

Figure C‑21: Schematic for cell "and2bL2u". 259

Figure C‑22: Schematic for cell "pclroscc2L2u". 260

Figure C‑23: Schematic for cell "ef2L4u". 261

Figure C‑24: Schematic for cell "istartL2u". 262

Figure C‑25: Schematic for cell "hstartL2u". 264


Table of Tables

Table 3‑1: Buffer delay in picoseconds for one load.. 38

Table 3‑2: Buffer delay in picoseconds for two loads. 39

Table 3‑3: Buffer delay in picoseconds for three loads. 40

Table 3‑4: Buffer delay in picoseconds for six loads. 40

Table 4‑1: Estimation of delay versus number of stages: s=1, B=32. 54

Table 4‑2: Estimation of delay versus number of stages: s=2, B=32. 55

Table 4‑3: Representative Timings From SPICE SImulation.. 69

Table 5‑1: Effects of erroneous resistor modeling in 9805A design kit.. 91

Table 5‑2: Delay for each feedback path in the SMI00 test structure, at 75°C.. 113

Table 5‑3: Minimum measured delay from the SMI00 test structure. 118

Table 5‑4: Delay for various capacitance extraction methods for the SMI00 test structure, updated to IBM SiGe 5HP v2.5 design kit.. 118

Table 5‑5: Delay for various temperatures. 119

Table 7‑1: Published adder/carry data.. 151

Table 7‑2: Delays for latched look-ahead gate. 165

 


Acknowledgment

For their roles in assisting and inspiring this research, thanks go out to:

·        John F. McDonald, thesis advisor

·        Mukkai Krishnamoorthy, Michael Savic, and Paul Schoch, members of the doctoral committee

·        Hans Greub and Russ Kraft, current and former faculty associated with the F-RISC group

·        The numerous students of the F-RISC group over the years

 

This research was sponsored in part by Defense Advanced Research Projects Agency (DARPA) under contracts ? N66001-96-8606, DAAH04-93-G-04777, and N00173-99-1-G013.

 


Abstract

Binary addition is a simple ubiquitous component of computational circuits. One can hardly imagine a computer that did not add; to many it wouldn’t even merit the name. In both general-purpose and application-specific processors the adder delay is a strong metric for cycle time.

This research spans three areas that contribute to adder speeds: logical arrangement of carry generation, circuits to implement that arrangement, and high-speed semiconductor devices to realize those circuits.

Carry generation belongs to a class of parallel computation problem knows as parallel prefixes. The basis of this work’s logical design is pseudo-carry look-ahead, a method that uses tree-like structures to minimize gate depth on critical paths and trades delay from critical paths to non-critical paths.

The logical forms that reduce the serial computations necessary for addition are interrelated with the circuit forms that allow the fastest generation of those computations. Special circuits to compute look-ahead in a single gate reduce signal path length and allow driving of signals at high speeds.

Silicon Germanium HBTs provide high-speed devices while leveraging the mature lithography of traditional silicon processes. Not only can fast circuits be built, but high integration allows not just large units like adder but whole systems into which adders would be embedded.

The combination of these three areas has allowed the construction of a 32-bit pseudo-carry look-ahead circuit with a delay of 146 ps in a 50 GHz fT SiGe process. In addition, direction for future work have been established that lead to delays on the order of 32 ps.


Chapter 1:  Introduction

“Speed has always been important otherwise one wouldn't need the computer.”

Seymour Cray

1.1      Wanting a fast adder

Binary addition is one of the smallest “complex” Boolean circuits, something beyond just the Boolean primitives. It is one of the simplest applications of Boolean logic that has meaning and usefulness in the real world.

Adders are at the center of every computer. Adding machines were in fact the forebears of computers, which were envisioned as calculators for more advanced arithmetic. Circuits for the computation of higher-order mathematical functions such as multiplication employ adders as sub-circuits [SWAR90].

Adders are also not restricted to general-purpose processors. Digital signal processors and network processors also need to perform arithmetic in their operations [SWAR97]. Improved performance can be made possible if a fast adder is available, possibly with even greater effect than on a general-purpose processor. As the bottleneck in computation shifts between the processor and data I/O, there remains a demand for fast addition. [FLYN01]

Being both a simple circuit and a building block for a very popular system, binary adders are strong candidates for special-case optimization. The simplicity of the operation leads to the tractability of the problem of delay optimization. The ubiquity of the operation allows the savings due to optimization to be reaped multiple times, resulting in design effort being transformed into latency reduction with high efficiency. Extensive logic design and handcrafted circuits for adders carry the possibility of great returns on effort invested in terms of processor performance. Examination of the logic of addition exposes the underlying parallel computation issue. Exploiting this parallelism speeds up the delay by doing work in different areas at the same time when possible instead of serially. Crafting the circuits by hand results in the optimum performance for the specific conditions by the parallel logic. Through adjustment in areas such as gate topology, input symmetry, and re-ordering of functions, delay can be moved off the critical paths that define the overall delay of the adder and onto other, shorter paths.

These activities could in theory be undertaken on any circuit in a processor. However, available designer effort is finite, and not every circuit will produce great gains for the overall system when more design effort is applied to it.

After the trivial ripple carry adder, early research led to carry-skip [KILB59] and carry-select [BEDR62], both of which have the advantage of a nearly bit-slice arrangement in the physical layout [OKLO85]. Full examination of parallelization led to block carry look-ahead [BREN82]. Although requiring a large area for its speed increases, carry look-ahead often attracts the interest of bipolar designers who are already driving hard toward the fastest possible circuits [BEWE88], and that of BiCMOS designers as well [KUO93]. Recent work has focused, instead of on general-purpose stand alone adders, on adders incorporated in other circuits, such as multipliers, which present an uneven input profile to an adder [STEL96].

Figure 11: A very simple representation of the basic operations of a processor


1.2      A metric for cycle time

A processor in its most basic form may be described as running in a loop of the following actions:

·        Retrieve some data from memory

·        Perform some operation on that data

·        Place the result back into memory.

For an actual processor, there of course needs to be some sort of controlling state machine which manages these actions, and the memory system requires rather more detail. However, this description begins to provide the framework for identifying where delay is produced which created the cycle time.

An empirical analysis of processor cycles times shows a strong correlation with this simplified model. Delay can be modeled as the sum of the time for the arithmetic-logic unit (ALU) to perform the operation plus the wiring delay of the longest connection from the controlling unit to the operations unit [SAIH95].

Figure 12: Basic units of a pipelined RISC processor, and connections between them


If a more sophisticated example of a pipelined RISC processor is examined, it can still be seen that the delay between pipeline latches, and thus the cycle time, is at a minimum the ALU delay plus the delay of the control signals to it. The register-file delay plus its control signal delay is also another possibility for bounding the cycle time, but it can be presumed that since the longest delay between pipeline stages defines the clock cycle for all stages, the relative importance of the ALU delay versus register-file delay is determined merely by which is currently the larger value. Nothing inherent to general properties of either functional unit makes its delay more important than the other. The ALU delay is still worthy of investigation.

1.3      Parallel circuits and prefix computation

The critical path in adders is often the carry signals. Each carry depends on all preceding operand bits. This kind of computation belongs to a general category called “prefix operations”. These have turned out to be one of the fundamental areas of the study of parallel algorithms. Recognizing the parallelism that can be brought to bear in solving prefix problems is key to development of fast adders [BREN82].

The connectivity of a carry circuit can be abstracted as a directed graph. Inputs are comprised of the operand bits at each bit position. “Processor” nodes, in contrast to simple buffer nodes, combine partial carries via the appropriate block carry operation for the carry method being considered. The nodes that represent the carry outputs for each bit position must have as predecessors in the carry graph the operand bits for each bit position of lesser or equal significance.

Given this graph representation, the analogy can be drawn between improving adder latency and minimizing the depth of spanning trees. The shallower the tree that can be constructed, the fewer gates each path to the carry outputs must pass through. However, there are constraints on both the construction and the drive capability of the gate circuitry that translate into finite capabilities for the “processor” nodes. These restrictions typically manifest themselves as limits on fan-in and fan-out of the “processor” nodes, with function complexity possibly being a concern as well. Thus, the mathematical basis of latency reduction of the arithmetic carry is established as the bounded fan-in/fan-out reduction of spanning trees.

The first area that this work will pursue is the mathematics of parallel prefixes and the mapping between prefixes and types of adders. This will provide an analytical basis for delay optimization of addition.

1.4      Silicon Germanium bipolar and BiCMOS for high-speed processors

The very first transistor was formed with a bipolar junction. When MOSFETS were developed, bipolar circuits held a speed advantage. However, CMOS was able to take over more and more of VLSI arena since it had lower needs for both stand-by current and voltage supply. It increased use lead to the more rapid maturing of CMOS lithography as compared to bipolar, creating a further advantage in terms of transistor density. Still, when the utmost speed was called for bipolar circuits were needed.

The F-RISC project was generated a body of work concerned with the development of a Spartan RISC processor in a Gallium Arsenide HBT process from Rockwell. Hetero-structures and III-V technologies offer higher device speeds than Silicon bipolar devices, let alone silicon MOSFETS. However, the yield of these technologies is terribly low for highly integrated systems. For a 10,000-transistor die in the Gallium Arsenide technology, testing often passes no more than one in three. Other technologies, such as Indium Phosphide, might offer faster devices but with even worse yields. They are typically only seen in very low integration levels in applications such as communications amplifiers, which might use only a handful of devices and can therefore be produced with useable yields. For the digital designer looking for high-integration, the “exotic” nature of these technologies has retarded the maturation of the semiconductor fabrication and presents great difficulty from a practical manufacturing standpoint.

Silicon Germanium BiCMOS processes, such as the SiGeHP [AHLG97] from IBM, provide access to the speeds of bipolar circuits while leveraging the mature lithography of CMOS [CRES98]. It is possible to combine the steps need for a polysilicion emitter and graded germanium base into an extant CMOS process. This makes small feature sizes available to the bipolar layers via lithographic methods whose development is amortized over the high production volumes of CMOS chips. Such high levels of development would be difficult to cover with a bipolar-only process. In turn, the integration and power characteristics of the CMOS circuits can be used outside the speed-critical areas, bringing processing onto the bipolar die that previously required a separate package. This also permits combining quality CMOS devices with quality bipolar devices into single circuits that were not previously possible.

The silicon germanium heterojunction bipolar transistor (HBT) also possesses advantages over high-performance III-V HBTs. The silicon substrate possesses advantageous mechanical properties, such as low-defect wafers, high thermal conductivity, mechanical strength, and acceptance of wide ranges of doping.

Bipolar devices provide the ability to construct effective small-swing differential logic circuits. If two HBTs are arranged as a differential current switch, the switching of the current from one device to the other is an exponential function of the differential voltage across the inputs of the pair. This results in a high gain for the circuit, which can be traded off against a very small differential input swing to produce short switching times.

Figure 13: Three-dimensional rendering of a Silicon-Germanium HBT


The construction of HBT-based digital gates will be addressed, as a necessary preamble to the development of an arithmetic unit. Description and analysis of differential current switches and small-swing outputs will be related to device and circuit parameters and how they affect circuit latency and load-driving capability.

1.5      The DARPA2 and SMI00 Reticles

A major difficulty that research in high-speed circuitry faces is that the advanced semiconductor manufacturing required often outstrips the budget of such a research program. Some processes might not even be available to such a market at all. Whenever an opportunity to get circuits on a fabrication run presents itself, every effort is made to utilize the space available as much as possible. However, since the timing of the fabrication run is in the hands of the third party supplying the reticle space, this can cause disruptions to the design cycle that reduce the efficiency with which the design can be verified. This unfortunately causes delays in analysis and communication and leads to design decisions which common sense would fault. Part of the university research writing then is on the one hand to identify and explain these decisions influenced by third party non-technical issues, and on the other to anticipate and prepare for such situations by creating a design process with reduced sensitivity to such factors and that can produce a reliable design at more frequent intervals.

Two such fabrication opportunities arose during this work. DARPA sponsored a two-run multi-user reticle program to expand the research opportunities in silicon germanium for the projects under its direction. The first reticle carried a three-port register file from the F-RISC group. The second reticle carried a new register file and a SERDES [KRAW00], as well as the pseudo-carry look-ahead test structure described in this work.

Sierra Monolithics, Inc., also graciously donated some space on two reticles for test structures built by the F-RISC. The first such reticle included an improved SERDES. The second included a BiCMOS FPGA design, along with an updated pseudo-carry look-ahead test structure that is discussed in this work.

This work will cover the design, analysis, fabrication, and testing of the carry test structures for these two reticles. Central to this work will be the analysis of the measured results. Sources of error need to be identified and the portions of the design process that created these error sources corrected.


Chapter 2:  Addition as a Parallel Prefix Problem

2.1      Introduction

Arithmetic carries belong to the set of functions called “prefix problems”. This kind of function generates a series of results where the each term depends on the previous term, i.e. each result is the “prefix” for the next. Efficient solutions to prefix problems depend on generating in parallel certain sub-sections of prefixes and then combining them to produce the complete results. Prefixes are one of the core ideas of parallel computation.

2.2      The Prefix Operator

Consider a series of terms on some associative operator “·”, e.g. x0, x0 · x1, x0 · x1 · x2, etc. If we look at each term of the series, Fn=x0 ·· xn-1 · xn, it’s clear that due to the associativity of our operator each term may be rewritten as a recurrence Fn=Fn-1 · xn. In other words, each term of the series is generated by applying our associative operator to a new variable and a prefix that turns out to be the previous term in the series. The set of problems that such a construction applies to are referred collectively as “prefix problems”. Prefix problems form the theoretical basis of quite a few practical computational circuits, notably among them carry trees for addition. The very idea of prefix circuits was first introduced as part of a fast binary adder [OFMA63].

If the series is generated purely by means of direct application of the recurrence relation, it is apparent that the time to generate the output of a prefix problem will grow linearly with the size of the input. To reduce this time, methods of parallel prefix generation are required. Indeed, prefix computation has become fundamental to field of parallel algorithms [LAKS94], dating back to the COMPRESS operator in APL [IVER62].

2.3      Depth/Size tradeoff in prefix circuits

A serial prefix circuit with n inputs can be shown by inspection to require n -1 operations and require time n -1 to complete. It can be shown for any prefix circuit that the lower bound of the sum of the time and the size of the circuit is 2 n –2 [SNIR86]. This suggests that increasing the number of operations could be used to reduce the depth of the circuit. In VLSI circuits, the ever-increasing integration causes this tradeoff to be easy to make as well as desirable.

Since the operator used to build the prefixes is associative (by definition), the prefix circuit could be built up by binary division. The inputs are divided into a lower-order and a higher-order half, and the prefixes for each half computed. The results from the lower-order half are then applied to the partial prefixes for the higher-order half to produce the complete prefixes. This division would be applied recursively to produce the prefixes for each half. The size of an n input circuit would be twice the size of an n /2 circuit plus the operations to produce the complete prefixes for the higher-order half, while the depth would be one more than the depth for n /2 inputs. Removing the recurrence, that means that the size is n /2 log n and the depth is log n. A different divide-and-conquer strategy was used in [BREN82] to reduce the size to O(n).

2.4      The Carry is a Prefix Operation

Figure 21: Logical diagram of ripple carry


It can be shown that addition is a type of prefix operation. Specifically, it is the carry generation that constitutes a system of prefixes. The sum at each bit position is a function of the operand bits at that position and the carry-out out from the preceding position. Carry-out of position n depends on the operands at position n and the carry from the proceeding position n-1, which can be expressed in terms of the recurrence relation for a prefix operation given above [BREN82]. The basic series computation of prefixes is the equivalent of ripple carry. If we consider the delay of a gate that computes “·” and the area that the circuit occupies as our basic units, generating prefixes in series over n terms will take n-1 time units but occupies a space of n -1. At the other extreme, each prefix could be computed independently in constant time at great expense in area (and with highly impractical fan-in and fan-out requirements). For a large set of these carry structures, the relationship between depth and size (in terms of processing nodes) is so strong that is possible to expand or contract the prefix graph with a non-heuristic algorithm to pass from one structure to another [ZIMM96].

Figure 22: Prefix graph for ripple carry


2.5      Ripple Carry

Ripple carry is the simplest form of addition carry, deriving straight from the definition of addition, or from the recurrence relation for a prefix operation. As mentioned above, it is simply the serial generation of prefixes as applied to the case of addition as the associative operator. It consists of two single-digit additions per bit. For each bit, the operands are "half-added", the lowest digit of the result being the sum and the rest being a carry. The carry is then added to the sum of the next bit, producing the addition result. This mechanism of adding the carry at one position to the sum at the next position creates the next carry. The carry signals are said to "ripple" from one bit to the next. The logical structure needed at each bit, as well as the interconnections between bits, are identical. The time to produce all result digits and the necessary circuit area are linear functions of time.

2.6      Carry select

Figure 23: Carry select


Improvements on carry times involve computation of prefixes over groups in order to generate intermediate results in parallel. Carry-select [BEDR62] is based on computing carries over specific groups of bits based on assumed inputs. Partial prefixes that depend only on the bits within a single group are generated in parallel for each group. Then the appropriate partial prefixes from different groups are combined in series to create complete prefixes. The carry out of one group is the carry in of the succeeding group, building a complete prefix for all for the preceding bits in series as it progresses through each group. For each group, partial prefix generation is accomplished by including two complete sets of the circuitry for a local carry-computation method. One copy has an assumed carry-in of one, the other a carry-in of zero. When the carry-in from the previous group is determined, selecting the proper output for each partial prefix can generate the complete prefix. All of the "assumed carry" values can be generated in parallel, and then the proper alternatives selected in turn when the carry-in is applied.

Figure 24: Prefix graph for carry select


Though it is not the fastest of carry structures, carry select offers some implementation advantages, being very easy and quick to implement due to the small number of subcell types needed and requiring not much more than twice the area for serial carry generation.

Figure 25: Logical diagram of (flat) carry look-ahead


2.7      Carry look-ahead

Carry look-ahead [BREN82] uses a tree structure to parallelize carry generation and obtain an O(log n) computation time. The tree structure is based on two intermediate signals the, "carry propagate" and the "carry generate". If the generate signal at a node is asserted, there is an unconditional carry out at that position, i.e. a carry is "generated" at that point. If the propagate is asserted, the carry out follows the carry in, i.e. a carry is "propagated" though that circuit.

When units creating generate and propagate signals are combined into groups, the generate for the group is asserted if any unit has its generate asserted and all subsequent propagates are asserted. The group propagate is asserted if all unit propagates are asserted. In this manner, the partial prefixes are built up at each node until the root node is reached, at which point the complete set of prefixes has been generated.

Figure 26: Logical diagram of block look-ahead


The Brent-Kung topology builds its trees by diving the inputs into odd and even sets. A partial prefix is constructed from each odd input and its next higher even input. These partial prefixes then become the inputs for a recursive subdivision. The outputs of this subdivision provide the complete prefixes for the even outputs, which are then combined with the next higher odd input to produce the rest of the complete prefixes.

Figure 27: Prefix graph for Kogge-Stone adder


Figure 28: Prefix graph for Brent-Kung adder


While the Brent-Kung prefix topology does exhibit O(log n) depth, maximum fan-out also increases with log n unless a factor of 2 increase in depth is allowed in order to insert buffers to reduce the fan-out at each level. Either of these alternatives has a significant negative impact on delay when realized in circuitry. However, another trade-off is available for prefix structures, this one reducing fan-out by using a much higher number of circuits. The Kogge-Stone prefix graph [KOGG73] exhibits a constant maximum fan-out, but the number of operation nodes increases with n log n.

2.8      Carry skip

Carry-skip is a very old type of arithmetic carry[KILB59], based on a concept originally invented by Charles Babbage in 1837. Carry-skip is based diving the operand bits into groups and computing carries within each group of bits via a simple method, such as ripple carry, in parallel, as with for carry-select. The underlying property is that for any group of bits, either the carry-out is generated solely by the current group or it is solely propagated from the previous group, but not both. Each block not only contains a serial path for building prefixes, but also a decision circuit to determine whether the result of the partial prefix generated by the block would actually impact the result of a complete prefix that included it. If for every bit the propagate signal (see carry look-ahead) is true but the generate signal is not, then the carry-out for the block is equal to the carry-in. In all other cases, i.e. there is at least one carry generate asserted or one bit where neither the generate or propagate is asserted, the carry out can be computed solely from the bits in the group.

Figure 29: Carry skip


Looking at the diagram of carry skip in Figure 2‑9 may give the impression that it depends on a serial path through every bit. However, it is known that if it were possible for a signal to propagate through each block that has a skip, the skip path would also become open and allow any incoming signal “around” the block with constant delay. As a prefix operation, the computation of each piece of a prefix is accompanied by a determination of whether the accumulated result for the prefix is impacted by the piece in question. Considering the topology of the prefix graph, without regard to the actual operation occurring at each node, this turns out to be very similar to a carry-select. The main difference lies in the presence of divergent edges as well as convergent edges.

Carry-skip can be quite fast, with the critical path bypassing whole blocks with a single gate. However, the delay through carry-skip is highly dependant on a complicated relationship between block sizes [KANT93]. Instead of a progression from one end to the other or growth orthogonal to span of bits, the blocks in carry select must expand from each end towards the middle. The longest path is actually through the first and last block and bypassing the intermediate blocks. Increasing the operand size also is not a simple expansion on the end of the adder, and the implementation requires less regular interconnect than other designs [CHAN92]. However, the physical layout of carry-skip can be amenable to a bit-slice organization in a manner similar to carry-select [OKLO85]

2.9      Threshold Circuits

There is a Boolean function of a higher order of complexity than AND, OR, and NOT which is of some note in the field of parallel computation. This function is called the “threshold function”. Imagine taking the weighted average of the inputs to a gate. Internally, the gate could compare that average to some value, and the output of the gate would be dependant on that comparison. In other word, the gate determines if the weighted average exceeds a given threshold, hence the term threshold function.

If we make a more precise definition of the action of the threshold function, such as

T(X) =       1 if S (wixi) ³ t,

                  0 if S (wixi) < t,

we can see that any threshold gate would perform the same basic action, and any feasible action for a specific instance can be specified once given the weight vector W and the threshold t. Threshold gates can be categorized by the characteristics of the weights and thresholds supplied to them. Categories in common use include small weights, integer weights, and bound weights. Threshold circuits are key in the field of neural computation. The threshold function closely resembles the action of an element of a neural network. These networks can be realized with threshold gates in less than the exponential size that would be required with simple AND-OR circuits, making the fabrication of networks of useful size feasible.

It is known that of the set of all possible n variable Boolean functions, the great majority will require a number gates that increases exponentially just to compute at all, let alone with a small depth and thus a short delay [SHAN49]. Implementation of a circuit whose size grows exponentially is unlikely to be feasible. No means of determining what functions can be implemented with circuits of less than exponential size has thus far been found, short of building such a circuit. As should be apparent by now, it turns out that an AND-OR circuit for binary addition can be constructed in less than exponential size. However, there is a theoretical lower bound of 2 gates for the depth of an AND-OR circuit that computed binary addition, the circuit for which would require exponential size [YAO85][HAST86].

While the main interest of this work is in making every effort to minimize circuit delay, through such means as restricting gate depth and fan-out, if the circuit were of such extraordinary size as to be difficult to fabricate or to integrate within a complete processor the research would be moot. At this point threshold circuits come into the picture, since in a number of cases it has been possible to replace an AND-OR circuit of exponential size with a threshold circuit of the same depth but with more reasonable size constraints. The effect is one of replacing a large number of simple AND-OR Boolean gates with a smaller number of more complex threshold gates.

The difficulty with employing threshold circuits is one of fan-in. The replacement circuit using threshold gates expects unbounded fan-in. High fan-in is frequently a “feature” of threshold circuits, and is a contributing factor in their prevalence in neural computation [SIU95]. However, once the implementation of high-speed digital logic is considered, high fan-in there can bee seen as a liability instead. Each method of accepting additional inputs into a circuit contributes to the delay of the gate. This will be shown to be as prohibitive in terms of gate delay as the AND-OR circuit was in terms of size. Fan-in as high as would be necessary for even a modest size adder could make each gate so slow as to actually result in a net loss in total circuit delay, despite the fact that the circuit would be less than half as deep in terms of gates. Even if a gate could be constructed with the technology used in this work, using them would not be practical and thus will not be considered for the present work.

2.10Conclusions

The computational design for an addition circuit has a mathematical analog in the prefix problem. As the prefix problem entails generation terms of a series in parallel, so does addition circuitry involve parallel computation paths. Study of parallel prefixes and their solution forms the theoretical basis of advanced arithmetic circuits.


Chapter 3:  Digital Circuit Design with Bipolar Transistors and Current Steering Logic

3.1      Introduction

Current-steering logic is a category of integrated transistor digital logic gate using a constant current source. Instead of turning current on and off to pull an output line high or low, a constant current is switched between two (or more) possible paths. This kind of digital logic gate is well suited to implementation with bipolar transistors due to the exponential relation between input voltage and output current. A small voltage swing on an input can still rapidly control a large current change, which in turn can rapidly produce a small voltage change on an output.

The basic building block of current steering logic is the “current switch”. The current switch and its operating principles will be described. Following that is discussion of how current switches can be combined to make gates that compute complex functions. Since current switches comprised of bipolar transistors are being considered, important characteristics of bipolar transistors and how they relates to the speed and load driving ability of the gates need to be considered. Simulated results for optimizing gates delay are included in that discussion, as well as interconnect parasitics. Finally, a gate will be constructed which computes a function that is key in look-ahead carry structures.

3.2      The Current Switch

Consider a pair of bipolar transistors with a common emitter connection. Given a fixed tail current, the proportion of the current drawn through each collector can be found to be a function of the difference between the two base voltages. As the voltage of one base moves from slightly less than that of the other base to slightly more, the current is “switched” from the second collector to the first.

Figure 31: A bipolar current switch, configured as a digital buffer


If resistors of a particular size are connected from each collector to a voltage rail, the switching of current from 0 to ICS can be converted to a switching of voltage differential from the rail from 0 to ICSR. This voltage switching can in turn be used to drive another current switch. What might be described in analog terms as a differential common-emitter amplifier thereby in digital terms will operate as a current-switched buffer or inverter. (Note that with two pull-up resistors a fully differential output is produced, so that inversion is a “free” gate.) Traditionally, this upper voltage rail has been connected to ground since the output voltage is most sensitive to that rail. A negative voltage supply in that case is used to drive the lower rail.

If one base is tied to a fixed reference voltage, swinging the voltage on the opposite base from Vref – Vsw to Vref + Vsw will switch the current to the collector of the transistor not connected to the reference. The pull-up resistor connected to that collector can be sized to swing the voltage on the collector from Vref + Vsw to Vref - Vsw. If a signal above the reference is considered a logical 1 and a signal below the reference a logical 0, the result is a logical NOT operation. This circuit can be referred to as a single-ended current switch, since a single line that switches against a reference represents the input. If multiple transistors are connected in parallel (common collector and emitter connections), raising the base voltage of any of them will switch the current. If each base is driven by a different signal, the output will produce the logical NOR of those signals.

Recall that the current switch inherently provides a differential output. If instead of taking one output and connecting it to a current switch opposite a fixed reference voltage both output rails are connected to a base, a differential current switch is created.

3.2.1                  Series Gating and Emitter Followers

A current switch diverts current from a single source at the common emitter to one of two connections at the collectors. The path from each collector through the current source in turn looks like a controllable current source connected to each collector node. The current source in a current switch could be replaced with a connection to the collector in another current switch. This topology is called “series-gating”. The output of the entire circuit would be a logical combination of the inputs of each series-gated level, depending on the connection. A tree built from N levels of series-gated differential current pairs can be used to generate any function of N variables or multiplex up to 2(N-1) signals

Figure 32: Series-gating of current switches


The common emitter node of a current switch is VBE, on below the highest input voltage. Attempting to drive two series-gated level of current switches with the same input voltages would mean that VCE of the transistors of the lower switch would be driven to 0 volts and the transistors would be put into saturation. The inputs of a lower level must be driven at a lower voltage relative to the inputs of a higher level. Emitter followers make this possible.

Emitter followers are the equivalent of common-collector amplifiers connected to the outputs of a current switched gate. With current flowing through the transistor, the output of the emitter follower will be VBE, on below the direct output of the current-switched gate. When driving a lower series-gated level with emitter follower outputs, VCE is equal to VBE, on, maintaining the nominal VCE of a non-series gated current switch. While there are reasons to desire a larger VCE, this value is easily generated by the simple emitter follower circuit and depends only on device matching issues that are already required in the current switched gate itself.

Emitter followers also have a benefit for driving large loads. While there is additional device delay while signals propagate through an additional transistor, the sensitivity of the gate delay to loading (RC parasitics in interconnect, high fan-out) is reduced. The sensitivity of the output voltage to current drawn out of the gate is also reduced, although the effects are not typically significant to begin with.

3.2.2                  Current Sources

A current switch pair can control the flow of current, but a source of the current is still needed. Current sources can be categorized as active or passive based on the presence of a transistor in the source circuit.

A bipolar active current source is constructed from a BJT with an emitter resistor. A reference voltage VCS applied to the base of the BJT will in turn fix the voltage across the emitter resistor at VCS-VBE, on – VEE. The current source will then supply (VCS-VBE, on – VEE)/RE into the collector of the source’s transistor. Small variations in the common emitter node of the current switch connected to the current switch that occur during switching will have little affect on the supplied current. However, the supply voltage must be large enough to allow biasing of the BJT as well as the resistor. If the transistor enters the saturated region, less than the nominal source current will be available. Also required would be a voltage reference to generate VCS. A current mirror would be setup to force a transistor base to the correct bias voltage for a set current, which would then be applied to the base of the current source transistor. If the RE in the reference and the RE in the current source are equal, the current through the source mirrors the current in the reference circuit. This line could be global, and each reference circuit could be shared by many current sources. The nominal value of the current source could be scaled in relation to the current of the reference by adjusting the RE of the current source. The ratio of the current would be the inverse ratio of the RE values.

A passive current source uses only a resistor connected to the common emitter node of the current switch. The input voltages set the biasing voltage, since the common emitter node will be at a potential VBE, on below the higher input voltage. The voltage across the resistor will vary somewhat during switching, due to the difference between the swinging input voltage and the changing base-emitter voltage due to the changing current. The passive current source is also sensitive to variation in VEE in a fairly linear manner. Changing the power supply requires adjusting each gate individually, while updating an active source needs no more than an adjustment to the source in the reference circuit.

3.2.3                  An Issue of Nomenclature

There are two kinds of differential output bipolar logic circuits used in the present work. The nomenclature used for these circuits in the current literature is not clear. The commonly used terms are Emitter Coupled Logic (ECL) and Current Mode Logic (CML). As identified by the “authorities” in the field [TREA89], these categories become no more than distinctions without difference, in ways that have been found to be more confusing than useful. Some use ECL to describe solely single ended inputs switching against a reference, or single-height current switches without series-gating, whereas CML is used for differential inputs in tall series-gated trees. Others use ECL to refer to the presence of emitter-follower outputs, while CML implies their lack. The term “differential current switch” (DCS) logic has also been used to describe current switches used in combination with dotted-emitter outputs. It becomes evident that this system of nomenclature does not successfully describe the circuits in question. An effort has been made to explicitly describe the major characteristics, and their relevance to the circuit designs of note. The primary distinctions that appear in the current work are single-ended versus differential inputs, and presence versus absence of emitter followers.

3.3      Bipolar Circuits and Designing Logic for Speed

3.3.1                  fT and fM OSC

Figure 33: Simulated base-emitter potential versus collector current for a 0.5 x 1.0 micron device, Vce=0.3

The speed of a bipolar transistor is most importantly characterized by the transition or toggle frequency fT. This is the frequency at which the short circuit gain in a common emitter configuration reaches unity. At low collector currents, the fT is controlled by the depletion capacitances and the collector current, while being bound at high currents by the base transit time. The peak of the fT versus IC curve is quoted as a figure of merit for the device, which will be mainly a function of the forward transit time.

The maximum oscillation frequency fM OSC is defined at the point of unity power gain as opposed to unity current gain. While this is a more complex arrangement, requiring a load matched to the output resistance at each IC of interest, it more closely matches the environment of a transistor embedded in a circuit driving loads and being driven itself. A value for fM OSC can be found computationally from fT:

fM OSC = (fT/8πCjcrbb)1/2.

Both the collector capacitance [ARMS95] and base resistance [BARN75] reduce this frequency. Note that while reducing the intrinsic base thickness would reduce the forward transit time and fT, this would also increase rbb, thereby creating a need for compromise in transistor design.

3.3.2                  Latency versus Bandwidth

The factors of “latency” and “bandwidth” both fall under the bailiwick of the high-speed circuit designer. In a functional block comprising of a plurality of gates, minimizing latency is not necessarily equivalent to maximizing bandwidth, and this dichotomy is represented by the division between logic and communications circuits. Logic gates need only enough bandwidth to ensure that the outputs of each functional unit or block of logic have switched one block delay after the input switching. As the cycle time of the input signals would be more closely related to the block delay which is many multiples of the gate delay, having enough bandwidth to prevent the rise and fall times from overwhelming the delay through the gates is sufficient. On the other hand, since fan-in and functional complexity lead to increased device parasitics loading each line, long chains of simple gates improve the bandwidth of the entire chain while extending the latency greatly. The latency of each individual gate is reduced, improving the bandwidth of the whole.

Toggle frequencies by themselves are primarily measures of the bandwidth of the input signal that can be pushed through the circuit. However, the propagation delay, or latency, of the current switch, and hence the speed of the entire circuit which is out primary interest, is dependent on more than the fT and fM OSC of the transistors used. Indeed, even when taking the limit as fT approaches infinity the propagation delay remains non-zero. The base resistance significantly affects the delay when it becomes greater than about half of the source resistance and the impact of the capacitance of the collector-base junction increases as emitter current drops [BARN81].

An examination of the sensitivity of propagation delay in ECL ring oscillators [CHOR88] demonstrates the effects of these parameters on the delay of the circuit. The largest sensitivities appear for the terms involving base resistance and collector capacitance. It should be noted that the projections for a delay-optimized 0.5 micron device in that work show that the two largest contributors to the total delay are the transit time and the RC load time constant. However, the sensitivities to base resistance and collector capacitance still stand, so that deviations could still impact performance noticeably. In addition, the third and fourth largest contributors of delay involve base resistance and collector capacitance respectively. These sensitivities are for gates with a fan-in of only one. With more complex gates, collector capacitance and similar parameters become even more important in comparison to fT [JOUP94].

Circuit design need for logic gates are divergent from the needs of communications circuits in certain areas:

·        Fan-in: Logic circuits should be designed with high fan-in to reduce total gate depth for a functional block. Communications circuits should be built from low fan-in gates in trees or chains to increase the bandwidth of each individual gate. The break even point for a single gate with a large number of inputs and depth-of-2 tree with a lesser number of inputs per gate is much higher with respect to latency than it is for delay.

·        Output stages: Communications circuits can be improved with a common-base stage between the current-switching tree proper and the output nodes and pull-ups. This eliminated the Miller effect on the collector junction capacitance between the bases at the switching inputs and collectors at the output nodes that are moving in opposite directions and at high gain. Logic gates are hampered by the additional delay since all input to output paths now pass through and additional device. With series-gating, there is an increase in delay for each level lower in the tree.

3.4      Noise margin and voltage swing

The voltage swing necessary on the input of a current switch is chosen so as to provide for a large noise margin for the gate [TREA89]. Maximum noise margin occurs at the input voltage when the gate is at unity gain:

vn = Vs/2 + Vsvt/(Vs-vt) - vt ln (Vs/vt-2)

Simultaneously, the base-emitter voltage must be sufficient for the transistor to be in the forward active region while the collector-emitter voltage must be sufficient to keep the transistor out of saturation:

VBE, on = VCE +Vs + vn

Figure 34: Graphical solution for voltage swing/noise margin relations


Solving both of these equations simultaneously provides the appropriate output voltage swing to maximize the noise margin. Figure 3‑4 shows a graphical solution for the relations between noise margin and voltage swing. The point where the two curves cross identifies the proper voltage swing to maximize the noise margin, as well as the noise margin that is available. Vs is the maximum voltage between high and low values for a single-ended input switching against a constant reference voltage. The maximum voltage differential across the inputs of a current switch is half of the magnitude of Vs, as the reference voltage is nominally half way between the high and low input levels. The swing for a differential pair is Vs/2, since opposing inputs of the current switch are always at the opposite extremes of the voltage swing and switching is not being performed against a static reference voltage.

3.5      Device Sizing in Loaded Buffers

While derivation of figures of merit may provide a point from which to design, eventually the transistors must be placed into a circuit. The loading that a current switch sees due to pull-ups, emitter followers, subsequent gates, and interconnect parasitics can affect the bias point, which minimizes delay.

A simple test case that allows examination of sizing to account for loading should refrain from applying ideal inputs directly to the gate under test. A chain of three buffers with emitter followers driven by the ideal source will produce a signal comparable to what might be observed by the circuit when in place. Similarly, three buffers following the gate under test present a realistic load on the output signal and provide the possibility of comparing the delays between several gates to check for variations. Since the current-switching gates under examination all use the current switch/pull-up/emitter-follower arrangement, using buffers gives a good representative of the set of gates as a whole. While we are interested in the delay on a per gate basis, it is important to replicate the loading and driving of the gate under test in a fair manner. Since a buffer is being considered representative of any gate and the driver and load would just be other gates, it is reasonable to represent them with buffers as well. This allows generating inputs in simulation with simple ideal sources, using the first buffer they generate a more realistic waveform, as well as automatically including the correct impedance for driver and load instead going through calculations for every sizing variation examined.

For the current investigation the assumption will be made that each buffer instance and each emitter follower instance will be biased similarly. While using different sized gates is useful when loading due to unusual fan-out or wiring length occurs or if there path under consideration extends directly and solely to off-chip connections, examination of the first case will be reserved for specific occurrences while the second case does not apply to the type of function that is the focus of this research. The second case does come into play in the bandwidth-oriented design of communication circuits, especially in the case of multiplexing two signals of a given frequency into one signal at twice that frequency.

3.5.1                  Tail Current

The parameters that are free to vary are primarily the transistor size in the buffer and emitter follower, and secondarily the tail current in the buffer and emitter follower. For the purpose of streamlining the design process, the tail currents relative to transistor size will be fixed early on. Any gain from adjusting tail current to suit specific loading is likely to be too small to be worthwhile, and required detail of the specific circuit conditions cannot in any case be supplied until a preliminary design has been completed anyway.

Since delay of the circuit is the object of optimization, the major figures of interest regarding tail current are:

·        the peak fT current of a transistor of a given emitter size,

·        the rated current of a transistor of a given emitter size.

While the initial bias point is the peak fT current, the introduction of loading in a delay-oriented design leads to seeking improvements in delay when the current is increased even beyond that point. However, this current cannot be increased without bound. The transistor is rated for a maximum current density beyond which the device will fail. (Successive design kits have progressively reduced this current rating. The 9805 kit had a 2 mA per square micron current density limit. By the time the 1999B kit was available, the current density had been reduced to 1.4 mA per square micron.) Furthermore, the transistor models have known inaccuracies past the peak fT current, and the device-modeling engineers have not focused on improving the models in this regime.

In the typical usage of these devices the concern is usually either bandwidth-oriented design, where biasing right at the peak fT current is often wanted, or noise where the biasing is much less than the peak fT current. These areas, being the focus of commercial development, receive the most modeling effort. Logic circuitry has thus far been mainly considered only in support of bandwidth-oriented communications circuits. Designing for latency minimization is somewhat beyond the pale.

On the other hand, it can be shown that at the highest current where the models are qualified, the peak fT current, the delay is still improving. This would seem to justify exceeding that point by some slight amount, but the informal opinion of designers in the area is that it is unwise to exceed it by much. Devoting fabrication funds for the characterization of the current versus delay relation would take away from functional “payload” circuits. The task itself would be expected to be daunting, in light of the fact that the device engineers have net even seen fit to address it. In that light, an upper bound for the tail current of the peak fT current must be set.

In the meantime, the rated current has decreased to the point where it is the major limiting factor of the “excess” current anyway. Biasing at less that the current for minimum delay can be used for power savings on gates off the critical path, however power consumption has not been one of the areas of concerns for the current work. In addition, the design is such that a great percentage of gates are involved in the critical path, limiting the amount of power that could be traded off for delay.

3.5.2                  Current-Switch Transistor Size

Figure 35: Buffer delay, for buffer transistor size equal to emitter-follower size


Once given a delay-minimizing current density, it is tempting to then use the smallest transistors possible to reduce power consumption and circuit area. However, it was noted above that the transistor speed is very sensitive to variation in base resistance. Small devices have high base resistances. While increasing the device size also increases the collector capacitance, there is a point where the combined effects are at a minimum and this point is above the minimum device size. Comparing the results for different sized current switch devices shows a larger jump between 1-micron and 2-micron emitter length devices than between other intervals. This indicates that it is worth the power to increase the fundamental emitter length to 2 microns.

3.5.3                  Emitter-Follower Size

1 load

EF Size (um)

 

 

 

 

 

 

 

 

Buffer Size (um)

1

2

3

4

6

8

10

12

14

16

18

20

1

16.1

14.5

14.5

14.7

15.3

16

16.7