An Optimization Of Critical ALU Paths And Implementation Considerations


Matthew W. Ernest

A Thesis Submitted to the Graduate

Faculty of Rensselaer Polytechnic Institute

in Partial Fulfillment of the Requirements of the Degree of



John F. McDonald

Thesis Advisor

Rensselaer Polytechnic Institute

Troy, New York

December 1996

  1. Table of Contents

Table of Contents I

Table of Tables iv

Table of Figures v

Abstract vi

Chapter 1 : Introduction and Background 1

On adders and critical paths 1

On yield limited technologies 2

Chapter 2 : Origin and Theory Of Carry Select Addition 3

Chapter 3 : Optimization Of Carry Select Stage Sizes 6

Chapter 4 : Optimal 32-bit ALU with Carry Select Addition 12

Logic Design 12

Circuit Design 14

Layout Of A Monolithic ALU 19

Considerations Affecting the Layout 24

Simulation of the design 26

Chapter 5 : F-RISC "Byte-Slice" Carry Select Implementation 29

A Multi-Chip Processor and Another Look at Yield Limitation 29

Comparison To Optimized Adder, And Other Possibilities 31

Chapter 6 : Discussion And Conclusions 35

Appendix A: Referenced Literature 37

  1. Table of Tables

Table 1: Estimation of delay versus number of stages: s=1, B=32 10

Table 2: Estimation of delay versus number of stages: s=2, B=32 11

Table 3: Representative Timings From SPICE SImulation 27

  1. Table of Figures

Figure 1: Representation Of A Carry Select Adder 3

Figure 2: Logical Organization Of The Optimized ALU 13

Figure 3: Carry Generation Circuit 15

Figure 4: Sum Generation Circuit 16

Figure 5: ALU Function Generator 18

Figure 6: Carry Selection Circuit 19

Figure 7: HEAD cell Layout 20

Figure 8: MID cell Layout 21

Figure 9: CMUX Cell Layout 23

Figure 10:Five bit carry select stage from optimized ALU layout 24

Figure 11: Testing The ALU 26

Figure 12: 32-SPICE Simulation of 32-bit Carry Select Adder 28

Figure 13: Comparison Of Adder And Register File Areas 34

  1. Abstract

An optimization of critical paths in high-speed addition circuit is described. Specifically, a 32-bit adder is optimized on a specific AlGaAs/GaAs technology, reducing the critical path delay to an O() function, from O(n) in naïve designs. This optimization is contrasted with the actual adder design of the F-RISC, a multi-chip microprocessor designed for this technology, the design of which was affected by yield considerations. Different measures of "optimal" are considered in reference to the manner in which changes in constraints of different technologies affect design considerations and how to adapt to them.

  1. Introduction and Background

  1. On adders and critical paths

The carry circuitry of any adder is typically a critical path. The carry out of an adder possibly depends on every input, thus requiring a carefully optimized circuit. Often the optimization is carried out in terms of the mythical "gate delay". Of course, once the design gets down to the circuit designer, actual per circuit timings are used and efforts are made to minimize them. However, if circuit behavior is accounted for at a higher level, this can lead to a much different optimization with improved performance under these degrading, i.e. real, circuit effects.

Specifically, a carry-select adder is examined in this work, as two instances are available for comparison. One, an application of the aforementioned higher level recognition of low level behavior, is a carry select optimization carried out in isolation from other circuitry. The goal was simply to make a fast adder, without regard to some larger circuitry to embed it in: a greedy optimization. The second example comes from the F-RISC processor. In this case, even ordinary optimizations were left out in light of other factors that affected overall speed of the processor and also the ability to manufacture such a design.

  1. On yield limited technologies

In the production silicon process used in commercial electronics, the circuit designer is typical free to apply whatever amount of resources to meet the design goals. Be it circuit speed, functional complexity or other criteria, any appropriate number of transistors may be used.

However, in researching the extremes of microelectronics performance, yield-limited technologies come into play. These high performance semiconductor technologies suffer from comparatively high failure rates during fabrication, such that only a severely limited number of transistors may be used per die to have any practical chance of fully working circuits.

Figure 1: Representation Of A Carry Select Adder

  1. Origin and Theory Of Carry Select Addition

In the design of adders, the critical path is typically in the carry chain. The carry into any bit of an adder depends on all previous inputs. With a ripple-carry adder, the simplest type of adder, the carry out from one bit is found directly from the carry from the previous bit, which is computed from the carry of the bit before that, and so forth. The carry chain thus extends in one path through each and every bit in turn, and the longest path length through the adder is directly proportional to the number of bits in the adder.

The carry select adder attempts to shorten this path through parallelization. The carry into any bit is solely a function of the less significant operand bits. Every carry bit is independent; adders have been designed which generated all of these carries simultaneously. However, generating all of the independent carries in parallel requires an investment in hardware that may be considered excessive for the gains it yields. The carry select adder was developed by Bedrij [BEDR62] as compromise that does some of the carry generation in parallel and hides serial delays of some sections behind the delays of others.

The adder is divided into several stages. With in a section, some other carry scheme is used, such as ripple carry. However, to avoid the delay waiting for the carry in to the first bit of the section, separate sets of logic are used to find the carry and sums for both possible carry in values (1 and 0). By the time the carry in is available, all possible sums and carry outs have been computed. The carry in can then be used to select the proper output via a multiplexor at each output. The longest path in this case consists of the carry through the first stage plus one multiplexor for each stage where the carry in selects the proper carry output.

Pedagogical examples of the carry select adder often consider only equal size stages; e.g., a 32-bit adder is presented with four stages of 8 bits each. The time to cross the longest path would be 8-bits worth of the lower level carry scheme plus four multiplexor delays for each select stage. A common textbook by Weste and Eshraghian [WEST93] does vary the stage size, but uses a naive increment of one bit per stage and only mentions the selection delay in passing.

  1. Optimization Of Carry Select Stage Sizes

Varying the size of each stage of a carry select adder, as opposed to the simple method of equal size stages, can reduce the critical path delay. Optimally, the carry in should arrive just as the carry out possibilities have been generated, and these signals should enter the carry out multiplexor at the same time. Thus each stage should be slightly shorter than its successor. How much shorter depends upon the relative delay through the multiplexor as compared to that of the "gate" generating the lower level carry. These are typically considered the same "gate delay", but depending on the technology can not only be very different but can also vary significantly due to circuit context.

If cn is the number of bits in stage n, tg is the delay time across the "carry" gate, tm is the delay time across a multiplexor, and N is the number of stages, then the delay of the carry chain is then


is true.

If a constant step in size is used between each stage, then it follows that the step size should be


If B is the total number of bits in the adder, then (considering that the final stage may not need to be cN to meet size requirements)

Solving the latter expression for c1 yields

Substituting that into the following expression for total delay

results in this expression for td in terns of N, B and s:

N is required to be an integer, and td/tg is typically considered as an integer as well, so if we consider N and td/tg as the ceiling of approximate variables

this equality can be stated:


Since B and s are given, the minimization of delay is solely a function of N. At this minimum, we know that

so it can then be found that


it must be that

Since the N2 term is positive, the inequality will be true for N between the roots of the equality. So finding the roots

Arbitrarily, the lowest value of N may be chosen, given td/tg from above, by:

Table 1: Estimation of delay versus number of stages: s=1, B=32

Stage widths
Carry Settling Time

[# Gate Delays]
No Possible Sequence1
1 Ripple carry only. No select needed for one stage

2 There is no possible sequence of length nine that also maintains the step size. Whereas N=9 and c1 >0, the delay for the nine stage adder would be at least 10 td, it is obviously not a consideration here.

Given these values for td/tg and N, c1 can then be found.

In a facile design of a carry select adder, s would be assumed to be 1. Given that, and B=32, a table of td/tg versus N could be calculated as shown in Table 1.

In the semiconductor technology and digital logic family being discussed considered, s can actually be up to 2.Now if we consider s=2, one can surmise by inspection that these slow multiplexors will make additional stages "costly". This changes our stage arrangement as shown in Table 2.

Table 2: Estimation of delay versus number of stages: s=2, B=32

Stage widths
Carry Settling Time

[# Gate Delays]
No Possible Sequence1
1 Ripple carry only. No select needed for one stage

2 There is no possible sequence of length seven that also maintains the step size. Whereas N=7 and c1 >0, the delay for the nine stage adder would be at least 15 td, it is obviously not a consideration here.

  1. Optimal 32-bit ALU with Carry Select Addition

  1. Logic Design

In general, ripple carry is used inside each stage. For each bit of the adder, two carries and two sums are generated:

For the first bit of each stage, the carry inputs are known explicitly to be 1 and 0, so the equations can be simplified:

The carry out from the previous stage is used to select the sum at each bit and the carry out from this stage.

In this particular design, however, a full ALU was built. Not only does it add, but bitwise logical function are computed as well. Carries are generated as described above for the general case. The sums on the other hand, are replaced by several generating relations. The ALU "function" is generated by

Figure 2: Logical Organization Of The Optimized ALU

AND and XOR are control signals that indicate what function to generate. For an OR function, both AND and XOR should be asserted. When performing addition, XOR is the desired function.

Instead of generating two sums and selecting between them, in this case the carry in signals are selected and used to generate one sum. It allows the carry path to be "shut down" for logical functions where there is no carry. The internal carry is generated by


The SEL signal is the select generated by the carry out of the previous stage. The CLR signal clears the carry path, for generating logic function with no carry.

The final sum is generated by


Thus it can be seen that when the CLR line is asserted to clear the carries, the generated function is available directly on the "sum" outputs of the ALU.

As in the general case, the first bit of each stage can be simplified due to the implicit incoming carry values, thus

The rest of logic remains the same. Compare Figure 1 and Figure 2 to see the modifications in the ALU.

  1. Circuit Design

The gates for this design were designed employing GaAs HBT CML[NAH91]. Current switches steer a fixed gate current to pull down one line of a differential pair. Using these particular transistors, this logic family allows current switches up to three deep, so any function of up to three variables can be realized as one gate, and some "selection" operations with more variables.

Figure 3: Carry Generation Circuit

For each bit, the following signals are needed: two carry outs, a carry select, an ALU function and a sum. Thus, five current trees are needed per bit. Carry is a majority function, shown in Figure 3. The sum in Figure 4 is generated from a simple XOR gate. The carry select in Figure 6 multiplexes the two incoming carries, with a clear to shut off the carry path. The ALU function shown in Figure 5 generates both the AND and the XOR of the operands which are selected by control signals.

For the first bit of each stage, where the static carry inputs lead to simplification as described above for the Logic Design, the carry generation and selection can be simplified. A statically driven current switch can be replaced with one short and one open connection oriented according to the static signal desired. The corresponding current switch can be eliminated, reducing circuit size. The two carry generators become just an AND and an OR. The carry select loses the full multiplexor, and is a function of just the select and the clear.

Figure 4: Sum Generation Circuit

Due to the construction of CML gates, each signal has a "level" that indicates the high voltage of the differential pair in reference to ground. Level 1 is at ground, Level 2 is 1 Vbe drop below ground, and Level 2 is 2 Vbe drops below ground. Level 2 signals in general offer the best performance for a long, loaded line (the selected carry signal drives a multiplexor in every bit of the succeeding stage), due to the emitter followers on the outputs. However, the delay of the emitter follower itself as well as the space used by the additional transistors imposes a penalty for short unloaded signals. Thus, the carry our signals that only drive the next bit are driven at Level 1, whereas the select lines are at Level 2 since they drive every bit of the following cell. The A and B operands of the ALU are at Levels 2 and 3 respectively, a requirement imposed by other parts of the processor. The control signals are eventually buffered at every stage in the layout, and have no critical timing needs, so they can be placed at any convenient level.

Figure 5: ALU Function Generator

Figure 6: Carry Selection Circuit

  1. Layout Of A Monolithic ALU

The layout of a monolithic ALU as a single macro cell was achieved with three smaller macro cells. Two cells contain the complete functionality for one bit, akin to a bit-slice design: the "MID" cell with the full circuitry for use in the middle of a carry select stage, and the "HEAD" cell with reduced carry generation for use at the beginning of a select stage. Each of these cells contains two carry generators, a carry selector, an ALU function generator and a sum generator. The data flow was generally considered to be

Figure 7: HEAD cell Layout

The carry generators are placed near the "top" to reduce any interconnect delay on the operands. In both cells, the carry generators are side by side and define the cell width, reducing interconnect delay on the carries as well. The carry selector is placed below that on the right, again reducing interconnect distance on the carries.

Figure 8: MID cell Layout

The ALU function generator is placed to the left of that. Since the sum generator depends on both the carry selector and the ALU function generator, it is located below those two units, with the sum exiting out the bottom of the cell. In addition, the MID cell contains a "sneak path" for the selected carry for use in computing ALU overflow. Since overflow depends on the last carry and the next-to-the-last carry, the last MID cell in each stage has an external connection to this carry, and the last stage can supply the carry for the overflow computation. The MID cell can be seen in Figure 8, and the HEAD cell can be seen in Figure 7.

The CMUX cell contains the carry multiplexor at the end of every stage. The multiplexor is located at the top of cell, where the carries come in on the right and the selected carry goes out the left. The CMUX cell also contains buffers for the control signals for the next stage. In addition, a superbuffer drives a copy of the selected carry signal that drives the carry selectors in the HEAD and MID cells, separate from the carry that drives the next carry multiplexor. These buffers are used to reduce the loading due to fan out. The control signals drive all 32 bits, which is quite a large fan out and needs to be split up. The superbuffer for the copy of the selected carry also has a lower sensitivity to capacitance than a regular buffer, reducing the performance hit cause by the increasing number of loads per stage. This has the added benefit of loading each carry multiplexor similarly with a superbuffer and another carry multiplexor, leading to similar delays for each.

There is an additional "START" cell that drives the control signals and carry in to the first stage. Conceivably, the control signals could be driven from the left, by buffers in the CMUX for that stage as opposed to buffers in some cell previous to that stage. However, there would still be the need for the superbuffer for the copy of the carry in to the first stage.

Figure 9: CMUX Cell Layout

Figure 10 illustrates the combination of these cells into a carry select stage. Shown is the low end of the complete ALU, focusing on the five bit wide first stage.

Figure 10:Five bit carry select stage from optimized ALU layout

  1. Considerations Affecting the Layout

It has been demonstrated time and again that interconnect delay plays a tremendous, almost overwhelming, role in this HBT technology used. The resistance as well as the capacitance contributes significantly, so delay on a wire increases quadratically with distance. Therefore, reducing interconnect distance was of vital importance.

A layout whose width per bit approaches the bitline pitch of the F-RISC register file was desired, to eliminate horizontal interconnect used to line up the bits of the register file and the ALU. Much effort was put into making the layout as narrow as possible, which also has a direct effect on the carry delay. Figure 13 shows a comparison of the sizes of the carry select adder described here and the current F-RISC register file. Only about half of the width of a register file block is due to bit spacing; address decoding accounts for the other half, increasing the average width per bit. Each register file block is eight bits wide, necessitating that four be used with this adder. Whereas the adder is approximately 3.69 mm wide, each register file block--with address decoders--is 0.97 mm wide.

Increases in horizontal density can be achieved by extending the cells in the vertical dimension. This becomes a trade off between overall carry delay, and delay in generating the sum for the high order bit. Although the main concern is for the carry chain, it is possible to slow down the sum enough so that the critical path changes

Figure 11: Testing The ALU

  1. Simulation of the design

A SPICE extraction of the layout, with interconnect capacitance, was used to simulate the design to determine delays through the circuit for addition. Table 3 shows the testing setup. CLR is held low to activate the carry path, and AND and XOR are low and high respectively to produce an addition. With the A operand bits low, and high order bit of the B operand high, toggling either the least significant bit of B or Cin will produce a change on Cout

Some representative timings taken from the SPICE output trace are shown in Table 3 , while a complete output trace is show in Figure 12. The longest path through the carry chain is from the b0 input to the cout output; thus, the delay of the adder could be characterized as approximately 283 ps.

Table 3: Representative Timings From SPICE SImulation

cin falling to s31 rising
235.6 ps
cin falling to cout falling
172.5 ps
b0 rising to s31 falling
347.0 ps
b0 rising to cout falling
282.7 ps

Figure 12: 32-SPICE Simulation of 32-bit Carry Select Adder

  1. F-RISC "Byte-Slice" Carry Select Implementation

The optimizations of the carry-select adder described in chapter 4 were undertaken without consideration of context. However, an adder is no island, cut off from other circuits. When one considers the design of an entire processor, additional constraints affect critical path lengths. The F-RISC/G [PHIL93] is an experimental processor actually designed in the process mentioned previously. Its design offers an example of how other factors may force a design that may be considered sub-optimal using only longest path delay as a metric, but which in light of yield limitations may be desirable or even necessary.

  1. A Multi-Chip Processor and Another Look at Yield Limitation

Yield limitation implies a fabrication error probability that is relatively high in comparison to that of other, more mature technologies. While this probability may still be quite small and any one device may still be likely to work, this chance of failure per device leads to complex circuits suffering from significant failure rates.

As shown in Philhower's thesis [PHIL93], the F-RISC/G implementation uses a bit sliced "datapath" built from 4 chips that each carry 8-bits: a "byte-slice" if you will. The datapath must contain an integer ALU, a register file, feed-forward logic due to pipelining, and other logic such as condition codes. This wide array of circuitry would require a large transistor budget and would have to be split across several chips to obtain any reasonable yield. To understand this, consider some "device" that has a 50% yield, i.e. every one fabricated has an even chance of working. Let's say we need 4 of these devices. If all four are fabricated on the same die, the net yield for the die will be 6.25%, or 1 out of 16. In other words, only one out of sixteen chips could be expected to work. On the other hand, if one device was fabricated per die, we could expect to fabricate eight chips to produce a working set of four. The "byte-slice" design also offered the opportunity to use identical chips for each slice, simplifying design of the datapath, and also the cache, due to regularity.

In the process utilized for the F-RISC/G, yield considerations place a practical limit of 8-10K transistors per chip to obtain any reasonable yield. The eight bit register file itself uses over 2K transistors; 32 bits of register file alone would "fill up" a chip. The F-RISC datapath was therefore divided into 4 identical datapath chips. Each chip holds an eight bit slice of the datapath. On this chip are an eight bit register file and eight bits of the ALU, as well as other logic. The carry chain of the ALU, having been identified as a critical path from the start, is a single macro cell in the layout; the rest of the ALU is built from standard cells. The carry chain macro is set to one side of the chip, carry multiplexor near the I/O pad ring, to minimize on-chip signal travel.

The carry chain macro holds an eight bit ripple carry and a carry multiplexor at the end. For each bit, the carry chain produces the to carries to the next bit, as well as selecting a carry in for that bit to be used in the rest of the ALU logic. The carry in for each bit may be disabled to used the ALU for logical operations, as opposed to arithmetic operations.

  1. Comparison To Optimized Adder, And Other Possibilities

The performance of a full-custom, single-macro varied stage carry select adder was compared with the even-length stage carry select adder spread across four dies currently used in the F-RISC Project microprocessor. For future stages of the project, it has been proposed that a factor of two can be gained in adder speed through architectural adjustments and migrating to a higher yield technology where the datapath could fit on one die, before device speed gains are even considered.

The original goal for the adder design was a factor of two improvement over the present F-RISC adder delay. As the F-RISC adder delay is slightly under 1 ns, the estimated 283 ps for the carry and 347 for the sum, this design meets that mark, but there are still improvements that could be made for a possible factor of four improvement. One possible way to improve this mark is to reexamine the determination that the ratio of tg to tm is about 1. As the load on each multiplexor is directly proportional to the size of the subsequent stage, tm should be larger when diving a larger stage. Perhaps the size step between stages needs to be larger, or should vary with stage size. Another alternative is to use something other than ripple carry with each stage. A faster intrastage carry could result in a larger first stage for the same delay and reduce the number of stages needed. A circuit level solution would be to more carefully analyze the carry select delay as a function of stage length. In Chapter 3, the analysis assumed a fixed ratio for the carry multiplexor delay. However, since the multiplexor drives an RC line proportional in length to the number of bits in the next stage, the delay of the carry select should increase with stage length. This would cause long stages to make a greater contribution to overall delay.

Another area of interest was the adder width as compared to the register file. The bitline pitch for the register file is approximately 60 mm, while the pitch for adjacent MID cells is 102.5 mm. However, the area for address decoding in the register file has not been accounted for. If width of total block is considered, the adder is 3.69 mm wide while four register files side by side is 3.88 mm wide without even considering wiring channels for the address lines.

It is important to keep in mind that a datapath employing this adder is not physically realizable at present in the technology considered due to yield limitations. The adder plus four register files already exceeds the device count available on one chip. This could possibly be handled by slicing horizontally, creating a register file chip and an ALU chip. The extra interchip crossing from register file to ALU is offset by three eliminated crossings in the carry chain, so there still should be a net gain in speed.

Figure 13: Comparison Of Adder And Register File Areas

  1. Discussion And Conclusions

The carry select optimization developed here offers a carry delay which is O(). The simple method usually used with equal sized stages has a delay that is O(n), since the size of each stage is directly proportional to the total size. Although this does not match the performance of completely parallel carry generation, it gives performance gains for no more area and that the simple carry select. For example, carry lookahead generation is O(log n) for delay but O(n log n) in area [HWAN79], while the optimized carry select is O() for delay but O(n) in area: a carry select will use up less that twice the area of plain ripple carry addition. Furthermore, the layout utilizes only four different cells, allowing adders of arbitrary size to be quickly assembled.

When compared to the F-RISC/G datapath which originally prompted the adder analysis, a direct implementation is not forthcoming due to design considerations make in light of yield limitation of the process used. This underscores how "optimality" can be a shifting quality. A bit-slice datapath design necessitates either equal stage sizes or an unreasonable number of different chips. Using optimized stage sizes would likely mean building the ALU as a monolithic circuit, which implies separating the ALU and the register file. Thus the information needed for a useful "optimization" expands as the extremes of performance are reached and models of operation become more and more specific.

  1. Referenced Literature

[BEDR62] O.J. Bedrij, "Carry-Select Adder", IRE Trans. On Electronic Computers, vol. EC-11 No. 3, pp. 340-346;June 1962.

[HWAN79] K. Hwang, Computer Arithmetic: Principles, Architecture, and Design, John Wiley and Sons, New York, 1979.

[NAH91] K. Nah, R. Philhower, J.S. Van Etten, V. Tsinker, J. Loy, H. Greub, J.F. McDonald, "F-RISC/G: AlGaAs/GaAs HBT Standard Cell Library", Proc. IEEE ICCD, Oct 1991.

[PHIL93] R. Philhower, "Spartan RISC Architecture For Yield-Limited technologies", Ph.D. Thesis, Rensselaer Polytechnic Institute, 1993

[WEST93] N.H.E. Weste and K. Eshraghian, Principles of CMOS VLSI Design: A Systems Perspective, 2nd Edition, Addison-Wesley, Reading, Mass.,1993.