Skip to main content

Full text of "DTIC ADA388019: Design of a Synchronous Pipelined Multiplier and Analysis of Clock Skew in High-Speed Digital Systems"

See other formats


NAVAL POSTGRADUATE SCHOOL 
Monterey, California 



THESIS 


DESIGN OF A SYNCHRONOUS PIPELINED MULTIPLIER 
AND ANALYSIS OF CLOCK SKEW IN HIGH-SPEED 
DIGITAL SYSTEMS 

by 

John R. Calvert, Jr. 


December 2000 

Thesis Advisor: Douglas J. Fouts 

Thesis Co-Advisor: Herschel H. Loomis, Jr. 


Approved for public release; distribution is unlimited. 


20010402 109 



REPORT DOCUMENTATION PAGE 

Form Approved 

OMB No . 0704-0188 

Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instruction, 
searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send 
comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to 
Washington headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 
22202-4302, and to the Office of Management and Budget, Paperwork Reduction Project (0704-0188) Washington DC 20503. 

1. AGENCY USE ONLY (Leave blank) 2. REPORT DATE 3. REPORT TYPE AND DATES COVERED 

December 2000 Master’s Thesis 

4. TITLE AND SUBTITLE : 

Design of a Synchronous Pipelined Multiplier and Analysis of Clock Skew in 
High-Speed Digital Systems 

5. FUNDING NUMBERS 

6. AUTHOR(S) 

Calvert, John R. Jr. 

7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 

Naval Postgraduate School 

Monterey, CA 93943-5000 

8. PERFORMING 
ORGANIZATION REPORT 
NUMBER 

9. SPONSORING / MONITORING AGENCY NAME(S) AND ADDRESS(ES) 

10. SPONSORING 
/MONITORING 

AGENCY REPORT NUMBER 

11. SUPPLEMENTARY NOTES 

The views expressed in this thesis are those of the author and do not reflect the official policy or position of the 
Department of Defense or the U.S. Government. 

12a. DISTRIBUTION/AVAILABILITY STATEMENT 

Approved for public release; distribution is unlimited. 

12b. DISTRIBUTION CODE 

13. ABSTRACT ( maximum 200 words) 

Digital systems implemented with high-speed transistor technologies face a variety of design challenges in an 
effort to keep pace with the accelerating demand for performance. As device switching frequencies climb comfortably 
into the gigahertz range, clock skew in digital systems threatens to limit the advantages of synchronous pipelined 
designs. This research investigates the limitations of clock skew on high-speed digital systems by designing and 
simulating an 8x8 bit synchronous, pipelined multiplier using Indium phosphide (InP), heterostructure bipolar junction 
(HBT) transistor technology. Fundamentals of circuit analysis and the principles of junction transistor behavior are 
applied to design an optimal family of logic devices using current-mode logic. All testing and simulation data is based 
upon results obtained from Tanner SPICE design tools. Using the building blocks of this logic family, an array 
multiplier is constructed and further configured into five distinct pipeline implementations. By employing a different 
number of pipeline stages in each implementation, the trade-offs of pipelining are illustrated and clock skew is 
analyzed at a variety of throughput rates. Finally, the impact of clock skew on throughput performance is quantified 
and summarized as a reference point for further research into asynchronous control techniques. 

14. SUBJECT TERMS 

Clock Skew, Pipelined Logic, Current-Mode Logic, Indium-phosphide Heterojunction 
Bipolar Transistors, High-Speed Logic 

15. NUMBER 

OF PAGES 

152 

16. PRICE 

CODE 

17. SECURITY 

CLASSIFICATION OF REPORT 
Unclassified 

18. SECURITY CLASSIFICATION 
OF THIS PAGE 

Unclassified 

19. SECURITY 

CLASSIFICATION OF 

ABSTRACT 

Unclassified 

20. 

LIMITATION 

OF ABSTRACT 

UL 


NSN 7540-01-280-5500 Standard Form 298 (Rev. 2-89) 


Prescribed by ANSI Std. 239-18 


1 
























THIS PAGE LEFT BLANK INTENTIONALLY 


11 



Approved for public release; distribution is unlimited. 

DESIGN OF A SYNCHRONOUS PIPELINED MULTIPLIER AND ANALYSIS 
OF CLOCK SKEW IN HIGH-SPEED DIGITAL SYSTEMS 

John R. Calvert, Jr. 

Major, United States Marine Corps 
B.S., United States Naval Academy, 1990 

Submitted in partial fulfillment of the 
requirements for the degree of 

MASTER OF SCIENCE IN ELECTRICAL ENGINEERING 

from the 

NAVAL POSTGRADUATE SCHOOL 
December 2000 

Author: 

Approved by: 

Department of Electrical and Computer Engineering 

iii 








ABSTRACT 


Digital systems implemented with high-speed transistor 
technologies face a variety of design challenges in an 
effort to keep pace with the accelerating demand for 
performance. As device switching frequencies climb 
comfortably into the gigahertz range, clock skew in digital 
systems threatens to limit the advantages of synchronous 
pipelined designs. This research investigates the 
limitations of clock skew on high-speed digital systems by 
designing and simulating an 8x8 bit synchronous, pipelined 
multiplier using Indium phosphide (InP), heterostructure 
bipolar junction (HBT) transistor technology. Fundamentals 
of circuit analysis and the principles of junction 
transistor behavior are applied to design an optimal family 
of logic devices using current-mode logic. All testing and 
simulation data is based upon results obtained from Tanner 
SPICE design tools. Using the building blocks of this logic 
family, an array multiplier is constructed and further 
configured into five distinct pipeline implementations. By 
employing a different number of pipeline stages in each 
implementation, the trade-offs of pipelining are illustrated 
and clock skew is analyzed at a variety of throughput rates. 
Finally, the impact of clock skew on throughput performance 
is quantified and summarized as a reference point for 
further research into asynchronous control techniques. 


v 



THIS PAGE LEFT BLANK INTENTIONALLY 


VI 



TABLE OF CONTENTS 


I. INTRODUCTION.....1 

A. THE RELEVANCE OF HIGH-SPEED LOGIC.1 

B. THE PROBLEM OF CLOCK SKEW.1 

C. THE DESIGN OF A TEST CIRCUIT.2 

D. THESIS OUTLINE.3 

H. BACKGROUND.5 

A. CLOCK SKEW.5 

B. PRINCIPLES OF PIPELINING.7 

C. LOGIC DESIGN OF A COMBINATIONAL MULTIPLIER.10 

D. BJT/HBT LOGIC. 13 

L BJT/HBT Principles and Characteristics . 13 

2. BJT/HBT Logic Families . 23 

HI. HBT CML LOGIC CIRCUIT DESIGN.31 

A. DESIGN OVERVIEW.31 

B. INVERTER DESIGN.31 

L Circuit Topology . 31 

2. Initial Conditions and Design Parameters . 34 

3. DC Analysis . 38 

4. AC/Transient Analysis . 47 

5. Final Design Summary: Inverter . 54 

C. LOGIC NOR GATE DESIGN.55 

L Overview and Analysis . 55 

2. Final Design Summary: OR/NOR .57 

3. Implementation of the AND Function .57 

D. ADDER DESIGN.59 

1. Implementation . 59 

2. Performance Analysis . 62 

E. PRACTICAL CURRENT SOURCE DESIGN.62 

1. Circuit Topologies . 62 

2. Performance Analysis . 63 

IV. HBT CML LATCH AND REGISTER DESIGN.71 

A. LATCH DESIGN.71 

1. Circuit Topology . 71 

2. .. Initial Conditions and Design Parameters . 74 

3. DC Analysis .75 

4. A C/Transient Analysis . 81 

5. Special Latch Implementations . 87 

6 . Final Design Summary: D-Latch . 88 

B. FLIP-FLOP DESIGN (D-TYPE).89 

1. Overview and Analysis . 89 

2. Final Design Summary . 90 

C. CLOCK DRIVER DESIGN.91 

1. Overview . 91 

2. Analysis and Results . 92 

3. Final Design Summary: Clock Driver . 94 


vii 

















































V. HBT CML PIPELINED MULTIPLIER DESIGN- 95 

A. LOGIC STAGE DESIGN.95 

1. Overview . 95 

2. Carry-Save Adders . 96 

3. Carry-Completion Adders . 101 

B. REGISTER STAGE DESIGN.103 

C. CLOCK DISTRIBUTION.103 

D. MULTIPLIER IMPLEMENTATIONS.104 

E. PERFORMANCE EVALUATION.104 

1. Evaluation Procedures . 104 

2. Performance Results of Each Implementation . 106 

3. Comparative Analysis . 118 

VL ANALYSIS OF CLOCK SKEW-123 

A. QUANTIFYING CLOCK SKEW.123 

B. ANALYSIS PROCEDURES.123 

C. RESULTS.126 

vn. CONCLUSIONS-129 

LIST OF REFERENCES_ 131 

INITIAL DISTRIBUTION LIST-133 


viii 






















EXECUTIVE SUMMARY 


The electronic subsystems of future overhead collection 
platforms will require extremely high performance digital 
logic for performing such tasks as data 
compression/decompression, data encryption, spread spectrum 
modulation, etc. To accomplish this, bit rates must reach 
into the gigabits per second range. Such speed obviously 
requires digital logic which will function correctly at 
clock rates of tens of gigahertz. The need for such high 
performance has led to the implementation of logic systems 
using indium phosphide (InP) heterojunction bipolar 
transistors (HBT) technology. However, clock frequency and 
pipeline throughput in digital systems implemented with InP 
HBT technology is significantly limited by clock, control 
signal, and data skew which is a much larger percentage of 
the clock period than it is in lower-speed digital systems 
implemented with complementary metal oxide semiconductor 
(CMOS) technology. Therefore, the presence of clock skew in 
high-speed digital systems defines a limitation for the 
advantages of synchronous pipelined architectures. 

It is the purpose of this thesis to design a 
synchronous 8x8 bit pipelined multiplier as a high-speed 
digital test circuit using InP HBT technology and 
furthermore, to quantify the impact of clock skew on 
throughput. This work represents the initial phase of a 
larger research project to determine if asynchronous 
pipeline control will yield greater overall pipeline 
throughput in high-performance InP HBT digital integrated 
circuits and if the resulting elimination of the clock 
distribution tree will reduce power consumption, device 
count and layout area. All simulation data is based upon 
the results obtained from Tanner SPICE design tools. 


ix 



Having received InP HBT device specifications from 
Hughes Research Laboratories, this project commenced with 
the design of an HBT logic family utilizing current-mode 
logic. Each circuit was designed and optimized for a 
minimum power-delay product while driving a maximal fanout 
load of four logic gates. This design effort produced the 
four essential circuit functions necessary for the practical 
implemention of any synchronous logic circuit: an 
inverter/buffer gate, an OR/NOR gate, a D-type latch, and a 
practical current source. 

Using the building blocks of this logic family, an 
array multiplier was constructed and further configured into 
five distinct pipeline implementations. These included a 
one, two, four, six, and ten-stage pipeline, respectively. 
A comparative analysis of their performance effectively 
illustrated the trade-offs of pipelining, i.e., the cost of 
the additional registers was shown to outpace the increase 
in throughput beyond a six-stage implementation. At a 
maximum throughput of 4.35 gigahertz, the six-stage 
pipelined multiplier was the most efficient design (in the 
absence of clock skew). The highest throughput achieved was 
5.56 gigahertz by the costly ten-stage implementation. 
Power consumption ranged from 4.4 to 14 watts. 

In the final analysis, clock skew was not simulated 
because SPICE simulations effectively eliminate skew from 
their calculations. Rather, the impact of clock skew was 
determined by applying numerical analysis to the no-skew 
simulation results. A range of possible skew values was 
considered in order to demonstrate a performance trend. The 
results confirmed that digital system throughput rates which 
are obtained as a function of higher clock rates will 
experience the most drastic performance reductions in the 
presence of clock skew. Also, it was shown for a typical 


x 



value of skew in this circuit that the efficiency curve 
shifts to indicate that the four-stage pipeline is the most 
efficient implementation, vice the six-stage pipeline. 

The design products and test results from this thesis 
provide a reference point for further research into 
alternative clocking/control techniques. Specifically, it 
is intended that future research use the CML HBT logic 
family designed in this thesis in order to implement the 
same array multiplier circuit using asynchronous control 
techniques. One such endeavor is already in progress as 
LtCol. Kirk Shawhan, USMC, investigates the use of local 
completion signals which employ request/acknowledge 
handshake signals to control the flow of data vice the use 
of a global clock signal. 


xi 




THIS PAGE LEFT BLANK INTENTIONALLY 


xii 




ACKNOWLEDGMENT 


My most immediate expression of gratitude is to the One 
that made me ... to my Creator, my Lord, and my Savior — 
Jesus Christ. I am convinced that the meticulous detail and 
extensive effort required to design even the simplest of 
electronic devices bears unmistakable testimony to the 
Divine and Intelligent design of our universe. 

But remember the Lord your God, for it is he who gives you 
the ability ... Deuteronomy 8:18 

...whatever you do, do it all for the glory of God. 

I Corinthians 10:31 

This thesis is dedicated to my wife, Laura. Though I 
have no expectation that she will ever read beyond this 
page, certainly it would never have been written without her 
encouragement, support, and patience. While I have 
considered myself fortunate to be learning about the 
mysteries of microelectronics, she has been making the most 
thrilling discoveries and profound impact each day as the 
mother of our beloved daughter, Victoria, who is 22 months 
old today. I love them both dearly. 

I wish to thank my advisors. Professor Douglas Fouts 
and Professor Herschel Loomis for sharing their time, their 
knowledge, and their passion for this field of study. Their 
guidance was essential and their enthusiasm was contagious. 

I am also thankful to have shared this academic journey 
with such a quality core of fellow students and service 
members. Specifically, it is with great admiration that I 
thank LtCol. Kirk Shawhan, USMC — my perpetual study 
partner and project mentor. He has patiently endured my 
questions and generously shared his ingenius grasp of the 
most challenging concepts. My graduate school experience 
would have been a much different story without his 
assistance and more importantly, his friendship. 


xm 



THIS PAGE LEFT BLANK INTENTIONALLY 



I. INTRODUCTION 

A. THE RELEVANCE OF HIGH-SPEED LOGIC 

The demand for increased processing speeds in digital 
electronics has driven the clock frequency of logic circuits 
from a scale of microseconds to one of picoseconds over the 
past twenty years. This remarkable trend is the synergistic 
result of technological advancements and innovations in 
device physics, very-large-scale integrated (VLSI) circuit 
fabrication, and digital systems architecture. Moore's Law 
accurately predicted this trend of improvement 35 years ago, 
and current expectations are that the trend will continue 
(Moore, 1997). Consider the anticipation of such 

technologies as real-time multimedia satellite 
communications and broadband networks. These applications 
will require extremely high performance digital logic that 
can function reliably at clock rates of tens of gigahertz. 

B. THE PROBLEM OF CLOCK SKEW 

There are a variety of technological hurdles to clear 
before achieving such clock speeds, and it is the purpose of 
this thesis to explore one particular hurdle in the course 
of digital systems architecture: the problem of clock skew 
in high-speed logic. Clock skew is the difference between 
arrival times of the clock signal at different synchronous 
clocked devices (Harris, 1999). As clock frequencies reach 


1 




into the multi-gigahertz range, clock skew is an increasing 
concern for high-speed circuit designers because it accounts 
for an increasing portion of the clock period — leaving 
less of the clock period to be budgeted for logic and 
latching delays. What was once a near negligible quantity 
has now become a significant design constraint. (Wakerly, 
2000 ) 

C. THE DESIGN OF A TEST CIRCUIT 

This thesis presents the design of a high-speed logic 
test circuit and the simulation of its performance in order 
to identify and quantify the effects of clock skew. It 
should be noted that these results are intended to serve as 
a reference for future research involving potential 
solutions for the reduction of clock skew. The following 
paragraphs develop the necessary specifications of the test 
circuit. 

To ensure valid results, it is important that the 
problem be simulated in an accurate context. Therefore, it 
is necessary to select a logic family based upon a 
transistor model that is capable of realizing multi¬ 
gigahertz clock speeds. Although complementary- metal-oxide- 
semiconductor (CMOS) technologies dominate VLSI 
applications, for comparable fabrication technologies, a 
bipolar circuit is approximately 2.5 times faster than a 
functionally similar CMOS circuit (Foley, 1994). Typically, 


2 



such high-speed bipolar circuits employ emitter coupled 
logic (ECL) or current mode logic (CML) . Notably, these 
logic families consume significantly more power than field 
effect transistor (FET) logic families; however, the trade¬ 
off is accepted here for the purpose of achieving sufficient 
clock speeds. For these reasons, current mode logic is 
employed to design a family of logic gates based upon the 
transistor specifications for an indium phosphide (InP) 
heterojunction bipolar transistor (HBT), courtesy of Hughes 
Research Laboratories. 

Additionally, it is important that the architecture and 
functionality of the test circuit provide a relevant context 
for evaluation. It should be noted here that the shorter 
clock periods discussed above are not exclusively the result 
of faster gate delays (i.e. faster transistors) but are also 
the result of pipelined architectures which require fewer 
gate delays per clock cycle. In keeping with this 
characteristic of high-speed logic circuits, the test 
circuit implements a pipelined architecture. As for circuit 
functionality, an 8x8 bit multiplier was chosen to provide 
sufficient complexity for pipeline implementation. 

D. THESIS OUTLINE 

The purpose of this thesis is to design, simulate, and 
evaluate the performance of a high-speed (InP HBT) 8x8-bit 
pipelined multiplier in the presence of clock skew. The 


3 




discussion begins with the review and development of several 
fundamental topics in Chapter II: clock skew, pipelining 
principles, logic-level design of a multiplier, and 
transistor-level design of BJT/HBT logic. Based upon that 
foundation. Chapters III through V present the hierarchical 
design of the pipelined multiplier from the bottom up. 
Respectively, these chapters address logic circuit design, 
clock-driven circuit design, and pipeline design. Each of 
the design chapters presents a complete discussion of 
pertinent design issues, low-level simulation, performance 
optimization, and final design specifications. Finally, 
Chapter VI records the analysis of clock skew and 
Chapter VII summarizes the conclusions of the entire work. 


4 



II. BACKGROUND 


A. CLOCK SKEW 

Clock skew is the difference between the arrival times 
of the clock signal at two different clock-driven devices, 
as illustrated in Figure (2-1). This difference is 
dependent upon multiple issues including normal component 
variations, wire propagation delay, RC delays, propagation 
distance, environmental variations (such as operating 
temperature), and clock loading. Notably, all of these 
contributing factors have been increasing relative to gate 
delays. (Harris, 1999) 



Figure 2-1. Clock Skew (After Wakerly). 

In traditional logic designs which employ flip-flops 
and operate at extremely high clock frequencies, clock skew 
has become a significant portion of the total clock period. 


5 




For a fixed-length clock period, this effectively reduces 
the amount of time available for computation. Equation 
(2-1) quantifies the terms which contribute to the minimum 
clock period (TVJ of a traditional synchronous logic 
circuit. 


T . 

mm 

= t . + t. . 

skew logic 

+ 

^Flip-Flop 

where, 

^Flip-FLop ^ setup 

+ 

( t ) 

' prop 1 max 


The simplest and most direct technique for minimizing 
clock skew would seem to be the implementation of a uniform 
clock distribution hierarchy which provides a local clock 
signal to a smaller portion of the entire circuit, i.e., a 
subcircuit. For signals that remain within the subcircuit, 
clock skew is reduced. The maximum propagation delay from 
the local clock source to the farthest clock input of the 
subcircuit can be kept within a desirable tolerance. But 
inevitably, signals must travel between subcircuits. This 
is an increasingly common occurrence when the maximum size 
of the subcircuit is restricted by practical limitations for 
fanout and power consumption — especially true in the case 
of current-driven logic. 

The local clock signals are not without skew relative 
to each other. Although the delay paths for each branch of 
the clock distribution tree may contain the same number of 
gate delays, the switching behavior along each path varies 


6 



within a narrow range. Thus, when a signal from one 
subcircuit must drive logic in another subcircuit, the 
worst-case value of relative clock skew must be assumed. 

An extensive clock distribution tree is employed in 
this thesis to provide local clock signals for circuit 
elements of a pipelined multiplier. Ultimately, the purpose 
is to quantify the clock skew experienced in a high-speed 
logic circuit and explore the impact of clock skew as the 
clock period is reduced. 

B. PRINCIPLES OF PIPELINING 

As referenced in the previous section, the minimum 
clock period is governed by the relationship presented in 
Equation (2-1). For a given block of combinational logic 
with an associated propagation time of t logic , the minimum 
clock period is required to be even greater. In the face of 
a large, complex combinational circuit (Figure 2-2a) this 
could impose undesirable restrictions on clock speed. 

However, a pipelined approach suggests that the 
combinational logic can be broken down into discrete levels 
of operation, known as pipeline levels (Figure 2-2b). Each 
pipeline level will contain fewer levels of logic than the 
original combinational circuit, and ideally, each pipeline 
level will contain the same number of logic levels in order 
to achieve near-equal propagation delays. Then, by adding 
appropriately sized registers between these levels (Figure 


7 



2-2c), the function of the original combinational logic can 
be achieved by sequentially sending operands through the 
series of pipeline levels. 

Furthermore, this can be done at a higher clock rate 
since the period is now governed by Equation (2-2), where 
t. . has now become t . . . 

logic pipe-level 


( 2 - 2 ) 


■'pipe-level 


'Flip-Flop 


The improvement in clock speed is quantified as the 
percentage of speedup. Equation (2-3). (Pollard, 1990) 


(2-3) 

Time for M operations WITHOUT pipelining 

Speedup = --—-- 

Time for M operations WITH pipelining 

Of course, this benefit is not without cost. There are 
several trade-offs involved such as increases in the number 
of components, power consumption, control complexity, chip 
area, and a variety of associated costs for design and 
fabrication. Additionally, the propagation latency for a 
single set of signals traveling through the pipeline is 
increased due to the additional delays contributed by the 
intermediate register(s) in the pipeline. Equation (2-4) 
expresses this increase in latency as a function of the 
number of pipeline stages (m) and the total register delay 
(Loomis, 2000). 

(2-4) Latency Increase = (m-1) t FIip _ plop 


8 




Figure 2-2. Example of Pipelining (After Loomis). 


9 



















Though the significant increase in delay for a single 
operation may seem to be a tragic loss, it is the remarkable 
increase in data throughput which accompanies the increase 
in clock speed that ultimately motivates the designer to 
adopt a pipelined architecture. 

In the context of this project, a pipelined 
architecture will facilitate the achievement of high clock 
speeds in the implementation of a relatively large, complex 

combinational circuit — a combinational multiplier. 

C. LOGIC DESIGN OF A COMBINATIONAL MULTIPLIER 

A combinational multiplier takes two n-bit operands and 
performs n shift and n add operations to generate a 2n-bit 
product. Most algorithms are implemented based upon the 
paper-and-pencil-like procedure of shifted product 
components as shown in Figure (2-3). Each individual bit of 
the multiplier (y 0 through y^) is successively multiplied 
times the entire n-bit multiplicand. With each subsequent 
multiplier bit, the resulting product component is shifted 
by one bit position, starting with an initial shift of zero 
and concluding with n-1. (Wakerly, 2 000) 

The worst-case delay for this type of multiplication is 
governed by the carry propagation out of the most 
significant bit position and into the follow-on stage of 
addition. By utilizing carry-save addition (Figure 2-4) , 
this propagation delay is eliminated for the initial n-1 


10 




Figure 2-3. Multiplication as a sum of partial product 

terms (From Wakerly). 



Figure 2-4. An 8x8 bit multiplier implemented with seven 
carry-save adder stages and one ripple-carry adder for 
carry completion (From Wakerly). 


11 























































































stages of addition; however, an extra stage is required to 
complete the addition of the final two resulting terms, as 
will be explained shortly. 

The first carry-save addition stage takes two binary 
addends and generates an n-bit modulo-two sum and a shifted 
n-bit carry term (shifted by one bit). Subsequent carry- 
save addition stages take three binary addends: the 
previous partial sum, the shifted carry term, and the next 
subsequent product term. These are also added to produce an 
n-bit modulo-two sum and a shifted n-bit carry term. As 
each carry-save addition occurs, the least significant bit 
(LSB) of each partial sum represents the next most 
significant bit (MSB) in the final product. This is 
repeated until the n th product term has been added, and all 
that remains are a sum term and a shifted carry term. At 
this point, a carry-completion adder computes the most 
significant n+1 bits of the product. This procedure 
accounts for the consecutive propagation of a carry bit as 
each pair of addend bits are summed from LSB to MSB. 

In the context of this project, the implementation of 
carry-save adders and carry completion adders allows 
convenient grouping of pipeline stages. This is 
particularly applicable to the final stage of the design 
process undertaken in this project. Chapter 5 provides 
further details on the implementation of a pipelined 8x8-bit 



combinational multiplier, as introduced in the preceding 
paragraphs. 

D. BJT/HBT LOGIC 

1. BJT/HBT Principles and Characteristics 

a) Device Structure 

A bipolar junction transistor (BJT) is a sandwich 
structure of three separately doped regions of silicon (or 
other suitable semiconductor) , such that one of two 
configurations exists. One configuration is the pnp 
transistor where a negatively doped region is bounded on 
either end by positively doped regions (p-type transistor). 
The other configuration is the npn transistor where a 
positively doped region is bounded on either end by 
negatively doped regions (n-type transistor). Figure (2-5) 
provides a simplified illustration and further identifies 
the proper names for the regions: collector, base, and 

emitter. 


Emitter ^ 

Emitter 

Base 

Collector 



Region 

Region 

Region 



Collector 


* 

Base 


Figure 2-5. Structure of a Bipolar junction Transistor 

(After Pierret). 


13 








Until recent years, BJTs were generally fabricated 
from a single semiconductor material. However, device¬ 
level physics has demonstrated that faster junction 
transistors can be constructed from dissimilar semiconductor 
materials with complementary properties. Such devices are 
known as heterojunction bipolar transistors (HBTs). 
Conveniently enough, their operational behavior is 
essentially governed by the same functional principles as 
BJTs (Pierret, 1996). Therefore, it is assumed that 
wherever BJT behavior is referenced, a direct correspondence 
to HBT behavior exists. The following sections will provide 
a fundamental understanding of that behavior. 

b) Device Function 

The significance of the BJT lies in its potential 
to behave as a current-controlled current source when the 
proper DC bias is applied to the three regions or terminals. 
The controlling terminal is the base. Applying the proper 
DC bias to an npn transistor, a small current flowing into 
the base will produce a proportionately larger current being 
drawn into the collector, across the base region, and out of 
the emitter (Figure 2-6) . The converse is true for a 
properly biased pnp transistor. A small current drawn out of 
the base will produce a proportionately larger current being 
drawn into the emitter, across the base region, and out of 
the collector. From this point forward, it will be helpful 


14 




I 1 



Figure 2-6. A functional illustration of an (a) npn and 
a (b) pnp bipolar junction transistor (After Sedra). 

to limit the discussion to npn transistors, because the pnp 
transistors operate in a very similar manner (with reversed 
polarity) and npn transistors are the only type encountered 
in the chapters ahead. 

As stipulated in the preceding discussion, proper 
DC bias conditions must exist in order to achieve the 
desired performance. Depending upon the DC bias, the 
transistor will operate in one of the following modes of 
operation: cutoff, active, or saturation. In the first 
case, the emitter-base junction is reverse biased which 
means V BE < V BE(on) for the pn junction (0.75v). This also 
implies that V BC < V BC(on) for the collector-base junction. 
Therefore, the collector-base junction is also reverse 
biased. This condition is known as the "cutoff" mode since 
effectively no current flows through the transistor. 


15 



In the two remaining modes, the emitter-base 
junction is forward biased, and the transistor conducts 
current. The mode of operation is distinguished by the 
condition of the collector-base junction — using the 

emitter as a common reference for both the collector and 
base. If V CE < V CE(sat) then the base-collector junction is 
saturated, and the flow of current from collector to emitter 
is not linearly dependent on I B . Conversely, when V CE > V CE(sat) 
for the base-collector junction, then it is reverse biased 
and current is swept from the collector, across the base, 
and out of the emitter in linear proportion to the amount of 
base current applied. This is known as the active region. 

Table (2-1) summarizes the relationships which 
govern the three regions of operation. Furthermore, Figure 
(2-7) is an i-v curve for the Hughes InP HBT (lxl micron) . 
It serves to illustrate the active and saturation modes of 
BJT operation while also providing necessary design 
information that relates the base-emitter voltage drop (V BE ) 
to collector current levels (I c ) . 

The linearly proportionate increase in collector 
current relative to base current is referred to as the 
common-emitter current gain, Beta ((3) , as shown in Equation 


(2-5). (Sedra, 1998) 



Collector Current (raA) 


Mode of 
Operation 


Base-Emitter 

Junction 


Collector-Emitter 

Junction 



Bias 

Relationship 

Bias 

Relationship 

Cutoff 

Reverse 

V < V 

v BE v BE(on) 

Reverse 



Saturation 

Forward 

^BE ^ ^BE(on) 

Forward 

v CE 

^ ^CE(sat) 

Active 

Forward 

"^BE ^ ^BE(on) 

Forward 

v CE 

^ ^CE(sat) 


Table 2-1. Relationships governing the operational regions 
of the BJT transistors (After Sedra). 



Figure 2-7. I-V Curve for the InP HBT 



























Figure 2-8. Variation of Beta for the XnP HBT with 
respect to V n and V^. 

Beta is a device parameter for BJTs — a function of the 
device physics and dimensions. Figure (2-8) illustrates how 
Beta varies according to the values of base-emitter voltage 
and collector-emitter voltage. 

Finally, a simple application of Kirchoff's 

Current Law produces Equation (2-6) — an important 

relationship for current through the transistor. 

(2-6) I E =I B +I C 


18 













c) DC Analysis of a BJT Circuit 

In order to illustrate the basic concepts of BJT 
operation as presented in the previous section, the 
transistor circuit in Figure (2-9) is now examined. Given 
the reference voltages, the turn-on voltage for the emitter- 
base junction (0.75v), and Beta for the transistor, it is 
readily (determined that V BE > V BE(on) , and therefore the 
emitter-base junction is forward biased. DC analysis 
reveals the value of V B and I B . Applying the equations from 
the previous section, I c , I E , and V c are determined, and it is 
concluded that the transistor is operating in the active 
region. 


+ m 


+ 5v 


R c = lkQ 


R* = lOOkQ 








DC ANALYSIS : 

V E = Ov 


V B = V E + V BE(on) = 0.7v 


T _ Vbb ~ V B 


5v - 0.7v 
lOOkQ 


= 43 


I c = p x I B = 4.3mA 

I E = I c + I B = 4.343mA 

V c = V cc - I C R C = lOv - (4.3mA) 
= 6.7v 


Figure 2-9. DC Analysis of a simple BJT circuit. 


19 



In anticipation of logic applications, consider 
the base voltage as a logical input which is either high 
(above V BE(on) ) or low (below V BE(on) ) . For a logic high input 
the transistor operates in the active mode, causing the 
voltage at the collector drop below V cc by an amount equal to 
I C R C . Alternately, for a logic low input the transistor 
operates in the cutoff mode, drawing effectively no current 
through the collector and leaving V c approximately equal to 
V cc . The functionality of this circuit is essentially that 
of a basic BJT inverter. 

d) BJT Differential Pair 

Before committing to the discussion of transistor 
logic circuits, it is necessary to introduce a configuration 
that maximizes the switching speed of the BJT transistor: 
the differential pair. A differential pair is constructed 
from two matched transistors (Q x and Q 2 ) with their emitters 
attached to a common current source and their collectors 
independently biased via separate pull-up resistors to a 
common voltage source, as shown in Figure (2-10). The base 
terminals are attached to separate voltage sources of equal 
value. Assuming the transistors have been given the proper 
DC bias for operation in the active mode, the relationship 
in Equation (2-7) is readily determined. 

(2-7) I E , = I E2 = ^ 


20 



V C c 



Figure 2-11. Example of a BJT Differential 
Pair configuration. 

Now, consider the scenario where V B2 is constant 
and V B1 is allowed to vary between two extremes: one above 
and one below V B2 . When V B1 reaches a voltage sufficiently 
larger than V B2 , all of the current from I bias is steered 
through Q x such that Q 2 is cutoff. Conversely, when V B1 drops 
sufficiently below V B2 , Q 2 is on and Q 2 is cutoff. As noted 
in the DC analysis of the previous BJT circuit, the 
collector voltage of Q 1 exhibits the behavior of a logic 
inverter with respect to V B1 , while the opposite collector 
voltage (Q 2 ) functions as a non-inverting buffer. 


21 



While the availability of complementary output 
voltages is certainly convenient, the most important 
observation of the differential pair is its switching speed. 
A relatively small voltage difference between V B1 and V B2 is 
required to switch the current almost entirely to the 
opposite path. More specifically, for a differential pair 
implemented with the Hughes InP HBT, it is shown in Figure 
(2-11) that a difference of only 75mV is sufficient to 
switch 90% of the current. 



Input Voltage (V) 

Figure 2-11. Current Switching Characteristic of the InP 

HBT Differential Pair. 







Furthermore, since Q 1 and Q 2 are biased to operate 
in the active mode, the switching occurs faster than 
scenarios which may place the transistors in saturation 
mode. This is because a saturated transistor stores charge 
in its base. That charge must be dissipated before 
switching can occur. 

It is the current-steering property of the 
differential pair configuration which ultimately provides a 
foundation for the development of current mode logic, as 
will be discussed later in this chapter. However, before 
reaching that discussion, a brief overview of the dominant 
BJT logic families will serve to accentuate the advantage of 
current mode logic. 

2. BJT/HBT Logic Families 

This discussion is not intended to address all BJT/HBT 
logic families. Rather, the purpose here is summarize the 
principles of the two most popular and relevant BJT/HBT 
logic families. These are transistor-transistor logic and 
current-mode logic. Ultimately, this discussion culminates 
with a comparison of the two logic families in order to 
justify the implementation of current-mode logic for high¬ 
speed applications. 

a) Transistor-Transistor Logic (TTL) 

Transistor-transistor logic evolved directly from 
diode-transistor logic (DTL) in a successful effort to 


23 


eliminate the drawbacks of DTL. (Richards, 1967) While 
there were several stages in this evolution, the end product 
is a TTL family which resembles the inverter shown in Figure 
(2-12). The enhanced performance of TTL is predominately 
achieved through two fundamental design features. 

The first improvement is the use of a second 
transistor in place of the diodes of a DTL circuit. For a 

V cc V cc 



low input voltage, Q x is turned on — rapidly drawing 
current from the base of Q 2 and dissipating the excess 
charge to achieve a faster transition. In the opposite 
case, when the input is high and Q x is cutoff, Q 2 is 
specifically engineered to have a low reverse Beta such that 
a small yet sufficient current flows out through the 
collector and is applied to the base of Q 2 . 

The second improvement is the use of an optimum 
output stage, commonly referred to as the "totem-pole" 


24 



output stage (not shown in the Figure 2-12). It combines 
the rapid high-to-low transition capability of the common- 
emitter output stage with the rapid low-to-high transition 
capability of the emitter-follower output stage. 

Based upon these two features in conjunction with 
other minor modifications, TTL logic achieved a level of 
popularity which made it the dominant design for SSI, MSI, 
and LSI circuits throughout two decades. Despite this 
success, standard TTL circuit speeds are still limited by 
two design issues. First, transistors operate in saturation 
mode which increases junction capacitance and its associated 
switching delay. Second, the resistance along the 
dissipation path for junction capacitance further increases 
this delay. 

b) Current-Mode Logic (CML) 

Current-mode logic is distinct from the design of 
other BJT/HBT logic families. The term "current-mode" 
refers to the channeling of a constant current along 
alternate paths to achieve logic functionality in circuits. 
Since it is the presence or absence of current that 
determines the logical output, the maximum voltage swing can 
be relatively small in contrast to voltage-mode circuits, 
such as TTL. 

The distinguishing design feature of current-mode 
logic circuits is the BJT differential pair. It is the 


25 


backbone of all CML circuits and the source of critical 
advantages and disadvantages. The benefit of smaller logic 
swings has already been mentioned. Also, the discussion of 
the BJT differential pair earlier in this chapter explained 
how the collector voltage swings (inverts) rapidly in 
response to reversing the polarity/magnitude of the 
differential inputs by a narrow margin of approximately 
75mv. This translates into a switching speed for CML which 
is unsurpassed by its predecessors. Contributing to this 
remarkable speed is the fact that the transistors of the 
differential pair can be operated in the active region and, 
therefore, do not suffer from the effects of excess charge 
stored at the transistor base. Unfortunately, the constant 
flow of current which enables these remarkable switching 
speeds also consumes a remarkable amount of power. 

For an illustration of how a CML circuit 
functions, consider the inverter in Figure (2-13). Let 
input B have a constant value — a reference voltage. When 
input A is high (greater than the reference voltage by at 
least 75mv) , then Q 2 is turned on and Q 2 is cut off. The 
current being drawn through R 2 produces a logic low (V^-I^RJ 
at V outl . Notably, the complement of this output, a logic 
high (V cc ) is simultaneously available at V out2 . The presence 
of complementary outputs is yet another benefit of CML 
circuits. When input A is switched from high to low, the 




Figure 2-13. CML Inverter. 

conditions for Q 1 and Q 2 reverse. Q 2 turns on and Q 2 is cut 
off. V out2 is pulled low while V outl is pulled high. 

c) Advantages and Disadvantages 

For high-speed applications, the selection of a 
BJT logic design is reduced to a quantitative comparison of 
TTL and CML. The predecessors of these two logic families 
are far inferior in their capability to dissipate the 
accumulated charge at the transistor base upon switching. 

If the only two criteria were maximizing speed 
while minimizing power consumption, then there could 

possibly be a toss-up between TTL and CML — ultimately to 


27 



be determined by the design which achieves the lowest power- 
delay product or by weighting one specification over the 
other (high-speed or low-power) . Clearly, TTL is the low- 
power contender, while CML is the high-speed champion. 
However, before addressing the issue in the context of this 
design project, consider the following summary of advantages 
and disadvantages. 

In addition to being faster, CML requires a 
smaller voltage swing than TTL and is less susceptible to 
noise due to the nature of the BJT differential pair. As 
another benefit of that nature, CML generates complementary 
outputs. The fact that both output signals are referenced 
to V cc provides for exceptional stability when V cc is 
referenced to ground and a negative supply voltage is used. 
Unfortunately for TTL, its strong point of consuming less 
power has a down side: the short pulses of current which 
must be generated for switching logic levels also create 
spikes in the supply voltage. The constant current drawn by 
CML circuits avoids this potential source of noise. 

In conclusion to this comparison, a logic designer 
presented with the choice of CML or TTL would only choose 
TTL in the event that power consumption made CML 
impractical. In real world applications, this is typically 
true. However, since it is the purpose of this design 
project to explore the impact of high-speed logic on digital 


28 



system architecture, priority has been given to the superior 
speed and extensive design benefits of CML. 

Having concluded that current-mode logic is the 
best approach to HBT high-speed logic design, it is 
necessary to design a sufficient set of logic gates to 
implement the desired test circuit, an 8x8 bit pipelined 
multiplier. Chapter III presents the discussion of logic 
circuit design which includes design of the following: an 

inverter/buffer gate, a NOR/OR gate, full adders, and a 
practical current source. 


29 



THIS PAGE LEFT BLANK INTENTIONALLY 


30 



III. 


HBT CML LOGIC CIRCUIT DESIGN 


A. DESIGN OVERVIEW 

In this chapter, CML logic circuits are designed which 
will serve as the building blocks for construction of the 
multiplier logic. The design process is presented in the 
context of a single logic circuit, beginning with the most 
fundamental functions and progressing toward the more 
complex. Of note are the following general design goals 
which served as guidance for decision-making in the early 
stages of logic circuit design: 

• Minimize the rail voltages (i.e. supply voltage) 

• Achieve proper DC bias conditions with reliable 
noise margins and fanout 

• Optimize transient performance for speed and power 
consumption 

B. INVERTER DESIGN 

1. Circuit Topology 

Based upon the introduction to CML design in the 
previous chapter, Figure (3-1) illustrates the circuit 
topology of a CML inverter. A detailed description of its 
function is presented in the previous chapter and will not 
be repeated here. However, there is one subtle constraint 
in this design. One of the differential inputs is tied to a 


31 




Figure 3-1. CML Inverter. 

reference voltage. While this is not essential for the 
design of an inverter, it will prove significant in the 
implementation of multiple-input logic gates. A common 
reference voltage eliminates the need to provide 
complementary logic signals for each input and furthermore, 
it avoids the increase in supply voltage associated with 
multiple complementary inputs in a stacked series of 
differential input pairs. 

Figure (3-2) illustrates the same inverter design as 
Figure (3-1); however, it also includes an emitter-follower 
stage at each collector output of the differential pair. 
The purpose of this stage is twofold. First, it provides a 
buffer between the input differential pair and the 
capacitive load of subsequent driven logic gates. Second, 


32 




it produces a downward DC shift equal to the base-emitter 
turn-on voltage. Ideally, the gain of the emitter-follower 
is one; however, in practice the gain is slightly less than 
one. The result is a slightly diminished voltage swing at 
the output of the emitter-follower when compared to the 
voltage swing at the collector of the differential pair. 

Whether or not to include the buffer stage represents a 
fundamental design issue for CML logic circuit design. At a 
glance, performance arguments can be made both for and 
against it. On the one hand, it would appear to increase 
fanout performance, yet on the other, it would appear to 
decrease switching performance with the additional switching 
delay of a second transistor stage. Additionally, the non¬ 
buff ered output topology would consume less power for a 


33 




given bias current. However, without performance data to 
substantiate one option over -the other, both will be 
developed and evaluated until objective design 
considerations can identify a clear preference. 

2. Initial Conditions and Design Parameters 

a) Voltage Parameters 

Having introduced the topology of the CML 
inverter, it is necessary to establish initial conditions 
for operation. The first is the supply voltage, which is 
bound by two primary considerations. It must be large 
enough to support the proper function of the circuit, i.e. 
provide proper transistor bias conditions and the desired 
voltage range between high and low logic levels. 
Conversely, it should be kept as small as possible, because 
the power consumed by the circuit is directly proportional 
to the magnitude of the supply voltage. 

Clearly, foresight must be exercised in order to 
determine the minimum supply voltage necessary to achieve 
proper DC bias conditions for all transistors in all 
circuits of the design. In the context of this project, the 
D-type latch design (presented in Chapter IV) imposes the 
greatest demand on the supply voltage level by operating 
three transistors in series between the voltage supply 
rails. For optimum, reliable clocking performance of the 
latch, the logic reference voltage is determined to be 1.45 


34 



volts. This figure is based upon a maximum logic signal 
range of 0.5 volts and a maximum logic high voltage of 1.7 
volts (reference Chapter IV-A-3a for further details). 

Given this information, the minimum required 
supply voltage is determined for each inverter topology. 
Both require that the voltage at the collector (V c ) be large 
enough to avoid saturation of Q 1 . Furthermore, both require 
that the voltage at the collector provide for an output 
voltage that matches the range of the input voltage. 

For the non-buffered topology, this implies an 
inverse match between the voltage at the base of Q 1 and the 
voltage at its collector. In other words, for a logic input 
that is high, V B(hi) , the output voltage at the collector 
should be low, such that the following relationship in 
Equation (3-1) holds true. 

(3-D V c(low) = V B(hi) - 0.5v 

Assuming the collector of Q x draws approximately 1mA of 
current, collector-emitter saturation voltage, V CE(sat ,, is 
0.275 volts and the base-emitter turn-on voltage is 0.775 
volts. Under these conditions, Q x is on the boundary of 
active mode operation. For a signal swing larger than 0.5 
volts, the transistor would saturate. Conversely, for a 
logic input (V B ) that is low, the collector voltage (V c ) must 
be given by Equation (3-2). 

(3-2) V c(hi) = V B(low) + 0.5V 


35 




For V B(low) equal to 1.2 volts, V c(hi) must be 1.7 volts. Thus, 
for the non-buffered topology, the maximum voltage at the 
collector is 1.7 volts. No current flows through R gain 
because Q x is cutoff; therefore, the minimum required supply 
voltage is also 1.7 volts. 

In the case of the buffered topology, the DC 
voltage drop across the base-emitter junction of the output 
buffer imposes a greater demand. For the output voltage 
range to match the input voltage range, the voltage at the 
collector (as described in Equation 3-2) must be increased 
by an amount of V BE(on) (as shown in Equation 3-3) in order to 
counter the base-emitter voltage drop at the buffered 
output. 

(hi) — V B(10W) + 0.5v + V BE(on) 

Assuming a current of 1mA or less through the buffer, V BE(on) 
is 0.775 volts. The result is a minimum required supply 
voltage of 2.5 volts. (Reference Chapter IV-A-3a for a 
thorough derivation of these conclusions.) 

In summary, different supply voltage levels will 
be utilized for the two inverter topologies. The non- 
buffered output topology will employ a 1.7 volt supply 
voltage, while the buffered output topology will employ a 
2.5 volt supply voltage. 


36 



b) Transistor Area/Size 

In order to optimize switching speeds in BJT/HBT 
transistors, it is desirable to keep the device area small, 
thereby minimizing parasitic capacitances. Likewise, a 
smaller device size requires less current and less current 
means less power. The InP HBT device sizes made available 
from Hughes Research Laboratories have junction areas of 
lxl, 1x3, 1x5, and 2x5 microns. The lxl area transistor is, 
therefore, the transistor of choice for switching 
applications (logic circuits). Note, however, that the 
consideration of device size must be re-visited for 
applications where switching speed is not a factor, i.e. the 
construction of a practical current source (addressed in 
Chapter IV). 

c) Fanout Requirement 

Fanout is the number of logic gate inputs that a 
single gate output can drive, while providing voltage levels 
within the correct logic range. Increased fanout is 
achieved at the expense of power consumption and loss of 
speed. Considering that the CML logic inputs/loads are 
current-driven, increased fanout will require a 
corresponding increase in switching delay and/or current. 
As a result, the fanout parameter should be chosen such that 
it sufficiently economizes the number of logic gates and 
levels of logic required without needlessly sacrificing 


37 



power and speed. In meeting this requirement, a reasonable 
fanout parameter has been established based upon the logic- 
level design of the a three-input adder (reference Chapter 
III-D). For implementation using the minimum number of 
logic levels, a three-input adder requires a fanout of four. 

3. DC Analysis 

a) Overview 

Given the circuit topology for a CML inverter as 
shown previously in Figure (3-2), the first step in circuit 
design is to establish the proper DC bias conditions for 
operation. This can be done for both the buffered and non- 
buffered cases simultaneously. For the non-buffered case, 
simply disregard the presence of the buffer stages. The 
remaining node voltages at the collector outputs on the 
differential pair are the same. 

Figures (3-3a) and (3-3b) show the DC node 
voltages for the desired operation of a CML inverter given a 
high logic input and a low logic input, respectively. Given 
matched transistors the two sides of the differential pair 
could be considered symmetric in their behavior, except that 
the input voltages driving the opposite sides of the 
differential pair are not symmetric. That is, the reference 
voltage drives the differential pair at 1.45 volts whereas 
the logic input drives it at 1.7 volts. The result is a 
difference of 0.25 volts at the emitter. This is a minor 



(I R a+I r . 1 )R_- i n 1 


V CC~ A B4 iV gain 


V CC - (*B3 + Ic 1 )Rgain"^BE(on) 


T ^CC ” (^B4^gain)'^BE(o 

bias 


(a) 

NOTE: Leakage 

Current is 
Neglected 


VcC^BsRgain 

k A 

^CC ■ (^B4 + Ic2) Rgain ^ 

^LOW # ^ C 

W % 

>i Q 

1 .. . 

N 

2j—• V RE p 


^REF "^BE(on) 

...... .... . ^ ... .. 

/ 

/ 

\ 

^CC " (^B3Rgain)"^BE(on) 

rh , 

/ 

V C C " (?B4 + Ic2)Rgain'”^BE(on) 


Figure 3-3. DC Analysis of a CML Inverter for (a) a HIGH 
input logic level and (b) a LOW input logic level. 


observation at present, but it explains the non-symmetric 
performance that is encountered between the two output 
signals (the inverted and the non-inverted signals). 


39 





b) Gain Resistor 

In order to take advantage of the switching speed 
of the differential pair, transistors must be biased to 
operate in the active mode. Therefore, the value of the 
base-emitter voltages (V BE ) for Q a and Q 2 must be such that 
V CE > V CE(sat) . Thus, for a given supply voltage and bias 
current, there is a restriction on the magnitude of the 
voltage drop across R gain . If the drop is too large, the 
transistor will saturate. Conversely, the voltage drop must 
not be too small because it is the product of I R _ gain and R gain 
which determines the magnitude of the signal voltage swing 
(assuming active operation). This same voltage range 
applies to the output of the buffer stages as well. As 
referenced earlier in this chapter, a constant DC shift of 
V BE(on) is the only difference between the nodes V c , and V bu£ . 

In summary, the significance of R gain is two-fold: 
it must be small enough to keep Q x (and Q 2 ) operating in the 
active mode, and it must be large enough to provide a 
satisfactory voltage swing between logic levels. Figure 
(3-4) illustrates the DC transfer characteristic of the 
inverter for various values of gain resistance. It 
effectively demonstrates the upper and lower limitations of 
gain resistance for a value of I bias equal to 1mA. At 
resistances of 500 ohms and less, the desired 0.5 volt 
signal swing is not achieved, and at resistances of 600 ohms 


40 




Figure 3-4. Effect of Gain Resistor Variation on 

Inverter Output. 


and greater, the effect of saturation can be observed by the 
upward bend in the curve. 

c) Buffer Resistor 

The buffer resistor (R^) governs the amount of 
current drawn by the emitter of transistors Q 3 and Q 4 . The 
magnitude of emitter current is directly proportional to the 
base current which is drawn from the collector of the 
differential pair. Thus, the base current of the output 
buffer represents a small portion of the current passing 
through R gain In this way, the size of the buffer resistor 


41 








effectively produces a small DC offset at the buffered 
output while regulating the amount of current drawn through 
the buffer stage. 

This is significant for two reasons. First, it 
facilitates optimization of switching speed versus power 
consumption by providing a mechanism for controlling the 
amount of current flowing through the buffer stage and 
therefore, available to drive a logic load. Second, R,^ is 
inversely proportional to a DC voltage offset at the 
buffered output. The ability to control this offset is 
especially helpful in matching the output signal swing to 
the input. Figure (3-5) represents the variation of output 
voltage for a range of resistor values based upon a bias 
current of 1mA. 

d) Bias Current 

Bias current is directly proportional to the 
current (I c ) drawn through the gain resistor (R gain ) • 
Therefore, bias current drives the magnitude of the voltage 
drop produced in the gain resistor, and this voltage drop 
corresponds to the maximum signal voltage swing. For this 
reason, a proper combination of I bias and R gain must be 
determined to provide the desired 0.5 volt swing. In order 
to select from an infinite set of current-resistor 
combinations, a likely set of current-resistor pairs will be 
identified to represent the practical range of 


42 




Figure 3-5. Effect of Buffer Resistor Variation on 

Inverter Output. 


possibilities. This is done for both the buffered and non- 
buffered inverter topologies. Note, the non-buffered 
topology can be allowed to draw a higher bias current 
through the differential pair because it does not draw any 
additional current through buffer stages. 

e) DC Noise Margins 

Once values of resistance and bias current are 
established, the circuit topology is completely defined and 
a DC transfer curve can be obtained. From this plot the DC 
noise margins for a particular design are calculated. Noise 
margins provide a measure of the allowable noise which can 


43 






be received at the input without affecting the correct logic 


output. Since this circuit will be operating with such a 
narrow signal voltage swing, noise mar ins are a critical 
interest for establishing reliable DC bias conditions. 
Equations (3-4) and (3-5) define the high and low noise 
margins in terms of the maximum and minimum, high and low 
logic values. (Weste, 1993) 


(3-4) 


DM, = | 

V - V 1 

ILmax OLmax 1 

(3-5) 


NM, = | 

V . - V . 1 

OHmin IHmin I 

where, 

V 

^ ILmax 

v 

OHmin 
^ OLmax 

= minimum 
= maximum 
= minimum 
= maximum 

HIGH input voltage 
LOW input voltage 
HIGH output voltage 
LOW output voltage 


These logic values are extracted from the DC transfer curve. 
The two unity gain points (where the slope equals negative 
one) of the DC transfer curve have been used to define the 
boundaries of these regions. 

f) DC Bias Optimization 

Given a set of practical current values, DC 
analysis is employed to identify a set of matching gain 
resistances which properly bias the inverter for logic 
operations. For each pair of current-resistor values, a DC 
transfer characteristic is obtained to determine the noise 
margins and the maximum range of the signal swing. The 
results are tabulated in Table (3-1). In the absence of a 


44 




45 




load, each configuration met the established design 
requirements — that is, a matched input and output signal 

voltage range of 0.5 volts, centered at a reference voltage 
of 1.45 volts with sufficiently balanced noise margins of 
0.1 volt minimum (20% of the signal range). 

However, when examined under the maximum fanout 
load (which is four), the performance of the non-buffered 
output topology suffers greatly. The maximum high logic 
voltage is reduced by an amount ranging from 0.09 volt to 
0.23 volt, depending upon the bias configuration. Not only 
does a load reduce the desired 0.5 volt signal range, but it 
also erodes the high-end noise margin. As a result, the non- 
buffered output topology can now be eliminated from further 
consideration in the design process. 

As for the buffered output topology, the noise 
margins and voltage range are remarkably consistent — 

regardless of the loading. The output buffer effectively 
isolates the current drawn by the load from the current in 
the differential pair. Thus, each of the bias 
configurations for the buffered output topology will be 
further tested under transient conditions to identify the 
optimum inverter design. It should be noted that the DC 
analysis presented here and the transient performance 
analysis which follows are both conducted using ideal 
current source models. 


46 




4. 


AC/Transient Analysis 


a) Delay Measurements 

Transient performance of logic circuits is 

generally quantified by measuring the delay associated with 

signal propagation. The delay times utilized here are 

standard performance parameters. However, for completeness, 
their mathematical definitions are provided below in 

Equations (3-6) and (3-7). (Weste, 1993) 

(3-6) tfall = time for a logic signal to traverse 

from 0.9 V RANGE to 0.1 V RANGE 

(3-7) trxse = time for a logic signal to traverse 

from 0.1 V RMGE to 0.9 V RANGE 

where, = the voltage difference between the 

steady state V HI and V L0W 

b) Performance Parameters 

At this point in the design process, two 
performance parameters are of primary concern, power and 
speed. Being related to each other, there is often a trade¬ 
off between the two. Optimization of these two parameters 
will determine which of the DC bias inverter configurations 
will be implemented. A common method of optimization is to 
quantify the parameters of power and speed as a single 


47 



figure of merit, such as a product or a ratio. Optimization 
is then achieved by maximizing or minimizing the appropriate 
figure of merit. 

Power-delay product is one such figure of merit. 
It is simply the product of the power consumed by a logic 
circuit multiplied times the propagation delay of the signal 
from input to output. Expectedly, the design that most 
efficiently balances the trade-off between speed and power 
consumption will yield the lowest power-delay product in 
transient testing. 

The ratio of speed to power provides a similar 
figure of merit, but speed measurements are not as clearly 
defined as delay measurements. Therefore, in the interest 
of optimizing this design for speed, a definition of maximum 
switching frequency will now be established. The maximum 
reliable frequency is defined as the maximum switching 
frequency of the logic input signal for which a maximally 
loaded output signal consistently traverses 90% of the 0.5 
volt range of logic. 

c) Transient Analysis Procedures 

For an accurate evaluation of logic circuit 
performance, it is necessary to provide a realistic input 
signal and a worst-case output load. Here, the term load 
implies driving four inverters in parallel. To achieve a 
realistic test environment, the test circuit of Figure (3-6) 


48 



was designed. Specifically, note the location of gates A 
and B. Their input and output signals will be measured to 
analyze performance with a fanout of one and four, 
respectively. 



It is expected that the use of a reference voltage 
at the differential input of the inverter will cause the 
inverted and non-inverted output signals to respond 
differently. As a result, two gate topologies are analyzed 
for each of the valid DC bias configurations from Table 
(3-1). The first gate topology is a single output inverter 
from which the inverted output signal is measured. The 
second is a complementary output inverter from which the 
non-inverted output signal is measured. Conveniently, these 
two configurations also represent the alternating signal 


49 





pattern which will characterize the adder circuits later in 
this chapter. 

Initially, the appropriate logic delays are 
measured at gate A and gate B in order to collect data for 
the cases of minimum and maximum loads, respectively. The 
worst-case delay is then multiplied by the average power per 
gate to obtain a power-delay product. This is done for both 

the inverted and the non-inverted output signals — 

providing separate power-delay product terms. Their Siam 
forms a composite power-delay product. The composite 
power-delay product is a figure of merit which effectively 
represents the implementation of the two gate topologies in 
series. 

Finally, the switching period of the input logic 
is decremented for successive tests in order to determine 
the shortest period for which the output signal of a loaded 
gate (gate B) would consistently traverse the full range of 
logic (between high and low) . This quantity has been 
defined in the previous section as the maximum reliable 
frequency (MRF) . For each configuration, the maximum 
reliable frequency is divided by the average power per gate 
to obtain a speed-power ratio (GHz/mW). The presence of a 
secondary load provides confirmation that consecutive loads 
can be successfully driven when the primary load is driven 
at its maximum reliable frequency. 


50 



d) Summary of Results 

Transient analysis confirms the non-symmetric 
behavior of the inverted and non-inverted output signals. 
Therefore, Tables (3-2a) and (3-2b) provide details of their 


Bias 

Current 

(mA) 

Tprop 

L-H 

(PS) 

Tprop 

H-L 

(PS) 

Current 
per Gate 
(mA) 

Power 
per Gate 
(mW) 

Maximum 

Power-Delay 

Product 

(mW-pS) 

0.1 

42 

255 

0.81 

2.03 

518 

0.25 

56 

48 

0.97 

2.42 

136 

0.5 

33 

26 

1.28 

3.20 

106 

0.75 

23 

26 

1.59 

3.99 

104 

1 

17 

26 

1.88 

4.69 

122 

1.5 

13 

27 

2.38 

5.94 

160 


Table 3-2a. Power-Delay Data for the Inverted Signal. 

Single output topology with practical current sources and a 

fanout load of four. 


Bias 

Current 

(mA) 

Tprop 

L-H 

(PS) 

Tprop 

H-L 

(PS) 

Current 
per Gate 
(mA) 

Power 
per Gate 
(mW) 

Maximum 
Power-Delay 
Product 
(mW-pS) _ 

0.1 

212 

82 

1.45 

3.63 

770 

0.25 

61 

88 

1.64 

4.10 

361 

0.5 

27 

63 

2.02 

5.04 

318 

0.75 

23 

46 

2.31 

5.78 

266 

1 

19 

41 

2.63 

6.56 

269 

1.5 

18 

40 

3.09 

7.74 

309 

Table 3-2b. 

Power- 

-Delay Data for 

the Non- 

Inverted Signal 


Complementary output topology with practical current 
sources and a fanout lpad of four. 


51 











respective delay measurements. Specifically, the high-to- 
low transition of the non-inverted output signal represents 
the worst-case transition. 

The overall performance of each DC bias 
configuration is summarized in Table (3-3). The power-delay 
product and speed-power ratio are normalized to simplify 
comparison. Figure (3-7) illustrates the minimization curve 
for the power-delay product, while Figure (3-8) shows the 
maximization curve for the speed-power ratio. 

Clearly, the 0.75mA configuration proves to be the 
optimum design — maximizing the speed-power ratio while 

minimizing the power-delay product. Furthermore, it 
provides for a maximum reliable frequency of 8.7 GHz. This 
is more than suitable to achieve the 5 GHz maximum clock 
frequency desired in Chapter V (for the maximally pipelined 
multiplier implementation). 


Bias 

Current 

(mA) 

Maximum 

Composite 

Power-Delay 

Product 

Normalized 

Composite 

Power-Delay 

Product 

Maximum 

Reliable 

Frequency 

(GHz) 

Normalized 

Speed-Power 

Ratio 

0.1 

467 

3.48 

n/o 

n/a 

0.25 

144 

1.34 

5.30 

0.86 

0.5 

96 

1.14 

7.10 

0.94 

0.75 

72 

1.00 

8.70 

1.00 

1 

67 

1.06 

9.09 

0.92 

1.5 

67 

1.27 

11.10 

0.96 


Table 3-3. Summary of Transient: Analysis Results. 

Composite Power-Delay Product and Speed-Power Ratio. 


52 






Figure 3-7. Results of Transient Analysis: 
Normalized Speed-Power Ratio of Inverter Configurations. 



Figure 3-8. Results of Transient Analysis: 
Normalized Bower-Delay Product of Inverter Configurations. 


53 








5. Final Design Summary: Inverter 

The final design for the CML inverter/buffer circuit is 
illustrated in Figure (3-9). The applicable design and 
performance parameters have been summarized in Table (3-3) . 
Here, the data represents performance when the design is 
implemented with the 0.75mA practical current source from 
Chapter III-E. Also note that when complementary output 
signals are not required, the unused output buffer stage can 
be excluded to conserve power and minimize the device count. 


CML Inverter 


Design and Performance Parameters 


Rgain • 

750& 

Rbuf : 

2000 0 

^bias : 

0.75 mA 

NMl : 

0.13V (26%Vswing) 

NMh : 

0.14V (28%Vswing) 

Power: 

5.78 mW (complementary output ) 

3.99 rnW (single output) 



Inverted 

Signal 

Mon-inverted Signal 

Delays 

Fanout s l 

Fanout s 4 

Fanout = 1 

Fanout = 4 

tp(H-L) 

14ps 

2 6ps 

39ps 

46ps 

tp(L-H) 

17ps 

23ps 

18ps 

23ps 

tfall 

19ps 

41ps 

87ps 

9 Ops 

trise 

48ps 

61ps 

45ps 

6 Ops 

Table 3-4. 

CML Inverter 

Design and 

Performance 

Parameters. 


54 







2.5 volts 



Figure 3-9. Final Design of the CML Inverter. 


C. LOGIC NOR GATE DESIGN 

1. Overview and Analysis 

The circuit topology for a two-input CML NOR gate is 
presented in Figure (3-10) . There is little that differs 
from the inverter, which accurately suggests that the 
analysis here will be extremely similar to the previous 
section. In fact, with regard to both circuit topology and 
performance analysis, the only distinguishing feature is the 
second logic input in parallel with the first. 


55 


Consider the functionality of the two parallel inputs A 
and B. If either of them is a logic high, then the left 
side of the differential pair is on and the NOR output is 
pulled low. Conversely, if both inputs A and B are low, 


NOR 

Output 



Figure 3-10. Circuit topology for a two-input OR/NOR 

logic gate. 

then the NOR output is high. On the opposite side of the 
differential pair is the complementary output — the OR 
function. If another input transistor were added in 
parallel to the existing two, it would be a three-input 
OR/NOR gate — and similarly for a fourth input. 

Despite the drastic change in functionality, the 
presence of several logic inputs in parallel to the original 
logic input induces no fundamental change to the DC bias of 


56 



the circuit. As a result, the DC bias conditions for the 
optimized inverter circuit are directly applied to the final 
design of the NOR circuit. 

2. Final Design Summary: OR/NOR 

With the exception of having multiple parallel 
transistors for multiple logic inputs, the final design for 
the CML OR/NOR logic circuit is identical to that of the 
inverter. As for its performance, the noise margins and 
delay measurements vary only slightly in response to the 
"multiple trigger" effect of simultaneous parallel inputs. 
The design parameters are identical to the inverter and 
therefore are not repeated. However, a selection of the 
performance parameters have been provided in Table (3-5) in 
order to demonstrate the variation of performance based upon 
the input configuration. 

Conveniently, the NOR gate constitutes a near identical 
capacitive load as the inverter — with maximum delay 
differences of less than 1.5ps. It exhibits the same delay 
variations between its OR and NOR signals as the inverter 
does between the inverted and non-inverted signals. And 
finally, as with the inverter, when both of the 
complementary outputs of the OR/NOR gate are not required, 
the unused output buffer stage is not included to conserve 
power and minimize the device count. 


57 




CML OR/NOR Gate 
Delay Performance Parameters 


2- Input OR/NOR Gate 

Single Input Transition 

NDR Signal OR Signal 

Fanout - 1 Fanout = 4 Fanout = 1 Fanout = 4 
tp(H-L) 16ps 29ps 4 Ops 47ps 

tp(L-H) 24ps 29ps 19ps 23ps 

3- Input OR/NOR Gate 

Single and Simultaneous Input Transitions 

NOR Signal OR Signal 



Fanout = 1 

Fanout = 4 

Fanout = 1 

Fanout = 4 

Single Input 

tp(H-L) 

19ps 

28ps 

41ps 

48ps 

Transition 

tp(L-H) 

29ps 

34ps 

18ps 

23ps 

Simultaneous 

tp(H-L) 

17ps 

3 Bps 

4 Ops 

47ps 

Input 


43ps 

48ps 

lips 

lGps 


Transition 


4-Input OR/NOR Gate 

Single Input Transition 

NOR Signal OR Signal 

Fanout = 1 Fanout = 4 Fanout = 1 Fanout = 4 

Single Input ^(H-L) 21 P S 3C ^> S 41 P S 48 P S 

Transition tp (L _H) 33ps 39ps 18ps 23ps 


Table 3-5. Summary of OR/NOR Gate Delay Performance. 


Single 

Input 

Transition 


58 






3. Implementation of the AND Function 

In current-mode logic, the AND function is implemented 
by simply inverting the input signals and reversing the 
polarity designation of the output nodes. In actual 
practice, inverters and OR/NOR gates are sufficient to 
realize any logic function. Thus, for the sake of 

simplicity, AND gates were not constructed as a separate 
logic circuit. Rather, all logic functions were 

deliberately expressed as functions of inverters and OR/NOR 
gates. 

D. ADDER DESIGN 

1. Implementation. 

Two-input and three-input adders are required to 
construct the carry-save adders and carry-completion adders 
of the multiplier (Chapter V) . Equipped with a sufficient 
set of logic gates, this is an elementary task. The sum of 
min-terms for the sum and carry bits of a two-input adder 
are shown in Equations (3-8) and (3-9), respectively. 


(3-8) 

SUm l2input 

= XY 

(3-9) 

Carry | 2input 

= XY 


Employing De'Morgan's Theorem, these expressions can be 
manipulated into the equivalent expressions for 


59 



implementation with OR/NOR gates, as shown in Equations 
(3-10) and (3-11) . 


(3-10) Sum l 2 i»p„t = (X'+Y)' + (X+Y')' 

(3-11) Carry 1 2inpuE = (X'+Y')' 


This adder design requires the complementary logic inputs be 
provided in order to eliminate the need for inverters and a 
third level of logic delay. Such a requirement is trivial 
because complementary signals are potentially available at 
the output of each CML logic gate. Figure (3-11) 
illustrates the two-input adder. 


* 1-0 



Figure 3-11. Two-input adder with identification of the 

critical path. 


60 





61 






2. Performance Analysis 

Proper functioning of each adder was verified for all 
possible input combinations. Notice that the critical path 
for each adder is identified in Figures (3-11) and (3-12). 
For the two-input adder, the critical path flows through two 
levels of logic to produce the sum bit. The worst case 
transition is from a (1/0) or a (0/1) input for (X/Y) to a 
(1/1) input. This is owing to the fact that the worst-case 
gate delay is the high-to-low transition of the OR output 
when it has been driven by the high-to-low output transition 
of the preceding NOR gate. Based upon the data from Table 
(3-5), the critical path delay equals 63 picoseconds. This 
provides a good match with a simulation of the critical path 
delay which yields 60 picoseconds. 

Similarly, for the three-input adder the critical path 
delay is calculated to be 67 picoseconds along the path 
illustrated in Figure (3-12). This was validated with a 
simulation measurement of 66 picoseconds. 

E. PRACTICAL CURRENT SOURCE DESIGN 

1. Circuit Topologies 

Up to this point, each logic element has been designed 
using an ideal current source. In order to validate the 
performance of these designs for actual' implementation, it 
is necessary to construct a practical current source. There 
are effectively three circuit configurations which provide 


62 



transistor bias conditions for establishing a current 
source. These three topologies are presented in Figure 
(3-13). In each configuration the amount of bias current 
drawn is regulated by and directly proportional to the 
magnitude of the current drawn by the base of Q S0DRCE . 



Figure 3-13. Current Source Topologies. 


2. Performance Analysis 

In order to analyze and compare the performance of each 
current source, three simple 0.75mA current sources are 
designed — one using each topology. Each is then 
implemented as the practical current source for the 
inverter/buffer circuit of Chapter III-B-5. Their relative 
performance is evaluated based upon the following design 
goals: 


63 



• Minimize the operational limitations due to 
frequency response 

• Approximate the performance of an ideal current 
source 

• Minimize the cost of implementation (power and 
device count) 

The performance of each configuration is illustrated in 
Figure (3-14a) and (3-14b). Notice that each inverted 
output signal drops below the desired 1.2 volt voltage low 
level when making the transition from high-to-low. This 
"dip" results from reversing the polarity of the 
differential pair input signals — inducing a brief drop in 
the bias voltage at the positive (POS) terminal of the 
current source. A delayed return to the proper bias voltage 
is then governed by the RC characteristics of the Q S0URCE 
collector. This delay is particularly observed in the 
transient performance of the topologies in Figure (3-13a) 
and (3-13b). 

3. Final Design: Current Source 

By process of elimination, the current mirror topology 
of Figure (3-13c) is the only design suitable for driving a 
logic device family that is capable of switching frequencies 
above 8 GHz. Unfortunately, the current mirror also incurs 
the largest cost in terms of power and device count. Thus, 
to reduce the amount of current "lost" through the left side 


64 



Inverted Output (Voltage) Non-Inverted Output (Voltage) 



65 











a) 0.75mA Current Source 


The final current source design for a 0.75mA 
current source is shown in Figure (3-15). The DC transfer 
characteristic of this source. Figure (3-16), illustrates 
that the bias current drawn is a function of the collector- 
emitter voltage (V CE ) at Q S0URCE . More specifically, it is seen 
that V CE must be greater than 0.3 volts in order to ensure 
that 0.75mA is drawn. This represents a critical design 
parameter for establishing a proper DC bias on the current 
source. 



Figure 3-15. Final Design of a Practical 0.75mA Current 

Source. 

The 0.75mA current source design is validated by a 
direct performance comparison with an ideal current source. 
Figure (3-17) compares the output signals for a maximally 
loaded inverter/buffer circuit when driven by both 


66 











an ideal and a practical current source. It can be seen 
that the transition delay resulting from the practical 
source is consistently ahead of the ideal source for the 
inverted output signal by a margin of five picoseconds. 
Meanwhile, the non-inverted output signal of the practical 
current source maintains the status quo by matching the pair 
delay of the ideal source. In a design that is 
characterized by alternating stages of positive and negative 
logic signals, it is reasonable to expect that the 
implementation of the practical current source would yield a 
slight improvement over the ideal source. 

b) 2.0mA Current Source 

Exercising a little foresight into the conclusions 
of Chapter IV, it is convenient here to present the design 
of the 2mA practical current source. This design is a 

simple modification to the 0.75mA design — implemented by 
decreasing the resistance from 5250 Q to 2020 £1. This 

allows an increase of current flow into the base of Q MIRR0R and 
produces the transfer characteristic shown in Figure (3-18). 
Again, a bias voltage at Q MIRR0R must ensure that V CE is greater 
than or equal to 0.3 volts in order to achieve proper 
functioning of the current source. 

The 2mA current source is also validated by 
testing it against an ideal current source while driving a 
maximally loaded D-type CML Latch. The respective output 


68 



Figure 3-18. Transfer Characteristic of the 2.0mA 

Current Source. 

signals, Q and QN, are plotted in Figure (3-19). It can be 
seen that the output signal transition delay resulting from 
the practical source compares favorably with the delay 
associated with the ideal source. However, the ideal-driven 
output signals consistently crosses the reference voltage of 
1.45 volts approximately 10 picoseconds ahead of the 
practical-source-driven output signals. Thus, the effective 
margin of error for approximating the practical source with 
an ideal source is 10 picoseconds. In a synchronous 
pipelined architecture, this simply adds between 10 and 20 
picoseconds to the minimum clock period. 


69 






Figure 3-19. Comparison of Latch Performance, Practical 
Current Source vs. an Ideal Source. 


In summary, a sufficient set of logic circuits is now 
in hand, along with a practical current source with which to 
drive them. Thus, the combinational logic for a multiplier 
can be fully implemented. However, based upon the intent of 
pipelining this multiplier, it is necessary to construct the 
clock-driven devices that will control the flow of data. 
Chapter IV presents this discussion with the design of a D- 
type latch, a D-type flip-flop, and a clock driver. 


70 















IV. HBT CML LATCH AND REGISTER DESIGN 

A. LATCH DESIGN 

1. Circuit Topology 

a) Two Latch Topologies 

The most common latch design is based upon the 
logic level schematic illustrated in Figure (4-1) . Design 
of this latch simply requires the proper connection of four 
NOR gates with the appropriate clock and logic input 
signals. The cumulative power consumed by the four NOR 
gates constitutes a significant cost (based upon the four 
milliwatt per gate design from Chapter III). 



Figure 4-1. D-type Latch constructed from NOR gates. 


71 



However, the unique characteristics of CML provide an 
alternative design that yields comparable performance at a 
significant savings in power. This CML latch design is 
illustrated in Figure (4-2). Due to the relative 

unfamiliarity of this design, a brief functional description 
follows. 

; Output Buffer 



Figure 4-2. CML D-type Latch Design (After Jalali). 





b) Functional Description of a CML Latch 

Referencing Figure (4-2), the source labeled I bias 
draws a constant current through the lower (clock-driven) 
differential pair. Complementary clock signals provide the 
differential inputs. Depending upon the phase of the clock 
signal, current is drawn from one of the two cascaded 
differential pairs, i.e. either the track pair or the latch 
pair. Consider the case when the CLK signal is high. 
Current will be drawn from the "track" pair while the 
"latch" pair is simultaneously cut off. In this case the 
latch is considered "open" or "transparent, " and the track 
pair behaves like the differential pair configuration of the 
inverter/buffer logic gate. Thus, the logic inputs of the 
track pair are mirrored at the opposite collector. However, 
there is one exception. In the CML latch, complementary 
logic inputs are employed rather than a logic reference 
voltage. For a single logic input, complementary input 
signals enhance noise immunity and provide for symmetric 
waveforms at the complementary output ports. 

Now, consider when the CLK signal transitions from 
high to low. The track pair is cutoff as current is 
switched to the latch pair via the right side of the clock- 
driven differential pair. Herein lies the significance of 
the common collector nodes shared by the track pair and 
latch pair. Due to the high impedance nature of the HBT 


73 


collector-base junction, the voltage level at the collector 
is slow to change and lingers long enough to bias the latch 
pair for essentially identical operation and output levels. 
This effectively latches the logic levels from the track 
pair to the latch pair. (Jalali, 1995) 

Regardless of the state of the latch, the logic 
levels at the common collector (of the track and latch 
pairs) are reflected at the latch output ports via the same 
output buffer configuration presented in Chapter III. 

2. Initial Conditions and Design Parameters 
The CML latch presents the most demanding DC bias 
requirements of any circuit designed for this project. As a 
result, no voltage cap has been placed upon its design. 

Rather, the initial design goal is to determine the minimum 

necessary DC bias conditions for proper operation of the 
latch. The resulting "voltage budget" will define the 

voltage relationships for proper operation of each 
transistor and differential pair. It will further establish 
important specifications for supply voltage and logic signal 
levels. Derivation of the "voltage budget" is presented as 
part of the DC analysis in the following section. 

The minimum available transistor area (lxl micron) is 
employed for optimum switching speeds, and the fanout 

requirement remains at four. These specifications are 


74 


consistent with the logic circuits designed in the previous 
chapter. 

3. DC Analysis 

a) DC Bias Conditions / The Voltage Budget 

For proper operation of the CML latch, each 

differential pair of transistors must be properly biased. 
Knowing the requirements imposed by proper DC bias 

conditions will reveal the following necessary design 
parameters: 

• Required minimum supply voltage 

• Required minimum voltage level for 

representing the positive (high) phase of 
the clock 

• Required minimum voltage level for 

representing a logic high state 

• Maximum allowable signal range between 
high and low logic levels 

To facilitate analysis, the CML latch topology is divided 
into three levels of operation, as illustrated in Figure 
(4-3). Level one (the bottom level) is a practical current 
source. Implementing the design from Chapter III-E, the 
current source requires a minimum of volts at node X in 

order to sustain the desired level of bias current. 

(4-D V x > V Ibias 


75 




76 


Figure 4-3. Voltage Budget for the CML Latch 





This requirement imposes the following operational condition 
upon the "driving" base voltage of the C^/Q., differential 
pair (i.e. the high CLK voltage). 

^CLK(hi) — V x + V BE(on) | 2)2 

A further consideration is the proper biasing of 
the Q 1 /Q 2 collectors for operation in the active region. 
This places the following operational condition upon the 
collector voltages (nodes Y1 and Y2). 

— ^CLKlhi) _ ^BE(on)|Ql2 ^CE(sat) 

where, V y represents either V yl or V y2 
Only the tracking differential pair (connected to node Yl) 
will be addressed at this point because it is driven by 
lower voltage levels which impose more restrictive DC bias 
conditions on Yl than Y2. 

Once again, a minimum voltage requirement at the 
common emitter of the Q 3 /Q 4 differential pair presents a 
constraint on the minimum steady-state driving voltage at 
each base. This driving voltage corresponds to a logic 
high input voltage. Thus, the voltage level selected to 
represent a logic high must satisfy the following 
relationship. 

(4 - 4 ) ^LOGIC(hi) — ^BE(on)|Q34 ^Yl 

Finally, three conditions must be satisfied at the 
collectors of the track pair. The first condition is that 


77 



transistors Q 3 and Q 4 must operate in the active mode. This 
requires the following familiar relationship. 

Vc,i ow) — ^LOGiahi) — ^BE(on)|Q34 Y;E(sat) 

where V c represents either V C1 or V c2 

Similarly, the second condition requires that the 
transistors of the latch pair also operate in the active 
mode. This condition differs from the one above because the 
latch pair is driven by the collector voltage levels of the 
track pair. 

^C(low) — ^C(hi) — V BE(on )|Q56 ^CE(sat) 

Defining the voltage range of the logic signal (V^^) as the 
difference between high and low voltage levels, Equation 
(4-5) is manipulated to show the.maximum value. 

(. 4 - 7 ) Grange — ^be(cti)|q 56 ~ "^CE(sat) 

Knowing the transistor parameters for V BE(on) and V CE(sat) from 
Chapter II, (V^)^ is 0.5 volts. 

The third condition is that the input and output 
logic levels must match. A high logic input at the 

transistor base must drive the collector voltage relatively 
low (V C(low) ) such that it produces a matched low logic output 
at QN. Likewise, the inverse must also be true. The 
following equations express these requirements. 


(4-8) 

^LOGIC(hi) 

Grange 

— Yrdow) 

^BE{on)|buffer 

(4-9) 

^LOGIC(low) 

Grange 

= ^C(hi) 

^BE(on)| buffer 


78 



Based upon these relationships the maximum collector voltage 
is determined, which further dictates the minimum required 
supply voltage for proper DC operating conditions. 

The voltage budget relationships are summarized in 
Figure (4-3) . Actual values have been determined for four 
latch configurations as listed in Table (4-1). The 
essential difference is the magnitude of the bias current. 
An economical margin of safety has been built into these 
values. 

Notice that these margins have been allowed to 
vary slightly between configurations in order to maintain 
uniform values for clock and logic signal values. This 
greatly simplifies the comparative testing of the four 
configurations. The design margins are highlighted to 
illustrate the negligible deviation incurred. All four 
configurations meet and exceed the required DC bias 
conditions. In the event that uniform design margins had 
been used such that the supply voltages were optimized, the 
difference would have been trivial — within plus or minus 
0.1 volt or 4% of the 2.5 volt supply voltage. 

b) DC Bias Optimization 

At this point the gain resistance, buffer 
resistance, and the bias current are the only undetermined 
parameters. The same procedures described in the design of 
the inverter/buffer circuit are employed to design four 


79 




CML Latch Voltage Budget 
for Multiple Bias Current Configurations 



1mA 

1.5mA 

2mA 

3mA 

Known/Measured Parameters: 




VfiE(on) 

0.775 

0.80 

0.82 

0.857 

VcE(sat) 

0.26 

0.30 

0.31 

0.35 

V I-bias 

0.3 

0.3 

0.3 

0.3 

Determined Parameters: 





[VRANGElmax 

0.515 

0.5 

0.51 

0.507 

Margin for Range of 

0.015 

0.0 

0.1 

0.007 

Logic Signal Voltage 





[V RANGE]actual 

0.5 

0.5 

0.5 

0.5 

Vcc 

2.5 

2.5 

2.5 

2.5 

Margin to nearest 

0.075 

0.025 

0.025 

0.0 

ICIXUl v/X <X Vyll VCO 

V C(hi) 

2.425 

2.475 

2.475 

.2.5. 

[VLOGIC(hi)] actual 

1.7 

1.7 

1.7 

1.7 

Margin for Differential 
Logic Signal Switching 

[VLOGIC(hi)]min 

0.24 

1.46 

0.2 

1.5 

111 

1.51 

; ;:r: 

. 1.55. 

Vyi 

0.685 

0.7 

0.69 

0.693 

VcLK(hi) 

1.2 

1.2 

1.2 

1.2 

.Vx... 

0.42 

0.4 

0.39 

0.358 

Margin for Differential 

0.12 

fis:0i!p«i 

0.09 

0.058 

Clock Signal Switching 





V I-bias 

0.3 

0.3 

0.3 

0.3 


Based upon a 0.5 volt signal swing for both logic and clock signals: 

Vlogic(Iow) 1-2 1.2 1.2 1.2 

Vclk(Iow) 0-7 0.7 0.7 0.7 


Table 4-1. Voltage Budget for the CML D-type Latch. 


80 











different latch configurations based upon the specifications 
determined in Table (4-1). 

Noise Margins are obtained from the DC transfer 
characteristic of each. These results are included in Table 
(4-2). With maximum fanout loads on both output ports, all 
four CML latch designs meet the requirements of a 0.5 volt 
output signal range and 0.1 volt (20%) balanced noise 
margins. Therefore, all four CML designs are considered in 
transient analysis. 


Bias 

Current 

(mA) 

Gain 

Resistor 

(Ohms) 

Buffer 

Resistor 

(Ohms) 

No Load / Loaded 
High Noise 
Margin 
(Volts) 

No Load / Loaded 
Low Noise 
Margin 
(Volts) 

Logic 

Signal 

Range 

(Volts) 

1 

600 

2000 

0.14 / 0.13 

0.13 / 0.13 

0.49 

1.5 

410 

2000 

0.13 / 0.13 

0.13 / 0.13 

0.51 

2 

310 

2000 

0.12 / 0.12 

0.12 / 0.12 

0.51 

3 

210 

2000 

0.11 / 0.11 

0.11 / 0.11 

0.52 


Table 4-2. Results of DC Analysis. 


4. AC/Transient Analysis 

a) Performance Parameters 

Three parameters are of primary interest in 
evaluating the transient performance of a latch: setup 

time, hold time, and logic propagation delay. Figure (4-4) 
illustrates how each of these relates to the events on a 
transient plot. In the absence of a reference voltage. 


81 





CLOCK 


Open 


Latched 



Figure 4-4. Illustration of setup time, hold time, and 

propagation delay. 

differential signal references are taken as the point where 
the complementary signals cross. 

As a figure of merit for optimizing the trade-off 
between speed and power, a power-delay product is calculated 
using the values defined here. The figure for power 

represents the average power, and the figure for delay 
represents the sum of the setup time and the worst-case 
propagation delay time. 

b) Analysis Procedures 

For an accurate evaluation of latch performance, 
it is necessary to provide realistic logic and clock input 
signals as well as realistic worst-case fanout loads. 


82 





Furthermore, to ensure and demonstrate the proper DC bias 
design of the CML latch, practical current sources are 
implemented in testing. 

In addition to the four CML latch designs, the 
traditional logic latch is also tested. Each design is 
substituted into the test circuit to determine the 
performance parameters described in the previous section. 

c) Summary of Results 

The results of transient analysis are summarized 
in Table (4-3). The 1.5mA configuration achieves the 
minimum power-delay product as illustrated in Figure (4-5). 
Note, however, that the 2mA configuration performs at a 



Bias Current (mA) 


Figure 4-5. Results of Transient Analysis: 
Normalized Power-Delay Product of Latch Configurations. 


83 






84 


































indicates a capacitive spike at the mutual collector nodes 
of the latch and track differential pairs. This results 
each time the clock-driven pair switches current to the 
opposite side. It is not expected that this noise will 
adversely affect the ability of the CML latch to drive 
reliable logic levels. However, in the event that the CML 
latch is overcome by noise, the NOR latch configuration is a 
viable alternative because it does not experience this 
problem. 

Finally, the switching activity of the 
differential pair also induces variations in the current 
drawn from the supply voltage. Figure (4-7) illustrates 
these power rail transients for a single CML latch. The 



Figure 4-7. Power Rail transients due to the switching 
activity of a single CML Latch. 


86 
















abrupt, periodic reduction in supply current coincides with 
the brief transition of current from one side of the 
differential pair to the other — driven by the switching of 
the clock signal. In the worst-case, this downward 
transient spike reaches a current level that is 18% below 
the average. It is also evident that slightly more current 
is drawn when the latch is latched because the latch pair is 
driven by a higher input voltage than the track pair. This 
results in a higher voltage and thus more current being 
drawn at the practical current source. 

5. Special Latch Implementations 

In the course of this design project, two special 
implementations of the CML latch have been designed. The 
first implements a logic reference voltage at one of the 
logic inputs of the latch. The purpose here is to eliminate 
the requirement for complementary logic signals at the 
multiplier input. 

The second special implementation also uses a reference 
voltage; however, it does so with the purpose of conducting 
a logic function at the input to the latch. Although this 
circuit functions well, it actually results in slightly 
greater delays due to the increased collector capacitance at 
the tracking pair. As a result, it is not utilized in the 
multiplier circuit. 


87 



6. Final Design Summary: D-Latch 

The final design for the CML latch is implemented with 
the parameters listed in Table (4-4) using the topology 
presented previously in Figure (4-2). Also listed are the 
transient performance parameters for operation at each level 
of fanout loading. These figures represent the performance 
of the latch when it is implemented with a practical current 
source and driven by a maximally loaded clock driver. 


Latch 

Design and Performance Summary 


Rgain- 310 O 
Rbuf- 2000 £2 
Ibias- 2 mA 

NM l : 0.12v 
NM h : 0.12v 
Power: 9.0 mw 


Max 


Fanout 

Setup 

Hold 

tprop 

tprop 

Total 

Load 

Time 

Time 

H-L 

L-H 

Delay 

(# gates) 

(PS) 

(PS) 

(PS) 

(PS) 

(PS) 

1 

33 

9 

27 

0 

60 

2 

33 

10 

28 

1 

61 

3 

34 

10 

31 

2 

65 

4 

35 

10 

34 

3 

69 


Table 4-4. Final Design Summary of the D-type 

CML Latch. 


88 






B. FLIP-FLOP DESIGN (D-TYPE) 

1. Overview and Analysis 

The D-type flip-flop is constructed from two D-type CML 
latches. The two latches are connected in a master-slave 
configuration such that they are latched by opposite phases 
of the clock. This simple design is illustrated in Figure 
(4-7) . 


D 

Q 

-♦- 

D 

Q 

D-LATCH 


D-LATCH 

DN 

QN 

-•- 

DN 

QN 

OPEN 

LATCH 


OPEN 

LATCH 


CLOCK INVERTED INVERTED CLOCK 

CLOCK CLOCK 

Figure 4-7. D-type Flip-Flop. 

The flip-flop design is tested under the same 
conditions of loading and input signals as discussed 
previously for the latch. This testing verifies proper 
function of the flip-flop design and confirms that the flip- 
flop performance parameters of setup time and hold time 
mirror those of the CML latch. However, due to the presence 
of a second latch in the flip-flop, the propagation delays 
are greater. 


89 





2. Final Design Summary 

The final design for the CML D-type flip-flop is 
essentially the master-slave configuration of two CML 
latches, as illustrated in Figure (4-7) . The design 
parameters of the master and slave latches remains the same 
as shown in Table (4-4). The applicable performance 
parameters of the flip-flop have been summarized in Table 
(4-5) . 


Flip-Flop 

Design and Performance Summary 

Reference Latch Design Parameters 
Power: 18 mw 


Fanout 

Setup 

Hold 

tprop 

tprop 

Max 

Total 

Load 

Time 

Time 

H-L 

L-H 

Delay 

(# gates) 

(PS) 

(PS) 

(PS) 

(pS) 

(PS) 

i 

33 

9 

49 

35 

82 

2 

33 

9 

53 

47 

86 

3 

34 

9 

52 

45 

86 

4 

35 

10 

54 

43 

89 


Table 4-4. Design and Performance Summary of the 

D-type Flip-Flop. 


90 




C. CLOCK DRIVER DESIGN 

1. Overview 

The topology of the clock driver closely resembles that 
of the inverter/buffer circuit. In fact, the only necessary 
modification to the inverter/buffer design is a reduction of 
the output voltage range at the output buffer. This is 
accomplished by a simple voltage divider that effectively 
steps the voltage down to the desired voltage range between 
0.7 and 1.2 volts (Figure 4-8). This voltage range is 
dictated by the CML latch design. 

Two performance parameters are of particular interest 
in the clock driver design, fanout capability and the 



91 



symmetry of complementary output signals. Increased fanout 
is desirable to reduce the number of clock drivers required. 
Meanwhile, output symmetry is important to reduce clock skew 
between parallel clock paths. The absence of symmetry 
between the complementary output signals of the logic 
circuits (in Chapter III) results from the corresponding 
lack of symmetry between the input signals, i.e. the use of 
a reference voltage. Therefore, the clock driver is driven 
by the differential clock signals CLK and CLK-N. 

2. Analysis and Results 

Fanout capability is maximized by the increase of 
current through the output buffer. Two further 
modifications to the inverter/buffer circuit make this 
possible. The first is to increase the bias current. For a 
supply voltage of 2.5 volts, a practical current source of 
2mA is the largest that is operable without adversely 
biasing the circuit. Second, reducing the total resistance 
in the output buffer draws a larger base current and 
ultimately, more current is available to the output load. 

For evaluation, the performance of two clock 
driver configurations is measured based upon the power 
consumed per load driven. The 1mA clock driver draws 5.5mA 
and consumes 13.8mW while driving a maximum of two latches. 
Meanwhile, the 2mA clock driver draws 6.5mA and consumes 


92 



16.3mW while driving four latches. Clearly, the 2mA clock 
driver is the desired implementation. 

The synchronous switching behavior of the clock driver 
coupled with its high current consumption warrant an 
investigation of its power rail transient characteristic 
(Figure 4-9) . It is not surprising that it follows the same 
periodic trend as discussed in the case of the CML latch. 
In the worst-case, the downward transient current spike 
deviates by 14.6% from the average current level. Also of 


6 . 9 :— 

6 . 8 :. 

6 . 7 : 

6 . 6 : 

6 . 5 : _ 
6 . 4 : 

s’ 6 

<3 : 

I 

< 6. It 

6 . 0 : 

5 . 9 : . 

5 . 8 : 

5 . 7 : 


Latched 


Tracking 
Input Signal 


J5. t .x 




OPENING 


\ 


LATCHING 


0.0 


— r -r- 

0.5 


\ 


1.0 


1.5 


Time (ns) 

Figure 4-9. Power Rail transients induced by the 
switching activity of a single Clock Driver. 


93 





















































interest is the noise induced on the clocking signal by 
strong, simultaneous logic transitions at the latch input. 
As a result, a clock driver must be capable of driving a 
maximum fanout load of latches when the every latch input 
transitions simultaneously in the same direction. 

3. Final Design Summary: Clock Driver 

The final design for the clock driver is implemented 
with the parameters listed in Table (4-6) using the topology 
presented previously in Figure (4-8). 


Clock Driver 

Design and Performance Summary 

Rgain- 

400 Q 

R1 but* 

110 Q 

R2but: 

450 ft 

Ibias- 

2 mA 

NM L : 

0.08v 

NM h : 

0.10 v 

Power: 

1 6.3 mW 

Fanout: 

4 Latches 


Table 4-6. Design and Performance Summary of the 
Clock Driver Circuit. 


At this point, the set of building blocks is complete. 


The logic circuits of Chapter III and the clock-driven 
devices of Chapter IV are brought together in Chapter V to 
implement several pipelined multiplier configurations. 




V. 


HBT CML PIPELINED MULTIPLIER DESIGN 


A. LOGIC STAGE DESIGN 
1. Overview 

As introduced in Chapter II-C, the multiplier logic for 
this project is implemented with the three functional 
processes illustrated in Figure (5-1): partial product 
generation, carry-save addition, and carry completion 


Multiplier Multiplicand 



Product 


Figure 5-1. Generalized Block Diagram of an 8x8 bit 

Multitplier. 

addition. In the case of the 8x8 bit multiplier which is 
implemented in this chapter, the process of carry-save 
addition is actually accomplished with successive stages of 


95 





carry-save adders. More specifically, the use of three-to- 
two carry-save adders produces the logic implementation 
illustrated in Figure (5-2). The detailed process of carry- 
save-addition is addressed in the following section; 
however, this block diagram accurately represents the 
functional design of the multiplier and establishes a 
graphic reference for the follow-on discussion. 

2. Carry-Save Adders 

Each three-to-two carry-save adder takes three operands 
and produces two outputs, a sum and a carry. However, the 
carry-save adder implementations are not identical, due to a 
slightly different input configuration that exists for the 
first carry-save adder stage than for the follow-on stages. 
Referencing Figure (5-3), the first carry-save adder 
receives three non-aligned n-bit partial products. As a 
result, it generates n+2 sum bits and n carry bits. 
Meanwhile, the follow-on stages each receive an aligned 
input pair comprised of the carry and sum terms generated by 
the preceding stage. The third input is the next partial 
product term, and it is shifted by one bit. Thus, the sum 
is only n+1 bits and the carry is still n bits. 

In the case of either carry-save adder, only the most 
significant n bits of the Siam term are passed on to the next 
adder stage. The remaining least significant bit(s) 
represent the next most significant bit(s) of the final 


96 



Partial Partial Partial Partial Partial Partial Partial Partial 

Roduct#8 Ptoduct#7 Product#) Product#5 Product#4 Roduct#3 ftoduct#2 Product#! 



P[15:71 m P[5] P[4] P[3] P[2] P[l] P[0] 


Figure 5-2. Logic Implementation o£ an 8x8 bit Multiplier 
using six stages of Carry-Save-Adders and a Carry-Completion 

Adder. 



























Carry-Save Adder #1 



Figure 5-3. Functional Illustration of the two Carry- 
Save-Adder Implementations. 


product and are passed directly to the multiplier output. 
These bits are highlighted with a circle in Figure (5-3) . 
The final designs of the two carry-save-adder configurations 
are provided in Figures (5-4) and (5-5). Note the presence 


98 

















99 








Input Signals 

Multiplier-Bit \ c “ rre ? t Multl P 1;Ler 
___/ Bit for this stage 


BNin[7;0] \ 8-bit Multiplicand 
-- ' (Negated) 

Output Signals 

P[03 ) Next Product Bit 
C[7:0] 8-bit Carry Term 

S[8:l] ) 8-bit Sum Term 

rt ^ 0 


frultiplier-BIT> -^ 


4>o 




Figure 5-5. Logic Schematic of Carry-Save-Adder #2 


LOO 







of more than simple adder circuits. A fanout limitation of 
four prevents a single signal from driving the eight input 
requirements for the current multiplier bit at each carry- 
save-adder stage. Thus, the arriving multiplier bits pass 
through an inverting buffer stage. 

Furthermore, the OR/NOR gates are used to generate the 
partial product terms within each carry-save-adder stage, 
rather than at the multiplier input. Taking advantage of 
the complementary output signals available from the 
preceding register, the NOR gates perform a logical AND of 
each multiplicand bit with the appropriate multiplier bit. 
Local Generation of the partial product terms avoids the 
extensive requirement for intermediate registers that would 
be necessary to pass all partial product terms from one 
pipeline stage to the next (that is, referencing a scenario 
where all partial products are generated before the first 
carry-save adder). 

3. Carry-Completion Adders 

The carry-completion adder implements ripple-carry 
addition. This elementary design is preferred over carry- 
look-ahead addition because it facilitates a variety of 
simple pipeline implementations. Figure (5-6) illustrates 
the full carry-completion adder which can be conveniently 
segmented into as many as eight pipeline stages by 
separating the successive two and three-input adders. 


101 



Figure 5-6. An 8-bit Ripple-Carry Adder to perform 

Carry-Completion. 


102 



















B. REGISTER STAGE DESIGN 

Regardless of the number of pipeline stages, each 
multiplier implementation requires two eight-bit input 
registers and a sixteen-bit output register. For pipeline 
implementations with more than one stage, intermediate 
registers are also required. The size of these registers 
varies depending upon where the register is inserted in the 
flow of logic. All intermediate and output registers 
require complementary input signals. However, the input 
registers are distinctly designed to accept a single logic 
input signal for each bit, vice requiring complementary 
logic input signals. In order to accomplish this, the D- 
type flip-flops utilized in the input register must employ a 
special latch implementation which does not require 
differential input signals for the master latch of the 
master-slave flip-flop pair. The details of this latch 
implementation are presented in Chapter IV-A-5. 

C. CLOCK DISTRIBUTION 

The purpose of the clock distribution scheme is to 
provide a local clock signal for clock-driven devices, 
namely the latches that comprise the registers described in 
the previous section. However, each clock driver can only 
sustain a maximum load of four latches, i.e., two flip- 
flops. Therefore, due to the number of clock-driven devices 
and the limited fanout capability of the clock drivers, the 


103 



clock signal must propagate through an extensive, multi¬ 
level distribution tree. As the number of clock-driven 
devices increases, the number of levels in this distribution 
tree must eventually increase as well. Thus, the more 
heavily pipelined multiplier implementations must make a 
larger investment of devices and power in clock 
distribution. 

D. MULTIPLIER IMPLEMENTATIONS 

Five pipelined multiplier implementations have been 
designed for testing via Tanner SPICE simulation tools. 
These implementations include a one-stage pipeline, a two- 
stage pipeline, a four-stage pipeline, a six-stage pipeline, 
and a ten stage pipeline. The arithmetic logic is identical 
for each; however, the increased number of registers present 
in the more heavily pipelined implementations also implies a 
more extensive clock distribution tree. A block diagram of 
each implementation is presented in the following section. 

E. PERFORMANCE EVALUATION 

1. Evaluation Procedures 

Prior to evaluation of the individual multiplier 
implementations, the multiplier logic is successfully tested 
with several operands in order to verify that it produces an 
accurate product. Following this verification, it is the 
goal of this performance evaluation to identify the maximum 


104 



operating clock frequency for each pipeline implementation. 
However, this can only be done once the critical path, i.e, 
the critical pipeline stage, is determined for each 
multiplier. 

a) Critical Path Identification 

The most direct and absolute means of identifying 
the critical path is to conduct full-length simulations of 
each multiplier for every possible combination and sequence 
of two 8-bit input operands. Conducting these nearly 4.3 
billion simulations on each of the five multiplier designs 
is obviously prohibitive. Thus, the opposite extreme 
suggests that the worst-case transition delay be assumed for 
every logic circuit in every stage of the pipeline. While 
this successfully identifies an upper bound on the delay 
associated with the critical path, it is likely that the 
upper bound case does not exist as a result of two input 
operands. Furthermore, without knowledge of the input 
operands, simulations can not be conducted for verification. 

Unfortunately, the logic behavior of the carry- 
save-adders makes an intuitive approach extremely difficult. 
Thus, a computer program designed by Kirk Shawhan, a 
research associate, has been utilized to identify the worst 
case input combinations. (Shawhan, 2000) The program 
effectively identifies a unique upper bound delay for each 
set of input operands. Those input combinations with the 


105 



worst-case upper-bound delays are then simulated to identify 
a single worst-case pair of operands and the critical stage 
where the most-delayed transition occurs. While it is not 
proven that this approach will identify the absolute 
critical path, it provides a reasonable and timely estimate 
for the purposes of this research. 

b) Maximum Throughput /Clocking Frequency 

Having determined the critical path, it is simply 
a matter of simulation time to identify the maximum clock 
frequency. For each pipeline implementation, a simulation 
is conducted which brackets the breakpoint of the 
multiplier. Furthermore, examination of the margin by which 
the setup time is met or missed provides a determination of 
the minimum clock period that is accurate within five 
picoseconds. 

The increased number of devices in the more 
heavily pipelined designs made full-circuit simulation times 
extremely long. As a result, the breakpoints for the four- 
stage, the six-stage, and the ten-stage multipliers were 
determined from partial simulations. Only the critical 
stage and those stages immediately before and after it were 
simulated. 

2. Performance Results of Each Implementation 

The following ten pages provide a tv page design and 
performance summary for each of the five pipelined 


106 



multiplier implementations. Figure (5-7) illustrates the 
design and critical path of the one-stage multiplier on a 
block diagram. Table (5-1) provides a summary of data which 
quantifies circuit complexity, power consumption, data 
throughput rate and data latency of the one-stage pipelined 
multiplier. Finally, Figure (5-8) illustrates the success 
and failure of P14, the critical path, at clock frequencies 
below the above the breakpoint of the circuit. 

Similarly, Figures (5-9) through (5-16) and Tables (5- 
2) through (5-5) provide the same performance results for 
the two, four, six, and ten-stage pipelined multipliers, 
respectively. A comparative analysis is conducted as a 
performance summary in the following section. 

As a final note, all full multiplier simulations are 
conducted using ideal current sources. This decision saves 
numerous simulation hours without sacrificing valid 
transient performance data. A close correspondence has been 
demonstrated between the transient performance of the 
practical and ideal current sources for both the logic and 
the latch designs. Use of the ideal source, however, does 
produce overly optimistic power-consumption data due to the 
absence of power dissipation from the transistors in the 
practical current source. Therefore, the simulation data 
for current consumption is scaled to accurately represent 
the power consumed in practical implementation. 


107 


A =1111 0111 


B = 1100 0111 


Critical Path Initiates 
with the two operands 
A=F7h, B=C7h 


c 

o 

8P 


eS 

C 

go 



Critical Path Terminates 
with the LOW-to-MGH 
transition of P14 



P =1100 0000 0000 0001 


Figure 5-7. One-stage pipelined multiplier 
implementation with an illustration of the 
critical path. 


108 


STAGE 1 









Voltage (V) 



Number of 
Transistors 

Number of 
Resistors 

Current 

(Amperes) 

Power 

(Watts) 

Logic 

3952 

2352 

1.28 

3.20 

Registers 

384 

320 

0.31 

0.77 

Clock 

126 

105 

0.19 

0.48 

TOTAL 

4462 

2777 

1.78 

4.44 


Maximum Throughput: 

1.33 GHz 




Latency: 

0.75 Nano-second 


Table 5-1. Performance summary for the one-stage 
pipelined multiplier. 




Figure 5-8. Performance bracket of the minimum period for 
the one-stage pipeline multiplier. 


109 














A =1111 0111 


B = 1100 0111 


■ss ■ 
o 

i 

W) 

as 

I 

*3 

& 



Critical Path Initiates 
with the two operands 
A=F7h, B=C7h 


16-Bit Input Register 


Carry Save Adder #1 (1) 


Carry Save Adder #2 (2) 


Carry Save Adder #2 (3) 


Carry Save Adder #2 (4) 


Carry Save Adder #2 (5) 


Carry Save Adder #2 (6) 


23-Bit Intermediate Register 


Critical Path Terminates 
with the LOW-to-HIGH 
transition of PI 4 


o 

< 

H 

V3 



O 

< 

H 


Figure 5-9. Two-stage pipelined multiplier 
implementation with an illustration of the 
critical path. 


110 









Number of 
Transistors 

Number of 
Resistors 

Current 

(Amperes) 

Power 

(Watts) 

Logic 

3952 

2352 

1.28 

3.20 

Registers 

660 

550 

0.52 

1.31 

Clock 

228 

190 

0.36 

0.90 

TOTAL 

4840 

3092 

2.17 

5.41 


Maximum Throughput: 2.0 GHz 
_ Latency: 1.0 Nano-second 

Table 5-2. Performance summary for the two-stage 
pipelined multiplier. 




Figure 5-10. Performance bracket of the minimum period for 
the two-stage pipeline multiplier. 














a = 1111 1111 


b = 1000 0001 


Critical Path Initiates 
with the two operands 
A=FFh, B=81h 


16-Bit Input Register 



Carry Save Adder #1 (1) 


Carry Save Adder #2 (2) 


Carry Save Adder #2 (3) 


31-Bit Intermediate Register 


Carry Save Adder #2 (4) 


Carry Save Adder #2 (5) 


Carry Save Adder #2 (6) 


[ 


23-Bit Intermediate Register 




■ l: ■ ■ ■ ■■ S • : .* 

Carry Completion Adder 
(4 Bits) 


20-Bit Intermediate Register 


Critical Path Terminates 
with the LOW-to-HIGH 
transition of PI5 


Carry Completion Adder 

P1J 1^14 


16-Bit Output Register 


p = iooo oooo oiii mi 


Figure 5-11. Four-stage pipelined multiplier 
implementation with an illustration of the 
critical path. 


112 









Number of 
Transistors 

Number of 
Resistors 

Current 

(Amperes) 

Power 

(Watts) 

Logic 

3952 

2352 

1.28 

3.20 

Registers 

1272 

1060 

1.01 

2.52 

Clock 

438 

365 

0.68 

1.71 

TOTAL 

5662 

3777 

2.97 

7.43 


Maximum Throughput: 

3.45 GHz 




Latency: 

1.16 Nano-seconds 


Table 5-3. Performance summary for the four-stage 
pipelined multiplier. 




Figure 5-12. Performance bracket of the minimum period 
for the four-stage pipeline multiplier. 


113 










a = 1111 1001 


B = 0010 0001 


Critical Path Initiates 
with the two operands 
A=F9h, B=21h 


§ 

a 

p 


a 



The Critical Path is Limited 
by the LOW-to-HIGH 
transition of the Carry Bit 
out of Stage 5. 


16-Bit Input Register 

Carry Save Adder #1 (1) 

Carry Save Adder #2 (2) 

31-Bit Intermediate Register 

Carry Save Adder #2 (3) 

Carry Save Adder #2 (4) 

31-Bit Intermediate Register 

Carry Save Adder #2 (5) 

Carry Save Adder #2 (6) 

23-Bit Intermediate Register 

Carry Completion Adder 
(3 Bits) 

21 -Bit In term ediate Re gi s ter 

Carry Completion Adder 

Carry , (3 Bits) 

18-Bit Intermediate Register 

Carry Completion Adder 
' (2 Bits) "■ 

16-Bit Output Register 


P = 0010 0000 0001 1001 


Figure 5-13. Six-stage pipelined multiplier 
implementation with an illustration of the 
critical path. 


114 








Number of 
Transistors 

Number of 
Resistors 

Current 

(Amperes) 

Power 

(Watts) 

Logic 

3952 

2352 

1.28 

3.20 

Registers 

1872 

1560 

1.49 

3.72 

Clock 

648 

540 

1.03 

2.57 

TOTAL 

6472 

4452 

3.80 

9.49 


Maximum Throughput: 4.35 GHz 

Latency: 1.38 Nano-seconds 








a= 1111 1001 


B = 0010 0001 


Critical Path Initiates 
with the two operands 
A=F9h, B=21h 



The Critical Path is Limited 
by the LOW-to-HIGH 
transition of the Carry Bit 
out of Stage 9. 


16-Bit Input Register 

Carry Save Adder #1 (1) 

31-Bit Intermediate Register 

Carry Save Adder #2 (2) 

31-Bit Intermediate Register 

Carry Save Adder #2 (3) 

31-Bit Intermediate Register 

Carry Save Adder #2 (4) 

31-Bit Intermediate Register 

Carry Save Adder #2 (5) 

31 -Bit Intermediate Register 

Carry Save Adder #2 (6) 

23-Bit Intermediate Register 

Carry Completion Adder (2 Bits) 

22-Bit Intermediate Register 

Carry Completion Adder (2 Bits) 

20-B it Intermediate Register 

C | Carry Completion Adder (2 Bits) 

▼ 

18-Bit Intermediate Register 

Carry Completion Adder (2 Bits) 

16-Bit Output Register 


P =0010 0000 0001 1001 


Figure 5-15. Ten-stage pipelined multiplier 
implementation with an illustration of the 
critical path. 


116 



Voltage (V) 



Number of 
Transistors 

Number of 
Resistors 

Current 

(Amperes) 

Power 

(Watts) 

Logic 

3912 

2320 

1.28 

3.20 

Registers 

3240 

2700 

2.57 

6.44 

Clock 

1116 

930 

1.74 

4.36 

TOTAL 

8268 

5950 

5.60 

13.99 


Maximum Throughput: 

5.56 GHz 




Latency: 

1.80 Nano-seconds 


Table 5-5. Performance summary for the ten-stage 
pipelined multiplier. 


T = 180ps, Critical Path Transition SUCCEEDS T = 170ps, Critical Path Transition FAILS 




Figure 5-16. Performance bracket of the minimum period 
for the ten-stage pipeline multiplier. 


117 









3. Comparative Analysis 

A summary of the performance results for each of the 
five pipelined multiplier implementations is presented in 
Table (5-6). A comparative analysis of these results 
quantifies and confirms the major trade-offs of pipelining 
as they were addressed in Chapter II-B. Figure (5-17) 
illustrates the increase in data throughput as compared to 
the increase in product latency. However, latency is 
generally an acceptable trade-off relative to the primary 
cost drivers of device count and power consumption. 



1 

2 

4 

6 

10 


STAGE 

STAGE 

STAGE 

STAGE 

STAGE 

Device Count 

7239 

7932 

9439 

10924 

14218 

Power (Watts) 

4.44 

5.41 

7.43 

9.49 

13.99 

Latency (nS) 

0.75 

1.00 

1.20 

1.38 

1.80 

Maximum Throughput 
(GHz) 

1.33 

2.00 

3.33 

4.35 

5.56 

Speed-Power Ratio 
(GHz/Watt) 

0.300 

0.370 

0.449 

0.458 

0.397 

Normalized 

Speed-Power Ratio 

0.66 

0.81 

0.98 

1.00 

0.87 


Table 5-6. Comparative Summary of Performance. 


118 




6.00 
5.00 
4.00 
3.00 
2.00 
1.00 
0.00 

1 2 4 6 10 

Number of Pipeline Stages 

Figure 5-17. Throughput and Latency as a function of the 

number of pipeline stages. 

Device count and power consumption are quantified in 
Figures (5-18) and (5-19), respectively. As the number of 
pipeline stages increases, the cost rises sharply - driven 
by the need for intermediate registers and an extensive 
clock distribution network. In the one-stage pipeline, the 
registers and clock tree represent only 13% of the total 
device count and consume 2 8% of the total power. On the 
other end of the spectrum, registers and clock distribution 
in the ten-stage pipeline represent 56% of the total device 
count and consume 77% of the total power. 



119 




Watts Number of Devices 


16000 
14000 
12000 
10000 
8000 
6000 
4000 






mm 



2000 


Number of Pipeline Stages 


Figure 5-18. Distribution of the Device Count. 


14.00 


12.00 

10.00 

8.00 

6.00 


4.00 








□ CLOCK 
■ REGISTER 
H LOGIC 


Number of Pipeline Stages 


Figure 5-19. Distribution of Power Consumption. 








Somewhere between these two extremes there exists an 


optimum pipelined implementation. Dividing the maximum 
throughput of each configuration by the total power that it 
consumes, a figure of merit is calculated which is referred 
to here as a speed-power ratio (for consistency with 
optimization procedures in previous chapters). Figure 
(5-21) plots the speed-power ratio as a function of the 
number of pipeline stages. The maximum point on the curve 
indicates that the optimal pipelined multiplier 
implementation employs five or six stages. 



Figure 5-20. Comparison of Speed-Power Ratio. 


121 

















Thus, having concluded an evaluation of the various 
pipelined multiplier implementations, it remains to consider 
the impact that clock skew has upon these high-speed 
circuits. Chapter VI undertakes this discussion in the 
pages that follow. 


122 



VI. 


ANALYSIS OF CLOCK SKEW 


A. QUANTIFYING CLOCK SKEW 

Clock skew appears naturally in practical circuits due 
to a variety of physical factors as described in Chapter 
II-A. However, in a typical SPICE simulation, transmission 
delays are not inherent to the process and circuit elements 
are evaluated under ideal, homogeneous operating conditions. 
The effective result is the near elimination of clock skew 
from the simulation environment. 

Clock skew could be introduced artificially; however, 
introducing a known amount of clock skew would have very 
predictable results, such that it can be determined without 
simulation. Thus, based upon the results of Chapter V a 
simple numerical analysis is conducted in this chapter which 
provides an illustration of how clock skew impacts pipelined 
architectures and serves as a set of reference data from 
which follow-on research into alternative control techniques 
can measure performance. 

B. ANALYSIS PROCEDURES 

Based upon the definition of skew from Chapter II-A, 
let S 0EVICE represent the maximum delay between two clock 
signals after propagation through a single level of clock 

drivers. As illustrated in Figure (6-1), the effect of S„.. 

on the clock signal as it propagates through the clock 


123 



distribution tree is that the clock signal potentially 
accumulates S DRIVEE picoseconds of skew at each level. 
Furthermore, any loading differences at the final level of 
the clock distribution will introduce another skew term, 
S^. Thus, the simplified expression to be used for 
analyzing and calculating skew is given in Equation (6-1) . 

^ TOTAL _ H X S DEVICE + S L0M) 

where, n = maximum number of levels in the 
clock distribution scheme 



Figure 6-1. Illustration of Clock Skew as it results from 
propagation path delays and loading. 


124 











An expression for n is derived in Equation (6-2), based upon 
the pipeline implementations from Chapter V. 



For 

synchronous 

logic, 

the 

timing 

inequality 

from 

Chapter 

II-A is repeated 

as 

Equation 

(6-3) . 

This 

relationship requires 

that 

the 

minimum 

clock period be 


expanded to account for the increase in skew. 


^ ^ ^min ^skew ^logic ^Flip-Flop 

The procedure for analysis of clock skew is simply to 
apply a range of values for S DEVICE to the clock distribution 
schemes from Chapter V, using Equation (6-2). Based upon 
simulation results, the worst-case value for S L0AD is 
determined to be 6.5 picoseconds. Thus, it is possible to 
calculate a worst-case skew value for each incremental value 
of S DEVICE as it applies to the clock distribution scheme of 
each multiplier implementation. Applying the worst-case 
skew values to Equation (6-3), a new minimum period is 
determined for each multiplier implementation. This is 


repeated for values of S DEV1CE ranging from two to twenty 
picoseconds. A comparative analysis of the results should 
identify/confirm the expectation of an increasingly negative 
impact on the more heavily pipelined architectures. 


Finally, within the 

stated 

range of S DEVICE 

values, a 

reasonable figure for S DEVICE is 

determined as 

it 

might 

actually occur due to 

device 

non-idealities 

in 

the 


fabrication process. The approximation of device-induced 
skew ( S.-rr —) is defined as 20% of the worst-case propagation 
delay for the clock driver circuit and is determined to be 
4.5 picoseconds. This set of data is referenced in the 
figures that follow as "typical skew". 

C. RESULTS 

Figure (6-2) provides a plot of the results. The 
values for skew which are referenced in the figures 
represent the values for S DEVICE . The data clearly confirms 
that the multipliers with throughput rates which are 
obtained as a function of higher clock rates will experience 
the most drastic performance reductions in the presence of 
clock skew. Furthermore, when weighed against the cost of 
power consumption a set of new speed-ratio curves is 
obtained, as shown in Figure (6-3). Thus, the contemporary 
appeal of synchronous pipelined architectures demonstrates a 
severe backlash at high clock rates. 


126 



Speed-Power Ratio (GHz/W) H . Throughput (GHz) 


6.00 




5.00 
4.00 

3.00 

2.00 

1.00 

0.00 

0 2 4 6 8 10 12 



—No Skew 
2ps Skew 
-*— 5ps Skew 
—x— 10ps Skew 
—•— 20ps Skew 
Typical Skew 


Number of Pipeline Stages 


gure 6-2. Effect of Skew on Pipeline Throughput Rates. 


0.50 
0.45 
0.40 
0.35 
0.30 
0.25 
0.20 
0.15 
0.10 
0.05 
0.00 

0 2 4 6 8 10 12 

Number of Pipeline Stages 

Figure 6-3. Effect of Skew on Pipeline Efficiency. 



-♦—No Skew 
Skew=2ps 
Skew=5ps 
Skew=10ps 
Skew=20ps 
-♦—Typical Skew 


127 








THIS PAGE LEFT BLANK INTENTIONALLY 


128 




VII 


CONCLUSIONS 


The fundamentals of circuit analysis and the principles 
of junction transistor behavior have been applied to design 
an optimal family of current-mode logic devices from InP HBT 
SPICE transistor models. From these building blocks of 
digital logic, an array multiplier has been constructed and 
pipelined into five distinct implementations. Each 
multiplier implementation has been simulated extensively via 
Tanner SPICE in order to identify the respective performance 
characteristics of power consumption and maximum operating 
frequency. 

A comparative analysis of multiplier performance has 
effectively demonstrated the trade-offs of pipelining with 
predictable yet interesting results. The cost of increasing 
throughput by increasing the number of pipeline stages has 
been quantified in terms of device count and power 
consumption. By maximizing data throughput at the most 
efficient cost in terms of power, the optimal 8x8 bit 
synchronous pipelined multiplier design has been determined 
to be the six-stage implementation, as shown on page 121. 

Finally, in the presence of clock skew, it has been 
demonstrated that the efficiency of synchronous pipelined 
architectures operating at high clock rates is significantly 
reduced. Thus, as device switching frequencies continue to 


129 



pave the way to faster logic circuits, the rate of data 
throughput will be left behind unless the synchronous logic 
design constraint of clock skew can be overcome. The impact 
of clock skew has been quantified and summarized such that 
it provides a reference point for further research into 
alternative clocking/control techniques. 

Specifically, it is intended that future research use 
the CML HBT logic family designed in this thesis in order to 
implement the same array multiplier circuit using 
asynchronous control techniques. One such endeavor is 
already in progress as LtCol. Kirk Shawhan, USMC, 
investigates the use of local completion signals which 
employ request/acknowledge handshake signals to control the 
flow of data vice the use of a global clock signal (Shawhan, 
2000). Perhaps in time such asynchronous schemes will 
mature into a design methodology that overcomes the obstacle 
of clock skew which now threatens to limit synchronous 
design methodology. 


130 



LIST OF REFERENCES 


Foley, J. B . ; Bannister, J. A. R., "Analysing ECL's Noise 
Margins," IEEE Circuits and Devices, Volume 10, 1994, pp. 
32-37. 

Harris, D., "Timing Analysis Including Clock Skew," IEEE 
Transactions on Computer-Aided Design of Integrated 
Circuits and Systems, Volume 18, 1999, pp. 1608-1618. 

Jalali, B.; Pearton, S. J., InP HBTs: Growth, Processing, 
and Applications, Artech House, Inc., Massachussets, 1995. 

Loomis, Herschel H. Jr., "Class Notes," EC 4830 Spring 
Quarter at the Naval Postgraduate School, 2000. 

Moore, G., "Moore's Law Extended: The Return of 
Cleverness," Solid State Technology, Volume 40, 1997, pp. 
359-364. 

Pierret, Robert F., Semiconductor Device Fundamentals, 
Addison-Wesley Publishing Company, Massachusetts, 1996. 

Pollard, L. Howard, Computer Design and Architecture, 
Prentice-Hall, New Jersey, 1990. 

Richards, R. K., Electronic Digital Components and Circuits, 
D. Van Nostrand Company, New Jersey, 1967. 

Sedra, Adel. S; Smith, Kenneth C., Microelectronic Circuits, 
Oxford University Press, New York, 1998. 

Shawhan, Kirk A., Design and Analysis of an Asynchronous 
Pipelined Multiplier with Comparison to Synchronous 
Implementation, Master Thesis, Naval Postgraduate School, 
Monterey, CA, Dec 2000. 

Sutherland, Ivan E., "Micropipelines," Communications of the 
ACM, Volume 32, 1989, 720-738. 

Wakerly, John F., Digital Design Principles and Practices, 
Prentice Hall, New Jersey, 2000. 

Weste, Niel H. E.; Eshraghian, Kamran, Principles of CMOS 
VLSI Design: A Systems Perspective, Addison Wesley Longman, 
Inc., 1993. 


131 


THIS PAGE INTENTIONALLY LEFT BLANK 


132 



INITIAL DISTRIBUTION LIST 


1. Defense Technical Information Center.2 

8725 John J. Kingman Road, Ste 0944 
Fort Belvoir, VA 22060-6218 

2 . Dudley Knox Library.2 


Naval Postgraduate School 
411 Dyer Road 

Monterey, California 93943-5101 

3. Director, Training and Education.1 

MCCDC, Code C46 

1019 Elliot Rd. 

Quantico, Virginia 22134-5027 

4. Director, Marine Corps Research Center.2 

MCCDC, Code C40RC 

2040 Broadway Street 
Quantico, Virginia 22134-5107 

5 Marine Corps Tactical System Support Activity . 1 

Technical Advisory Branch 
Attn: Librarian 
Box 555171 

Camp Pendleton, CA 92055-5080 

6. Marine Corps Representative.1 

Naval Postgraduate School 

Code 037, Bldg. 330, Ingersoll Hall, Room 116 
555 Dyer Road 
Monterey, CA 93943 

7. Engineering and Technology Curricular Office, Code 34 1 
Naval Postgraduate School 

Monterey, California 93943-5109 

8. Chairman, Code EC.1 

Department of Electrical and Computer Engineering 
Naval Postgraduate School 

Monterey, California 93943-5121 

9. Professor Douglas Fouts, Code EC/FS. 1 

Department of Electrical and Computer Engineering 
Naval Postgraduate School 

Monterey, California 93943-5121 


133 











10. Professor Herschel Loomis, Code EC/LM.1 

Department of Electrical and Computer Engineering 
Naval Postgraduate School 

Monterey, California 93943-5121 

11. LtCol. Kirk Shawhan (USMC) .1 

P.0. Box 749 

Quantico, VA 22134-0749 

12. Maj . John R. Calvert, Jr. (USMC).4 

1422 Woodway Drive 

Ooltewah, TN 37363 


134