NAVAL POSTGRADUATE SCHOOL
Monterey, California
THESIS
DESIGN OF A SYNCHRONOUS PIPELINED MULTIPLIER
AND ANALYSIS OF CLOCK SKEW IN HIGH-SPEED
DIGITAL SYSTEMS
by
John R. Calvert, Jr.
December 2000
Thesis Advisor: Douglas J. Fouts
Thesis Co-Advisor: Herschel H. Loomis, Jr.
Approved for public release; distribution is unlimited.
20010402 109
REPORT DOCUMENTATION PAGE
Form Approved
OMB No . 0704-0188
Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instruction,
searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send
comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to
Washington headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA
22202-4302, and to the Office of Management and Budget, Paperwork Reduction Project (0704-0188) Washington DC 20503.
1. AGENCY USE ONLY (Leave blank) 2. REPORT DATE 3. REPORT TYPE AND DATES COVERED
December 2000 Master’s Thesis
4. TITLE AND SUBTITLE :
Design of a Synchronous Pipelined Multiplier and Analysis of Clock Skew in
High-Speed Digital Systems
5. FUNDING NUMBERS
6. AUTHOR(S)
Calvert, John R. Jr.
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES)
Naval Postgraduate School
Monterey, CA 93943-5000
8. PERFORMING
ORGANIZATION REPORT
NUMBER
9. SPONSORING / MONITORING AGENCY NAME(S) AND ADDRESS(ES)
10. SPONSORING
/MONITORING
AGENCY REPORT NUMBER
11. SUPPLEMENTARY NOTES
The views expressed in this thesis are those of the author and do not reflect the official policy or position of the
Department of Defense or the U.S. Government.
12a. DISTRIBUTION/AVAILABILITY STATEMENT
Approved for public release; distribution is unlimited.
12b. DISTRIBUTION CODE
13. ABSTRACT ( maximum 200 words)
Digital systems implemented with high-speed transistor technologies face a variety of design challenges in an
effort to keep pace with the accelerating demand for performance. As device switching frequencies climb comfortably
into the gigahertz range, clock skew in digital systems threatens to limit the advantages of synchronous pipelined
designs. This research investigates the limitations of clock skew on high-speed digital systems by designing and
simulating an 8x8 bit synchronous, pipelined multiplier using Indium phosphide (InP), heterostructure bipolar junction
(HBT) transistor technology. Fundamentals of circuit analysis and the principles of junction transistor behavior are
applied to design an optimal family of logic devices using current-mode logic. All testing and simulation data is based
upon results obtained from Tanner SPICE design tools. Using the building blocks of this logic family, an array
multiplier is constructed and further configured into five distinct pipeline implementations. By employing a different
number of pipeline stages in each implementation, the trade-offs of pipelining are illustrated and clock skew is
analyzed at a variety of throughput rates. Finally, the impact of clock skew on throughput performance is quantified
and summarized as a reference point for further research into asynchronous control techniques.
14. SUBJECT TERMS
Clock Skew, Pipelined Logic, Current-Mode Logic, Indium-phosphide Heterojunction
Bipolar Transistors, High-Speed Logic
15. NUMBER
OF PAGES
152
16. PRICE
CODE
17. SECURITY
CLASSIFICATION OF REPORT
Unclassified
18. SECURITY CLASSIFICATION
OF THIS PAGE
Unclassified
19. SECURITY
CLASSIFICATION OF
ABSTRACT
Unclassified
20.
LIMITATION
OF ABSTRACT
UL
NSN 7540-01-280-5500 Standard Form 298 (Rev. 2-89)
Prescribed by ANSI Std. 239-18
1
THIS PAGE LEFT BLANK INTENTIONALLY
11
Approved for public release; distribution is unlimited.
DESIGN OF A SYNCHRONOUS PIPELINED MULTIPLIER AND ANALYSIS
OF CLOCK SKEW IN HIGH-SPEED DIGITAL SYSTEMS
John R. Calvert, Jr.
Major, United States Marine Corps
B.S., United States Naval Academy, 1990
Submitted in partial fulfillment of the
requirements for the degree of
MASTER OF SCIENCE IN ELECTRICAL ENGINEERING
from the
NAVAL POSTGRADUATE SCHOOL
December 2000
Author:
Approved by:
Department of Electrical and Computer Engineering
iii
ABSTRACT
Digital systems implemented with high-speed transistor
technologies face a variety of design challenges in an
effort to keep pace with the accelerating demand for
performance. As device switching frequencies climb
comfortably into the gigahertz range, clock skew in digital
systems threatens to limit the advantages of synchronous
pipelined designs. This research investigates the
limitations of clock skew on high-speed digital systems by
designing and simulating an 8x8 bit synchronous, pipelined
multiplier using Indium phosphide (InP), heterostructure
bipolar junction (HBT) transistor technology. Fundamentals
of circuit analysis and the principles of junction
transistor behavior are applied to design an optimal family
of logic devices using current-mode logic. All testing and
simulation data is based upon results obtained from Tanner
SPICE design tools. Using the building blocks of this logic
family, an array multiplier is constructed and further
configured into five distinct pipeline implementations. By
employing a different number of pipeline stages in each
implementation, the trade-offs of pipelining are illustrated
and clock skew is analyzed at a variety of throughput rates.
Finally, the impact of clock skew on throughput performance
is quantified and summarized as a reference point for
further research into asynchronous control techniques.
v
THIS PAGE LEFT BLANK INTENTIONALLY
VI
TABLE OF CONTENTS
I. INTRODUCTION.....1
A. THE RELEVANCE OF HIGH-SPEED LOGIC.1
B. THE PROBLEM OF CLOCK SKEW.1
C. THE DESIGN OF A TEST CIRCUIT.2
D. THESIS OUTLINE.3
H. BACKGROUND.5
A. CLOCK SKEW.5
B. PRINCIPLES OF PIPELINING.7
C. LOGIC DESIGN OF A COMBINATIONAL MULTIPLIER.10
D. BJT/HBT LOGIC. 13
L BJT/HBT Principles and Characteristics . 13
2. BJT/HBT Logic Families . 23
HI. HBT CML LOGIC CIRCUIT DESIGN.31
A. DESIGN OVERVIEW.31
B. INVERTER DESIGN.31
L Circuit Topology . 31
2. Initial Conditions and Design Parameters . 34
3. DC Analysis . 38
4. AC/Transient Analysis . 47
5. Final Design Summary: Inverter . 54
C. LOGIC NOR GATE DESIGN.55
L Overview and Analysis . 55
2. Final Design Summary: OR/NOR .57
3. Implementation of the AND Function .57
D. ADDER DESIGN.59
1. Implementation . 59
2. Performance Analysis . 62
E. PRACTICAL CURRENT SOURCE DESIGN.62
1. Circuit Topologies . 62
2. Performance Analysis . 63
IV. HBT CML LATCH AND REGISTER DESIGN.71
A. LATCH DESIGN.71
1. Circuit Topology . 71
2. .. Initial Conditions and Design Parameters . 74
3. DC Analysis .75
4. A C/Transient Analysis . 81
5. Special Latch Implementations . 87
6 . Final Design Summary: D-Latch . 88
B. FLIP-FLOP DESIGN (D-TYPE).89
1. Overview and Analysis . 89
2. Final Design Summary . 90
C. CLOCK DRIVER DESIGN.91
1. Overview . 91
2. Analysis and Results . 92
3. Final Design Summary: Clock Driver . 94
vii
V. HBT CML PIPELINED MULTIPLIER DESIGN- 95
A. LOGIC STAGE DESIGN.95
1. Overview . 95
2. Carry-Save Adders . 96
3. Carry-Completion Adders . 101
B. REGISTER STAGE DESIGN.103
C. CLOCK DISTRIBUTION.103
D. MULTIPLIER IMPLEMENTATIONS.104
E. PERFORMANCE EVALUATION.104
1. Evaluation Procedures . 104
2. Performance Results of Each Implementation . 106
3. Comparative Analysis . 118
VL ANALYSIS OF CLOCK SKEW-123
A. QUANTIFYING CLOCK SKEW.123
B. ANALYSIS PROCEDURES.123
C. RESULTS.126
vn. CONCLUSIONS-129
LIST OF REFERENCES_ 131
INITIAL DISTRIBUTION LIST-133
viii
EXECUTIVE SUMMARY
The electronic subsystems of future overhead collection
platforms will require extremely high performance digital
logic for performing such tasks as data
compression/decompression, data encryption, spread spectrum
modulation, etc. To accomplish this, bit rates must reach
into the gigabits per second range. Such speed obviously
requires digital logic which will function correctly at
clock rates of tens of gigahertz. The need for such high
performance has led to the implementation of logic systems
using indium phosphide (InP) heterojunction bipolar
transistors (HBT) technology. However, clock frequency and
pipeline throughput in digital systems implemented with InP
HBT technology is significantly limited by clock, control
signal, and data skew which is a much larger percentage of
the clock period than it is in lower-speed digital systems
implemented with complementary metal oxide semiconductor
(CMOS) technology. Therefore, the presence of clock skew in
high-speed digital systems defines a limitation for the
advantages of synchronous pipelined architectures.
It is the purpose of this thesis to design a
synchronous 8x8 bit pipelined multiplier as a high-speed
digital test circuit using InP HBT technology and
furthermore, to quantify the impact of clock skew on
throughput. This work represents the initial phase of a
larger research project to determine if asynchronous
pipeline control will yield greater overall pipeline
throughput in high-performance InP HBT digital integrated
circuits and if the resulting elimination of the clock
distribution tree will reduce power consumption, device
count and layout area. All simulation data is based upon
the results obtained from Tanner SPICE design tools.
ix
Having received InP HBT device specifications from
Hughes Research Laboratories, this project commenced with
the design of an HBT logic family utilizing current-mode
logic. Each circuit was designed and optimized for a
minimum power-delay product while driving a maximal fanout
load of four logic gates. This design effort produced the
four essential circuit functions necessary for the practical
implemention of any synchronous logic circuit: an
inverter/buffer gate, an OR/NOR gate, a D-type latch, and a
practical current source.
Using the building blocks of this logic family, an
array multiplier was constructed and further configured into
five distinct pipeline implementations. These included a
one, two, four, six, and ten-stage pipeline, respectively.
A comparative analysis of their performance effectively
illustrated the trade-offs of pipelining, i.e., the cost of
the additional registers was shown to outpace the increase
in throughput beyond a six-stage implementation. At a
maximum throughput of 4.35 gigahertz, the six-stage
pipelined multiplier was the most efficient design (in the
absence of clock skew). The highest throughput achieved was
5.56 gigahertz by the costly ten-stage implementation.
Power consumption ranged from 4.4 to 14 watts.
In the final analysis, clock skew was not simulated
because SPICE simulations effectively eliminate skew from
their calculations. Rather, the impact of clock skew was
determined by applying numerical analysis to the no-skew
simulation results. A range of possible skew values was
considered in order to demonstrate a performance trend. The
results confirmed that digital system throughput rates which
are obtained as a function of higher clock rates will
experience the most drastic performance reductions in the
presence of clock skew. Also, it was shown for a typical
x
value of skew in this circuit that the efficiency curve
shifts to indicate that the four-stage pipeline is the most
efficient implementation, vice the six-stage pipeline.
The design products and test results from this thesis
provide a reference point for further research into
alternative clocking/control techniques. Specifically, it
is intended that future research use the CML HBT logic
family designed in this thesis in order to implement the
same array multiplier circuit using asynchronous control
techniques. One such endeavor is already in progress as
LtCol. Kirk Shawhan, USMC, investigates the use of local
completion signals which employ request/acknowledge
handshake signals to control the flow of data vice the use
of a global clock signal.
xi
THIS PAGE LEFT BLANK INTENTIONALLY
xii
ACKNOWLEDGMENT
My most immediate expression of gratitude is to the One
that made me ... to my Creator, my Lord, and my Savior —
Jesus Christ. I am convinced that the meticulous detail and
extensive effort required to design even the simplest of
electronic devices bears unmistakable testimony to the
Divine and Intelligent design of our universe.
But remember the Lord your God, for it is he who gives you
the ability ... Deuteronomy 8:18
...whatever you do, do it all for the glory of God.
I Corinthians 10:31
This thesis is dedicated to my wife, Laura. Though I
have no expectation that she will ever read beyond this
page, certainly it would never have been written without her
encouragement, support, and patience. While I have
considered myself fortunate to be learning about the
mysteries of microelectronics, she has been making the most
thrilling discoveries and profound impact each day as the
mother of our beloved daughter, Victoria, who is 22 months
old today. I love them both dearly.
I wish to thank my advisors. Professor Douglas Fouts
and Professor Herschel Loomis for sharing their time, their
knowledge, and their passion for this field of study. Their
guidance was essential and their enthusiasm was contagious.
I am also thankful to have shared this academic journey
with such a quality core of fellow students and service
members. Specifically, it is with great admiration that I
thank LtCol. Kirk Shawhan, USMC — my perpetual study
partner and project mentor. He has patiently endured my
questions and generously shared his ingenius grasp of the
most challenging concepts. My graduate school experience
would have been a much different story without his
assistance and more importantly, his friendship.
xm
THIS PAGE LEFT BLANK INTENTIONALLY
I. INTRODUCTION
A. THE RELEVANCE OF HIGH-SPEED LOGIC
The demand for increased processing speeds in digital
electronics has driven the clock frequency of logic circuits
from a scale of microseconds to one of picoseconds over the
past twenty years. This remarkable trend is the synergistic
result of technological advancements and innovations in
device physics, very-large-scale integrated (VLSI) circuit
fabrication, and digital systems architecture. Moore's Law
accurately predicted this trend of improvement 35 years ago,
and current expectations are that the trend will continue
(Moore, 1997). Consider the anticipation of such
technologies as real-time multimedia satellite
communications and broadband networks. These applications
will require extremely high performance digital logic that
can function reliably at clock rates of tens of gigahertz.
B. THE PROBLEM OF CLOCK SKEW
There are a variety of technological hurdles to clear
before achieving such clock speeds, and it is the purpose of
this thesis to explore one particular hurdle in the course
of digital systems architecture: the problem of clock skew
in high-speed logic. Clock skew is the difference between
arrival times of the clock signal at different synchronous
clocked devices (Harris, 1999). As clock frequencies reach
1
into the multi-gigahertz range, clock skew is an increasing
concern for high-speed circuit designers because it accounts
for an increasing portion of the clock period — leaving
less of the clock period to be budgeted for logic and
latching delays. What was once a near negligible quantity
has now become a significant design constraint. (Wakerly,
2000 )
C. THE DESIGN OF A TEST CIRCUIT
This thesis presents the design of a high-speed logic
test circuit and the simulation of its performance in order
to identify and quantify the effects of clock skew. It
should be noted that these results are intended to serve as
a reference for future research involving potential
solutions for the reduction of clock skew. The following
paragraphs develop the necessary specifications of the test
circuit.
To ensure valid results, it is important that the
problem be simulated in an accurate context. Therefore, it
is necessary to select a logic family based upon a
transistor model that is capable of realizing multi¬
gigahertz clock speeds. Although complementary- metal-oxide-
semiconductor (CMOS) technologies dominate VLSI
applications, for comparable fabrication technologies, a
bipolar circuit is approximately 2.5 times faster than a
functionally similar CMOS circuit (Foley, 1994). Typically,
2
such high-speed bipolar circuits employ emitter coupled
logic (ECL) or current mode logic (CML) . Notably, these
logic families consume significantly more power than field
effect transistor (FET) logic families; however, the trade¬
off is accepted here for the purpose of achieving sufficient
clock speeds. For these reasons, current mode logic is
employed to design a family of logic gates based upon the
transistor specifications for an indium phosphide (InP)
heterojunction bipolar transistor (HBT), courtesy of Hughes
Research Laboratories.
Additionally, it is important that the architecture and
functionality of the test circuit provide a relevant context
for evaluation. It should be noted here that the shorter
clock periods discussed above are not exclusively the result
of faster gate delays (i.e. faster transistors) but are also
the result of pipelined architectures which require fewer
gate delays per clock cycle. In keeping with this
characteristic of high-speed logic circuits, the test
circuit implements a pipelined architecture. As for circuit
functionality, an 8x8 bit multiplier was chosen to provide
sufficient complexity for pipeline implementation.
D. THESIS OUTLINE
The purpose of this thesis is to design, simulate, and
evaluate the performance of a high-speed (InP HBT) 8x8-bit
pipelined multiplier in the presence of clock skew. The
3
discussion begins with the review and development of several
fundamental topics in Chapter II: clock skew, pipelining
principles, logic-level design of a multiplier, and
transistor-level design of BJT/HBT logic. Based upon that
foundation. Chapters III through V present the hierarchical
design of the pipelined multiplier from the bottom up.
Respectively, these chapters address logic circuit design,
clock-driven circuit design, and pipeline design. Each of
the design chapters presents a complete discussion of
pertinent design issues, low-level simulation, performance
optimization, and final design specifications. Finally,
Chapter VI records the analysis of clock skew and
Chapter VII summarizes the conclusions of the entire work.
4
II. BACKGROUND
A. CLOCK SKEW
Clock skew is the difference between the arrival times
of the clock signal at two different clock-driven devices,
as illustrated in Figure (2-1). This difference is
dependent upon multiple issues including normal component
variations, wire propagation delay, RC delays, propagation
distance, environmental variations (such as operating
temperature), and clock loading. Notably, all of these
contributing factors have been increasing relative to gate
delays. (Harris, 1999)
Figure 2-1. Clock Skew (After Wakerly).
In traditional logic designs which employ flip-flops
and operate at extremely high clock frequencies, clock skew
has become a significant portion of the total clock period.
5
For a fixed-length clock period, this effectively reduces
the amount of time available for computation. Equation
(2-1) quantifies the terms which contribute to the minimum
clock period (TVJ of a traditional synchronous logic
circuit.
T .
mm
= t . + t. .
skew logic
+
^Flip-Flop
where,
^Flip-FLop ^ setup
+
( t )
' prop 1 max
The simplest and most direct technique for minimizing
clock skew would seem to be the implementation of a uniform
clock distribution hierarchy which provides a local clock
signal to a smaller portion of the entire circuit, i.e., a
subcircuit. For signals that remain within the subcircuit,
clock skew is reduced. The maximum propagation delay from
the local clock source to the farthest clock input of the
subcircuit can be kept within a desirable tolerance. But
inevitably, signals must travel between subcircuits. This
is an increasingly common occurrence when the maximum size
of the subcircuit is restricted by practical limitations for
fanout and power consumption — especially true in the case
of current-driven logic.
The local clock signals are not without skew relative
to each other. Although the delay paths for each branch of
the clock distribution tree may contain the same number of
gate delays, the switching behavior along each path varies
6
within a narrow range. Thus, when a signal from one
subcircuit must drive logic in another subcircuit, the
worst-case value of relative clock skew must be assumed.
An extensive clock distribution tree is employed in
this thesis to provide local clock signals for circuit
elements of a pipelined multiplier. Ultimately, the purpose
is to quantify the clock skew experienced in a high-speed
logic circuit and explore the impact of clock skew as the
clock period is reduced.
B. PRINCIPLES OF PIPELINING
As referenced in the previous section, the minimum
clock period is governed by the relationship presented in
Equation (2-1). For a given block of combinational logic
with an associated propagation time of t logic , the minimum
clock period is required to be even greater. In the face of
a large, complex combinational circuit (Figure 2-2a) this
could impose undesirable restrictions on clock speed.
However, a pipelined approach suggests that the
combinational logic can be broken down into discrete levels
of operation, known as pipeline levels (Figure 2-2b). Each
pipeline level will contain fewer levels of logic than the
original combinational circuit, and ideally, each pipeline
level will contain the same number of logic levels in order
to achieve near-equal propagation delays. Then, by adding
appropriately sized registers between these levels (Figure
7
2-2c), the function of the original combinational logic can
be achieved by sequentially sending operands through the
series of pipeline levels.
Furthermore, this can be done at a higher clock rate
since the period is now governed by Equation (2-2), where
t. . has now become t . . .
logic pipe-level
( 2 - 2 )
■'pipe-level
'Flip-Flop
The improvement in clock speed is quantified as the
percentage of speedup. Equation (2-3). (Pollard, 1990)
(2-3)
Time for M operations WITHOUT pipelining
Speedup = --—--
Time for M operations WITH pipelining
Of course, this benefit is not without cost. There are
several trade-offs involved such as increases in the number
of components, power consumption, control complexity, chip
area, and a variety of associated costs for design and
fabrication. Additionally, the propagation latency for a
single set of signals traveling through the pipeline is
increased due to the additional delays contributed by the
intermediate register(s) in the pipeline. Equation (2-4)
expresses this increase in latency as a function of the
number of pipeline stages (m) and the total register delay
(Loomis, 2000).
(2-4) Latency Increase = (m-1) t FIip _ plop
8
Figure 2-2. Example of Pipelining (After Loomis).
9
Though the significant increase in delay for a single
operation may seem to be a tragic loss, it is the remarkable
increase in data throughput which accompanies the increase
in clock speed that ultimately motivates the designer to
adopt a pipelined architecture.
In the context of this project, a pipelined
architecture will facilitate the achievement of high clock
speeds in the implementation of a relatively large, complex
combinational circuit — a combinational multiplier.
C. LOGIC DESIGN OF A COMBINATIONAL MULTIPLIER
A combinational multiplier takes two n-bit operands and
performs n shift and n add operations to generate a 2n-bit
product. Most algorithms are implemented based upon the
paper-and-pencil-like procedure of shifted product
components as shown in Figure (2-3). Each individual bit of
the multiplier (y 0 through y^) is successively multiplied
times the entire n-bit multiplicand. With each subsequent
multiplier bit, the resulting product component is shifted
by one bit position, starting with an initial shift of zero
and concluding with n-1. (Wakerly, 2 000)
The worst-case delay for this type of multiplication is
governed by the carry propagation out of the most
significant bit position and into the follow-on stage of
addition. By utilizing carry-save addition (Figure 2-4) ,
this propagation delay is eliminated for the initial n-1
10
Figure 2-3. Multiplication as a sum of partial product
terms (From Wakerly).
Figure 2-4. An 8x8 bit multiplier implemented with seven
carry-save adder stages and one ripple-carry adder for
carry completion (From Wakerly).
11
stages of addition; however, an extra stage is required to
complete the addition of the final two resulting terms, as
will be explained shortly.
The first carry-save addition stage takes two binary
addends and generates an n-bit modulo-two sum and a shifted
n-bit carry term (shifted by one bit). Subsequent carry-
save addition stages take three binary addends: the
previous partial sum, the shifted carry term, and the next
subsequent product term. These are also added to produce an
n-bit modulo-two sum and a shifted n-bit carry term. As
each carry-save addition occurs, the least significant bit
(LSB) of each partial sum represents the next most
significant bit (MSB) in the final product. This is
repeated until the n th product term has been added, and all
that remains are a sum term and a shifted carry term. At
this point, a carry-completion adder computes the most
significant n+1 bits of the product. This procedure
accounts for the consecutive propagation of a carry bit as
each pair of addend bits are summed from LSB to MSB.
In the context of this project, the implementation of
carry-save adders and carry completion adders allows
convenient grouping of pipeline stages. This is
particularly applicable to the final stage of the design
process undertaken in this project. Chapter 5 provides
further details on the implementation of a pipelined 8x8-bit
combinational multiplier, as introduced in the preceding
paragraphs.
D. BJT/HBT LOGIC
1. BJT/HBT Principles and Characteristics
a) Device Structure
A bipolar junction transistor (BJT) is a sandwich
structure of three separately doped regions of silicon (or
other suitable semiconductor) , such that one of two
configurations exists. One configuration is the pnp
transistor where a negatively doped region is bounded on
either end by positively doped regions (p-type transistor).
The other configuration is the npn transistor where a
positively doped region is bounded on either end by
negatively doped regions (n-type transistor). Figure (2-5)
provides a simplified illustration and further identifies
the proper names for the regions: collector, base, and
emitter.
Emitter ^
Emitter
Base
Collector
Region
Region
Region
Collector
*
Base
Figure 2-5. Structure of a Bipolar junction Transistor
(After Pierret).
13
Until recent years, BJTs were generally fabricated
from a single semiconductor material. However, device¬
level physics has demonstrated that faster junction
transistors can be constructed from dissimilar semiconductor
materials with complementary properties. Such devices are
known as heterojunction bipolar transistors (HBTs).
Conveniently enough, their operational behavior is
essentially governed by the same functional principles as
BJTs (Pierret, 1996). Therefore, it is assumed that
wherever BJT behavior is referenced, a direct correspondence
to HBT behavior exists. The following sections will provide
a fundamental understanding of that behavior.
b) Device Function
The significance of the BJT lies in its potential
to behave as a current-controlled current source when the
proper DC bias is applied to the three regions or terminals.
The controlling terminal is the base. Applying the proper
DC bias to an npn transistor, a small current flowing into
the base will produce a proportionately larger current being
drawn into the collector, across the base region, and out of
the emitter (Figure 2-6) . The converse is true for a
properly biased pnp transistor. A small current drawn out of
the base will produce a proportionately larger current being
drawn into the emitter, across the base region, and out of
the collector. From this point forward, it will be helpful
14
I 1
Figure 2-6. A functional illustration of an (a) npn and
a (b) pnp bipolar junction transistor (After Sedra).
to limit the discussion to npn transistors, because the pnp
transistors operate in a very similar manner (with reversed
polarity) and npn transistors are the only type encountered
in the chapters ahead.
As stipulated in the preceding discussion, proper
DC bias conditions must exist in order to achieve the
desired performance. Depending upon the DC bias, the
transistor will operate in one of the following modes of
operation: cutoff, active, or saturation. In the first
case, the emitter-base junction is reverse biased which
means V BE < V BE(on) for the pn junction (0.75v). This also
implies that V BC < V BC(on) for the collector-base junction.
Therefore, the collector-base junction is also reverse
biased. This condition is known as the "cutoff" mode since
effectively no current flows through the transistor.
15
In the two remaining modes, the emitter-base
junction is forward biased, and the transistor conducts
current. The mode of operation is distinguished by the
condition of the collector-base junction — using the
emitter as a common reference for both the collector and
base. If V CE < V CE(sat) then the base-collector junction is
saturated, and the flow of current from collector to emitter
is not linearly dependent on I B . Conversely, when V CE > V CE(sat)
for the base-collector junction, then it is reverse biased
and current is swept from the collector, across the base,
and out of the emitter in linear proportion to the amount of
base current applied. This is known as the active region.
Table (2-1) summarizes the relationships which
govern the three regions of operation. Furthermore, Figure
(2-7) is an i-v curve for the Hughes InP HBT (lxl micron) .
It serves to illustrate the active and saturation modes of
BJT operation while also providing necessary design
information that relates the base-emitter voltage drop (V BE )
to collector current levels (I c ) .
The linearly proportionate increase in collector
current relative to base current is referred to as the
common-emitter current gain, Beta ((3) , as shown in Equation
(2-5). (Sedra, 1998)
Collector Current (raA)
Mode of
Operation
Base-Emitter
Junction
Collector-Emitter
Junction
Bias
Relationship
Bias
Relationship
Cutoff
Reverse
V < V
v BE v BE(on)
Reverse
Saturation
Forward
^BE ^ ^BE(on)
Forward
v CE
^ ^CE(sat)
Active
Forward
"^BE ^ ^BE(on)
Forward
v CE
^ ^CE(sat)
Table 2-1. Relationships governing the operational regions
of the BJT transistors (After Sedra).
Figure 2-7. I-V Curve for the InP HBT
Figure 2-8. Variation of Beta for the XnP HBT with
respect to V n and V^.
Beta is a device parameter for BJTs — a function of the
device physics and dimensions. Figure (2-8) illustrates how
Beta varies according to the values of base-emitter voltage
and collector-emitter voltage.
Finally, a simple application of Kirchoff's
Current Law produces Equation (2-6) — an important
relationship for current through the transistor.
(2-6) I E =I B +I C
18
c) DC Analysis of a BJT Circuit
In order to illustrate the basic concepts of BJT
operation as presented in the previous section, the
transistor circuit in Figure (2-9) is now examined. Given
the reference voltages, the turn-on voltage for the emitter-
base junction (0.75v), and Beta for the transistor, it is
readily (determined that V BE > V BE(on) , and therefore the
emitter-base junction is forward biased. DC analysis
reveals the value of V B and I B . Applying the equations from
the previous section, I c , I E , and V c are determined, and it is
concluded that the transistor is operating in the active
region.
+ m
+ 5v
R c = lkQ
R* = lOOkQ
DC ANALYSIS :
V E = Ov
V B = V E + V BE(on) = 0.7v
T _ Vbb ~ V B
5v - 0.7v
lOOkQ
= 43
I c = p x I B = 4.3mA
I E = I c + I B = 4.343mA
V c = V cc - I C R C = lOv - (4.3mA)
= 6.7v
Figure 2-9. DC Analysis of a simple BJT circuit.
19
In anticipation of logic applications, consider
the base voltage as a logical input which is either high
(above V BE(on) ) or low (below V BE(on) ) . For a logic high input
the transistor operates in the active mode, causing the
voltage at the collector drop below V cc by an amount equal to
I C R C . Alternately, for a logic low input the transistor
operates in the cutoff mode, drawing effectively no current
through the collector and leaving V c approximately equal to
V cc . The functionality of this circuit is essentially that
of a basic BJT inverter.
d) BJT Differential Pair
Before committing to the discussion of transistor
logic circuits, it is necessary to introduce a configuration
that maximizes the switching speed of the BJT transistor:
the differential pair. A differential pair is constructed
from two matched transistors (Q x and Q 2 ) with their emitters
attached to a common current source and their collectors
independently biased via separate pull-up resistors to a
common voltage source, as shown in Figure (2-10). The base
terminals are attached to separate voltage sources of equal
value. Assuming the transistors have been given the proper
DC bias for operation in the active mode, the relationship
in Equation (2-7) is readily determined.
(2-7) I E , = I E2 = ^
20
V C c
Figure 2-11. Example of a BJT Differential
Pair configuration.
Now, consider the scenario where V B2 is constant
and V B1 is allowed to vary between two extremes: one above
and one below V B2 . When V B1 reaches a voltage sufficiently
larger than V B2 , all of the current from I bias is steered
through Q x such that Q 2 is cutoff. Conversely, when V B1 drops
sufficiently below V B2 , Q 2 is on and Q 2 is cutoff. As noted
in the DC analysis of the previous BJT circuit, the
collector voltage of Q 1 exhibits the behavior of a logic
inverter with respect to V B1 , while the opposite collector
voltage (Q 2 ) functions as a non-inverting buffer.
21
While the availability of complementary output
voltages is certainly convenient, the most important
observation of the differential pair is its switching speed.
A relatively small voltage difference between V B1 and V B2 is
required to switch the current almost entirely to the
opposite path. More specifically, for a differential pair
implemented with the Hughes InP HBT, it is shown in Figure
(2-11) that a difference of only 75mV is sufficient to
switch 90% of the current.
Input Voltage (V)
Figure 2-11. Current Switching Characteristic of the InP
HBT Differential Pair.
Furthermore, since Q 1 and Q 2 are biased to operate
in the active mode, the switching occurs faster than
scenarios which may place the transistors in saturation
mode. This is because a saturated transistor stores charge
in its base. That charge must be dissipated before
switching can occur.
It is the current-steering property of the
differential pair configuration which ultimately provides a
foundation for the development of current mode logic, as
will be discussed later in this chapter. However, before
reaching that discussion, a brief overview of the dominant
BJT logic families will serve to accentuate the advantage of
current mode logic.
2. BJT/HBT Logic Families
This discussion is not intended to address all BJT/HBT
logic families. Rather, the purpose here is summarize the
principles of the two most popular and relevant BJT/HBT
logic families. These are transistor-transistor logic and
current-mode logic. Ultimately, this discussion culminates
with a comparison of the two logic families in order to
justify the implementation of current-mode logic for high¬
speed applications.
a) Transistor-Transistor Logic (TTL)
Transistor-transistor logic evolved directly from
diode-transistor logic (DTL) in a successful effort to
23
eliminate the drawbacks of DTL. (Richards, 1967) While
there were several stages in this evolution, the end product
is a TTL family which resembles the inverter shown in Figure
(2-12). The enhanced performance of TTL is predominately
achieved through two fundamental design features.
The first improvement is the use of a second
transistor in place of the diodes of a DTL circuit. For a
V cc V cc
low input voltage, Q x is turned on — rapidly drawing
current from the base of Q 2 and dissipating the excess
charge to achieve a faster transition. In the opposite
case, when the input is high and Q x is cutoff, Q 2 is
specifically engineered to have a low reverse Beta such that
a small yet sufficient current flows out through the
collector and is applied to the base of Q 2 .
The second improvement is the use of an optimum
output stage, commonly referred to as the "totem-pole"
24
output stage (not shown in the Figure 2-12). It combines
the rapid high-to-low transition capability of the common-
emitter output stage with the rapid low-to-high transition
capability of the emitter-follower output stage.
Based upon these two features in conjunction with
other minor modifications, TTL logic achieved a level of
popularity which made it the dominant design for SSI, MSI,
and LSI circuits throughout two decades. Despite this
success, standard TTL circuit speeds are still limited by
two design issues. First, transistors operate in saturation
mode which increases junction capacitance and its associated
switching delay. Second, the resistance along the
dissipation path for junction capacitance further increases
this delay.
b) Current-Mode Logic (CML)
Current-mode logic is distinct from the design of
other BJT/HBT logic families. The term "current-mode"
refers to the channeling of a constant current along
alternate paths to achieve logic functionality in circuits.
Since it is the presence or absence of current that
determines the logical output, the maximum voltage swing can
be relatively small in contrast to voltage-mode circuits,
such as TTL.
The distinguishing design feature of current-mode
logic circuits is the BJT differential pair. It is the
25
backbone of all CML circuits and the source of critical
advantages and disadvantages. The benefit of smaller logic
swings has already been mentioned. Also, the discussion of
the BJT differential pair earlier in this chapter explained
how the collector voltage swings (inverts) rapidly in
response to reversing the polarity/magnitude of the
differential inputs by a narrow margin of approximately
75mv. This translates into a switching speed for CML which
is unsurpassed by its predecessors. Contributing to this
remarkable speed is the fact that the transistors of the
differential pair can be operated in the active region and,
therefore, do not suffer from the effects of excess charge
stored at the transistor base. Unfortunately, the constant
flow of current which enables these remarkable switching
speeds also consumes a remarkable amount of power.
For an illustration of how a CML circuit
functions, consider the inverter in Figure (2-13). Let
input B have a constant value — a reference voltage. When
input A is high (greater than the reference voltage by at
least 75mv) , then Q 2 is turned on and Q 2 is cut off. The
current being drawn through R 2 produces a logic low (V^-I^RJ
at V outl . Notably, the complement of this output, a logic
high (V cc ) is simultaneously available at V out2 . The presence
of complementary outputs is yet another benefit of CML
circuits. When input A is switched from high to low, the
Figure 2-13. CML Inverter.
conditions for Q 1 and Q 2 reverse. Q 2 turns on and Q 2 is cut
off. V out2 is pulled low while V outl is pulled high.
c) Advantages and Disadvantages
For high-speed applications, the selection of a
BJT logic design is reduced to a quantitative comparison of
TTL and CML. The predecessors of these two logic families
are far inferior in their capability to dissipate the
accumulated charge at the transistor base upon switching.
If the only two criteria were maximizing speed
while minimizing power consumption, then there could
possibly be a toss-up between TTL and CML — ultimately to
27
be determined by the design which achieves the lowest power-
delay product or by weighting one specification over the
other (high-speed or low-power) . Clearly, TTL is the low-
power contender, while CML is the high-speed champion.
However, before addressing the issue in the context of this
design project, consider the following summary of advantages
and disadvantages.
In addition to being faster, CML requires a
smaller voltage swing than TTL and is less susceptible to
noise due to the nature of the BJT differential pair. As
another benefit of that nature, CML generates complementary
outputs. The fact that both output signals are referenced
to V cc provides for exceptional stability when V cc is
referenced to ground and a negative supply voltage is used.
Unfortunately for TTL, its strong point of consuming less
power has a down side: the short pulses of current which
must be generated for switching logic levels also create
spikes in the supply voltage. The constant current drawn by
CML circuits avoids this potential source of noise.
In conclusion to this comparison, a logic designer
presented with the choice of CML or TTL would only choose
TTL in the event that power consumption made CML
impractical. In real world applications, this is typically
true. However, since it is the purpose of this design
project to explore the impact of high-speed logic on digital
28
system architecture, priority has been given to the superior
speed and extensive design benefits of CML.
Having concluded that current-mode logic is the
best approach to HBT high-speed logic design, it is
necessary to design a sufficient set of logic gates to
implement the desired test circuit, an 8x8 bit pipelined
multiplier. Chapter III presents the discussion of logic
circuit design which includes design of the following: an
inverter/buffer gate, a NOR/OR gate, full adders, and a
practical current source.
29
THIS PAGE LEFT BLANK INTENTIONALLY
30
III.
HBT CML LOGIC CIRCUIT DESIGN
A. DESIGN OVERVIEW
In this chapter, CML logic circuits are designed which
will serve as the building blocks for construction of the
multiplier logic. The design process is presented in the
context of a single logic circuit, beginning with the most
fundamental functions and progressing toward the more
complex. Of note are the following general design goals
which served as guidance for decision-making in the early
stages of logic circuit design:
• Minimize the rail voltages (i.e. supply voltage)
• Achieve proper DC bias conditions with reliable
noise margins and fanout
• Optimize transient performance for speed and power
consumption
B. INVERTER DESIGN
1. Circuit Topology
Based upon the introduction to CML design in the
previous chapter, Figure (3-1) illustrates the circuit
topology of a CML inverter. A detailed description of its
function is presented in the previous chapter and will not
be repeated here. However, there is one subtle constraint
in this design. One of the differential inputs is tied to a
31
Figure 3-1. CML Inverter.
reference voltage. While this is not essential for the
design of an inverter, it will prove significant in the
implementation of multiple-input logic gates. A common
reference voltage eliminates the need to provide
complementary logic signals for each input and furthermore,
it avoids the increase in supply voltage associated with
multiple complementary inputs in a stacked series of
differential input pairs.
Figure (3-2) illustrates the same inverter design as
Figure (3-1); however, it also includes an emitter-follower
stage at each collector output of the differential pair.
The purpose of this stage is twofold. First, it provides a
buffer between the input differential pair and the
capacitive load of subsequent driven logic gates. Second,
32
it produces a downward DC shift equal to the base-emitter
turn-on voltage. Ideally, the gain of the emitter-follower
is one; however, in practice the gain is slightly less than
one. The result is a slightly diminished voltage swing at
the output of the emitter-follower when compared to the
voltage swing at the collector of the differential pair.
Whether or not to include the buffer stage represents a
fundamental design issue for CML logic circuit design. At a
glance, performance arguments can be made both for and
against it. On the one hand, it would appear to increase
fanout performance, yet on the other, it would appear to
decrease switching performance with the additional switching
delay of a second transistor stage. Additionally, the non¬
buff ered output topology would consume less power for a
33
given bias current. However, without performance data to
substantiate one option over -the other, both will be
developed and evaluated until objective design
considerations can identify a clear preference.
2. Initial Conditions and Design Parameters
a) Voltage Parameters
Having introduced the topology of the CML
inverter, it is necessary to establish initial conditions
for operation. The first is the supply voltage, which is
bound by two primary considerations. It must be large
enough to support the proper function of the circuit, i.e.
provide proper transistor bias conditions and the desired
voltage range between high and low logic levels.
Conversely, it should be kept as small as possible, because
the power consumed by the circuit is directly proportional
to the magnitude of the supply voltage.
Clearly, foresight must be exercised in order to
determine the minimum supply voltage necessary to achieve
proper DC bias conditions for all transistors in all
circuits of the design. In the context of this project, the
D-type latch design (presented in Chapter IV) imposes the
greatest demand on the supply voltage level by operating
three transistors in series between the voltage supply
rails. For optimum, reliable clocking performance of the
latch, the logic reference voltage is determined to be 1.45
34
volts. This figure is based upon a maximum logic signal
range of 0.5 volts and a maximum logic high voltage of 1.7
volts (reference Chapter IV-A-3a for further details).
Given this information, the minimum required
supply voltage is determined for each inverter topology.
Both require that the voltage at the collector (V c ) be large
enough to avoid saturation of Q 1 . Furthermore, both require
that the voltage at the collector provide for an output
voltage that matches the range of the input voltage.
For the non-buffered topology, this implies an
inverse match between the voltage at the base of Q 1 and the
voltage at its collector. In other words, for a logic input
that is high, V B(hi) , the output voltage at the collector
should be low, such that the following relationship in
Equation (3-1) holds true.
(3-D V c(low) = V B(hi) - 0.5v
Assuming the collector of Q x draws approximately 1mA of
current, collector-emitter saturation voltage, V CE(sat ,, is
0.275 volts and the base-emitter turn-on voltage is 0.775
volts. Under these conditions, Q x is on the boundary of
active mode operation. For a signal swing larger than 0.5
volts, the transistor would saturate. Conversely, for a
logic input (V B ) that is low, the collector voltage (V c ) must
be given by Equation (3-2).
(3-2) V c(hi) = V B(low) + 0.5V
35
For V B(low) equal to 1.2 volts, V c(hi) must be 1.7 volts. Thus,
for the non-buffered topology, the maximum voltage at the
collector is 1.7 volts. No current flows through R gain
because Q x is cutoff; therefore, the minimum required supply
voltage is also 1.7 volts.
In the case of the buffered topology, the DC
voltage drop across the base-emitter junction of the output
buffer imposes a greater demand. For the output voltage
range to match the input voltage range, the voltage at the
collector (as described in Equation 3-2) must be increased
by an amount of V BE(on) (as shown in Equation 3-3) in order to
counter the base-emitter voltage drop at the buffered
output.
(hi) — V B(10W) + 0.5v + V BE(on)
Assuming a current of 1mA or less through the buffer, V BE(on)
is 0.775 volts. The result is a minimum required supply
voltage of 2.5 volts. (Reference Chapter IV-A-3a for a
thorough derivation of these conclusions.)
In summary, different supply voltage levels will
be utilized for the two inverter topologies. The non-
buffered output topology will employ a 1.7 volt supply
voltage, while the buffered output topology will employ a
2.5 volt supply voltage.
36
b) Transistor Area/Size
In order to optimize switching speeds in BJT/HBT
transistors, it is desirable to keep the device area small,
thereby minimizing parasitic capacitances. Likewise, a
smaller device size requires less current and less current
means less power. The InP HBT device sizes made available
from Hughes Research Laboratories have junction areas of
lxl, 1x3, 1x5, and 2x5 microns. The lxl area transistor is,
therefore, the transistor of choice for switching
applications (logic circuits). Note, however, that the
consideration of device size must be re-visited for
applications where switching speed is not a factor, i.e. the
construction of a practical current source (addressed in
Chapter IV).
c) Fanout Requirement
Fanout is the number of logic gate inputs that a
single gate output can drive, while providing voltage levels
within the correct logic range. Increased fanout is
achieved at the expense of power consumption and loss of
speed. Considering that the CML logic inputs/loads are
current-driven, increased fanout will require a
corresponding increase in switching delay and/or current.
As a result, the fanout parameter should be chosen such that
it sufficiently economizes the number of logic gates and
levels of logic required without needlessly sacrificing
37
power and speed. In meeting this requirement, a reasonable
fanout parameter has been established based upon the logic-
level design of the a three-input adder (reference Chapter
III-D). For implementation using the minimum number of
logic levels, a three-input adder requires a fanout of four.
3. DC Analysis
a) Overview
Given the circuit topology for a CML inverter as
shown previously in Figure (3-2), the first step in circuit
design is to establish the proper DC bias conditions for
operation. This can be done for both the buffered and non-
buffered cases simultaneously. For the non-buffered case,
simply disregard the presence of the buffer stages. The
remaining node voltages at the collector outputs on the
differential pair are the same.
Figures (3-3a) and (3-3b) show the DC node
voltages for the desired operation of a CML inverter given a
high logic input and a low logic input, respectively. Given
matched transistors the two sides of the differential pair
could be considered symmetric in their behavior, except that
the input voltages driving the opposite sides of the
differential pair are not symmetric. That is, the reference
voltage drives the differential pair at 1.45 volts whereas
the logic input drives it at 1.7 volts. The result is a
difference of 0.25 volts at the emitter. This is a minor
(I R a+I r . 1 )R_- i n 1
V CC~ A B4 iV gain
V CC - (*B3 + Ic 1 )Rgain"^BE(on)
T ^CC ” (^B4^gain)'^BE(o
bias
(a)
NOTE: Leakage
Current is
Neglected
VcC^BsRgain
k A
^CC ■ (^B4 + Ic2) Rgain ^
^LOW # ^ C
W %
>i Q
1 .. .
N
2j—• V RE p
^REF "^BE(on)
...... .... . ^ ... ..
/
/
\
^CC " (^B3Rgain)"^BE(on)
rh ,
/
V C C " (?B4 + Ic2)Rgain'”^BE(on)
Figure 3-3. DC Analysis of a CML Inverter for (a) a HIGH
input logic level and (b) a LOW input logic level.
observation at present, but it explains the non-symmetric
performance that is encountered between the two output
signals (the inverted and the non-inverted signals).
39
b) Gain Resistor
In order to take advantage of the switching speed
of the differential pair, transistors must be biased to
operate in the active mode. Therefore, the value of the
base-emitter voltages (V BE ) for Q a and Q 2 must be such that
V CE > V CE(sat) . Thus, for a given supply voltage and bias
current, there is a restriction on the magnitude of the
voltage drop across R gain . If the drop is too large, the
transistor will saturate. Conversely, the voltage drop must
not be too small because it is the product of I R _ gain and R gain
which determines the magnitude of the signal voltage swing
(assuming active operation). This same voltage range
applies to the output of the buffer stages as well. As
referenced earlier in this chapter, a constant DC shift of
V BE(on) is the only difference between the nodes V c , and V bu£ .
In summary, the significance of R gain is two-fold:
it must be small enough to keep Q x (and Q 2 ) operating in the
active mode, and it must be large enough to provide a
satisfactory voltage swing between logic levels. Figure
(3-4) illustrates the DC transfer characteristic of the
inverter for various values of gain resistance. It
effectively demonstrates the upper and lower limitations of
gain resistance for a value of I bias equal to 1mA. At
resistances of 500 ohms and less, the desired 0.5 volt
signal swing is not achieved, and at resistances of 600 ohms
40
Figure 3-4. Effect of Gain Resistor Variation on
Inverter Output.
and greater, the effect of saturation can be observed by the
upward bend in the curve.
c) Buffer Resistor
The buffer resistor (R^) governs the amount of
current drawn by the emitter of transistors Q 3 and Q 4 . The
magnitude of emitter current is directly proportional to the
base current which is drawn from the collector of the
differential pair. Thus, the base current of the output
buffer represents a small portion of the current passing
through R gain In this way, the size of the buffer resistor
41
effectively produces a small DC offset at the buffered
output while regulating the amount of current drawn through
the buffer stage.
This is significant for two reasons. First, it
facilitates optimization of switching speed versus power
consumption by providing a mechanism for controlling the
amount of current flowing through the buffer stage and
therefore, available to drive a logic load. Second, R,^ is
inversely proportional to a DC voltage offset at the
buffered output. The ability to control this offset is
especially helpful in matching the output signal swing to
the input. Figure (3-5) represents the variation of output
voltage for a range of resistor values based upon a bias
current of 1mA.
d) Bias Current
Bias current is directly proportional to the
current (I c ) drawn through the gain resistor (R gain ) •
Therefore, bias current drives the magnitude of the voltage
drop produced in the gain resistor, and this voltage drop
corresponds to the maximum signal voltage swing. For this
reason, a proper combination of I bias and R gain must be
determined to provide the desired 0.5 volt swing. In order
to select from an infinite set of current-resistor
combinations, a likely set of current-resistor pairs will be
identified to represent the practical range of
42
Figure 3-5. Effect of Buffer Resistor Variation on
Inverter Output.
possibilities. This is done for both the buffered and non-
buffered inverter topologies. Note, the non-buffered
topology can be allowed to draw a higher bias current
through the differential pair because it does not draw any
additional current through buffer stages.
e) DC Noise Margins
Once values of resistance and bias current are
established, the circuit topology is completely defined and
a DC transfer curve can be obtained. From this plot the DC
noise margins for a particular design are calculated. Noise
margins provide a measure of the allowable noise which can
43
be received at the input without affecting the correct logic
output. Since this circuit will be operating with such a
narrow signal voltage swing, noise mar ins are a critical
interest for establishing reliable DC bias conditions.
Equations (3-4) and (3-5) define the high and low noise
margins in terms of the maximum and minimum, high and low
logic values. (Weste, 1993)
(3-4)
DM, = |
V - V 1
ILmax OLmax 1
(3-5)
NM, = |
V . - V . 1
OHmin IHmin I
where,
V
^ ILmax
v
OHmin
^ OLmax
= minimum
= maximum
= minimum
= maximum
HIGH input voltage
LOW input voltage
HIGH output voltage
LOW output voltage
These logic values are extracted from the DC transfer curve.
The two unity gain points (where the slope equals negative
one) of the DC transfer curve have been used to define the
boundaries of these regions.
f) DC Bias Optimization
Given a set of practical current values, DC
analysis is employed to identify a set of matching gain
resistances which properly bias the inverter for logic
operations. For each pair of current-resistor values, a DC
transfer characteristic is obtained to determine the noise
margins and the maximum range of the signal swing. The
results are tabulated in Table (3-1). In the absence of a
44
45
load, each configuration met the established design
requirements — that is, a matched input and output signal
voltage range of 0.5 volts, centered at a reference voltage
of 1.45 volts with sufficiently balanced noise margins of
0.1 volt minimum (20% of the signal range).
However, when examined under the maximum fanout
load (which is four), the performance of the non-buffered
output topology suffers greatly. The maximum high logic
voltage is reduced by an amount ranging from 0.09 volt to
0.23 volt, depending upon the bias configuration. Not only
does a load reduce the desired 0.5 volt signal range, but it
also erodes the high-end noise margin. As a result, the non-
buffered output topology can now be eliminated from further
consideration in the design process.
As for the buffered output topology, the noise
margins and voltage range are remarkably consistent —
regardless of the loading. The output buffer effectively
isolates the current drawn by the load from the current in
the differential pair. Thus, each of the bias
configurations for the buffered output topology will be
further tested under transient conditions to identify the
optimum inverter design. It should be noted that the DC
analysis presented here and the transient performance
analysis which follows are both conducted using ideal
current source models.
46
4.
AC/Transient Analysis
a) Delay Measurements
Transient performance of logic circuits is
generally quantified by measuring the delay associated with
signal propagation. The delay times utilized here are
standard performance parameters. However, for completeness,
their mathematical definitions are provided below in
Equations (3-6) and (3-7). (Weste, 1993)
(3-6) tfall = time for a logic signal to traverse
from 0.9 V RANGE to 0.1 V RANGE
(3-7) trxse = time for a logic signal to traverse
from 0.1 V RMGE to 0.9 V RANGE
where, = the voltage difference between the
steady state V HI and V L0W
b) Performance Parameters
At this point in the design process, two
performance parameters are of primary concern, power and
speed. Being related to each other, there is often a trade¬
off between the two. Optimization of these two parameters
will determine which of the DC bias inverter configurations
will be implemented. A common method of optimization is to
quantify the parameters of power and speed as a single
47
figure of merit, such as a product or a ratio. Optimization
is then achieved by maximizing or minimizing the appropriate
figure of merit.
Power-delay product is one such figure of merit.
It is simply the product of the power consumed by a logic
circuit multiplied times the propagation delay of the signal
from input to output. Expectedly, the design that most
efficiently balances the trade-off between speed and power
consumption will yield the lowest power-delay product in
transient testing.
The ratio of speed to power provides a similar
figure of merit, but speed measurements are not as clearly
defined as delay measurements. Therefore, in the interest
of optimizing this design for speed, a definition of maximum
switching frequency will now be established. The maximum
reliable frequency is defined as the maximum switching
frequency of the logic input signal for which a maximally
loaded output signal consistently traverses 90% of the 0.5
volt range of logic.
c) Transient Analysis Procedures
For an accurate evaluation of logic circuit
performance, it is necessary to provide a realistic input
signal and a worst-case output load. Here, the term load
implies driving four inverters in parallel. To achieve a
realistic test environment, the test circuit of Figure (3-6)
48
was designed. Specifically, note the location of gates A
and B. Their input and output signals will be measured to
analyze performance with a fanout of one and four,
respectively.
It is expected that the use of a reference voltage
at the differential input of the inverter will cause the
inverted and non-inverted output signals to respond
differently. As a result, two gate topologies are analyzed
for each of the valid DC bias configurations from Table
(3-1). The first gate topology is a single output inverter
from which the inverted output signal is measured. The
second is a complementary output inverter from which the
non-inverted output signal is measured. Conveniently, these
two configurations also represent the alternating signal
49
pattern which will characterize the adder circuits later in
this chapter.
Initially, the appropriate logic delays are
measured at gate A and gate B in order to collect data for
the cases of minimum and maximum loads, respectively. The
worst-case delay is then multiplied by the average power per
gate to obtain a power-delay product. This is done for both
the inverted and the non-inverted output signals —
providing separate power-delay product terms. Their Siam
forms a composite power-delay product. The composite
power-delay product is a figure of merit which effectively
represents the implementation of the two gate topologies in
series.
Finally, the switching period of the input logic
is decremented for successive tests in order to determine
the shortest period for which the output signal of a loaded
gate (gate B) would consistently traverse the full range of
logic (between high and low) . This quantity has been
defined in the previous section as the maximum reliable
frequency (MRF) . For each configuration, the maximum
reliable frequency is divided by the average power per gate
to obtain a speed-power ratio (GHz/mW). The presence of a
secondary load provides confirmation that consecutive loads
can be successfully driven when the primary load is driven
at its maximum reliable frequency.
50
d) Summary of Results
Transient analysis confirms the non-symmetric
behavior of the inverted and non-inverted output signals.
Therefore, Tables (3-2a) and (3-2b) provide details of their
Bias
Current
(mA)
Tprop
L-H
(PS)
Tprop
H-L
(PS)
Current
per Gate
(mA)
Power
per Gate
(mW)
Maximum
Power-Delay
Product
(mW-pS)
0.1
42
255
0.81
2.03
518
0.25
56
48
0.97
2.42
136
0.5
33
26
1.28
3.20
106
0.75
23
26
1.59
3.99
104
1
17
26
1.88
4.69
122
1.5
13
27
2.38
5.94
160
Table 3-2a. Power-Delay Data for the Inverted Signal.
Single output topology with practical current sources and a
fanout load of four.
Bias
Current
(mA)
Tprop
L-H
(PS)
Tprop
H-L
(PS)
Current
per Gate
(mA)
Power
per Gate
(mW)
Maximum
Power-Delay
Product
(mW-pS) _
0.1
212
82
1.45
3.63
770
0.25
61
88
1.64
4.10
361
0.5
27
63
2.02
5.04
318
0.75
23
46
2.31
5.78
266
1
19
41
2.63
6.56
269
1.5
18
40
3.09
7.74
309
Table 3-2b.
Power-
-Delay Data for
the Non-
Inverted Signal
Complementary output topology with practical current
sources and a fanout lpad of four.
51
respective delay measurements. Specifically, the high-to-
low transition of the non-inverted output signal represents
the worst-case transition.
The overall performance of each DC bias
configuration is summarized in Table (3-3). The power-delay
product and speed-power ratio are normalized to simplify
comparison. Figure (3-7) illustrates the minimization curve
for the power-delay product, while Figure (3-8) shows the
maximization curve for the speed-power ratio.
Clearly, the 0.75mA configuration proves to be the
optimum design — maximizing the speed-power ratio while
minimizing the power-delay product. Furthermore, it
provides for a maximum reliable frequency of 8.7 GHz. This
is more than suitable to achieve the 5 GHz maximum clock
frequency desired in Chapter V (for the maximally pipelined
multiplier implementation).
Bias
Current
(mA)
Maximum
Composite
Power-Delay
Product
Normalized
Composite
Power-Delay
Product
Maximum
Reliable
Frequency
(GHz)
Normalized
Speed-Power
Ratio
0.1
467
3.48
n/o
n/a
0.25
144
1.34
5.30
0.86
0.5
96
1.14
7.10
0.94
0.75
72
1.00
8.70
1.00
1
67
1.06
9.09
0.92
1.5
67
1.27
11.10
0.96
Table 3-3. Summary of Transient: Analysis Results.
Composite Power-Delay Product and Speed-Power Ratio.
52
Figure 3-7. Results of Transient Analysis:
Normalized Speed-Power Ratio of Inverter Configurations.
Figure 3-8. Results of Transient Analysis:
Normalized Bower-Delay Product of Inverter Configurations.
53
5. Final Design Summary: Inverter
The final design for the CML inverter/buffer circuit is
illustrated in Figure (3-9). The applicable design and
performance parameters have been summarized in Table (3-3) .
Here, the data represents performance when the design is
implemented with the 0.75mA practical current source from
Chapter III-E. Also note that when complementary output
signals are not required, the unused output buffer stage can
be excluded to conserve power and minimize the device count.
CML Inverter
Design and Performance Parameters
Rgain •
750&
Rbuf :
2000 0
^bias :
0.75 mA
NMl :
0.13V (26%Vswing)
NMh :
0.14V (28%Vswing)
Power:
5.78 mW (complementary output )
3.99 rnW (single output)
Inverted
Signal
Mon-inverted Signal
Delays
Fanout s l
Fanout s 4
Fanout = 1
Fanout = 4
tp(H-L)
14ps
2 6ps
39ps
46ps
tp(L-H)
17ps
23ps
18ps
23ps
tfall
19ps
41ps
87ps
9 Ops
trise
48ps
61ps
45ps
6 Ops
Table 3-4.
CML Inverter
Design and
Performance
Parameters.
54
2.5 volts
Figure 3-9. Final Design of the CML Inverter.
C. LOGIC NOR GATE DESIGN
1. Overview and Analysis
The circuit topology for a two-input CML NOR gate is
presented in Figure (3-10) . There is little that differs
from the inverter, which accurately suggests that the
analysis here will be extremely similar to the previous
section. In fact, with regard to both circuit topology and
performance analysis, the only distinguishing feature is the
second logic input in parallel with the first.
55
Consider the functionality of the two parallel inputs A
and B. If either of them is a logic high, then the left
side of the differential pair is on and the NOR output is
pulled low. Conversely, if both inputs A and B are low,
NOR
Output
Figure 3-10. Circuit topology for a two-input OR/NOR
logic gate.
then the NOR output is high. On the opposite side of the
differential pair is the complementary output — the OR
function. If another input transistor were added in
parallel to the existing two, it would be a three-input
OR/NOR gate — and similarly for a fourth input.
Despite the drastic change in functionality, the
presence of several logic inputs in parallel to the original
logic input induces no fundamental change to the DC bias of
56
the circuit. As a result, the DC bias conditions for the
optimized inverter circuit are directly applied to the final
design of the NOR circuit.
2. Final Design Summary: OR/NOR
With the exception of having multiple parallel
transistors for multiple logic inputs, the final design for
the CML OR/NOR logic circuit is identical to that of the
inverter. As for its performance, the noise margins and
delay measurements vary only slightly in response to the
"multiple trigger" effect of simultaneous parallel inputs.
The design parameters are identical to the inverter and
therefore are not repeated. However, a selection of the
performance parameters have been provided in Table (3-5) in
order to demonstrate the variation of performance based upon
the input configuration.
Conveniently, the NOR gate constitutes a near identical
capacitive load as the inverter — with maximum delay
differences of less than 1.5ps. It exhibits the same delay
variations between its OR and NOR signals as the inverter
does between the inverted and non-inverted signals. And
finally, as with the inverter, when both of the
complementary outputs of the OR/NOR gate are not required,
the unused output buffer stage is not included to conserve
power and minimize the device count.
57
CML OR/NOR Gate
Delay Performance Parameters
2- Input OR/NOR Gate
Single Input Transition
NDR Signal OR Signal
Fanout - 1 Fanout = 4 Fanout = 1 Fanout = 4
tp(H-L) 16ps 29ps 4 Ops 47ps
tp(L-H) 24ps 29ps 19ps 23ps
3- Input OR/NOR Gate
Single and Simultaneous Input Transitions
NOR Signal OR Signal
Fanout = 1
Fanout = 4
Fanout = 1
Fanout = 4
Single Input
tp(H-L)
19ps
28ps
41ps
48ps
Transition
tp(L-H)
29ps
34ps
18ps
23ps
Simultaneous
tp(H-L)
17ps
3 Bps
4 Ops
47ps
Input
43ps
48ps
lips
lGps
Transition
4-Input OR/NOR Gate
Single Input Transition
NOR Signal OR Signal
Fanout = 1 Fanout = 4 Fanout = 1 Fanout = 4
Single Input ^(H-L) 21 P S 3C ^> S 41 P S 48 P S
Transition tp (L _H) 33ps 39ps 18ps 23ps
Table 3-5. Summary of OR/NOR Gate Delay Performance.
Single
Input
Transition
58
3. Implementation of the AND Function
In current-mode logic, the AND function is implemented
by simply inverting the input signals and reversing the
polarity designation of the output nodes. In actual
practice, inverters and OR/NOR gates are sufficient to
realize any logic function. Thus, for the sake of
simplicity, AND gates were not constructed as a separate
logic circuit. Rather, all logic functions were
deliberately expressed as functions of inverters and OR/NOR
gates.
D. ADDER DESIGN
1. Implementation.
Two-input and three-input adders are required to
construct the carry-save adders and carry-completion adders
of the multiplier (Chapter V) . Equipped with a sufficient
set of logic gates, this is an elementary task. The sum of
min-terms for the sum and carry bits of a two-input adder
are shown in Equations (3-8) and (3-9), respectively.
(3-8)
SUm l2input
= XY
(3-9)
Carry | 2input
= XY
Employing De'Morgan's Theorem, these expressions can be
manipulated into the equivalent expressions for
59
implementation with OR/NOR gates, as shown in Equations
(3-10) and (3-11) .
(3-10) Sum l 2 i»p„t = (X'+Y)' + (X+Y')'
(3-11) Carry 1 2inpuE = (X'+Y')'
This adder design requires the complementary logic inputs be
provided in order to eliminate the need for inverters and a
third level of logic delay. Such a requirement is trivial
because complementary signals are potentially available at
the output of each CML logic gate. Figure (3-11)
illustrates the two-input adder.
* 1-0
Figure 3-11. Two-input adder with identification of the
critical path.
60
61
2. Performance Analysis
Proper functioning of each adder was verified for all
possible input combinations. Notice that the critical path
for each adder is identified in Figures (3-11) and (3-12).
For the two-input adder, the critical path flows through two
levels of logic to produce the sum bit. The worst case
transition is from a (1/0) or a (0/1) input for (X/Y) to a
(1/1) input. This is owing to the fact that the worst-case
gate delay is the high-to-low transition of the OR output
when it has been driven by the high-to-low output transition
of the preceding NOR gate. Based upon the data from Table
(3-5), the critical path delay equals 63 picoseconds. This
provides a good match with a simulation of the critical path
delay which yields 60 picoseconds.
Similarly, for the three-input adder the critical path
delay is calculated to be 67 picoseconds along the path
illustrated in Figure (3-12). This was validated with a
simulation measurement of 66 picoseconds.
E. PRACTICAL CURRENT SOURCE DESIGN
1. Circuit Topologies
Up to this point, each logic element has been designed
using an ideal current source. In order to validate the
performance of these designs for actual' implementation, it
is necessary to construct a practical current source. There
are effectively three circuit configurations which provide
62
transistor bias conditions for establishing a current
source. These three topologies are presented in Figure
(3-13). In each configuration the amount of bias current
drawn is regulated by and directly proportional to the
magnitude of the current drawn by the base of Q S0DRCE .
Figure 3-13. Current Source Topologies.
2. Performance Analysis
In order to analyze and compare the performance of each
current source, three simple 0.75mA current sources are
designed — one using each topology. Each is then
implemented as the practical current source for the
inverter/buffer circuit of Chapter III-B-5. Their relative
performance is evaluated based upon the following design
goals:
63
• Minimize the operational limitations due to
frequency response
• Approximate the performance of an ideal current
source
• Minimize the cost of implementation (power and
device count)
The performance of each configuration is illustrated in
Figure (3-14a) and (3-14b). Notice that each inverted
output signal drops below the desired 1.2 volt voltage low
level when making the transition from high-to-low. This
"dip" results from reversing the polarity of the
differential pair input signals — inducing a brief drop in
the bias voltage at the positive (POS) terminal of the
current source. A delayed return to the proper bias voltage
is then governed by the RC characteristics of the Q S0URCE
collector. This delay is particularly observed in the
transient performance of the topologies in Figure (3-13a)
and (3-13b).
3. Final Design: Current Source
By process of elimination, the current mirror topology
of Figure (3-13c) is the only design suitable for driving a
logic device family that is capable of switching frequencies
above 8 GHz. Unfortunately, the current mirror also incurs
the largest cost in terms of power and device count. Thus,
to reduce the amount of current "lost" through the left side
64
Inverted Output (Voltage) Non-Inverted Output (Voltage)
65
a) 0.75mA Current Source
The final current source design for a 0.75mA
current source is shown in Figure (3-15). The DC transfer
characteristic of this source. Figure (3-16), illustrates
that the bias current drawn is a function of the collector-
emitter voltage (V CE ) at Q S0URCE . More specifically, it is seen
that V CE must be greater than 0.3 volts in order to ensure
that 0.75mA is drawn. This represents a critical design
parameter for establishing a proper DC bias on the current
source.
Figure 3-15. Final Design of a Practical 0.75mA Current
Source.
The 0.75mA current source design is validated by a
direct performance comparison with an ideal current source.
Figure (3-17) compares the output signals for a maximally
loaded inverter/buffer circuit when driven by both
66
an ideal and a practical current source. It can be seen
that the transition delay resulting from the practical
source is consistently ahead of the ideal source for the
inverted output signal by a margin of five picoseconds.
Meanwhile, the non-inverted output signal of the practical
current source maintains the status quo by matching the pair
delay of the ideal source. In a design that is
characterized by alternating stages of positive and negative
logic signals, it is reasonable to expect that the
implementation of the practical current source would yield a
slight improvement over the ideal source.
b) 2.0mA Current Source
Exercising a little foresight into the conclusions
of Chapter IV, it is convenient here to present the design
of the 2mA practical current source. This design is a
simple modification to the 0.75mA design — implemented by
decreasing the resistance from 5250 Q to 2020 £1. This
allows an increase of current flow into the base of Q MIRR0R and
produces the transfer characteristic shown in Figure (3-18).
Again, a bias voltage at Q MIRR0R must ensure that V CE is greater
than or equal to 0.3 volts in order to achieve proper
functioning of the current source.
The 2mA current source is also validated by
testing it against an ideal current source while driving a
maximally loaded D-type CML Latch. The respective output
68
Figure 3-18. Transfer Characteristic of the 2.0mA
Current Source.
signals, Q and QN, are plotted in Figure (3-19). It can be
seen that the output signal transition delay resulting from
the practical source compares favorably with the delay
associated with the ideal source. However, the ideal-driven
output signals consistently crosses the reference voltage of
1.45 volts approximately 10 picoseconds ahead of the
practical-source-driven output signals. Thus, the effective
margin of error for approximating the practical source with
an ideal source is 10 picoseconds. In a synchronous
pipelined architecture, this simply adds between 10 and 20
picoseconds to the minimum clock period.
69
Figure 3-19. Comparison of Latch Performance, Practical
Current Source vs. an Ideal Source.
In summary, a sufficient set of logic circuits is now
in hand, along with a practical current source with which to
drive them. Thus, the combinational logic for a multiplier
can be fully implemented. However, based upon the intent of
pipelining this multiplier, it is necessary to construct the
clock-driven devices that will control the flow of data.
Chapter IV presents this discussion with the design of a D-
type latch, a D-type flip-flop, and a clock driver.
70
IV. HBT CML LATCH AND REGISTER DESIGN
A. LATCH DESIGN
1. Circuit Topology
a) Two Latch Topologies
The most common latch design is based upon the
logic level schematic illustrated in Figure (4-1) . Design
of this latch simply requires the proper connection of four
NOR gates with the appropriate clock and logic input
signals. The cumulative power consumed by the four NOR
gates constitutes a significant cost (based upon the four
milliwatt per gate design from Chapter III).
Figure 4-1. D-type Latch constructed from NOR gates.
71
However, the unique characteristics of CML provide an
alternative design that yields comparable performance at a
significant savings in power. This CML latch design is
illustrated in Figure (4-2). Due to the relative
unfamiliarity of this design, a brief functional description
follows.
; Output Buffer
Figure 4-2. CML D-type Latch Design (After Jalali).
b) Functional Description of a CML Latch
Referencing Figure (4-2), the source labeled I bias
draws a constant current through the lower (clock-driven)
differential pair. Complementary clock signals provide the
differential inputs. Depending upon the phase of the clock
signal, current is drawn from one of the two cascaded
differential pairs, i.e. either the track pair or the latch
pair. Consider the case when the CLK signal is high.
Current will be drawn from the "track" pair while the
"latch" pair is simultaneously cut off. In this case the
latch is considered "open" or "transparent, " and the track
pair behaves like the differential pair configuration of the
inverter/buffer logic gate. Thus, the logic inputs of the
track pair are mirrored at the opposite collector. However,
there is one exception. In the CML latch, complementary
logic inputs are employed rather than a logic reference
voltage. For a single logic input, complementary input
signals enhance noise immunity and provide for symmetric
waveforms at the complementary output ports.
Now, consider when the CLK signal transitions from
high to low. The track pair is cutoff as current is
switched to the latch pair via the right side of the clock-
driven differential pair. Herein lies the significance of
the common collector nodes shared by the track pair and
latch pair. Due to the high impedance nature of the HBT
73
collector-base junction, the voltage level at the collector
is slow to change and lingers long enough to bias the latch
pair for essentially identical operation and output levels.
This effectively latches the logic levels from the track
pair to the latch pair. (Jalali, 1995)
Regardless of the state of the latch, the logic
levels at the common collector (of the track and latch
pairs) are reflected at the latch output ports via the same
output buffer configuration presented in Chapter III.
2. Initial Conditions and Design Parameters
The CML latch presents the most demanding DC bias
requirements of any circuit designed for this project. As a
result, no voltage cap has been placed upon its design.
Rather, the initial design goal is to determine the minimum
necessary DC bias conditions for proper operation of the
latch. The resulting "voltage budget" will define the
voltage relationships for proper operation of each
transistor and differential pair. It will further establish
important specifications for supply voltage and logic signal
levels. Derivation of the "voltage budget" is presented as
part of the DC analysis in the following section.
The minimum available transistor area (lxl micron) is
employed for optimum switching speeds, and the fanout
requirement remains at four. These specifications are
74
consistent with the logic circuits designed in the previous
chapter.
3. DC Analysis
a) DC Bias Conditions / The Voltage Budget
For proper operation of the CML latch, each
differential pair of transistors must be properly biased.
Knowing the requirements imposed by proper DC bias
conditions will reveal the following necessary design
parameters:
• Required minimum supply voltage
• Required minimum voltage level for
representing the positive (high) phase of
the clock
• Required minimum voltage level for
representing a logic high state
• Maximum allowable signal range between
high and low logic levels
To facilitate analysis, the CML latch topology is divided
into three levels of operation, as illustrated in Figure
(4-3). Level one (the bottom level) is a practical current
source. Implementing the design from Chapter III-E, the
current source requires a minimum of volts at node X in
order to sustain the desired level of bias current.
(4-D V x > V Ibias
75
76
Figure 4-3. Voltage Budget for the CML Latch
This requirement imposes the following operational condition
upon the "driving" base voltage of the C^/Q., differential
pair (i.e. the high CLK voltage).
^CLK(hi) — V x + V BE(on) | 2)2
A further consideration is the proper biasing of
the Q 1 /Q 2 collectors for operation in the active region.
This places the following operational condition upon the
collector voltages (nodes Y1 and Y2).
— ^CLKlhi) _ ^BE(on)|Ql2 ^CE(sat)
where, V y represents either V yl or V y2
Only the tracking differential pair (connected to node Yl)
will be addressed at this point because it is driven by
lower voltage levels which impose more restrictive DC bias
conditions on Yl than Y2.
Once again, a minimum voltage requirement at the
common emitter of the Q 3 /Q 4 differential pair presents a
constraint on the minimum steady-state driving voltage at
each base. This driving voltage corresponds to a logic
high input voltage. Thus, the voltage level selected to
represent a logic high must satisfy the following
relationship.
(4 - 4 ) ^LOGIC(hi) — ^BE(on)|Q34 ^Yl
Finally, three conditions must be satisfied at the
collectors of the track pair. The first condition is that
77
transistors Q 3 and Q 4 must operate in the active mode. This
requires the following familiar relationship.
Vc,i ow) — ^LOGiahi) — ^BE(on)|Q34 Y;E(sat)
where V c represents either V C1 or V c2
Similarly, the second condition requires that the
transistors of the latch pair also operate in the active
mode. This condition differs from the one above because the
latch pair is driven by the collector voltage levels of the
track pair.
^C(low) — ^C(hi) — V BE(on )|Q56 ^CE(sat)
Defining the voltage range of the logic signal (V^^) as the
difference between high and low voltage levels, Equation
(4-5) is manipulated to show the.maximum value.
(. 4 - 7 ) Grange — ^be(cti)|q 56 ~ "^CE(sat)
Knowing the transistor parameters for V BE(on) and V CE(sat) from
Chapter II, (V^)^ is 0.5 volts.
The third condition is that the input and output
logic levels must match. A high logic input at the
transistor base must drive the collector voltage relatively
low (V C(low) ) such that it produces a matched low logic output
at QN. Likewise, the inverse must also be true. The
following equations express these requirements.
(4-8)
^LOGIC(hi)
Grange
— Yrdow)
^BE{on)|buffer
(4-9)
^LOGIC(low)
Grange
= ^C(hi)
^BE(on)| buffer
78
Based upon these relationships the maximum collector voltage
is determined, which further dictates the minimum required
supply voltage for proper DC operating conditions.
The voltage budget relationships are summarized in
Figure (4-3) . Actual values have been determined for four
latch configurations as listed in Table (4-1). The
essential difference is the magnitude of the bias current.
An economical margin of safety has been built into these
values.
Notice that these margins have been allowed to
vary slightly between configurations in order to maintain
uniform values for clock and logic signal values. This
greatly simplifies the comparative testing of the four
configurations. The design margins are highlighted to
illustrate the negligible deviation incurred. All four
configurations meet and exceed the required DC bias
conditions. In the event that uniform design margins had
been used such that the supply voltages were optimized, the
difference would have been trivial — within plus or minus
0.1 volt or 4% of the 2.5 volt supply voltage.
b) DC Bias Optimization
At this point the gain resistance, buffer
resistance, and the bias current are the only undetermined
parameters. The same procedures described in the design of
the inverter/buffer circuit are employed to design four
79
CML Latch Voltage Budget
for Multiple Bias Current Configurations
1mA
1.5mA
2mA
3mA
Known/Measured Parameters:
VfiE(on)
0.775
0.80
0.82
0.857
VcE(sat)
0.26
0.30
0.31
0.35
V I-bias
0.3
0.3
0.3
0.3
Determined Parameters:
[VRANGElmax
0.515
0.5
0.51
0.507
Margin for Range of
0.015
0.0
0.1
0.007
Logic Signal Voltage
[V RANGE]actual
0.5
0.5
0.5
0.5
Vcc
2.5
2.5
2.5
2.5
Margin to nearest
0.075
0.025
0.025
0.0
ICIXUl v/X <X Vyll VCO
V C(hi)
2.425
2.475
2.475
.2.5.
[VLOGIC(hi)] actual
1.7
1.7
1.7
1.7
Margin for Differential
Logic Signal Switching
[VLOGIC(hi)]min
0.24
1.46
0.2
1.5
111
1.51
; ;:r:
. 1.55.
Vyi
0.685
0.7
0.69
0.693
VcLK(hi)
1.2
1.2
1.2
1.2
.Vx...
0.42
0.4
0.39
0.358
Margin for Differential
0.12
fis:0i!p«i
0.09
0.058
Clock Signal Switching
V I-bias
0.3
0.3
0.3
0.3
Based upon a 0.5 volt signal swing for both logic and clock signals:
Vlogic(Iow) 1-2 1.2 1.2 1.2
Vclk(Iow) 0-7 0.7 0.7 0.7
Table 4-1. Voltage Budget for the CML D-type Latch.
80
different latch configurations based upon the specifications
determined in Table (4-1).
Noise Margins are obtained from the DC transfer
characteristic of each. These results are included in Table
(4-2). With maximum fanout loads on both output ports, all
four CML latch designs meet the requirements of a 0.5 volt
output signal range and 0.1 volt (20%) balanced noise
margins. Therefore, all four CML designs are considered in
transient analysis.
Bias
Current
(mA)
Gain
Resistor
(Ohms)
Buffer
Resistor
(Ohms)
No Load / Loaded
High Noise
Margin
(Volts)
No Load / Loaded
Low Noise
Margin
(Volts)
Logic
Signal
Range
(Volts)
1
600
2000
0.14 / 0.13
0.13 / 0.13
0.49
1.5
410
2000
0.13 / 0.13
0.13 / 0.13
0.51
2
310
2000
0.12 / 0.12
0.12 / 0.12
0.51
3
210
2000
0.11 / 0.11
0.11 / 0.11
0.52
Table 4-2. Results of DC Analysis.
4. AC/Transient Analysis
a) Performance Parameters
Three parameters are of primary interest in
evaluating the transient performance of a latch: setup
time, hold time, and logic propagation delay. Figure (4-4)
illustrates how each of these relates to the events on a
transient plot. In the absence of a reference voltage.
81
CLOCK
Open
Latched
Figure 4-4. Illustration of setup time, hold time, and
propagation delay.
differential signal references are taken as the point where
the complementary signals cross.
As a figure of merit for optimizing the trade-off
between speed and power, a power-delay product is calculated
using the values defined here. The figure for power
represents the average power, and the figure for delay
represents the sum of the setup time and the worst-case
propagation delay time.
b) Analysis Procedures
For an accurate evaluation of latch performance,
it is necessary to provide realistic logic and clock input
signals as well as realistic worst-case fanout loads.
82
Furthermore, to ensure and demonstrate the proper DC bias
design of the CML latch, practical current sources are
implemented in testing.
In addition to the four CML latch designs, the
traditional logic latch is also tested. Each design is
substituted into the test circuit to determine the
performance parameters described in the previous section.
c) Summary of Results
The results of transient analysis are summarized
in Table (4-3). The 1.5mA configuration achieves the
minimum power-delay product as illustrated in Figure (4-5).
Note, however, that the 2mA configuration performs at a
Bias Current (mA)
Figure 4-5. Results of Transient Analysis:
Normalized Power-Delay Product of Latch Configurations.
83
84
indicates a capacitive spike at the mutual collector nodes
of the latch and track differential pairs. This results
each time the clock-driven pair switches current to the
opposite side. It is not expected that this noise will
adversely affect the ability of the CML latch to drive
reliable logic levels. However, in the event that the CML
latch is overcome by noise, the NOR latch configuration is a
viable alternative because it does not experience this
problem.
Finally, the switching activity of the
differential pair also induces variations in the current
drawn from the supply voltage. Figure (4-7) illustrates
these power rail transients for a single CML latch. The
Figure 4-7. Power Rail transients due to the switching
activity of a single CML Latch.
86
abrupt, periodic reduction in supply current coincides with
the brief transition of current from one side of the
differential pair to the other — driven by the switching of
the clock signal. In the worst-case, this downward
transient spike reaches a current level that is 18% below
the average. It is also evident that slightly more current
is drawn when the latch is latched because the latch pair is
driven by a higher input voltage than the track pair. This
results in a higher voltage and thus more current being
drawn at the practical current source.
5. Special Latch Implementations
In the course of this design project, two special
implementations of the CML latch have been designed. The
first implements a logic reference voltage at one of the
logic inputs of the latch. The purpose here is to eliminate
the requirement for complementary logic signals at the
multiplier input.
The second special implementation also uses a reference
voltage; however, it does so with the purpose of conducting
a logic function at the input to the latch. Although this
circuit functions well, it actually results in slightly
greater delays due to the increased collector capacitance at
the tracking pair. As a result, it is not utilized in the
multiplier circuit.
87
6. Final Design Summary: D-Latch
The final design for the CML latch is implemented with
the parameters listed in Table (4-4) using the topology
presented previously in Figure (4-2). Also listed are the
transient performance parameters for operation at each level
of fanout loading. These figures represent the performance
of the latch when it is implemented with a practical current
source and driven by a maximally loaded clock driver.
Latch
Design and Performance Summary
Rgain- 310 O
Rbuf- 2000 £2
Ibias- 2 mA
NM l : 0.12v
NM h : 0.12v
Power: 9.0 mw
Max
Fanout
Setup
Hold
tprop
tprop
Total
Load
Time
Time
H-L
L-H
Delay
(# gates)
(PS)
(PS)
(PS)
(PS)
(PS)
1
33
9
27
0
60
2
33
10
28
1
61
3
34
10
31
2
65
4
35
10
34
3
69
Table 4-4. Final Design Summary of the D-type
CML Latch.
88
B. FLIP-FLOP DESIGN (D-TYPE)
1. Overview and Analysis
The D-type flip-flop is constructed from two D-type CML
latches. The two latches are connected in a master-slave
configuration such that they are latched by opposite phases
of the clock. This simple design is illustrated in Figure
(4-7) .
D
Q
-♦-
D
Q
D-LATCH
D-LATCH
DN
QN
-•-
DN
QN
OPEN
LATCH
OPEN
LATCH
CLOCK INVERTED INVERTED CLOCK
CLOCK CLOCK
Figure 4-7. D-type Flip-Flop.
The flip-flop design is tested under the same
conditions of loading and input signals as discussed
previously for the latch. This testing verifies proper
function of the flip-flop design and confirms that the flip-
flop performance parameters of setup time and hold time
mirror those of the CML latch. However, due to the presence
of a second latch in the flip-flop, the propagation delays
are greater.
89
2. Final Design Summary
The final design for the CML D-type flip-flop is
essentially the master-slave configuration of two CML
latches, as illustrated in Figure (4-7) . The design
parameters of the master and slave latches remains the same
as shown in Table (4-4). The applicable performance
parameters of the flip-flop have been summarized in Table
(4-5) .
Flip-Flop
Design and Performance Summary
Reference Latch Design Parameters
Power: 18 mw
Fanout
Setup
Hold
tprop
tprop
Max
Total
Load
Time
Time
H-L
L-H
Delay
(# gates)
(PS)
(PS)
(PS)
(pS)
(PS)
i
33
9
49
35
82
2
33
9
53
47
86
3
34
9
52
45
86
4
35
10
54
43
89
Table 4-4. Design and Performance Summary of the
D-type Flip-Flop.
90
C. CLOCK DRIVER DESIGN
1. Overview
The topology of the clock driver closely resembles that
of the inverter/buffer circuit. In fact, the only necessary
modification to the inverter/buffer design is a reduction of
the output voltage range at the output buffer. This is
accomplished by a simple voltage divider that effectively
steps the voltage down to the desired voltage range between
0.7 and 1.2 volts (Figure 4-8). This voltage range is
dictated by the CML latch design.
Two performance parameters are of particular interest
in the clock driver design, fanout capability and the
91
symmetry of complementary output signals. Increased fanout
is desirable to reduce the number of clock drivers required.
Meanwhile, output symmetry is important to reduce clock skew
between parallel clock paths. The absence of symmetry
between the complementary output signals of the logic
circuits (in Chapter III) results from the corresponding
lack of symmetry between the input signals, i.e. the use of
a reference voltage. Therefore, the clock driver is driven
by the differential clock signals CLK and CLK-N.
2. Analysis and Results
Fanout capability is maximized by the increase of
current through the output buffer. Two further
modifications to the inverter/buffer circuit make this
possible. The first is to increase the bias current. For a
supply voltage of 2.5 volts, a practical current source of
2mA is the largest that is operable without adversely
biasing the circuit. Second, reducing the total resistance
in the output buffer draws a larger base current and
ultimately, more current is available to the output load.
For evaluation, the performance of two clock
driver configurations is measured based upon the power
consumed per load driven. The 1mA clock driver draws 5.5mA
and consumes 13.8mW while driving a maximum of two latches.
Meanwhile, the 2mA clock driver draws 6.5mA and consumes
92
16.3mW while driving four latches. Clearly, the 2mA clock
driver is the desired implementation.
The synchronous switching behavior of the clock driver
coupled with its high current consumption warrant an
investigation of its power rail transient characteristic
(Figure 4-9) . It is not surprising that it follows the same
periodic trend as discussed in the case of the CML latch.
In the worst-case, the downward transient current spike
deviates by 14.6% from the average current level. Also of
6 . 9 :—
6 . 8 :.
6 . 7 :
6 . 6 :
6 . 5 : _
6 . 4 :
s’ 6
<3 :
I
< 6. It
6 . 0 :
5 . 9 : .
5 . 8 :
5 . 7 :
Latched
Tracking
Input Signal
J5. t .x
OPENING
\
LATCHING
0.0
— r -r-
0.5
\
1.0
1.5
Time (ns)
Figure 4-9. Power Rail transients induced by the
switching activity of a single Clock Driver.
93
interest is the noise induced on the clocking signal by
strong, simultaneous logic transitions at the latch input.
As a result, a clock driver must be capable of driving a
maximum fanout load of latches when the every latch input
transitions simultaneously in the same direction.
3. Final Design Summary: Clock Driver
The final design for the clock driver is implemented
with the parameters listed in Table (4-6) using the topology
presented previously in Figure (4-8).
Clock Driver
Design and Performance Summary
Rgain-
400 Q
R1 but*
110 Q
R2but:
450 ft
Ibias-
2 mA
NM L :
0.08v
NM h :
0.10 v
Power:
1 6.3 mW
Fanout:
4 Latches
Table 4-6. Design and Performance Summary of the
Clock Driver Circuit.
At this point, the set of building blocks is complete.
The logic circuits of Chapter III and the clock-driven
devices of Chapter IV are brought together in Chapter V to
implement several pipelined multiplier configurations.
V.
HBT CML PIPELINED MULTIPLIER DESIGN
A. LOGIC STAGE DESIGN
1. Overview
As introduced in Chapter II-C, the multiplier logic for
this project is implemented with the three functional
processes illustrated in Figure (5-1): partial product
generation, carry-save addition, and carry completion
Multiplier Multiplicand
Product
Figure 5-1. Generalized Block Diagram of an 8x8 bit
Multitplier.
addition. In the case of the 8x8 bit multiplier which is
implemented in this chapter, the process of carry-save
addition is actually accomplished with successive stages of
95
carry-save adders. More specifically, the use of three-to-
two carry-save adders produces the logic implementation
illustrated in Figure (5-2). The detailed process of carry-
save-addition is addressed in the following section;
however, this block diagram accurately represents the
functional design of the multiplier and establishes a
graphic reference for the follow-on discussion.
2. Carry-Save Adders
Each three-to-two carry-save adder takes three operands
and produces two outputs, a sum and a carry. However, the
carry-save adder implementations are not identical, due to a
slightly different input configuration that exists for the
first carry-save adder stage than for the follow-on stages.
Referencing Figure (5-3), the first carry-save adder
receives three non-aligned n-bit partial products. As a
result, it generates n+2 sum bits and n carry bits.
Meanwhile, the follow-on stages each receive an aligned
input pair comprised of the carry and sum terms generated by
the preceding stage. The third input is the next partial
product term, and it is shifted by one bit. Thus, the sum
is only n+1 bits and the carry is still n bits.
In the case of either carry-save adder, only the most
significant n bits of the Siam term are passed on to the next
adder stage. The remaining least significant bit(s)
represent the next most significant bit(s) of the final
96
Partial Partial Partial Partial Partial Partial Partial Partial
Roduct#8 Ptoduct#7 Product#) Product#5 Product#4 Roduct#3 ftoduct#2 Product#!
P[15:71 m P[5] P[4] P[3] P[2] P[l] P[0]
Figure 5-2. Logic Implementation o£ an 8x8 bit Multiplier
using six stages of Carry-Save-Adders and a Carry-Completion
Adder.
Carry-Save Adder #1
Figure 5-3. Functional Illustration of the two Carry-
Save-Adder Implementations.
product and are passed directly to the multiplier output.
These bits are highlighted with a circle in Figure (5-3) .
The final designs of the two carry-save-adder configurations
are provided in Figures (5-4) and (5-5). Note the presence
98
99
Input Signals
Multiplier-Bit \ c “ rre ? t Multl P 1;Ler
___/ Bit for this stage
BNin[7;0] \ 8-bit Multiplicand
-- ' (Negated)
Output Signals
P[03 ) Next Product Bit
C[7:0] 8-bit Carry Term
S[8:l] ) 8-bit Sum Term
rt ^ 0
frultiplier-BIT> -^
4>o
Figure 5-5. Logic Schematic of Carry-Save-Adder #2
LOO
of more than simple adder circuits. A fanout limitation of
four prevents a single signal from driving the eight input
requirements for the current multiplier bit at each carry-
save-adder stage. Thus, the arriving multiplier bits pass
through an inverting buffer stage.
Furthermore, the OR/NOR gates are used to generate the
partial product terms within each carry-save-adder stage,
rather than at the multiplier input. Taking advantage of
the complementary output signals available from the
preceding register, the NOR gates perform a logical AND of
each multiplicand bit with the appropriate multiplier bit.
Local Generation of the partial product terms avoids the
extensive requirement for intermediate registers that would
be necessary to pass all partial product terms from one
pipeline stage to the next (that is, referencing a scenario
where all partial products are generated before the first
carry-save adder).
3. Carry-Completion Adders
The carry-completion adder implements ripple-carry
addition. This elementary design is preferred over carry-
look-ahead addition because it facilitates a variety of
simple pipeline implementations. Figure (5-6) illustrates
the full carry-completion adder which can be conveniently
segmented into as many as eight pipeline stages by
separating the successive two and three-input adders.
101
Figure 5-6. An 8-bit Ripple-Carry Adder to perform
Carry-Completion.
102
B. REGISTER STAGE DESIGN
Regardless of the number of pipeline stages, each
multiplier implementation requires two eight-bit input
registers and a sixteen-bit output register. For pipeline
implementations with more than one stage, intermediate
registers are also required. The size of these registers
varies depending upon where the register is inserted in the
flow of logic. All intermediate and output registers
require complementary input signals. However, the input
registers are distinctly designed to accept a single logic
input signal for each bit, vice requiring complementary
logic input signals. In order to accomplish this, the D-
type flip-flops utilized in the input register must employ a
special latch implementation which does not require
differential input signals for the master latch of the
master-slave flip-flop pair. The details of this latch
implementation are presented in Chapter IV-A-5.
C. CLOCK DISTRIBUTION
The purpose of the clock distribution scheme is to
provide a local clock signal for clock-driven devices,
namely the latches that comprise the registers described in
the previous section. However, each clock driver can only
sustain a maximum load of four latches, i.e., two flip-
flops. Therefore, due to the number of clock-driven devices
and the limited fanout capability of the clock drivers, the
103
clock signal must propagate through an extensive, multi¬
level distribution tree. As the number of clock-driven
devices increases, the number of levels in this distribution
tree must eventually increase as well. Thus, the more
heavily pipelined multiplier implementations must make a
larger investment of devices and power in clock
distribution.
D. MULTIPLIER IMPLEMENTATIONS
Five pipelined multiplier implementations have been
designed for testing via Tanner SPICE simulation tools.
These implementations include a one-stage pipeline, a two-
stage pipeline, a four-stage pipeline, a six-stage pipeline,
and a ten stage pipeline. The arithmetic logic is identical
for each; however, the increased number of registers present
in the more heavily pipelined implementations also implies a
more extensive clock distribution tree. A block diagram of
each implementation is presented in the following section.
E. PERFORMANCE EVALUATION
1. Evaluation Procedures
Prior to evaluation of the individual multiplier
implementations, the multiplier logic is successfully tested
with several operands in order to verify that it produces an
accurate product. Following this verification, it is the
goal of this performance evaluation to identify the maximum
104
operating clock frequency for each pipeline implementation.
However, this can only be done once the critical path, i.e,
the critical pipeline stage, is determined for each
multiplier.
a) Critical Path Identification
The most direct and absolute means of identifying
the critical path is to conduct full-length simulations of
each multiplier for every possible combination and sequence
of two 8-bit input operands. Conducting these nearly 4.3
billion simulations on each of the five multiplier designs
is obviously prohibitive. Thus, the opposite extreme
suggests that the worst-case transition delay be assumed for
every logic circuit in every stage of the pipeline. While
this successfully identifies an upper bound on the delay
associated with the critical path, it is likely that the
upper bound case does not exist as a result of two input
operands. Furthermore, without knowledge of the input
operands, simulations can not be conducted for verification.
Unfortunately, the logic behavior of the carry-
save-adders makes an intuitive approach extremely difficult.
Thus, a computer program designed by Kirk Shawhan, a
research associate, has been utilized to identify the worst
case input combinations. (Shawhan, 2000) The program
effectively identifies a unique upper bound delay for each
set of input operands. Those input combinations with the
105
worst-case upper-bound delays are then simulated to identify
a single worst-case pair of operands and the critical stage
where the most-delayed transition occurs. While it is not
proven that this approach will identify the absolute
critical path, it provides a reasonable and timely estimate
for the purposes of this research.
b) Maximum Throughput /Clocking Frequency
Having determined the critical path, it is simply
a matter of simulation time to identify the maximum clock
frequency. For each pipeline implementation, a simulation
is conducted which brackets the breakpoint of the
multiplier. Furthermore, examination of the margin by which
the setup time is met or missed provides a determination of
the minimum clock period that is accurate within five
picoseconds.
The increased number of devices in the more
heavily pipelined designs made full-circuit simulation times
extremely long. As a result, the breakpoints for the four-
stage, the six-stage, and the ten-stage multipliers were
determined from partial simulations. Only the critical
stage and those stages immediately before and after it were
simulated.
2. Performance Results of Each Implementation
The following ten pages provide a tv page design and
performance summary for each of the five pipelined
106
multiplier implementations. Figure (5-7) illustrates the
design and critical path of the one-stage multiplier on a
block diagram. Table (5-1) provides a summary of data which
quantifies circuit complexity, power consumption, data
throughput rate and data latency of the one-stage pipelined
multiplier. Finally, Figure (5-8) illustrates the success
and failure of P14, the critical path, at clock frequencies
below the above the breakpoint of the circuit.
Similarly, Figures (5-9) through (5-16) and Tables (5-
2) through (5-5) provide the same performance results for
the two, four, six, and ten-stage pipelined multipliers,
respectively. A comparative analysis is conducted as a
performance summary in the following section.
As a final note, all full multiplier simulations are
conducted using ideal current sources. This decision saves
numerous simulation hours without sacrificing valid
transient performance data. A close correspondence has been
demonstrated between the transient performance of the
practical and ideal current sources for both the logic and
the latch designs. Use of the ideal source, however, does
produce overly optimistic power-consumption data due to the
absence of power dissipation from the transistors in the
practical current source. Therefore, the simulation data
for current consumption is scaled to accurately represent
the power consumed in practical implementation.
107
A =1111 0111
B = 1100 0111
Critical Path Initiates
with the two operands
A=F7h, B=C7h
c
o
8P
eS
C
go
Critical Path Terminates
with the LOW-to-MGH
transition of P14
P =1100 0000 0000 0001
Figure 5-7. One-stage pipelined multiplier
implementation with an illustration of the
critical path.
108
STAGE 1
Voltage (V)
Number of
Transistors
Number of
Resistors
Current
(Amperes)
Power
(Watts)
Logic
3952
2352
1.28
3.20
Registers
384
320
0.31
0.77
Clock
126
105
0.19
0.48
TOTAL
4462
2777
1.78
4.44
Maximum Throughput:
1.33 GHz
Latency:
0.75 Nano-second
Table 5-1. Performance summary for the one-stage
pipelined multiplier.
Figure 5-8. Performance bracket of the minimum period for
the one-stage pipeline multiplier.
109
A =1111 0111
B = 1100 0111
■ss ■
o
i
W)
as
I
*3
&
Critical Path Initiates
with the two operands
A=F7h, B=C7h
16-Bit Input Register
Carry Save Adder #1 (1)
Carry Save Adder #2 (2)
Carry Save Adder #2 (3)
Carry Save Adder #2 (4)
Carry Save Adder #2 (5)
Carry Save Adder #2 (6)
23-Bit Intermediate Register
Critical Path Terminates
with the LOW-to-HIGH
transition of PI 4
o
<
H
V3
O
<
H
Figure 5-9. Two-stage pipelined multiplier
implementation with an illustration of the
critical path.
110
Number of
Transistors
Number of
Resistors
Current
(Amperes)
Power
(Watts)
Logic
3952
2352
1.28
3.20
Registers
660
550
0.52
1.31
Clock
228
190
0.36
0.90
TOTAL
4840
3092
2.17
5.41
Maximum Throughput: 2.0 GHz
_ Latency: 1.0 Nano-second
Table 5-2. Performance summary for the two-stage
pipelined multiplier.
Figure 5-10. Performance bracket of the minimum period for
the two-stage pipeline multiplier.
a = 1111 1111
b = 1000 0001
Critical Path Initiates
with the two operands
A=FFh, B=81h
16-Bit Input Register
Carry Save Adder #1 (1)
Carry Save Adder #2 (2)
Carry Save Adder #2 (3)
31-Bit Intermediate Register
Carry Save Adder #2 (4)
Carry Save Adder #2 (5)
Carry Save Adder #2 (6)
[
23-Bit Intermediate Register
■ l: ■ ■ ■ ■■ S • : .*
Carry Completion Adder
(4 Bits)
20-Bit Intermediate Register
Critical Path Terminates
with the LOW-to-HIGH
transition of PI5
Carry Completion Adder
P1J 1^14
16-Bit Output Register
p = iooo oooo oiii mi
Figure 5-11. Four-stage pipelined multiplier
implementation with an illustration of the
critical path.
112
Number of
Transistors
Number of
Resistors
Current
(Amperes)
Power
(Watts)
Logic
3952
2352
1.28
3.20
Registers
1272
1060
1.01
2.52
Clock
438
365
0.68
1.71
TOTAL
5662
3777
2.97
7.43
Maximum Throughput:
3.45 GHz
Latency:
1.16 Nano-seconds
Table 5-3. Performance summary for the four-stage
pipelined multiplier.
Figure 5-12. Performance bracket of the minimum period
for the four-stage pipeline multiplier.
113
a = 1111 1001
B = 0010 0001
Critical Path Initiates
with the two operands
A=F9h, B=21h
§
a
p
a
The Critical Path is Limited
by the LOW-to-HIGH
transition of the Carry Bit
out of Stage 5.
16-Bit Input Register
Carry Save Adder #1 (1)
Carry Save Adder #2 (2)
31-Bit Intermediate Register
Carry Save Adder #2 (3)
Carry Save Adder #2 (4)
31-Bit Intermediate Register
Carry Save Adder #2 (5)
Carry Save Adder #2 (6)
23-Bit Intermediate Register
Carry Completion Adder
(3 Bits)
21 -Bit In term ediate Re gi s ter
Carry Completion Adder
Carry , (3 Bits)
18-Bit Intermediate Register
Carry Completion Adder
' (2 Bits) "■
16-Bit Output Register
P = 0010 0000 0001 1001
Figure 5-13. Six-stage pipelined multiplier
implementation with an illustration of the
critical path.
114
Number of
Transistors
Number of
Resistors
Current
(Amperes)
Power
(Watts)
Logic
3952
2352
1.28
3.20
Registers
1872
1560
1.49
3.72
Clock
648
540
1.03
2.57
TOTAL
6472
4452
3.80
9.49
Maximum Throughput: 4.35 GHz
Latency: 1.38 Nano-seconds
a= 1111 1001
B = 0010 0001
Critical Path Initiates
with the two operands
A=F9h, B=21h
The Critical Path is Limited
by the LOW-to-HIGH
transition of the Carry Bit
out of Stage 9.
16-Bit Input Register
Carry Save Adder #1 (1)
31-Bit Intermediate Register
Carry Save Adder #2 (2)
31-Bit Intermediate Register
Carry Save Adder #2 (3)
31-Bit Intermediate Register
Carry Save Adder #2 (4)
31-Bit Intermediate Register
Carry Save Adder #2 (5)
31 -Bit Intermediate Register
Carry Save Adder #2 (6)
23-Bit Intermediate Register
Carry Completion Adder (2 Bits)
22-Bit Intermediate Register
Carry Completion Adder (2 Bits)
20-B it Intermediate Register
C | Carry Completion Adder (2 Bits)
▼
18-Bit Intermediate Register
Carry Completion Adder (2 Bits)
16-Bit Output Register
P =0010 0000 0001 1001
Figure 5-15. Ten-stage pipelined multiplier
implementation with an illustration of the
critical path.
116
Voltage (V)
Number of
Transistors
Number of
Resistors
Current
(Amperes)
Power
(Watts)
Logic
3912
2320
1.28
3.20
Registers
3240
2700
2.57
6.44
Clock
1116
930
1.74
4.36
TOTAL
8268
5950
5.60
13.99
Maximum Throughput:
5.56 GHz
Latency:
1.80 Nano-seconds
Table 5-5. Performance summary for the ten-stage
pipelined multiplier.
T = 180ps, Critical Path Transition SUCCEEDS T = 170ps, Critical Path Transition FAILS
Figure 5-16. Performance bracket of the minimum period
for the ten-stage pipeline multiplier.
117
3. Comparative Analysis
A summary of the performance results for each of the
five pipelined multiplier implementations is presented in
Table (5-6). A comparative analysis of these results
quantifies and confirms the major trade-offs of pipelining
as they were addressed in Chapter II-B. Figure (5-17)
illustrates the increase in data throughput as compared to
the increase in product latency. However, latency is
generally an acceptable trade-off relative to the primary
cost drivers of device count and power consumption.
1
2
4
6
10
STAGE
STAGE
STAGE
STAGE
STAGE
Device Count
7239
7932
9439
10924
14218
Power (Watts)
4.44
5.41
7.43
9.49
13.99
Latency (nS)
0.75
1.00
1.20
1.38
1.80
Maximum Throughput
(GHz)
1.33
2.00
3.33
4.35
5.56
Speed-Power Ratio
(GHz/Watt)
0.300
0.370
0.449
0.458
0.397
Normalized
Speed-Power Ratio
0.66
0.81
0.98
1.00
0.87
Table 5-6. Comparative Summary of Performance.
118
6.00
5.00
4.00
3.00
2.00
1.00
0.00
1 2 4 6 10
Number of Pipeline Stages
Figure 5-17. Throughput and Latency as a function of the
number of pipeline stages.
Device count and power consumption are quantified in
Figures (5-18) and (5-19), respectively. As the number of
pipeline stages increases, the cost rises sharply - driven
by the need for intermediate registers and an extensive
clock distribution network. In the one-stage pipeline, the
registers and clock tree represent only 13% of the total
device count and consume 2 8% of the total power. On the
other end of the spectrum, registers and clock distribution
in the ten-stage pipeline represent 56% of the total device
count and consume 77% of the total power.
119
Watts Number of Devices
16000
14000
12000
10000
8000
6000
4000
mm
2000
Number of Pipeline Stages
Figure 5-18. Distribution of the Device Count.
14.00
12.00
10.00
8.00
6.00
4.00
□ CLOCK
■ REGISTER
H LOGIC
Number of Pipeline Stages
Figure 5-19. Distribution of Power Consumption.
Somewhere between these two extremes there exists an
optimum pipelined implementation. Dividing the maximum
throughput of each configuration by the total power that it
consumes, a figure of merit is calculated which is referred
to here as a speed-power ratio (for consistency with
optimization procedures in previous chapters). Figure
(5-21) plots the speed-power ratio as a function of the
number of pipeline stages. The maximum point on the curve
indicates that the optimal pipelined multiplier
implementation employs five or six stages.
Figure 5-20. Comparison of Speed-Power Ratio.
121
Thus, having concluded an evaluation of the various
pipelined multiplier implementations, it remains to consider
the impact that clock skew has upon these high-speed
circuits. Chapter VI undertakes this discussion in the
pages that follow.
122
VI.
ANALYSIS OF CLOCK SKEW
A. QUANTIFYING CLOCK SKEW
Clock skew appears naturally in practical circuits due
to a variety of physical factors as described in Chapter
II-A. However, in a typical SPICE simulation, transmission
delays are not inherent to the process and circuit elements
are evaluated under ideal, homogeneous operating conditions.
The effective result is the near elimination of clock skew
from the simulation environment.
Clock skew could be introduced artificially; however,
introducing a known amount of clock skew would have very
predictable results, such that it can be determined without
simulation. Thus, based upon the results of Chapter V a
simple numerical analysis is conducted in this chapter which
provides an illustration of how clock skew impacts pipelined
architectures and serves as a set of reference data from
which follow-on research into alternative control techniques
can measure performance.
B. ANALYSIS PROCEDURES
Based upon the definition of skew from Chapter II-A,
let S 0EVICE represent the maximum delay between two clock
signals after propagation through a single level of clock
drivers. As illustrated in Figure (6-1), the effect of S„..
on the clock signal as it propagates through the clock
123
distribution tree is that the clock signal potentially
accumulates S DRIVEE picoseconds of skew at each level.
Furthermore, any loading differences at the final level of
the clock distribution will introduce another skew term,
S^. Thus, the simplified expression to be used for
analyzing and calculating skew is given in Equation (6-1) .
^ TOTAL _ H X S DEVICE + S L0M)
where, n = maximum number of levels in the
clock distribution scheme
Figure 6-1. Illustration of Clock Skew as it results from
propagation path delays and loading.
124
An expression for n is derived in Equation (6-2), based upon
the pipeline implementations from Chapter V.
For
synchronous
logic,
the
timing
inequality
from
Chapter
II-A is repeated
as
Equation
(6-3) .
This
relationship requires
that
the
minimum
clock period be
expanded to account for the increase in skew.
^ ^ ^min ^skew ^logic ^Flip-Flop
The procedure for analysis of clock skew is simply to
apply a range of values for S DEVICE to the clock distribution
schemes from Chapter V, using Equation (6-2). Based upon
simulation results, the worst-case value for S L0AD is
determined to be 6.5 picoseconds. Thus, it is possible to
calculate a worst-case skew value for each incremental value
of S DEVICE as it applies to the clock distribution scheme of
each multiplier implementation. Applying the worst-case
skew values to Equation (6-3), a new minimum period is
determined for each multiplier implementation. This is
repeated for values of S DEV1CE ranging from two to twenty
picoseconds. A comparative analysis of the results should
identify/confirm the expectation of an increasingly negative
impact on the more heavily pipelined architectures.
Finally, within the
stated
range of S DEVICE
values, a
reasonable figure for S DEVICE is
determined as
it
might
actually occur due to
device
non-idealities
in
the
fabrication process. The approximation of device-induced
skew ( S.-rr —) is defined as 20% of the worst-case propagation
delay for the clock driver circuit and is determined to be
4.5 picoseconds. This set of data is referenced in the
figures that follow as "typical skew".
C. RESULTS
Figure (6-2) provides a plot of the results. The
values for skew which are referenced in the figures
represent the values for S DEVICE . The data clearly confirms
that the multipliers with throughput rates which are
obtained as a function of higher clock rates will experience
the most drastic performance reductions in the presence of
clock skew. Furthermore, when weighed against the cost of
power consumption a set of new speed-ratio curves is
obtained, as shown in Figure (6-3). Thus, the contemporary
appeal of synchronous pipelined architectures demonstrates a
severe backlash at high clock rates.
126
Speed-Power Ratio (GHz/W) H . Throughput (GHz)
6.00
5.00
4.00
3.00
2.00
1.00
0.00
0 2 4 6 8 10 12
—No Skew
2ps Skew
-*— 5ps Skew
—x— 10ps Skew
—•— 20ps Skew
Typical Skew
Number of Pipeline Stages
gure 6-2. Effect of Skew on Pipeline Throughput Rates.
0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
0 2 4 6 8 10 12
Number of Pipeline Stages
Figure 6-3. Effect of Skew on Pipeline Efficiency.
-♦—No Skew
Skew=2ps
Skew=5ps
Skew=10ps
Skew=20ps
-♦—Typical Skew
127
THIS PAGE LEFT BLANK INTENTIONALLY
128
VII
CONCLUSIONS
The fundamentals of circuit analysis and the principles
of junction transistor behavior have been applied to design
an optimal family of current-mode logic devices from InP HBT
SPICE transistor models. From these building blocks of
digital logic, an array multiplier has been constructed and
pipelined into five distinct implementations. Each
multiplier implementation has been simulated extensively via
Tanner SPICE in order to identify the respective performance
characteristics of power consumption and maximum operating
frequency.
A comparative analysis of multiplier performance has
effectively demonstrated the trade-offs of pipelining with
predictable yet interesting results. The cost of increasing
throughput by increasing the number of pipeline stages has
been quantified in terms of device count and power
consumption. By maximizing data throughput at the most
efficient cost in terms of power, the optimal 8x8 bit
synchronous pipelined multiplier design has been determined
to be the six-stage implementation, as shown on page 121.
Finally, in the presence of clock skew, it has been
demonstrated that the efficiency of synchronous pipelined
architectures operating at high clock rates is significantly
reduced. Thus, as device switching frequencies continue to
129
pave the way to faster logic circuits, the rate of data
throughput will be left behind unless the synchronous logic
design constraint of clock skew can be overcome. The impact
of clock skew has been quantified and summarized such that
it provides a reference point for further research into
alternative clocking/control techniques.
Specifically, it is intended that future research use
the CML HBT logic family designed in this thesis in order to
implement the same array multiplier circuit using
asynchronous control techniques. One such endeavor is
already in progress as LtCol. Kirk Shawhan, USMC,
investigates the use of local completion signals which
employ request/acknowledge handshake signals to control the
flow of data vice the use of a global clock signal (Shawhan,
2000). Perhaps in time such asynchronous schemes will
mature into a design methodology that overcomes the obstacle
of clock skew which now threatens to limit synchronous
design methodology.
130
LIST OF REFERENCES
Foley, J. B . ; Bannister, J. A. R., "Analysing ECL's Noise
Margins," IEEE Circuits and Devices, Volume 10, 1994, pp.
32-37.
Harris, D., "Timing Analysis Including Clock Skew," IEEE
Transactions on Computer-Aided Design of Integrated
Circuits and Systems, Volume 18, 1999, pp. 1608-1618.
Jalali, B.; Pearton, S. J., InP HBTs: Growth, Processing,
and Applications, Artech House, Inc., Massachussets, 1995.
Loomis, Herschel H. Jr., "Class Notes," EC 4830 Spring
Quarter at the Naval Postgraduate School, 2000.
Moore, G., "Moore's Law Extended: The Return of
Cleverness," Solid State Technology, Volume 40, 1997, pp.
359-364.
Pierret, Robert F., Semiconductor Device Fundamentals,
Addison-Wesley Publishing Company, Massachusetts, 1996.
Pollard, L. Howard, Computer Design and Architecture,
Prentice-Hall, New Jersey, 1990.
Richards, R. K., Electronic Digital Components and Circuits,
D. Van Nostrand Company, New Jersey, 1967.
Sedra, Adel. S; Smith, Kenneth C., Microelectronic Circuits,
Oxford University Press, New York, 1998.
Shawhan, Kirk A., Design and Analysis of an Asynchronous
Pipelined Multiplier with Comparison to Synchronous
Implementation, Master Thesis, Naval Postgraduate School,
Monterey, CA, Dec 2000.
Sutherland, Ivan E., "Micropipelines," Communications of the
ACM, Volume 32, 1989, 720-738.
Wakerly, John F., Digital Design Principles and Practices,
Prentice Hall, New Jersey, 2000.
Weste, Niel H. E.; Eshraghian, Kamran, Principles of CMOS
VLSI Design: A Systems Perspective, Addison Wesley Longman,
Inc., 1993.
131
THIS PAGE INTENTIONALLY LEFT BLANK
132
INITIAL DISTRIBUTION LIST
1. Defense Technical Information Center.2
8725 John J. Kingman Road, Ste 0944
Fort Belvoir, VA 22060-6218
2 . Dudley Knox Library.2
Naval Postgraduate School
411 Dyer Road
Monterey, California 93943-5101
3. Director, Training and Education.1
MCCDC, Code C46
1019 Elliot Rd.
Quantico, Virginia 22134-5027
4. Director, Marine Corps Research Center.2
MCCDC, Code C40RC
2040 Broadway Street
Quantico, Virginia 22134-5107
5 Marine Corps Tactical System Support Activity . 1
Technical Advisory Branch
Attn: Librarian
Box 555171
Camp Pendleton, CA 92055-5080
6. Marine Corps Representative.1
Naval Postgraduate School
Code 037, Bldg. 330, Ingersoll Hall, Room 116
555 Dyer Road
Monterey, CA 93943
7. Engineering and Technology Curricular Office, Code 34 1
Naval Postgraduate School
Monterey, California 93943-5109
8. Chairman, Code EC.1
Department of Electrical and Computer Engineering
Naval Postgraduate School
Monterey, California 93943-5121
9. Professor Douglas Fouts, Code EC/FS. 1
Department of Electrical and Computer Engineering
Naval Postgraduate School
Monterey, California 93943-5121
133
10. Professor Herschel Loomis, Code EC/LM.1
Department of Electrical and Computer Engineering
Naval Postgraduate School
Monterey, California 93943-5121
11. LtCol. Kirk Shawhan (USMC) .1
P.0. Box 749
Quantico, VA 22134-0749
12. Maj . John R. Calvert, Jr. (USMC).4
1422 Woodway Drive
Ooltewah, TN 37363
134