# Full text of "DTIC ADA388019: Design of a Synchronous Pipelined Multiplier and Analysis of Clock Skew in High-Speed Digital Systems"

## See other formats

NAVAL POSTGRADUATE SCHOOL Monterey, California THESIS DESIGN OF A SYNCHRONOUS PIPELINED MULTIPLIER AND ANALYSIS OF CLOCK SKEW IN HIGH-SPEED DIGITAL SYSTEMS by John R. Calvert, Jr. December 2000 Thesis Advisor: Douglas J. Fouts Thesis Co-Advisor: Herschel H. Loomis, Jr. Approved for public release; distribution is unlimited. 20010402 109 REPORT DOCUMENTATION PAGE Form Approved OMB No . 0704-0188 Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instruction, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-4302, and to the Office of Management and Budget, Paperwork Reduction Project (0704-0188) Washington DC 20503. 1. AGENCY USE ONLY (Leave blank) 2. REPORT DATE 3. REPORT TYPE AND DATES COVERED December 2000 Master’s Thesis 4. TITLE AND SUBTITLE : Design of a Synchronous Pipelined Multiplier and Analysis of Clock Skew in High-Speed Digital Systems 5. FUNDING NUMBERS 6. AUTHOR(S) Calvert, John R. Jr. 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) Naval Postgraduate School Monterey, CA 93943-5000 8. PERFORMING ORGANIZATION REPORT NUMBER 9. SPONSORING / MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSORING /MONITORING AGENCY REPORT NUMBER 11. SUPPLEMENTARY NOTES The views expressed in this thesis are those of the author and do not reflect the official policy or position of the Department of Defense or the U.S. Government. 12a. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution is unlimited. 12b. DISTRIBUTION CODE 13. ABSTRACT ( maximum 200 words) Digital systems implemented with high-speed transistor technologies face a variety of design challenges in an effort to keep pace with the accelerating demand for performance. As device switching frequencies climb comfortably into the gigahertz range, clock skew in digital systems threatens to limit the advantages of synchronous pipelined designs. This research investigates the limitations of clock skew on high-speed digital systems by designing and simulating an 8x8 bit synchronous, pipelined multiplier using Indium phosphide (InP), heterostructure bipolar junction (HBT) transistor technology. Fundamentals of circuit analysis and the principles of junction transistor behavior are applied to design an optimal family of logic devices using current-mode logic. All testing and simulation data is based upon results obtained from Tanner SPICE design tools. Using the building blocks of this logic family, an array multiplier is constructed and further configured into five distinct pipeline implementations. By employing a different number of pipeline stages in each implementation, the trade-offs of pipelining are illustrated and clock skew is analyzed at a variety of throughput rates. Finally, the impact of clock skew on throughput performance is quantified and summarized as a reference point for further research into asynchronous control techniques. 14. SUBJECT TERMS Clock Skew, Pipelined Logic, Current-Mode Logic, Indium-phosphide Heterojunction Bipolar Transistors, High-Speed Logic 15. NUMBER OF PAGES 152 16. PRICE CODE 17. SECURITY CLASSIFICATION OF REPORT Unclassified 18. SECURITY CLASSIFICATION OF THIS PAGE Unclassified 19. SECURITY CLASSIFICATION OF ABSTRACT Unclassified 20. LIMITATION OF ABSTRACT UL NSN 7540-01-280-5500 Standard Form 298 (Rev. 2-89) Prescribed by ANSI Std. 239-18 1 THIS PAGE LEFT BLANK INTENTIONALLY 11 Approved for public release; distribution is unlimited. DESIGN OF A SYNCHRONOUS PIPELINED MULTIPLIER AND ANALYSIS OF CLOCK SKEW IN HIGH-SPEED DIGITAL SYSTEMS John R. Calvert, Jr. Major, United States Marine Corps B.S., United States Naval Academy, 1990 Submitted in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE IN ELECTRICAL ENGINEERING from the NAVAL POSTGRADUATE SCHOOL December 2000 Author: Approved by: Department of Electrical and Computer Engineering iii ABSTRACT Digital systems implemented with high-speed transistor technologies face a variety of design challenges in an effort to keep pace with the accelerating demand for performance. As device switching frequencies climb comfortably into the gigahertz range, clock skew in digital systems threatens to limit the advantages of synchronous pipelined designs. This research investigates the limitations of clock skew on high-speed digital systems by designing and simulating an 8x8 bit synchronous, pipelined multiplier using Indium phosphide (InP), heterostructure bipolar junction (HBT) transistor technology. Fundamentals of circuit analysis and the principles of junction transistor behavior are applied to design an optimal family of logic devices using current-mode logic. All testing and simulation data is based upon results obtained from Tanner SPICE design tools. Using the building blocks of this logic family, an array multiplier is constructed and further configured into five distinct pipeline implementations. By employing a different number of pipeline stages in each implementation, the trade-offs of pipelining are illustrated and clock skew is analyzed at a variety of throughput rates. Finally, the impact of clock skew on throughput performance is quantified and summarized as a reference point for further research into asynchronous control techniques. v THIS PAGE LEFT BLANK INTENTIONALLY VI TABLE OF CONTENTS I. INTRODUCTION.....1 A. THE RELEVANCE OF HIGH-SPEED LOGIC.1 B. THE PROBLEM OF CLOCK SKEW.1 C. THE DESIGN OF A TEST CIRCUIT.2 D. THESIS OUTLINE.3 H. BACKGROUND.5 A. CLOCK SKEW.5 B. PRINCIPLES OF PIPELINING.7 C. LOGIC DESIGN OF A COMBINATIONAL MULTIPLIER.10 D. BJT/HBT LOGIC. 13 L BJT/HBT Principles and Characteristics . 13 2. BJT/HBT Logic Families . 23 HI. HBT CML LOGIC CIRCUIT DESIGN.31 A. DESIGN OVERVIEW.31 B. INVERTER DESIGN.31 L Circuit Topology . 31 2. Initial Conditions and Design Parameters . 34 3. DC Analysis . 38 4. AC/Transient Analysis . 47 5. Final Design Summary: Inverter . 54 C. LOGIC NOR GATE DESIGN.55 L Overview and Analysis . 55 2. Final Design Summary: OR/NOR .57 3. Implementation of the AND Function .57 D. ADDER DESIGN.59 1. Implementation . 59 2. Performance Analysis . 62 E. PRACTICAL CURRENT SOURCE DESIGN.62 1. Circuit Topologies . 62 2. Performance Analysis . 63 IV. HBT CML LATCH AND REGISTER DESIGN.71 A. LATCH DESIGN.71 1. Circuit Topology . 71 2. .. Initial Conditions and Design Parameters . 74 3. DC Analysis .75 4. A C/Transient Analysis . 81 5. Special Latch Implementations . 87 6 . Final Design Summary: D-Latch . 88 B. FLIP-FLOP DESIGN (D-TYPE).89 1. Overview and Analysis . 89 2. Final Design Summary . 90 C. CLOCK DRIVER DESIGN.91 1. Overview . 91 2. Analysis and Results . 92 3. Final Design Summary: Clock Driver . 94 vii V. HBT CML PIPELINED MULTIPLIER DESIGN- 95 A. LOGIC STAGE DESIGN.95 1. Overview . 95 2. Carry-Save Adders . 96 3. Carry-Completion Adders . 101 B. REGISTER STAGE DESIGN.103 C. CLOCK DISTRIBUTION.103 D. MULTIPLIER IMPLEMENTATIONS.104 E. PERFORMANCE EVALUATION.104 1. Evaluation Procedures . 104 2. Performance Results of Each Implementation . 106 3. Comparative Analysis . 118 VL ANALYSIS OF CLOCK SKEW-123 A. QUANTIFYING CLOCK SKEW.123 B. ANALYSIS PROCEDURES.123 C. RESULTS.126 vn. CONCLUSIONS-129 LIST OF REFERENCES_ 131 INITIAL DISTRIBUTION LIST-133 viii EXECUTIVE SUMMARY The electronic subsystems of future overhead collection platforms will require extremely high performance digital logic for performing such tasks as data compression/decompression, data encryption, spread spectrum modulation, etc. To accomplish this, bit rates must reach into the gigabits per second range. Such speed obviously requires digital logic which will function correctly at clock rates of tens of gigahertz. The need for such high performance has led to the implementation of logic systems using indium phosphide (InP) heterojunction bipolar transistors (HBT) technology. However, clock frequency and pipeline throughput in digital systems implemented with InP HBT technology is significantly limited by clock, control signal, and data skew which is a much larger percentage of the clock period than it is in lower-speed digital systems implemented with complementary metal oxide semiconductor (CMOS) technology. Therefore, the presence of clock skew in high-speed digital systems defines a limitation for the advantages of synchronous pipelined architectures. It is the purpose of this thesis to design a synchronous 8x8 bit pipelined multiplier as a high-speed digital test circuit using InP HBT technology and furthermore, to quantify the impact of clock skew on throughput. This work represents the initial phase of a larger research project to determine if asynchronous pipeline control will yield greater overall pipeline throughput in high-performance InP HBT digital integrated circuits and if the resulting elimination of the clock distribution tree will reduce power consumption, device count and layout area. All simulation data is based upon the results obtained from Tanner SPICE design tools. ix Having received InP HBT device specifications from Hughes Research Laboratories, this project commenced with the design of an HBT logic family utilizing current-mode logic. Each circuit was designed and optimized for a minimum power-delay product while driving a maximal fanout load of four logic gates. This design effort produced the four essential circuit functions necessary for the practical implemention of any synchronous logic circuit: an inverter/buffer gate, an OR/NOR gate, a D-type latch, and a practical current source. Using the building blocks of this logic family, an array multiplier was constructed and further configured into five distinct pipeline implementations. These included a one, two, four, six, and ten-stage pipeline, respectively. A comparative analysis of their performance effectively illustrated the trade-offs of pipelining, i.e., the cost of the additional registers was shown to outpace the increase in throughput beyond a six-stage implementation. At a maximum throughput of 4.35 gigahertz, the six-stage pipelined multiplier was the most efficient design (in the absence of clock skew). The highest throughput achieved was 5.56 gigahertz by the costly ten-stage implementation. Power consumption ranged from 4.4 to 14 watts. In the final analysis, clock skew was not simulated because SPICE simulations effectively eliminate skew from their calculations. Rather, the impact of clock skew was determined by applying numerical analysis to the no-skew simulation results. A range of possible skew values was considered in order to demonstrate a performance trend. The results confirmed that digital system throughput rates which are obtained as a function of higher clock rates will experience the most drastic performance reductions in the presence of clock skew. Also, it was shown for a typical x value of skew in this circuit that the efficiency curve shifts to indicate that the four-stage pipeline is the most efficient implementation, vice the six-stage pipeline. The design products and test results from this thesis provide a reference point for further research into alternative clocking/control techniques. Specifically, it is intended that future research use the CML HBT logic family designed in this thesis in order to implement the same array multiplier circuit using asynchronous control techniques. One such endeavor is already in progress as LtCol. Kirk Shawhan, USMC, investigates the use of local completion signals which employ request/acknowledge handshake signals to control the flow of data vice the use of a global clock signal. xi THIS PAGE LEFT BLANK INTENTIONALLY xii ACKNOWLEDGMENT My most immediate expression of gratitude is to the One that made me ... to my Creator, my Lord, and my Savior — Jesus Christ. I am convinced that the meticulous detail and extensive effort required to design even the simplest of electronic devices bears unmistakable testimony to the Divine and Intelligent design of our universe. But remember the Lord your God, for it is he who gives you the ability ... Deuteronomy 8:18 ...whatever you do, do it all for the glory of God. I Corinthians 10:31 This thesis is dedicated to my wife, Laura. Though I have no expectation that she will ever read beyond this page, certainly it would never have been written without her encouragement, support, and patience. While I have considered myself fortunate to be learning about the mysteries of microelectronics, she has been making the most thrilling discoveries and profound impact each day as the mother of our beloved daughter, Victoria, who is 22 months old today. I love them both dearly. I wish to thank my advisors. Professor Douglas Fouts and Professor Herschel Loomis for sharing their time, their knowledge, and their passion for this field of study. Their guidance was essential and their enthusiasm was contagious. I am also thankful to have shared this academic journey with such a quality core of fellow students and service members. Specifically, it is with great admiration that I thank LtCol. Kirk Shawhan, USMC — my perpetual study partner and project mentor. He has patiently endured my questions and generously shared his ingenius grasp of the most challenging concepts. My graduate school experience would have been a much different story without his assistance and more importantly, his friendship. xm THIS PAGE LEFT BLANK INTENTIONALLY I. INTRODUCTION A. THE RELEVANCE OF HIGH-SPEED LOGIC The demand for increased processing speeds in digital electronics has driven the clock frequency of logic circuits from a scale of microseconds to one of picoseconds over the past twenty years. This remarkable trend is the synergistic result of technological advancements and innovations in device physics, very-large-scale integrated (VLSI) circuit fabrication, and digital systems architecture. Moore's Law accurately predicted this trend of improvement 35 years ago, and current expectations are that the trend will continue (Moore, 1997). Consider the anticipation of such technologies as real-time multimedia satellite communications and broadband networks. These applications will require extremely high performance digital logic that can function reliably at clock rates of tens of gigahertz. B. THE PROBLEM OF CLOCK SKEW There are a variety of technological hurdles to clear before achieving such clock speeds, and it is the purpose of this thesis to explore one particular hurdle in the course of digital systems architecture: the problem of clock skew in high-speed logic. Clock skew is the difference between arrival times of the clock signal at different synchronous clocked devices (Harris, 1999). As clock frequencies reach 1 into the multi-gigahertz range, clock skew is an increasing concern for high-speed circuit designers because it accounts for an increasing portion of the clock period — leaving less of the clock period to be budgeted for logic and latching delays. What was once a near negligible quantity has now become a significant design constraint. (Wakerly, 2000 ) C. THE DESIGN OF A TEST CIRCUIT This thesis presents the design of a high-speed logic test circuit and the simulation of its performance in order to identify and quantify the effects of clock skew. It should be noted that these results are intended to serve as a reference for future research involving potential solutions for the reduction of clock skew. The following paragraphs develop the necessary specifications of the test circuit. To ensure valid results, it is important that the problem be simulated in an accurate context. Therefore, it is necessary to select a logic family based upon a transistor model that is capable of realizing multi¬ gigahertz clock speeds. Although complementary- metal-oxide- semiconductor (CMOS) technologies dominate VLSI applications, for comparable fabrication technologies, a bipolar circuit is approximately 2.5 times faster than a functionally similar CMOS circuit (Foley, 1994). Typically, 2 such high-speed bipolar circuits employ emitter coupled logic (ECL) or current mode logic (CML) . Notably, these logic families consume significantly more power than field effect transistor (FET) logic families; however, the trade¬ off is accepted here for the purpose of achieving sufficient clock speeds. For these reasons, current mode logic is employed to design a family of logic gates based upon the transistor specifications for an indium phosphide (InP) heterojunction bipolar transistor (HBT), courtesy of Hughes Research Laboratories. Additionally, it is important that the architecture and functionality of the test circuit provide a relevant context for evaluation. It should be noted here that the shorter clock periods discussed above are not exclusively the result of faster gate delays (i.e. faster transistors) but are also the result of pipelined architectures which require fewer gate delays per clock cycle. In keeping with this characteristic of high-speed logic circuits, the test circuit implements a pipelined architecture. As for circuit functionality, an 8x8 bit multiplier was chosen to provide sufficient complexity for pipeline implementation. D. THESIS OUTLINE The purpose of this thesis is to design, simulate, and evaluate the performance of a high-speed (InP HBT) 8x8-bit pipelined multiplier in the presence of clock skew. The 3 discussion begins with the review and development of several fundamental topics in Chapter II: clock skew, pipelining principles, logic-level design of a multiplier, and transistor-level design of BJT/HBT logic. Based upon that foundation. Chapters III through V present the hierarchical design of the pipelined multiplier from the bottom up. Respectively, these chapters address logic circuit design, clock-driven circuit design, and pipeline design. Each of the design chapters presents a complete discussion of pertinent design issues, low-level simulation, performance optimization, and final design specifications. Finally, Chapter VI records the analysis of clock skew and Chapter VII summarizes the conclusions of the entire work. 4 II. BACKGROUND A. CLOCK SKEW Clock skew is the difference between the arrival times of the clock signal at two different clock-driven devices, as illustrated in Figure (2-1). This difference is dependent upon multiple issues including normal component variations, wire propagation delay, RC delays, propagation distance, environmental variations (such as operating temperature), and clock loading. Notably, all of these contributing factors have been increasing relative to gate delays. (Harris, 1999) Figure 2-1. Clock Skew (After Wakerly). In traditional logic designs which employ flip-flops and operate at extremely high clock frequencies, clock skew has become a significant portion of the total clock period. 5 For a fixed-length clock period, this effectively reduces the amount of time available for computation. Equation (2-1) quantifies the terms which contribute to the minimum clock period (TVJ of a traditional synchronous logic circuit. T . mm = t . + t. . skew logic + ^Flip-Flop where, ^Flip-FLop ^ setup + ( t ) ' prop 1 max The simplest and most direct technique for minimizing clock skew would seem to be the implementation of a uniform clock distribution hierarchy which provides a local clock signal to a smaller portion of the entire circuit, i.e., a subcircuit. For signals that remain within the subcircuit, clock skew is reduced. The maximum propagation delay from the local clock source to the farthest clock input of the subcircuit can be kept within a desirable tolerance. But inevitably, signals must travel between subcircuits. This is an increasingly common occurrence when the maximum size of the subcircuit is restricted by practical limitations for fanout and power consumption — especially true in the case of current-driven logic. The local clock signals are not without skew relative to each other. Although the delay paths for each branch of the clock distribution tree may contain the same number of gate delays, the switching behavior along each path varies 6 within a narrow range. Thus, when a signal from one subcircuit must drive logic in another subcircuit, the worst-case value of relative clock skew must be assumed. An extensive clock distribution tree is employed in this thesis to provide local clock signals for circuit elements of a pipelined multiplier. Ultimately, the purpose is to quantify the clock skew experienced in a high-speed logic circuit and explore the impact of clock skew as the clock period is reduced. B. PRINCIPLES OF PIPELINING As referenced in the previous section, the minimum clock period is governed by the relationship presented in Equation (2-1). For a given block of combinational logic with an associated propagation time of t logic , the minimum clock period is required to be even greater. In the face of a large, complex combinational circuit (Figure 2-2a) this could impose undesirable restrictions on clock speed. However, a pipelined approach suggests that the combinational logic can be broken down into discrete levels of operation, known as pipeline levels (Figure 2-2b). Each pipeline level will contain fewer levels of logic than the original combinational circuit, and ideally, each pipeline level will contain the same number of logic levels in order to achieve near-equal propagation delays. Then, by adding appropriately sized registers between these levels (Figure 7 2-2c), the function of the original combinational logic can be achieved by sequentially sending operands through the series of pipeline levels. Furthermore, this can be done at a higher clock rate since the period is now governed by Equation (2-2), where t. . has now become t . . . logic pipe-level ( 2 - 2 ) ■'pipe-level 'Flip-Flop The improvement in clock speed is quantified as the percentage of speedup. Equation (2-3). (Pollard, 1990) (2-3) Time for M operations WITHOUT pipelining Speedup = --—-- Time for M operations WITH pipelining Of course, this benefit is not without cost. There are several trade-offs involved such as increases in the number of components, power consumption, control complexity, chip area, and a variety of associated costs for design and fabrication. Additionally, the propagation latency for a single set of signals traveling through the pipeline is increased due to the additional delays contributed by the intermediate register(s) in the pipeline. Equation (2-4) expresses this increase in latency as a function of the number of pipeline stages (m) and the total register delay (Loomis, 2000). (2-4) Latency Increase = (m-1) t FIip _ plop 8 Figure 2-2. Example of Pipelining (After Loomis). 9 Though the significant increase in delay for a single operation may seem to be a tragic loss, it is the remarkable increase in data throughput which accompanies the increase in clock speed that ultimately motivates the designer to adopt a pipelined architecture. In the context of this project, a pipelined architecture will facilitate the achievement of high clock speeds in the implementation of a relatively large, complex combinational circuit — a combinational multiplier. C. LOGIC DESIGN OF A COMBINATIONAL MULTIPLIER A combinational multiplier takes two n-bit operands and performs n shift and n add operations to generate a 2n-bit product. Most algorithms are implemented based upon the paper-and-pencil-like procedure of shifted product components as shown in Figure (2-3). Each individual bit of the multiplier (y 0 through y^) is successively multiplied times the entire n-bit multiplicand. With each subsequent multiplier bit, the resulting product component is shifted by one bit position, starting with an initial shift of zero and concluding with n-1. (Wakerly, 2 000) The worst-case delay for this type of multiplication is governed by the carry propagation out of the most significant bit position and into the follow-on stage of addition. By utilizing carry-save addition (Figure 2-4) , this propagation delay is eliminated for the initial n-1 10 Figure 2-3. Multiplication as a sum of partial product terms (From Wakerly). Figure 2-4. An 8x8 bit multiplier implemented with seven carry-save adder stages and one ripple-carry adder for carry completion (From Wakerly). 11 stages of addition; however, an extra stage is required to complete the addition of the final two resulting terms, as will be explained shortly. The first carry-save addition stage takes two binary addends and generates an n-bit modulo-two sum and a shifted n-bit carry term (shifted by one bit). Subsequent carry- save addition stages take three binary addends: the previous partial sum, the shifted carry term, and the next subsequent product term. These are also added to produce an n-bit modulo-two sum and a shifted n-bit carry term. As each carry-save addition occurs, the least significant bit (LSB) of each partial sum represents the next most significant bit (MSB) in the final product. This is repeated until the n th product term has been added, and all that remains are a sum term and a shifted carry term. At this point, a carry-completion adder computes the most significant n+1 bits of the product. This procedure accounts for the consecutive propagation of a carry bit as each pair of addend bits are summed from LSB to MSB. In the context of this project, the implementation of carry-save adders and carry completion adders allows convenient grouping of pipeline stages. This is particularly applicable to the final stage of the design process undertaken in this project. Chapter 5 provides further details on the implementation of a pipelined 8x8-bit combinational multiplier, as introduced in the preceding paragraphs. D. BJT/HBT LOGIC 1. BJT/HBT Principles and Characteristics a) Device Structure A bipolar junction transistor (BJT) is a sandwich structure of three separately doped regions of silicon (or other suitable semiconductor) , such that one of two configurations exists. One configuration is the pnp transistor where a negatively doped region is bounded on either end by positively doped regions (p-type transistor). The other configuration is the npn transistor where a positively doped region is bounded on either end by negatively doped regions (n-type transistor). Figure (2-5) provides a simplified illustration and further identifies the proper names for the regions: collector, base, and emitter. Emitter ^ Emitter Base Collector Region Region Region Collector * Base Figure 2-5. Structure of a Bipolar junction Transistor (After Pierret). 13 Until recent years, BJTs were generally fabricated from a single semiconductor material. However, device¬ level physics has demonstrated that faster junction transistors can be constructed from dissimilar semiconductor materials with complementary properties. Such devices are known as heterojunction bipolar transistors (HBTs). Conveniently enough, their operational behavior is essentially governed by the same functional principles as BJTs (Pierret, 1996). Therefore, it is assumed that wherever BJT behavior is referenced, a direct correspondence to HBT behavior exists. The following sections will provide a fundamental understanding of that behavior. b) Device Function The significance of the BJT lies in its potential to behave as a current-controlled current source when the proper DC bias is applied to the three regions or terminals. The controlling terminal is the base. Applying the proper DC bias to an npn transistor, a small current flowing into the base will produce a proportionately larger current being drawn into the collector, across the base region, and out of the emitter (Figure 2-6) . The converse is true for a properly biased pnp transistor. A small current drawn out of the base will produce a proportionately larger current being drawn into the emitter, across the base region, and out of the collector. From this point forward, it will be helpful 14 I 1 Figure 2-6. A functional illustration of an (a) npn and a (b) pnp bipolar junction transistor (After Sedra). to limit the discussion to npn transistors, because the pnp transistors operate in a very similar manner (with reversed polarity) and npn transistors are the only type encountered in the chapters ahead. As stipulated in the preceding discussion, proper DC bias conditions must exist in order to achieve the desired performance. Depending upon the DC bias, the transistor will operate in one of the following modes of operation: cutoff, active, or saturation. In the first case, the emitter-base junction is reverse biased which means V BE < V BE(on) for the pn junction (0.75v). This also implies that V BC < V BC(on) for the collector-base junction. Therefore, the collector-base junction is also reverse biased. This condition is known as the "cutoff" mode since effectively no current flows through the transistor. 15 In the two remaining modes, the emitter-base junction is forward biased, and the transistor conducts current. The mode of operation is distinguished by the condition of the collector-base junction — using the emitter as a common reference for both the collector and base. If V CE < V CE(sat) then the base-collector junction is saturated, and the flow of current from collector to emitter is not linearly dependent on I B . Conversely, when V CE > V CE(sat) for the base-collector junction, then it is reverse biased and current is swept from the collector, across the base, and out of the emitter in linear proportion to the amount of base current applied. This is known as the active region. Table (2-1) summarizes the relationships which govern the three regions of operation. Furthermore, Figure (2-7) is an i-v curve for the Hughes InP HBT (lxl micron) . It serves to illustrate the active and saturation modes of BJT operation while also providing necessary design information that relates the base-emitter voltage drop (V BE ) to collector current levels (I c ) . The linearly proportionate increase in collector current relative to base current is referred to as the common-emitter current gain, Beta ((3) , as shown in Equation (2-5). (Sedra, 1998) Collector Current (raA) Mode of Operation Base-Emitter Junction Collector-Emitter Junction Bias Relationship Bias Relationship Cutoff Reverse V < V v BE v BE(on) Reverse Saturation Forward ^BE ^ ^BE(on) Forward v CE ^ ^CE(sat) Active Forward "^BE ^ ^BE(on) Forward v CE ^ ^CE(sat) Table 2-1. Relationships governing the operational regions of the BJT transistors (After Sedra). Figure 2-7. I-V Curve for the InP HBT Figure 2-8. Variation of Beta for the XnP HBT with respect to V n and V^. Beta is a device parameter for BJTs — a function of the device physics and dimensions. Figure (2-8) illustrates how Beta varies according to the values of base-emitter voltage and collector-emitter voltage. Finally, a simple application of Kirchoff's Current Law produces Equation (2-6) — an important relationship for current through the transistor. (2-6) I E =I B +I C 18 c) DC Analysis of a BJT Circuit In order to illustrate the basic concepts of BJT operation as presented in the previous section, the transistor circuit in Figure (2-9) is now examined. Given the reference voltages, the turn-on voltage for the emitter- base junction (0.75v), and Beta for the transistor, it is readily (determined that V BE > V BE(on) , and therefore the emitter-base junction is forward biased. DC analysis reveals the value of V B and I B . Applying the equations from the previous section, I c , I E , and V c are determined, and it is concluded that the transistor is operating in the active region. + m + 5v R c = lkQ R* = lOOkQ DC ANALYSIS : V E = Ov V B = V E + V BE(on) = 0.7v T _ Vbb ~ V B 5v - 0.7v lOOkQ = 43 I c = p x I B = 4.3mA I E = I c + I B = 4.343mA V c = V cc - I C R C = lOv - (4.3mA) = 6.7v Figure 2-9. DC Analysis of a simple BJT circuit. 19 In anticipation of logic applications, consider the base voltage as a logical input which is either high (above V BE(on) ) or low (below V BE(on) ) . For a logic high input the transistor operates in the active mode, causing the voltage at the collector drop below V cc by an amount equal to I C R C . Alternately, for a logic low input the transistor operates in the cutoff mode, drawing effectively no current through the collector and leaving V c approximately equal to V cc . The functionality of this circuit is essentially that of a basic BJT inverter. d) BJT Differential Pair Before committing to the discussion of transistor logic circuits, it is necessary to introduce a configuration that maximizes the switching speed of the BJT transistor: the differential pair. A differential pair is constructed from two matched transistors (Q x and Q 2 ) with their emitters attached to a common current source and their collectors independently biased via separate pull-up resistors to a common voltage source, as shown in Figure (2-10). The base terminals are attached to separate voltage sources of equal value. Assuming the transistors have been given the proper DC bias for operation in the active mode, the relationship in Equation (2-7) is readily determined. (2-7) I E , = I E2 = ^ 20 V C c Figure 2-11. Example of a BJT Differential Pair configuration. Now, consider the scenario where V B2 is constant and V B1 is allowed to vary between two extremes: one above and one below V B2 . When V B1 reaches a voltage sufficiently larger than V B2 , all of the current from I bias is steered through Q x such that Q 2 is cutoff. Conversely, when V B1 drops sufficiently below V B2 , Q 2 is on and Q 2 is cutoff. As noted in the DC analysis of the previous BJT circuit, the collector voltage of Q 1 exhibits the behavior of a logic inverter with respect to V B1 , while the opposite collector voltage (Q 2 ) functions as a non-inverting buffer. 21 While the availability of complementary output voltages is certainly convenient, the most important observation of the differential pair is its switching speed. A relatively small voltage difference between V B1 and V B2 is required to switch the current almost entirely to the opposite path. More specifically, for a differential pair implemented with the Hughes InP HBT, it is shown in Figure (2-11) that a difference of only 75mV is sufficient to switch 90% of the current. Input Voltage (V) Figure 2-11. Current Switching Characteristic of the InP HBT Differential Pair. Furthermore, since Q 1 and Q 2 are biased to operate in the active mode, the switching occurs faster than scenarios which may place the transistors in saturation mode. This is because a saturated transistor stores charge in its base. That charge must be dissipated before switching can occur. It is the current-steering property of the differential pair configuration which ultimately provides a foundation for the development of current mode logic, as will be discussed later in this chapter. However, before reaching that discussion, a brief overview of the dominant BJT logic families will serve to accentuate the advantage of current mode logic. 2. BJT/HBT Logic Families This discussion is not intended to address all BJT/HBT logic families. Rather, the purpose here is summarize the principles of the two most popular and relevant BJT/HBT logic families. These are transistor-transistor logic and current-mode logic. Ultimately, this discussion culminates with a comparison of the two logic families in order to justify the implementation of current-mode logic for high¬ speed applications. a) Transistor-Transistor Logic (TTL) Transistor-transistor logic evolved directly from diode-transistor logic (DTL) in a successful effort to 23 eliminate the drawbacks of DTL. (Richards, 1967) While there were several stages in this evolution, the end product is a TTL family which resembles the inverter shown in Figure (2-12). The enhanced performance of TTL is predominately achieved through two fundamental design features. The first improvement is the use of a second transistor in place of the diodes of a DTL circuit. For a V cc V cc low input voltage, Q x is turned on — rapidly drawing current from the base of Q 2 and dissipating the excess charge to achieve a faster transition. In the opposite case, when the input is high and Q x is cutoff, Q 2 is specifically engineered to have a low reverse Beta such that a small yet sufficient current flows out through the collector and is applied to the base of Q 2 . The second improvement is the use of an optimum output stage, commonly referred to as the "totem-pole" 24 output stage (not shown in the Figure 2-12). It combines the rapid high-to-low transition capability of the common- emitter output stage with the rapid low-to-high transition capability of the emitter-follower output stage. Based upon these two features in conjunction with other minor modifications, TTL logic achieved a level of popularity which made it the dominant design for SSI, MSI, and LSI circuits throughout two decades. Despite this success, standard TTL circuit speeds are still limited by two design issues. First, transistors operate in saturation mode which increases junction capacitance and its associated switching delay. Second, the resistance along the dissipation path for junction capacitance further increases this delay. b) Current-Mode Logic (CML) Current-mode logic is distinct from the design of other BJT/HBT logic families. The term "current-mode" refers to the channeling of a constant current along alternate paths to achieve logic functionality in circuits. Since it is the presence or absence of current that determines the logical output, the maximum voltage swing can be relatively small in contrast to voltage-mode circuits, such as TTL. The distinguishing design feature of current-mode logic circuits is the BJT differential pair. It is the 25 backbone of all CML circuits and the source of critical advantages and disadvantages. The benefit of smaller logic swings has already been mentioned. Also, the discussion of the BJT differential pair earlier in this chapter explained how the collector voltage swings (inverts) rapidly in response to reversing the polarity/magnitude of the differential inputs by a narrow margin of approximately 75mv. This translates into a switching speed for CML which is unsurpassed by its predecessors. Contributing to this remarkable speed is the fact that the transistors of the differential pair can be operated in the active region and, therefore, do not suffer from the effects of excess charge stored at the transistor base. Unfortunately, the constant flow of current which enables these remarkable switching speeds also consumes a remarkable amount of power. For an illustration of how a CML circuit functions, consider the inverter in Figure (2-13). Let input B have a constant value — a reference voltage. When input A is high (greater than the reference voltage by at least 75mv) , then Q 2 is turned on and Q 2 is cut off. The current being drawn through R 2 produces a logic low (V^-I^RJ at V outl . Notably, the complement of this output, a logic high (V cc ) is simultaneously available at V out2 . The presence of complementary outputs is yet another benefit of CML circuits. When input A is switched from high to low, the Figure 2-13. CML Inverter. conditions for Q 1 and Q 2 reverse. Q 2 turns on and Q 2 is cut off. V out2 is pulled low while V outl is pulled high. c) Advantages and Disadvantages For high-speed applications, the selection of a BJT logic design is reduced to a quantitative comparison of TTL and CML. The predecessors of these two logic families are far inferior in their capability to dissipate the accumulated charge at the transistor base upon switching. If the only two criteria were maximizing speed while minimizing power consumption, then there could possibly be a toss-up between TTL and CML — ultimately to 27 be determined by the design which achieves the lowest power- delay product or by weighting one specification over the other (high-speed or low-power) . Clearly, TTL is the low- power contender, while CML is the high-speed champion. However, before addressing the issue in the context of this design project, consider the following summary of advantages and disadvantages. In addition to being faster, CML requires a smaller voltage swing than TTL and is less susceptible to noise due to the nature of the BJT differential pair. As another benefit of that nature, CML generates complementary outputs. The fact that both output signals are referenced to V cc provides for exceptional stability when V cc is referenced to ground and a negative supply voltage is used. Unfortunately for TTL, its strong point of consuming less power has a down side: the short pulses of current which must be generated for switching logic levels also create spikes in the supply voltage. The constant current drawn by CML circuits avoids this potential source of noise. In conclusion to this comparison, a logic designer presented with the choice of CML or TTL would only choose TTL in the event that power consumption made CML impractical. In real world applications, this is typically true. However, since it is the purpose of this design project to explore the impact of high-speed logic on digital 28 system architecture, priority has been given to the superior speed and extensive design benefits of CML. Having concluded that current-mode logic is the best approach to HBT high-speed logic design, it is necessary to design a sufficient set of logic gates to implement the desired test circuit, an 8x8 bit pipelined multiplier. Chapter III presents the discussion of logic circuit design which includes design of the following: an inverter/buffer gate, a NOR/OR gate, full adders, and a practical current source. 29 THIS PAGE LEFT BLANK INTENTIONALLY 30 III. HBT CML LOGIC CIRCUIT DESIGN A. DESIGN OVERVIEW In this chapter, CML logic circuits are designed which will serve as the building blocks for construction of the multiplier logic. The design process is presented in the context of a single logic circuit, beginning with the most fundamental functions and progressing toward the more complex. Of note are the following general design goals which served as guidance for decision-making in the early stages of logic circuit design: • Minimize the rail voltages (i.e. supply voltage) • Achieve proper DC bias conditions with reliable noise margins and fanout • Optimize transient performance for speed and power consumption B. INVERTER DESIGN 1. Circuit Topology Based upon the introduction to CML design in the previous chapter, Figure (3-1) illustrates the circuit topology of a CML inverter. A detailed description of its function is presented in the previous chapter and will not be repeated here. However, there is one subtle constraint in this design. One of the differential inputs is tied to a 31 Figure 3-1. CML Inverter. reference voltage. While this is not essential for the design of an inverter, it will prove significant in the implementation of multiple-input logic gates. A common reference voltage eliminates the need to provide complementary logic signals for each input and furthermore, it avoids the increase in supply voltage associated with multiple complementary inputs in a stacked series of differential input pairs. Figure (3-2) illustrates the same inverter design as Figure (3-1); however, it also includes an emitter-follower stage at each collector output of the differential pair. The purpose of this stage is twofold. First, it provides a buffer between the input differential pair and the capacitive load of subsequent driven logic gates. Second, 32 it produces a downward DC shift equal to the base-emitter turn-on voltage. Ideally, the gain of the emitter-follower is one; however, in practice the gain is slightly less than one. The result is a slightly diminished voltage swing at the output of the emitter-follower when compared to the voltage swing at the collector of the differential pair. Whether or not to include the buffer stage represents a fundamental design issue for CML logic circuit design. At a glance, performance arguments can be made both for and against it. On the one hand, it would appear to increase fanout performance, yet on the other, it would appear to decrease switching performance with the additional switching delay of a second transistor stage. Additionally, the non¬ buff ered output topology would consume less power for a 33 given bias current. However, without performance data to substantiate one option over -the other, both will be developed and evaluated until objective design considerations can identify a clear preference. 2. Initial Conditions and Design Parameters a) Voltage Parameters Having introduced the topology of the CML inverter, it is necessary to establish initial conditions for operation. The first is the supply voltage, which is bound by two primary considerations. It must be large enough to support the proper function of the circuit, i.e. provide proper transistor bias conditions and the desired voltage range between high and low logic levels. Conversely, it should be kept as small as possible, because the power consumed by the circuit is directly proportional to the magnitude of the supply voltage. Clearly, foresight must be exercised in order to determine the minimum supply voltage necessary to achieve proper DC bias conditions for all transistors in all circuits of the design. In the context of this project, the D-type latch design (presented in Chapter IV) imposes the greatest demand on the supply voltage level by operating three transistors in series between the voltage supply rails. For optimum, reliable clocking performance of the latch, the logic reference voltage is determined to be 1.45 34 volts. This figure is based upon a maximum logic signal range of 0.5 volts and a maximum logic high voltage of 1.7 volts (reference Chapter IV-A-3a for further details). Given this information, the minimum required supply voltage is determined for each inverter topology. Both require that the voltage at the collector (V c ) be large enough to avoid saturation of Q 1 . Furthermore, both require that the voltage at the collector provide for an output voltage that matches the range of the input voltage. For the non-buffered topology, this implies an inverse match between the voltage at the base of Q 1 and the voltage at its collector. In other words, for a logic input that is high, V B(hi) , the output voltage at the collector should be low, such that the following relationship in Equation (3-1) holds true. (3-D V c(low) = V B(hi) - 0.5v Assuming the collector of Q x draws approximately 1mA of current, collector-emitter saturation voltage, V CE(sat ,, is 0.275 volts and the base-emitter turn-on voltage is 0.775 volts. Under these conditions, Q x is on the boundary of active mode operation. For a signal swing larger than 0.5 volts, the transistor would saturate. Conversely, for a logic input (V B ) that is low, the collector voltage (V c ) must be given by Equation (3-2). (3-2) V c(hi) = V B(low) + 0.5V 35 For V B(low) equal to 1.2 volts, V c(hi) must be 1.7 volts. Thus, for the non-buffered topology, the maximum voltage at the collector is 1.7 volts. No current flows through R gain because Q x is cutoff; therefore, the minimum required supply voltage is also 1.7 volts. In the case of the buffered topology, the DC voltage drop across the base-emitter junction of the output buffer imposes a greater demand. For the output voltage range to match the input voltage range, the voltage at the collector (as described in Equation 3-2) must be increased by an amount of V BE(on) (as shown in Equation 3-3) in order to counter the base-emitter voltage drop at the buffered output. (hi) — V B(10W) + 0.5v + V BE(on) Assuming a current of 1mA or less through the buffer, V BE(on) is 0.775 volts. The result is a minimum required supply voltage of 2.5 volts. (Reference Chapter IV-A-3a for a thorough derivation of these conclusions.) In summary, different supply voltage levels will be utilized for the two inverter topologies. The non- buffered output topology will employ a 1.7 volt supply voltage, while the buffered output topology will employ a 2.5 volt supply voltage. 36 b) Transistor Area/Size In order to optimize switching speeds in BJT/HBT transistors, it is desirable to keep the device area small, thereby minimizing parasitic capacitances. Likewise, a smaller device size requires less current and less current means less power. The InP HBT device sizes made available from Hughes Research Laboratories have junction areas of lxl, 1x3, 1x5, and 2x5 microns. The lxl area transistor is, therefore, the transistor of choice for switching applications (logic circuits). Note, however, that the consideration of device size must be re-visited for applications where switching speed is not a factor, i.e. the construction of a practical current source (addressed in Chapter IV). c) Fanout Requirement Fanout is the number of logic gate inputs that a single gate output can drive, while providing voltage levels within the correct logic range. Increased fanout is achieved at the expense of power consumption and loss of speed. Considering that the CML logic inputs/loads are current-driven, increased fanout will require a corresponding increase in switching delay and/or current. As a result, the fanout parameter should be chosen such that it sufficiently economizes the number of logic gates and levels of logic required without needlessly sacrificing 37 power and speed. In meeting this requirement, a reasonable fanout parameter has been established based upon the logic- level design of the a three-input adder (reference Chapter III-D). For implementation using the minimum number of logic levels, a three-input adder requires a fanout of four. 3. DC Analysis a) Overview Given the circuit topology for a CML inverter as shown previously in Figure (3-2), the first step in circuit design is to establish the proper DC bias conditions for operation. This can be done for both the buffered and non- buffered cases simultaneously. For the non-buffered case, simply disregard the presence of the buffer stages. The remaining node voltages at the collector outputs on the differential pair are the same. Figures (3-3a) and (3-3b) show the DC node voltages for the desired operation of a CML inverter given a high logic input and a low logic input, respectively. Given matched transistors the two sides of the differential pair could be considered symmetric in their behavior, except that the input voltages driving the opposite sides of the differential pair are not symmetric. That is, the reference voltage drives the differential pair at 1.45 volts whereas the logic input drives it at 1.7 volts. The result is a difference of 0.25 volts at the emitter. This is a minor (I R a+I r . 1 )R_- i n 1 V CC~ A B4 iV gain V CC - (*B3 + Ic 1 )Rgain"^BE(on) T ^CC ” (^B4^gain)'^BE(o bias (a) NOTE: Leakage Current is Neglected VcC^BsRgain k A ^CC ■ (^B4 + Ic2) Rgain ^ ^LOW # ^ C W % >i Q 1 .. . N 2j—• V RE p ^REF "^BE(on) ...... .... . ^ ... .. / / \ ^CC " (^B3Rgain)"^BE(on) rh , / V C C " (?B4 + Ic2)Rgain'”^BE(on) Figure 3-3. DC Analysis of a CML Inverter for (a) a HIGH input logic level and (b) a LOW input logic level. observation at present, but it explains the non-symmetric performance that is encountered between the two output signals (the inverted and the non-inverted signals). 39 b) Gain Resistor In order to take advantage of the switching speed of the differential pair, transistors must be biased to operate in the active mode. Therefore, the value of the base-emitter voltages (V BE ) for Q a and Q 2 must be such that V CE > V CE(sat) . Thus, for a given supply voltage and bias current, there is a restriction on the magnitude of the voltage drop across R gain . If the drop is too large, the transistor will saturate. Conversely, the voltage drop must not be too small because it is the product of I R _ gain and R gain which determines the magnitude of the signal voltage swing (assuming active operation). This same voltage range applies to the output of the buffer stages as well. As referenced earlier in this chapter, a constant DC shift of V BE(on) is the only difference between the nodes V c , and V bu£ . In summary, the significance of R gain is two-fold: it must be small enough to keep Q x (and Q 2 ) operating in the active mode, and it must be large enough to provide a satisfactory voltage swing between logic levels. Figure (3-4) illustrates the DC transfer characteristic of the inverter for various values of gain resistance. It effectively demonstrates the upper and lower limitations of gain resistance for a value of I bias equal to 1mA. At resistances of 500 ohms and less, the desired 0.5 volt signal swing is not achieved, and at resistances of 600 ohms 40 Figure 3-4. Effect of Gain Resistor Variation on Inverter Output. and greater, the effect of saturation can be observed by the upward bend in the curve. c) Buffer Resistor The buffer resistor (R^) governs the amount of current drawn by the emitter of transistors Q 3 and Q 4 . The magnitude of emitter current is directly proportional to the base current which is drawn from the collector of the differential pair. Thus, the base current of the output buffer represents a small portion of the current passing through R gain In this way, the size of the buffer resistor 41 effectively produces a small DC offset at the buffered output while regulating the amount of current drawn through the buffer stage. This is significant for two reasons. First, it facilitates optimization of switching speed versus power consumption by providing a mechanism for controlling the amount of current flowing through the buffer stage and therefore, available to drive a logic load. Second, R,^ is inversely proportional to a DC voltage offset at the buffered output. The ability to control this offset is especially helpful in matching the output signal swing to the input. Figure (3-5) represents the variation of output voltage for a range of resistor values based upon a bias current of 1mA. d) Bias Current Bias current is directly proportional to the current (I c ) drawn through the gain resistor (R gain ) • Therefore, bias current drives the magnitude of the voltage drop produced in the gain resistor, and this voltage drop corresponds to the maximum signal voltage swing. For this reason, a proper combination of I bias and R gain must be determined to provide the desired 0.5 volt swing. In order to select from an infinite set of current-resistor combinations, a likely set of current-resistor pairs will be identified to represent the practical range of 42 Figure 3-5. Effect of Buffer Resistor Variation on Inverter Output. possibilities. This is done for both the buffered and non- buffered inverter topologies. Note, the non-buffered topology can be allowed to draw a higher bias current through the differential pair because it does not draw any additional current through buffer stages. e) DC Noise Margins Once values of resistance and bias current are established, the circuit topology is completely defined and a DC transfer curve can be obtained. From this plot the DC noise margins for a particular design are calculated. Noise margins provide a measure of the allowable noise which can 43 be received at the input without affecting the correct logic output. Since this circuit will be operating with such a narrow signal voltage swing, noise mar ins are a critical interest for establishing reliable DC bias conditions. Equations (3-4) and (3-5) define the high and low noise margins in terms of the maximum and minimum, high and low logic values. (Weste, 1993) (3-4) DM, = | V - V 1 ILmax OLmax 1 (3-5) NM, = | V . - V . 1 OHmin IHmin I where, V ^ ILmax v OHmin ^ OLmax = minimum = maximum = minimum = maximum HIGH input voltage LOW input voltage HIGH output voltage LOW output voltage These logic values are extracted from the DC transfer curve. The two unity gain points (where the slope equals negative one) of the DC transfer curve have been used to define the boundaries of these regions. f) DC Bias Optimization Given a set of practical current values, DC analysis is employed to identify a set of matching gain resistances which properly bias the inverter for logic operations. For each pair of current-resistor values, a DC transfer characteristic is obtained to determine the noise margins and the maximum range of the signal swing. The results are tabulated in Table (3-1). In the absence of a 44 45 load, each configuration met the established design requirements — that is, a matched input and output signal voltage range of 0.5 volts, centered at a reference voltage of 1.45 volts with sufficiently balanced noise margins of 0.1 volt minimum (20% of the signal range). However, when examined under the maximum fanout load (which is four), the performance of the non-buffered output topology suffers greatly. The maximum high logic voltage is reduced by an amount ranging from 0.09 volt to 0.23 volt, depending upon the bias configuration. Not only does a load reduce the desired 0.5 volt signal range, but it also erodes the high-end noise margin. As a result, the non- buffered output topology can now be eliminated from further consideration in the design process. As for the buffered output topology, the noise margins and voltage range are remarkably consistent — regardless of the loading. The output buffer effectively isolates the current drawn by the load from the current in the differential pair. Thus, each of the bias configurations for the buffered output topology will be further tested under transient conditions to identify the optimum inverter design. It should be noted that the DC analysis presented here and the transient performance analysis which follows are both conducted using ideal current source models. 46 4. AC/Transient Analysis a) Delay Measurements Transient performance of logic circuits is generally quantified by measuring the delay associated with signal propagation. The delay times utilized here are standard performance parameters. However, for completeness, their mathematical definitions are provided below in Equations (3-6) and (3-7). (Weste, 1993) (3-6) tfall = time for a logic signal to traverse from 0.9 V RANGE to 0.1 V RANGE (3-7) trxse = time for a logic signal to traverse from 0.1 V RMGE to 0.9 V RANGE where, = the voltage difference between the steady state V HI and V L0W b) Performance Parameters At this point in the design process, two performance parameters are of primary concern, power and speed. Being related to each other, there is often a trade¬ off between the two. Optimization of these two parameters will determine which of the DC bias inverter configurations will be implemented. A common method of optimization is to quantify the parameters of power and speed as a single 47 figure of merit, such as a product or a ratio. Optimization is then achieved by maximizing or minimizing the appropriate figure of merit. Power-delay product is one such figure of merit. It is simply the product of the power consumed by a logic circuit multiplied times the propagation delay of the signal from input to output. Expectedly, the design that most efficiently balances the trade-off between speed and power consumption will yield the lowest power-delay product in transient testing. The ratio of speed to power provides a similar figure of merit, but speed measurements are not as clearly defined as delay measurements. Therefore, in the interest of optimizing this design for speed, a definition of maximum switching frequency will now be established. The maximum reliable frequency is defined as the maximum switching frequency of the logic input signal for which a maximally loaded output signal consistently traverses 90% of the 0.5 volt range of logic. c) Transient Analysis Procedures For an accurate evaluation of logic circuit performance, it is necessary to provide a realistic input signal and a worst-case output load. Here, the term load implies driving four inverters in parallel. To achieve a realistic test environment, the test circuit of Figure (3-6) 48 was designed. Specifically, note the location of gates A and B. Their input and output signals will be measured to analyze performance with a fanout of one and four, respectively. It is expected that the use of a reference voltage at the differential input of the inverter will cause the inverted and non-inverted output signals to respond differently. As a result, two gate topologies are analyzed for each of the valid DC bias configurations from Table (3-1). The first gate topology is a single output inverter from which the inverted output signal is measured. The second is a complementary output inverter from which the non-inverted output signal is measured. Conveniently, these two configurations also represent the alternating signal 49 pattern which will characterize the adder circuits later in this chapter. Initially, the appropriate logic delays are measured at gate A and gate B in order to collect data for the cases of minimum and maximum loads, respectively. The worst-case delay is then multiplied by the average power per gate to obtain a power-delay product. This is done for both the inverted and the non-inverted output signals — providing separate power-delay product terms. Their Siam forms a composite power-delay product. The composite power-delay product is a figure of merit which effectively represents the implementation of the two gate topologies in series. Finally, the switching period of the input logic is decremented for successive tests in order to determine the shortest period for which the output signal of a loaded gate (gate B) would consistently traverse the full range of logic (between high and low) . This quantity has been defined in the previous section as the maximum reliable frequency (MRF) . For each configuration, the maximum reliable frequency is divided by the average power per gate to obtain a speed-power ratio (GHz/mW). The presence of a secondary load provides confirmation that consecutive loads can be successfully driven when the primary load is driven at its maximum reliable frequency. 50 d) Summary of Results Transient analysis confirms the non-symmetric behavior of the inverted and non-inverted output signals. Therefore, Tables (3-2a) and (3-2b) provide details of their Bias Current (mA) Tprop L-H (PS) Tprop H-L (PS) Current per Gate (mA) Power per Gate (mW) Maximum Power-Delay Product (mW-pS) 0.1 42 255 0.81 2.03 518 0.25 56 48 0.97 2.42 136 0.5 33 26 1.28 3.20 106 0.75 23 26 1.59 3.99 104 1 17 26 1.88 4.69 122 1.5 13 27 2.38 5.94 160 Table 3-2a. Power-Delay Data for the Inverted Signal. Single output topology with practical current sources and a fanout load of four. Bias Current (mA) Tprop L-H (PS) Tprop H-L (PS) Current per Gate (mA) Power per Gate (mW) Maximum Power-Delay Product (mW-pS) _ 0.1 212 82 1.45 3.63 770 0.25 61 88 1.64 4.10 361 0.5 27 63 2.02 5.04 318 0.75 23 46 2.31 5.78 266 1 19 41 2.63 6.56 269 1.5 18 40 3.09 7.74 309 Table 3-2b. Power- -Delay Data for the Non- Inverted Signal Complementary output topology with practical current sources and a fanout lpad of four. 51 respective delay measurements. Specifically, the high-to- low transition of the non-inverted output signal represents the worst-case transition. The overall performance of each DC bias configuration is summarized in Table (3-3). The power-delay product and speed-power ratio are normalized to simplify comparison. Figure (3-7) illustrates the minimization curve for the power-delay product, while Figure (3-8) shows the maximization curve for the speed-power ratio. Clearly, the 0.75mA configuration proves to be the optimum design — maximizing the speed-power ratio while minimizing the power-delay product. Furthermore, it provides for a maximum reliable frequency of 8.7 GHz. This is more than suitable to achieve the 5 GHz maximum clock frequency desired in Chapter V (for the maximally pipelined multiplier implementation). Bias Current (mA) Maximum Composite Power-Delay Product Normalized Composite Power-Delay Product Maximum Reliable Frequency (GHz) Normalized Speed-Power Ratio 0.1 467 3.48 n/o n/a 0.25 144 1.34 5.30 0.86 0.5 96 1.14 7.10 0.94 0.75 72 1.00 8.70 1.00 1 67 1.06 9.09 0.92 1.5 67 1.27 11.10 0.96 Table 3-3. Summary of Transient: Analysis Results. Composite Power-Delay Product and Speed-Power Ratio. 52 Figure 3-7. Results of Transient Analysis: Normalized Speed-Power Ratio of Inverter Configurations. Figure 3-8. Results of Transient Analysis: Normalized Bower-Delay Product of Inverter Configurations. 53 5. Final Design Summary: Inverter The final design for the CML inverter/buffer circuit is illustrated in Figure (3-9). The applicable design and performance parameters have been summarized in Table (3-3) . Here, the data represents performance when the design is implemented with the 0.75mA practical current source from Chapter III-E. Also note that when complementary output signals are not required, the unused output buffer stage can be excluded to conserve power and minimize the device count. CML Inverter Design and Performance Parameters Rgain • 750& Rbuf : 2000 0 ^bias : 0.75 mA NMl : 0.13V (26%Vswing) NMh : 0.14V (28%Vswing) Power: 5.78 mW (complementary output ) 3.99 rnW (single output) Inverted Signal Mon-inverted Signal Delays Fanout s l Fanout s 4 Fanout = 1 Fanout = 4 tp(H-L) 14ps 2 6ps 39ps 46ps tp(L-H) 17ps 23ps 18ps 23ps tfall 19ps 41ps 87ps 9 Ops trise 48ps 61ps 45ps 6 Ops Table 3-4. CML Inverter Design and Performance Parameters. 54 2.5 volts Figure 3-9. Final Design of the CML Inverter. C. LOGIC NOR GATE DESIGN 1. Overview and Analysis The circuit topology for a two-input CML NOR gate is presented in Figure (3-10) . There is little that differs from the inverter, which accurately suggests that the analysis here will be extremely similar to the previous section. In fact, with regard to both circuit topology and performance analysis, the only distinguishing feature is the second logic input in parallel with the first. 55 Consider the functionality of the two parallel inputs A and B. If either of them is a logic high, then the left side of the differential pair is on and the NOR output is pulled low. Conversely, if both inputs A and B are low, NOR Output Figure 3-10. Circuit topology for a two-input OR/NOR logic gate. then the NOR output is high. On the opposite side of the differential pair is the complementary output — the OR function. If another input transistor were added in parallel to the existing two, it would be a three-input OR/NOR gate — and similarly for a fourth input. Despite the drastic change in functionality, the presence of several logic inputs in parallel to the original logic input induces no fundamental change to the DC bias of 56 the circuit. As a result, the DC bias conditions for the optimized inverter circuit are directly applied to the final design of the NOR circuit. 2. Final Design Summary: OR/NOR With the exception of having multiple parallel transistors for multiple logic inputs, the final design for the CML OR/NOR logic circuit is identical to that of the inverter. As for its performance, the noise margins and delay measurements vary only slightly in response to the "multiple trigger" effect of simultaneous parallel inputs. The design parameters are identical to the inverter and therefore are not repeated. However, a selection of the performance parameters have been provided in Table (3-5) in order to demonstrate the variation of performance based upon the input configuration. Conveniently, the NOR gate constitutes a near identical capacitive load as the inverter — with maximum delay differences of less than 1.5ps. It exhibits the same delay variations between its OR and NOR signals as the inverter does between the inverted and non-inverted signals. And finally, as with the inverter, when both of the complementary outputs of the OR/NOR gate are not required, the unused output buffer stage is not included to conserve power and minimize the device count. 57 CML OR/NOR Gate Delay Performance Parameters 2- Input OR/NOR Gate Single Input Transition NDR Signal OR Signal Fanout - 1 Fanout = 4 Fanout = 1 Fanout = 4 tp(H-L) 16ps 29ps 4 Ops 47ps tp(L-H) 24ps 29ps 19ps 23ps 3- Input OR/NOR Gate Single and Simultaneous Input Transitions NOR Signal OR Signal Fanout = 1 Fanout = 4 Fanout = 1 Fanout = 4 Single Input tp(H-L) 19ps 28ps 41ps 48ps Transition tp(L-H) 29ps 34ps 18ps 23ps Simultaneous tp(H-L) 17ps 3 Bps 4 Ops 47ps Input 43ps 48ps lips lGps Transition 4-Input OR/NOR Gate Single Input Transition NOR Signal OR Signal Fanout = 1 Fanout = 4 Fanout = 1 Fanout = 4 Single Input ^(H-L) 21 P S 3C ^> S 41 P S 48 P S Transition tp (L _H) 33ps 39ps 18ps 23ps Table 3-5. Summary of OR/NOR Gate Delay Performance. Single Input Transition 58 3. Implementation of the AND Function In current-mode logic, the AND function is implemented by simply inverting the input signals and reversing the polarity designation of the output nodes. In actual practice, inverters and OR/NOR gates are sufficient to realize any logic function. Thus, for the sake of simplicity, AND gates were not constructed as a separate logic circuit. Rather, all logic functions were deliberately expressed as functions of inverters and OR/NOR gates. D. ADDER DESIGN 1. Implementation. Two-input and three-input adders are required to construct the carry-save adders and carry-completion adders of the multiplier (Chapter V) . Equipped with a sufficient set of logic gates, this is an elementary task. The sum of min-terms for the sum and carry bits of a two-input adder are shown in Equations (3-8) and (3-9), respectively. (3-8) SUm l2input = XY (3-9) Carry | 2input = XY Employing De'Morgan's Theorem, these expressions can be manipulated into the equivalent expressions for 59 implementation with OR/NOR gates, as shown in Equations (3-10) and (3-11) . (3-10) Sum l 2 i»p„t = (X'+Y)' + (X+Y')' (3-11) Carry 1 2inpuE = (X'+Y')' This adder design requires the complementary logic inputs be provided in order to eliminate the need for inverters and a third level of logic delay. Such a requirement is trivial because complementary signals are potentially available at the output of each CML logic gate. Figure (3-11) illustrates the two-input adder. * 1-0 Figure 3-11. Two-input adder with identification of the critical path. 60 61 2. Performance Analysis Proper functioning of each adder was verified for all possible input combinations. Notice that the critical path for each adder is identified in Figures (3-11) and (3-12). For the two-input adder, the critical path flows through two levels of logic to produce the sum bit. The worst case transition is from a (1/0) or a (0/1) input for (X/Y) to a (1/1) input. This is owing to the fact that the worst-case gate delay is the high-to-low transition of the OR output when it has been driven by the high-to-low output transition of the preceding NOR gate. Based upon the data from Table (3-5), the critical path delay equals 63 picoseconds. This provides a good match with a simulation of the critical path delay which yields 60 picoseconds. Similarly, for the three-input adder the critical path delay is calculated to be 67 picoseconds along the path illustrated in Figure (3-12). This was validated with a simulation measurement of 66 picoseconds. E. PRACTICAL CURRENT SOURCE DESIGN 1. Circuit Topologies Up to this point, each logic element has been designed using an ideal current source. In order to validate the performance of these designs for actual' implementation, it is necessary to construct a practical current source. There are effectively three circuit configurations which provide 62 transistor bias conditions for establishing a current source. These three topologies are presented in Figure (3-13). In each configuration the amount of bias current drawn is regulated by and directly proportional to the magnitude of the current drawn by the base of Q S0DRCE . Figure 3-13. Current Source Topologies. 2. Performance Analysis In order to analyze and compare the performance of each current source, three simple 0.75mA current sources are designed — one using each topology. Each is then implemented as the practical current source for the inverter/buffer circuit of Chapter III-B-5. Their relative performance is evaluated based upon the following design goals: 63 • Minimize the operational limitations due to frequency response • Approximate the performance of an ideal current source • Minimize the cost of implementation (power and device count) The performance of each configuration is illustrated in Figure (3-14a) and (3-14b). Notice that each inverted output signal drops below the desired 1.2 volt voltage low level when making the transition from high-to-low. This "dip" results from reversing the polarity of the differential pair input signals — inducing a brief drop in the bias voltage at the positive (POS) terminal of the current source. A delayed return to the proper bias voltage is then governed by the RC characteristics of the Q S0URCE collector. This delay is particularly observed in the transient performance of the topologies in Figure (3-13a) and (3-13b). 3. Final Design: Current Source By process of elimination, the current mirror topology of Figure (3-13c) is the only design suitable for driving a logic device family that is capable of switching frequencies above 8 GHz. Unfortunately, the current mirror also incurs the largest cost in terms of power and device count. Thus, to reduce the amount of current "lost" through the left side 64 Inverted Output (Voltage) Non-Inverted Output (Voltage) 65 a) 0.75mA Current Source The final current source design for a 0.75mA current source is shown in Figure (3-15). The DC transfer characteristic of this source. Figure (3-16), illustrates that the bias current drawn is a function of the collector- emitter voltage (V CE ) at Q S0URCE . More specifically, it is seen that V CE must be greater than 0.3 volts in order to ensure that 0.75mA is drawn. This represents a critical design parameter for establishing a proper DC bias on the current source. Figure 3-15. Final Design of a Practical 0.75mA Current Source. The 0.75mA current source design is validated by a direct performance comparison with an ideal current source. Figure (3-17) compares the output signals for a maximally loaded inverter/buffer circuit when driven by both 66 an ideal and a practical current source. It can be seen that the transition delay resulting from the practical source is consistently ahead of the ideal source for the inverted output signal by a margin of five picoseconds. Meanwhile, the non-inverted output signal of the practical current source maintains the status quo by matching the pair delay of the ideal source. In a design that is characterized by alternating stages of positive and negative logic signals, it is reasonable to expect that the implementation of the practical current source would yield a slight improvement over the ideal source. b) 2.0mA Current Source Exercising a little foresight into the conclusions of Chapter IV, it is convenient here to present the design of the 2mA practical current source. This design is a simple modification to the 0.75mA design — implemented by decreasing the resistance from 5250 Q to 2020 £1. This allows an increase of current flow into the base of Q MIRR0R and produces the transfer characteristic shown in Figure (3-18). Again, a bias voltage at Q MIRR0R must ensure that V CE is greater than or equal to 0.3 volts in order to achieve proper functioning of the current source. The 2mA current source is also validated by testing it against an ideal current source while driving a maximally loaded D-type CML Latch. The respective output 68 Figure 3-18. Transfer Characteristic of the 2.0mA Current Source. signals, Q and QN, are plotted in Figure (3-19). It can be seen that the output signal transition delay resulting from the practical source compares favorably with the delay associated with the ideal source. However, the ideal-driven output signals consistently crosses the reference voltage of 1.45 volts approximately 10 picoseconds ahead of the practical-source-driven output signals. Thus, the effective margin of error for approximating the practical source with an ideal source is 10 picoseconds. In a synchronous pipelined architecture, this simply adds between 10 and 20 picoseconds to the minimum clock period. 69 Figure 3-19. Comparison of Latch Performance, Practical Current Source vs. an Ideal Source. In summary, a sufficient set of logic circuits is now in hand, along with a practical current source with which to drive them. Thus, the combinational logic for a multiplier can be fully implemented. However, based upon the intent of pipelining this multiplier, it is necessary to construct the clock-driven devices that will control the flow of data. Chapter IV presents this discussion with the design of a D- type latch, a D-type flip-flop, and a clock driver. 70 IV. HBT CML LATCH AND REGISTER DESIGN A. LATCH DESIGN 1. Circuit Topology a) Two Latch Topologies The most common latch design is based upon the logic level schematic illustrated in Figure (4-1) . Design of this latch simply requires the proper connection of four NOR gates with the appropriate clock and logic input signals. The cumulative power consumed by the four NOR gates constitutes a significant cost (based upon the four milliwatt per gate design from Chapter III). Figure 4-1. D-type Latch constructed from NOR gates. 71 However, the unique characteristics of CML provide an alternative design that yields comparable performance at a significant savings in power. This CML latch design is illustrated in Figure (4-2). Due to the relative unfamiliarity of this design, a brief functional description follows. ; Output Buffer Figure 4-2. CML D-type Latch Design (After Jalali). b) Functional Description of a CML Latch Referencing Figure (4-2), the source labeled I bias draws a constant current through the lower (clock-driven) differential pair. Complementary clock signals provide the differential inputs. Depending upon the phase of the clock signal, current is drawn from one of the two cascaded differential pairs, i.e. either the track pair or the latch pair. Consider the case when the CLK signal is high. Current will be drawn from the "track" pair while the "latch" pair is simultaneously cut off. In this case the latch is considered "open" or "transparent, " and the track pair behaves like the differential pair configuration of the inverter/buffer logic gate. Thus, the logic inputs of the track pair are mirrored at the opposite collector. However, there is one exception. In the CML latch, complementary logic inputs are employed rather than a logic reference voltage. For a single logic input, complementary input signals enhance noise immunity and provide for symmetric waveforms at the complementary output ports. Now, consider when the CLK signal transitions from high to low. The track pair is cutoff as current is switched to the latch pair via the right side of the clock- driven differential pair. Herein lies the significance of the common collector nodes shared by the track pair and latch pair. Due to the high impedance nature of the HBT 73 collector-base junction, the voltage level at the collector is slow to change and lingers long enough to bias the latch pair for essentially identical operation and output levels. This effectively latches the logic levels from the track pair to the latch pair. (Jalali, 1995) Regardless of the state of the latch, the logic levels at the common collector (of the track and latch pairs) are reflected at the latch output ports via the same output buffer configuration presented in Chapter III. 2. Initial Conditions and Design Parameters The CML latch presents the most demanding DC bias requirements of any circuit designed for this project. As a result, no voltage cap has been placed upon its design. Rather, the initial design goal is to determine the minimum necessary DC bias conditions for proper operation of the latch. The resulting "voltage budget" will define the voltage relationships for proper operation of each transistor and differential pair. It will further establish important specifications for supply voltage and logic signal levels. Derivation of the "voltage budget" is presented as part of the DC analysis in the following section. The minimum available transistor area (lxl micron) is employed for optimum switching speeds, and the fanout requirement remains at four. These specifications are 74 consistent with the logic circuits designed in the previous chapter. 3. DC Analysis a) DC Bias Conditions / The Voltage Budget For proper operation of the CML latch, each differential pair of transistors must be properly biased. Knowing the requirements imposed by proper DC bias conditions will reveal the following necessary design parameters: • Required minimum supply voltage • Required minimum voltage level for representing the positive (high) phase of the clock • Required minimum voltage level for representing a logic high state • Maximum allowable signal range between high and low logic levels To facilitate analysis, the CML latch topology is divided into three levels of operation, as illustrated in Figure (4-3). Level one (the bottom level) is a practical current source. Implementing the design from Chapter III-E, the current source requires a minimum of volts at node X in order to sustain the desired level of bias current. (4-D V x > V Ibias 75 76 Figure 4-3. Voltage Budget for the CML Latch This requirement imposes the following operational condition upon the "driving" base voltage of the C^/Q., differential pair (i.e. the high CLK voltage). ^CLK(hi) — V x + V BE(on) | 2)2 A further consideration is the proper biasing of the Q 1 /Q 2 collectors for operation in the active region. This places the following operational condition upon the collector voltages (nodes Y1 and Y2). — ^CLKlhi) _ ^BE(on)|Ql2 ^CE(sat) where, V y represents either V yl or V y2 Only the tracking differential pair (connected to node Yl) will be addressed at this point because it is driven by lower voltage levels which impose more restrictive DC bias conditions on Yl than Y2. Once again, a minimum voltage requirement at the common emitter of the Q 3 /Q 4 differential pair presents a constraint on the minimum steady-state driving voltage at each base. This driving voltage corresponds to a logic high input voltage. Thus, the voltage level selected to represent a logic high must satisfy the following relationship. (4 - 4 ) ^LOGIC(hi) — ^BE(on)|Q34 ^Yl Finally, three conditions must be satisfied at the collectors of the track pair. The first condition is that 77 transistors Q 3 and Q 4 must operate in the active mode. This requires the following familiar relationship. Vc,i ow) — ^LOGiahi) — ^BE(on)|Q34 Y;E(sat) where V c represents either V C1 or V c2 Similarly, the second condition requires that the transistors of the latch pair also operate in the active mode. This condition differs from the one above because the latch pair is driven by the collector voltage levels of the track pair. ^C(low) — ^C(hi) — V BE(on )|Q56 ^CE(sat) Defining the voltage range of the logic signal (V^^) as the difference between high and low voltage levels, Equation (4-5) is manipulated to show the.maximum value. (. 4 - 7 ) Grange — ^be(cti)|q 56 ~ "^CE(sat) Knowing the transistor parameters for V BE(on) and V CE(sat) from Chapter II, (V^)^ is 0.5 volts. The third condition is that the input and output logic levels must match. A high logic input at the transistor base must drive the collector voltage relatively low (V C(low) ) such that it produces a matched low logic output at QN. Likewise, the inverse must also be true. The following equations express these requirements. (4-8) ^LOGIC(hi) Grange — Yrdow) ^BE{on)|buffer (4-9) ^LOGIC(low) Grange = ^C(hi) ^BE(on)| buffer 78 Based upon these relationships the maximum collector voltage is determined, which further dictates the minimum required supply voltage for proper DC operating conditions. The voltage budget relationships are summarized in Figure (4-3) . Actual values have been determined for four latch configurations as listed in Table (4-1). The essential difference is the magnitude of the bias current. An economical margin of safety has been built into these values. Notice that these margins have been allowed to vary slightly between configurations in order to maintain uniform values for clock and logic signal values. This greatly simplifies the comparative testing of the four configurations. The design margins are highlighted to illustrate the negligible deviation incurred. All four configurations meet and exceed the required DC bias conditions. In the event that uniform design margins had been used such that the supply voltages were optimized, the difference would have been trivial — within plus or minus 0.1 volt or 4% of the 2.5 volt supply voltage. b) DC Bias Optimization At this point the gain resistance, buffer resistance, and the bias current are the only undetermined parameters. The same procedures described in the design of the inverter/buffer circuit are employed to design four 79 CML Latch Voltage Budget for Multiple Bias Current Configurations 1mA 1.5mA 2mA 3mA Known/Measured Parameters: VfiE(on) 0.775 0.80 0.82 0.857 VcE(sat) 0.26 0.30 0.31 0.35 V I-bias 0.3 0.3 0.3 0.3 Determined Parameters: [VRANGElmax 0.515 0.5 0.51 0.507 Margin for Range of 0.015 0.0 0.1 0.007 Logic Signal Voltage [V RANGE]actual 0.5 0.5 0.5 0.5 Vcc 2.5 2.5 2.5 2.5 Margin to nearest 0.075 0.025 0.025 0.0 ICIXUl v/X <X Vyll VCO V C(hi) 2.425 2.475 2.475 .2.5. [VLOGIC(hi)] actual 1.7 1.7 1.7 1.7 Margin for Differential Logic Signal Switching [VLOGIC(hi)]min 0.24 1.46 0.2 1.5 111 1.51 ; ;:r: . 1.55. Vyi 0.685 0.7 0.69 0.693 VcLK(hi) 1.2 1.2 1.2 1.2 .Vx... 0.42 0.4 0.39 0.358 Margin for Differential 0.12 fis:0i!p«i 0.09 0.058 Clock Signal Switching V I-bias 0.3 0.3 0.3 0.3 Based upon a 0.5 volt signal swing for both logic and clock signals: Vlogic(Iow) 1-2 1.2 1.2 1.2 Vclk(Iow) 0-7 0.7 0.7 0.7 Table 4-1. Voltage Budget for the CML D-type Latch. 80 different latch configurations based upon the specifications determined in Table (4-1). Noise Margins are obtained from the DC transfer characteristic of each. These results are included in Table (4-2). With maximum fanout loads on both output ports, all four CML latch designs meet the requirements of a 0.5 volt output signal range and 0.1 volt (20%) balanced noise margins. Therefore, all four CML designs are considered in transient analysis. Bias Current (mA) Gain Resistor (Ohms) Buffer Resistor (Ohms) No Load / Loaded High Noise Margin (Volts) No Load / Loaded Low Noise Margin (Volts) Logic Signal Range (Volts) 1 600 2000 0.14 / 0.13 0.13 / 0.13 0.49 1.5 410 2000 0.13 / 0.13 0.13 / 0.13 0.51 2 310 2000 0.12 / 0.12 0.12 / 0.12 0.51 3 210 2000 0.11 / 0.11 0.11 / 0.11 0.52 Table 4-2. Results of DC Analysis. 4. AC/Transient Analysis a) Performance Parameters Three parameters are of primary interest in evaluating the transient performance of a latch: setup time, hold time, and logic propagation delay. Figure (4-4) illustrates how each of these relates to the events on a transient plot. In the absence of a reference voltage. 81 CLOCK Open Latched Figure 4-4. Illustration of setup time, hold time, and propagation delay. differential signal references are taken as the point where the complementary signals cross. As a figure of merit for optimizing the trade-off between speed and power, a power-delay product is calculated using the values defined here. The figure for power represents the average power, and the figure for delay represents the sum of the setup time and the worst-case propagation delay time. b) Analysis Procedures For an accurate evaluation of latch performance, it is necessary to provide realistic logic and clock input signals as well as realistic worst-case fanout loads. 82 Furthermore, to ensure and demonstrate the proper DC bias design of the CML latch, practical current sources are implemented in testing. In addition to the four CML latch designs, the traditional logic latch is also tested. Each design is substituted into the test circuit to determine the performance parameters described in the previous section. c) Summary of Results The results of transient analysis are summarized in Table (4-3). The 1.5mA configuration achieves the minimum power-delay product as illustrated in Figure (4-5). Note, however, that the 2mA configuration performs at a Bias Current (mA) Figure 4-5. Results of Transient Analysis: Normalized Power-Delay Product of Latch Configurations. 83 84 indicates a capacitive spike at the mutual collector nodes of the latch and track differential pairs. This results each time the clock-driven pair switches current to the opposite side. It is not expected that this noise will adversely affect the ability of the CML latch to drive reliable logic levels. However, in the event that the CML latch is overcome by noise, the NOR latch configuration is a viable alternative because it does not experience this problem. Finally, the switching activity of the differential pair also induces variations in the current drawn from the supply voltage. Figure (4-7) illustrates these power rail transients for a single CML latch. The Figure 4-7. Power Rail transients due to the switching activity of a single CML Latch. 86 abrupt, periodic reduction in supply current coincides with the brief transition of current from one side of the differential pair to the other — driven by the switching of the clock signal. In the worst-case, this downward transient spike reaches a current level that is 18% below the average. It is also evident that slightly more current is drawn when the latch is latched because the latch pair is driven by a higher input voltage than the track pair. This results in a higher voltage and thus more current being drawn at the practical current source. 5. Special Latch Implementations In the course of this design project, two special implementations of the CML latch have been designed. The first implements a logic reference voltage at one of the logic inputs of the latch. The purpose here is to eliminate the requirement for complementary logic signals at the multiplier input. The second special implementation also uses a reference voltage; however, it does so with the purpose of conducting a logic function at the input to the latch. Although this circuit functions well, it actually results in slightly greater delays due to the increased collector capacitance at the tracking pair. As a result, it is not utilized in the multiplier circuit. 87 6. Final Design Summary: D-Latch The final design for the CML latch is implemented with the parameters listed in Table (4-4) using the topology presented previously in Figure (4-2). Also listed are the transient performance parameters for operation at each level of fanout loading. These figures represent the performance of the latch when it is implemented with a practical current source and driven by a maximally loaded clock driver. Latch Design and Performance Summary Rgain- 310 O Rbuf- 2000 £2 Ibias- 2 mA NM l : 0.12v NM h : 0.12v Power: 9.0 mw Max Fanout Setup Hold tprop tprop Total Load Time Time H-L L-H Delay (# gates) (PS) (PS) (PS) (PS) (PS) 1 33 9 27 0 60 2 33 10 28 1 61 3 34 10 31 2 65 4 35 10 34 3 69 Table 4-4. Final Design Summary of the D-type CML Latch. 88 B. FLIP-FLOP DESIGN (D-TYPE) 1. Overview and Analysis The D-type flip-flop is constructed from two D-type CML latches. The two latches are connected in a master-slave configuration such that they are latched by opposite phases of the clock. This simple design is illustrated in Figure (4-7) . D Q -♦- D Q D-LATCH D-LATCH DN QN -•- DN QN OPEN LATCH OPEN LATCH CLOCK INVERTED INVERTED CLOCK CLOCK CLOCK Figure 4-7. D-type Flip-Flop. The flip-flop design is tested under the same conditions of loading and input signals as discussed previously for the latch. This testing verifies proper function of the flip-flop design and confirms that the flip- flop performance parameters of setup time and hold time mirror those of the CML latch. However, due to the presence of a second latch in the flip-flop, the propagation delays are greater. 89 2. Final Design Summary The final design for the CML D-type flip-flop is essentially the master-slave configuration of two CML latches, as illustrated in Figure (4-7) . The design parameters of the master and slave latches remains the same as shown in Table (4-4). The applicable performance parameters of the flip-flop have been summarized in Table (4-5) . Flip-Flop Design and Performance Summary Reference Latch Design Parameters Power: 18 mw Fanout Setup Hold tprop tprop Max Total Load Time Time H-L L-H Delay (# gates) (PS) (PS) (PS) (pS) (PS) i 33 9 49 35 82 2 33 9 53 47 86 3 34 9 52 45 86 4 35 10 54 43 89 Table 4-4. Design and Performance Summary of the D-type Flip-Flop. 90 C. CLOCK DRIVER DESIGN 1. Overview The topology of the clock driver closely resembles that of the inverter/buffer circuit. In fact, the only necessary modification to the inverter/buffer design is a reduction of the output voltage range at the output buffer. This is accomplished by a simple voltage divider that effectively steps the voltage down to the desired voltage range between 0.7 and 1.2 volts (Figure 4-8). This voltage range is dictated by the CML latch design. Two performance parameters are of particular interest in the clock driver design, fanout capability and the 91 symmetry of complementary output signals. Increased fanout is desirable to reduce the number of clock drivers required. Meanwhile, output symmetry is important to reduce clock skew between parallel clock paths. The absence of symmetry between the complementary output signals of the logic circuits (in Chapter III) results from the corresponding lack of symmetry between the input signals, i.e. the use of a reference voltage. Therefore, the clock driver is driven by the differential clock signals CLK and CLK-N. 2. Analysis and Results Fanout capability is maximized by the increase of current through the output buffer. Two further modifications to the inverter/buffer circuit make this possible. The first is to increase the bias current. For a supply voltage of 2.5 volts, a practical current source of 2mA is the largest that is operable without adversely biasing the circuit. Second, reducing the total resistance in the output buffer draws a larger base current and ultimately, more current is available to the output load. For evaluation, the performance of two clock driver configurations is measured based upon the power consumed per load driven. The 1mA clock driver draws 5.5mA and consumes 13.8mW while driving a maximum of two latches. Meanwhile, the 2mA clock driver draws 6.5mA and consumes 92 16.3mW while driving four latches. Clearly, the 2mA clock driver is the desired implementation. The synchronous switching behavior of the clock driver coupled with its high current consumption warrant an investigation of its power rail transient characteristic (Figure 4-9) . It is not surprising that it follows the same periodic trend as discussed in the case of the CML latch. In the worst-case, the downward transient current spike deviates by 14.6% from the average current level. Also of 6 . 9 :— 6 . 8 :. 6 . 7 : 6 . 6 : 6 . 5 : _ 6 . 4 : s’ 6 <3 : I < 6. It 6 . 0 : 5 . 9 : . 5 . 8 : 5 . 7 : Latched Tracking Input Signal J5. t .x OPENING \ LATCHING 0.0 — r -r- 0.5 \ 1.0 1.5 Time (ns) Figure 4-9. Power Rail transients induced by the switching activity of a single Clock Driver. 93 interest is the noise induced on the clocking signal by strong, simultaneous logic transitions at the latch input. As a result, a clock driver must be capable of driving a maximum fanout load of latches when the every latch input transitions simultaneously in the same direction. 3. Final Design Summary: Clock Driver The final design for the clock driver is implemented with the parameters listed in Table (4-6) using the topology presented previously in Figure (4-8). Clock Driver Design and Performance Summary Rgain- 400 Q R1 but* 110 Q R2but: 450 ft Ibias- 2 mA NM L : 0.08v NM h : 0.10 v Power: 1 6.3 mW Fanout: 4 Latches Table 4-6. Design and Performance Summary of the Clock Driver Circuit. At this point, the set of building blocks is complete. The logic circuits of Chapter III and the clock-driven devices of Chapter IV are brought together in Chapter V to implement several pipelined multiplier configurations. V. HBT CML PIPELINED MULTIPLIER DESIGN A. LOGIC STAGE DESIGN 1. Overview As introduced in Chapter II-C, the multiplier logic for this project is implemented with the three functional processes illustrated in Figure (5-1): partial product generation, carry-save addition, and carry completion Multiplier Multiplicand Product Figure 5-1. Generalized Block Diagram of an 8x8 bit Multitplier. addition. In the case of the 8x8 bit multiplier which is implemented in this chapter, the process of carry-save addition is actually accomplished with successive stages of 95 carry-save adders. More specifically, the use of three-to- two carry-save adders produces the logic implementation illustrated in Figure (5-2). The detailed process of carry- save-addition is addressed in the following section; however, this block diagram accurately represents the functional design of the multiplier and establishes a graphic reference for the follow-on discussion. 2. Carry-Save Adders Each three-to-two carry-save adder takes three operands and produces two outputs, a sum and a carry. However, the carry-save adder implementations are not identical, due to a slightly different input configuration that exists for the first carry-save adder stage than for the follow-on stages. Referencing Figure (5-3), the first carry-save adder receives three non-aligned n-bit partial products. As a result, it generates n+2 sum bits and n carry bits. Meanwhile, the follow-on stages each receive an aligned input pair comprised of the carry and sum terms generated by the preceding stage. The third input is the next partial product term, and it is shifted by one bit. Thus, the sum is only n+1 bits and the carry is still n bits. In the case of either carry-save adder, only the most significant n bits of the Siam term are passed on to the next adder stage. The remaining least significant bit(s) represent the next most significant bit(s) of the final 96 Partial Partial Partial Partial Partial Partial Partial Partial Roduct#8 Ptoduct#7 Product#) Product#5 Product#4 Roduct#3 ftoduct#2 Product#! P[15:71 m P[5] P[4] P[3] P[2] P[l] P[0] Figure 5-2. Logic Implementation o£ an 8x8 bit Multiplier using six stages of Carry-Save-Adders and a Carry-Completion Adder. Carry-Save Adder #1 Figure 5-3. Functional Illustration of the two Carry- Save-Adder Implementations. product and are passed directly to the multiplier output. These bits are highlighted with a circle in Figure (5-3) . The final designs of the two carry-save-adder configurations are provided in Figures (5-4) and (5-5). Note the presence 98 99 Input Signals Multiplier-Bit \ c “ rre ? t Multl P 1;Ler ___/ Bit for this stage BNin[7;0] \ 8-bit Multiplicand -- ' (Negated) Output Signals P[03 ) Next Product Bit C[7:0] 8-bit Carry Term S[8:l] ) 8-bit Sum Term rt ^ 0 frultiplier-BIT> -^ 4>o Figure 5-5. Logic Schematic of Carry-Save-Adder #2 LOO of more than simple adder circuits. A fanout limitation of four prevents a single signal from driving the eight input requirements for the current multiplier bit at each carry- save-adder stage. Thus, the arriving multiplier bits pass through an inverting buffer stage. Furthermore, the OR/NOR gates are used to generate the partial product terms within each carry-save-adder stage, rather than at the multiplier input. Taking advantage of the complementary output signals available from the preceding register, the NOR gates perform a logical AND of each multiplicand bit with the appropriate multiplier bit. Local Generation of the partial product terms avoids the extensive requirement for intermediate registers that would be necessary to pass all partial product terms from one pipeline stage to the next (that is, referencing a scenario where all partial products are generated before the first carry-save adder). 3. Carry-Completion Adders The carry-completion adder implements ripple-carry addition. This elementary design is preferred over carry- look-ahead addition because it facilitates a variety of simple pipeline implementations. Figure (5-6) illustrates the full carry-completion adder which can be conveniently segmented into as many as eight pipeline stages by separating the successive two and three-input adders. 101 Figure 5-6. An 8-bit Ripple-Carry Adder to perform Carry-Completion. 102 B. REGISTER STAGE DESIGN Regardless of the number of pipeline stages, each multiplier implementation requires two eight-bit input registers and a sixteen-bit output register. For pipeline implementations with more than one stage, intermediate registers are also required. The size of these registers varies depending upon where the register is inserted in the flow of logic. All intermediate and output registers require complementary input signals. However, the input registers are distinctly designed to accept a single logic input signal for each bit, vice requiring complementary logic input signals. In order to accomplish this, the D- type flip-flops utilized in the input register must employ a special latch implementation which does not require differential input signals for the master latch of the master-slave flip-flop pair. The details of this latch implementation are presented in Chapter IV-A-5. C. CLOCK DISTRIBUTION The purpose of the clock distribution scheme is to provide a local clock signal for clock-driven devices, namely the latches that comprise the registers described in the previous section. However, each clock driver can only sustain a maximum load of four latches, i.e., two flip- flops. Therefore, due to the number of clock-driven devices and the limited fanout capability of the clock drivers, the 103 clock signal must propagate through an extensive, multi¬ level distribution tree. As the number of clock-driven devices increases, the number of levels in this distribution tree must eventually increase as well. Thus, the more heavily pipelined multiplier implementations must make a larger investment of devices and power in clock distribution. D. MULTIPLIER IMPLEMENTATIONS Five pipelined multiplier implementations have been designed for testing via Tanner SPICE simulation tools. These implementations include a one-stage pipeline, a two- stage pipeline, a four-stage pipeline, a six-stage pipeline, and a ten stage pipeline. The arithmetic logic is identical for each; however, the increased number of registers present in the more heavily pipelined implementations also implies a more extensive clock distribution tree. A block diagram of each implementation is presented in the following section. E. PERFORMANCE EVALUATION 1. Evaluation Procedures Prior to evaluation of the individual multiplier implementations, the multiplier logic is successfully tested with several operands in order to verify that it produces an accurate product. Following this verification, it is the goal of this performance evaluation to identify the maximum 104 operating clock frequency for each pipeline implementation. However, this can only be done once the critical path, i.e, the critical pipeline stage, is determined for each multiplier. a) Critical Path Identification The most direct and absolute means of identifying the critical path is to conduct full-length simulations of each multiplier for every possible combination and sequence of two 8-bit input operands. Conducting these nearly 4.3 billion simulations on each of the five multiplier designs is obviously prohibitive. Thus, the opposite extreme suggests that the worst-case transition delay be assumed for every logic circuit in every stage of the pipeline. While this successfully identifies an upper bound on the delay associated with the critical path, it is likely that the upper bound case does not exist as a result of two input operands. Furthermore, without knowledge of the input operands, simulations can not be conducted for verification. Unfortunately, the logic behavior of the carry- save-adders makes an intuitive approach extremely difficult. Thus, a computer program designed by Kirk Shawhan, a research associate, has been utilized to identify the worst case input combinations. (Shawhan, 2000) The program effectively identifies a unique upper bound delay for each set of input operands. Those input combinations with the 105 worst-case upper-bound delays are then simulated to identify a single worst-case pair of operands and the critical stage where the most-delayed transition occurs. While it is not proven that this approach will identify the absolute critical path, it provides a reasonable and timely estimate for the purposes of this research. b) Maximum Throughput /Clocking Frequency Having determined the critical path, it is simply a matter of simulation time to identify the maximum clock frequency. For each pipeline implementation, a simulation is conducted which brackets the breakpoint of the multiplier. Furthermore, examination of the margin by which the setup time is met or missed provides a determination of the minimum clock period that is accurate within five picoseconds. The increased number of devices in the more heavily pipelined designs made full-circuit simulation times extremely long. As a result, the breakpoints for the four- stage, the six-stage, and the ten-stage multipliers were determined from partial simulations. Only the critical stage and those stages immediately before and after it were simulated. 2. Performance Results of Each Implementation The following ten pages provide a tv page design and performance summary for each of the five pipelined 106 multiplier implementations. Figure (5-7) illustrates the design and critical path of the one-stage multiplier on a block diagram. Table (5-1) provides a summary of data which quantifies circuit complexity, power consumption, data throughput rate and data latency of the one-stage pipelined multiplier. Finally, Figure (5-8) illustrates the success and failure of P14, the critical path, at clock frequencies below the above the breakpoint of the circuit. Similarly, Figures (5-9) through (5-16) and Tables (5- 2) through (5-5) provide the same performance results for the two, four, six, and ten-stage pipelined multipliers, respectively. A comparative analysis is conducted as a performance summary in the following section. As a final note, all full multiplier simulations are conducted using ideal current sources. This decision saves numerous simulation hours without sacrificing valid transient performance data. A close correspondence has been demonstrated between the transient performance of the practical and ideal current sources for both the logic and the latch designs. Use of the ideal source, however, does produce overly optimistic power-consumption data due to the absence of power dissipation from the transistors in the practical current source. Therefore, the simulation data for current consumption is scaled to accurately represent the power consumed in practical implementation. 107 A =1111 0111 B = 1100 0111 Critical Path Initiates with the two operands A=F7h, B=C7h c o 8P eS C go Critical Path Terminates with the LOW-to-MGH transition of P14 P =1100 0000 0000 0001 Figure 5-7. One-stage pipelined multiplier implementation with an illustration of the critical path. 108 STAGE 1 Voltage (V) Number of Transistors Number of Resistors Current (Amperes) Power (Watts) Logic 3952 2352 1.28 3.20 Registers 384 320 0.31 0.77 Clock 126 105 0.19 0.48 TOTAL 4462 2777 1.78 4.44 Maximum Throughput: 1.33 GHz Latency: 0.75 Nano-second Table 5-1. Performance summary for the one-stage pipelined multiplier. Figure 5-8. Performance bracket of the minimum period for the one-stage pipeline multiplier. 109 A =1111 0111 B = 1100 0111 ■ss ■ o i W) as I *3 & Critical Path Initiates with the two operands A=F7h, B=C7h 16-Bit Input Register Carry Save Adder #1 (1) Carry Save Adder #2 (2) Carry Save Adder #2 (3) Carry Save Adder #2 (4) Carry Save Adder #2 (5) Carry Save Adder #2 (6) 23-Bit Intermediate Register Critical Path Terminates with the LOW-to-HIGH transition of PI 4 o < H V3 O < H Figure 5-9. Two-stage pipelined multiplier implementation with an illustration of the critical path. 110 Number of Transistors Number of Resistors Current (Amperes) Power (Watts) Logic 3952 2352 1.28 3.20 Registers 660 550 0.52 1.31 Clock 228 190 0.36 0.90 TOTAL 4840 3092 2.17 5.41 Maximum Throughput: 2.0 GHz _ Latency: 1.0 Nano-second Table 5-2. Performance summary for the two-stage pipelined multiplier. Figure 5-10. Performance bracket of the minimum period for the two-stage pipeline multiplier. a = 1111 1111 b = 1000 0001 Critical Path Initiates with the two operands A=FFh, B=81h 16-Bit Input Register Carry Save Adder #1 (1) Carry Save Adder #2 (2) Carry Save Adder #2 (3) 31-Bit Intermediate Register Carry Save Adder #2 (4) Carry Save Adder #2 (5) Carry Save Adder #2 (6) [ 23-Bit Intermediate Register ■ l: ■ ■ ■ ■■ S • : .* Carry Completion Adder (4 Bits) 20-Bit Intermediate Register Critical Path Terminates with the LOW-to-HIGH transition of PI5 Carry Completion Adder P1J 1^14 16-Bit Output Register p = iooo oooo oiii mi Figure 5-11. Four-stage pipelined multiplier implementation with an illustration of the critical path. 112 Number of Transistors Number of Resistors Current (Amperes) Power (Watts) Logic 3952 2352 1.28 3.20 Registers 1272 1060 1.01 2.52 Clock 438 365 0.68 1.71 TOTAL 5662 3777 2.97 7.43 Maximum Throughput: 3.45 GHz Latency: 1.16 Nano-seconds Table 5-3. Performance summary for the four-stage pipelined multiplier. Figure 5-12. Performance bracket of the minimum period for the four-stage pipeline multiplier. 113 a = 1111 1001 B = 0010 0001 Critical Path Initiates with the two operands A=F9h, B=21h § a p a The Critical Path is Limited by the LOW-to-HIGH transition of the Carry Bit out of Stage 5. 16-Bit Input Register Carry Save Adder #1 (1) Carry Save Adder #2 (2) 31-Bit Intermediate Register Carry Save Adder #2 (3) Carry Save Adder #2 (4) 31-Bit Intermediate Register Carry Save Adder #2 (5) Carry Save Adder #2 (6) 23-Bit Intermediate Register Carry Completion Adder (3 Bits) 21 -Bit In term ediate Re gi s ter Carry Completion Adder Carry , (3 Bits) 18-Bit Intermediate Register Carry Completion Adder ' (2 Bits) "■ 16-Bit Output Register P = 0010 0000 0001 1001 Figure 5-13. Six-stage pipelined multiplier implementation with an illustration of the critical path. 114 Number of Transistors Number of Resistors Current (Amperes) Power (Watts) Logic 3952 2352 1.28 3.20 Registers 1872 1560 1.49 3.72 Clock 648 540 1.03 2.57 TOTAL 6472 4452 3.80 9.49 Maximum Throughput: 4.35 GHz Latency: 1.38 Nano-seconds a= 1111 1001 B = 0010 0001 Critical Path Initiates with the two operands A=F9h, B=21h The Critical Path is Limited by the LOW-to-HIGH transition of the Carry Bit out of Stage 9. 16-Bit Input Register Carry Save Adder #1 (1) 31-Bit Intermediate Register Carry Save Adder #2 (2) 31-Bit Intermediate Register Carry Save Adder #2 (3) 31-Bit Intermediate Register Carry Save Adder #2 (4) 31-Bit Intermediate Register Carry Save Adder #2 (5) 31 -Bit Intermediate Register Carry Save Adder #2 (6) 23-Bit Intermediate Register Carry Completion Adder (2 Bits) 22-Bit Intermediate Register Carry Completion Adder (2 Bits) 20-B it Intermediate Register C | Carry Completion Adder (2 Bits) ▼ 18-Bit Intermediate Register Carry Completion Adder (2 Bits) 16-Bit Output Register P =0010 0000 0001 1001 Figure 5-15. Ten-stage pipelined multiplier implementation with an illustration of the critical path. 116 Voltage (V) Number of Transistors Number of Resistors Current (Amperes) Power (Watts) Logic 3912 2320 1.28 3.20 Registers 3240 2700 2.57 6.44 Clock 1116 930 1.74 4.36 TOTAL 8268 5950 5.60 13.99 Maximum Throughput: 5.56 GHz Latency: 1.80 Nano-seconds Table 5-5. Performance summary for the ten-stage pipelined multiplier. T = 180ps, Critical Path Transition SUCCEEDS T = 170ps, Critical Path Transition FAILS Figure 5-16. Performance bracket of the minimum period for the ten-stage pipeline multiplier. 117 3. Comparative Analysis A summary of the performance results for each of the five pipelined multiplier implementations is presented in Table (5-6). A comparative analysis of these results quantifies and confirms the major trade-offs of pipelining as they were addressed in Chapter II-B. Figure (5-17) illustrates the increase in data throughput as compared to the increase in product latency. However, latency is generally an acceptable trade-off relative to the primary cost drivers of device count and power consumption. 1 2 4 6 10 STAGE STAGE STAGE STAGE STAGE Device Count 7239 7932 9439 10924 14218 Power (Watts) 4.44 5.41 7.43 9.49 13.99 Latency (nS) 0.75 1.00 1.20 1.38 1.80 Maximum Throughput (GHz) 1.33 2.00 3.33 4.35 5.56 Speed-Power Ratio (GHz/Watt) 0.300 0.370 0.449 0.458 0.397 Normalized Speed-Power Ratio 0.66 0.81 0.98 1.00 0.87 Table 5-6. Comparative Summary of Performance. 118 6.00 5.00 4.00 3.00 2.00 1.00 0.00 1 2 4 6 10 Number of Pipeline Stages Figure 5-17. Throughput and Latency as a function of the number of pipeline stages. Device count and power consumption are quantified in Figures (5-18) and (5-19), respectively. As the number of pipeline stages increases, the cost rises sharply - driven by the need for intermediate registers and an extensive clock distribution network. In the one-stage pipeline, the registers and clock tree represent only 13% of the total device count and consume 2 8% of the total power. On the other end of the spectrum, registers and clock distribution in the ten-stage pipeline represent 56% of the total device count and consume 77% of the total power. 119 Watts Number of Devices 16000 14000 12000 10000 8000 6000 4000 mm 2000 Number of Pipeline Stages Figure 5-18. Distribution of the Device Count. 14.00 12.00 10.00 8.00 6.00 4.00 □ CLOCK ■ REGISTER H LOGIC Number of Pipeline Stages Figure 5-19. Distribution of Power Consumption. Somewhere between these two extremes there exists an optimum pipelined implementation. Dividing the maximum throughput of each configuration by the total power that it consumes, a figure of merit is calculated which is referred to here as a speed-power ratio (for consistency with optimization procedures in previous chapters). Figure (5-21) plots the speed-power ratio as a function of the number of pipeline stages. The maximum point on the curve indicates that the optimal pipelined multiplier implementation employs five or six stages. Figure 5-20. Comparison of Speed-Power Ratio. 121 Thus, having concluded an evaluation of the various pipelined multiplier implementations, it remains to consider the impact that clock skew has upon these high-speed circuits. Chapter VI undertakes this discussion in the pages that follow. 122 VI. ANALYSIS OF CLOCK SKEW A. QUANTIFYING CLOCK SKEW Clock skew appears naturally in practical circuits due to a variety of physical factors as described in Chapter II-A. However, in a typical SPICE simulation, transmission delays are not inherent to the process and circuit elements are evaluated under ideal, homogeneous operating conditions. The effective result is the near elimination of clock skew from the simulation environment. Clock skew could be introduced artificially; however, introducing a known amount of clock skew would have very predictable results, such that it can be determined without simulation. Thus, based upon the results of Chapter V a simple numerical analysis is conducted in this chapter which provides an illustration of how clock skew impacts pipelined architectures and serves as a set of reference data from which follow-on research into alternative control techniques can measure performance. B. ANALYSIS PROCEDURES Based upon the definition of skew from Chapter II-A, let S 0EVICE represent the maximum delay between two clock signals after propagation through a single level of clock drivers. As illustrated in Figure (6-1), the effect of S„.. on the clock signal as it propagates through the clock 123 distribution tree is that the clock signal potentially accumulates S DRIVEE picoseconds of skew at each level. Furthermore, any loading differences at the final level of the clock distribution will introduce another skew term, S^. Thus, the simplified expression to be used for analyzing and calculating skew is given in Equation (6-1) . ^ TOTAL _ H X S DEVICE + S L0M) where, n = maximum number of levels in the clock distribution scheme Figure 6-1. Illustration of Clock Skew as it results from propagation path delays and loading. 124 An expression for n is derived in Equation (6-2), based upon the pipeline implementations from Chapter V. For synchronous logic, the timing inequality from Chapter II-A is repeated as Equation (6-3) . This relationship requires that the minimum clock period be expanded to account for the increase in skew. ^ ^ ^min ^skew ^logic ^Flip-Flop The procedure for analysis of clock skew is simply to apply a range of values for S DEVICE to the clock distribution schemes from Chapter V, using Equation (6-2). Based upon simulation results, the worst-case value for S L0AD is determined to be 6.5 picoseconds. Thus, it is possible to calculate a worst-case skew value for each incremental value of S DEVICE as it applies to the clock distribution scheme of each multiplier implementation. Applying the worst-case skew values to Equation (6-3), a new minimum period is determined for each multiplier implementation. This is repeated for values of S DEV1CE ranging from two to twenty picoseconds. A comparative analysis of the results should identify/confirm the expectation of an increasingly negative impact on the more heavily pipelined architectures. Finally, within the stated range of S DEVICE values, a reasonable figure for S DEVICE is determined as it might actually occur due to device non-idealities in the fabrication process. The approximation of device-induced skew ( S.-rr —) is defined as 20% of the worst-case propagation delay for the clock driver circuit and is determined to be 4.5 picoseconds. This set of data is referenced in the figures that follow as "typical skew". C. RESULTS Figure (6-2) provides a plot of the results. The values for skew which are referenced in the figures represent the values for S DEVICE . The data clearly confirms that the multipliers with throughput rates which are obtained as a function of higher clock rates will experience the most drastic performance reductions in the presence of clock skew. Furthermore, when weighed against the cost of power consumption a set of new speed-ratio curves is obtained, as shown in Figure (6-3). Thus, the contemporary appeal of synchronous pipelined architectures demonstrates a severe backlash at high clock rates. 126 Speed-Power Ratio (GHz/W) H . Throughput (GHz) 6.00 5.00 4.00 3.00 2.00 1.00 0.00 0 2 4 6 8 10 12 —No Skew 2ps Skew -*— 5ps Skew —x— 10ps Skew —•— 20ps Skew Typical Skew Number of Pipeline Stages gure 6-2. Effect of Skew on Pipeline Throughput Rates. 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0 2 4 6 8 10 12 Number of Pipeline Stages Figure 6-3. Effect of Skew on Pipeline Efficiency. -♦—No Skew Skew=2ps Skew=5ps Skew=10ps Skew=20ps -♦—Typical Skew 127 THIS PAGE LEFT BLANK INTENTIONALLY 128 VII CONCLUSIONS The fundamentals of circuit analysis and the principles of junction transistor behavior have been applied to design an optimal family of current-mode logic devices from InP HBT SPICE transistor models. From these building blocks of digital logic, an array multiplier has been constructed and pipelined into five distinct implementations. Each multiplier implementation has been simulated extensively via Tanner SPICE in order to identify the respective performance characteristics of power consumption and maximum operating frequency. A comparative analysis of multiplier performance has effectively demonstrated the trade-offs of pipelining with predictable yet interesting results. The cost of increasing throughput by increasing the number of pipeline stages has been quantified in terms of device count and power consumption. By maximizing data throughput at the most efficient cost in terms of power, the optimal 8x8 bit synchronous pipelined multiplier design has been determined to be the six-stage implementation, as shown on page 121. Finally, in the presence of clock skew, it has been demonstrated that the efficiency of synchronous pipelined architectures operating at high clock rates is significantly reduced. Thus, as device switching frequencies continue to 129 pave the way to faster logic circuits, the rate of data throughput will be left behind unless the synchronous logic design constraint of clock skew can be overcome. The impact of clock skew has been quantified and summarized such that it provides a reference point for further research into alternative clocking/control techniques. Specifically, it is intended that future research use the CML HBT logic family designed in this thesis in order to implement the same array multiplier circuit using asynchronous control techniques. One such endeavor is already in progress as LtCol. Kirk Shawhan, USMC, investigates the use of local completion signals which employ request/acknowledge handshake signals to control the flow of data vice the use of a global clock signal (Shawhan, 2000). Perhaps in time such asynchronous schemes will mature into a design methodology that overcomes the obstacle of clock skew which now threatens to limit synchronous design methodology. 130 LIST OF REFERENCES Foley, J. B . ; Bannister, J. A. R., "Analysing ECL's Noise Margins," IEEE Circuits and Devices, Volume 10, 1994, pp. 32-37. Harris, D., "Timing Analysis Including Clock Skew," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Volume 18, 1999, pp. 1608-1618. Jalali, B.; Pearton, S. J., InP HBTs: Growth, Processing, and Applications, Artech House, Inc., Massachussets, 1995. Loomis, Herschel H. Jr., "Class Notes," EC 4830 Spring Quarter at the Naval Postgraduate School, 2000. Moore, G., "Moore's Law Extended: The Return of Cleverness," Solid State Technology, Volume 40, 1997, pp. 359-364. Pierret, Robert F., Semiconductor Device Fundamentals, Addison-Wesley Publishing Company, Massachusetts, 1996. Pollard, L. Howard, Computer Design and Architecture, Prentice-Hall, New Jersey, 1990. Richards, R. K., Electronic Digital Components and Circuits, D. Van Nostrand Company, New Jersey, 1967. Sedra, Adel. S; Smith, Kenneth C., Microelectronic Circuits, Oxford University Press, New York, 1998. Shawhan, Kirk A., Design and Analysis of an Asynchronous Pipelined Multiplier with Comparison to Synchronous Implementation, Master Thesis, Naval Postgraduate School, Monterey, CA, Dec 2000. Sutherland, Ivan E., "Micropipelines," Communications of the ACM, Volume 32, 1989, 720-738. Wakerly, John F., Digital Design Principles and Practices, Prentice Hall, New Jersey, 2000. Weste, Niel H. E.; Eshraghian, Kamran, Principles of CMOS VLSI Design: A Systems Perspective, Addison Wesley Longman, Inc., 1993. 131 THIS PAGE INTENTIONALLY LEFT BLANK 132 INITIAL DISTRIBUTION LIST 1. Defense Technical Information Center.2 8725 John J. Kingman Road, Ste 0944 Fort Belvoir, VA 22060-6218 2 . Dudley Knox Library.2 Naval Postgraduate School 411 Dyer Road Monterey, California 93943-5101 3. Director, Training and Education.1 MCCDC, Code C46 1019 Elliot Rd. Quantico, Virginia 22134-5027 4. Director, Marine Corps Research Center.2 MCCDC, Code C40RC 2040 Broadway Street Quantico, Virginia 22134-5107 5 Marine Corps Tactical System Support Activity . 1 Technical Advisory Branch Attn: Librarian Box 555171 Camp Pendleton, CA 92055-5080 6. Marine Corps Representative.1 Naval Postgraduate School Code 037, Bldg. 330, Ingersoll Hall, Room 116 555 Dyer Road Monterey, CA 93943 7. Engineering and Technology Curricular Office, Code 34 1 Naval Postgraduate School Monterey, California 93943-5109 8. Chairman, Code EC.1 Department of Electrical and Computer Engineering Naval Postgraduate School Monterey, California 93943-5121 9. Professor Douglas Fouts, Code EC/FS. 1 Department of Electrical and Computer Engineering Naval Postgraduate School Monterey, California 93943-5121 133 10. Professor Herschel Loomis, Code EC/LM.1 Department of Electrical and Computer Engineering Naval Postgraduate School Monterey, California 93943-5121 11. LtCol. Kirk Shawhan (USMC) .1 P.0. Box 749 Quantico, VA 22134-0749 12. Maj . John R. Calvert, Jr. (USMC).4 1422 Woodway Drive Ooltewah, TN 37363 134