# Full text of "Fundamental distributed algorithms for management of computer networks with changing topologies"

## See other formats

FUNDAMENTAL DISTRmUTED ALGORITHMS FOR MANAGEMENT OF COMPUTER NETWORKS WITH CHANGING TOPOLOGIES BY KENNETH CHIH-KEN LUO A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 1991 ACKNOWLEDGEMENTS I would like to take this opponunity to express my sincere gratitude to my dissertation advisor. Dr. Yuan-Chieh Chow, for his guidance and encouragement of this research, which was supponed under the SDI/IST program, administered through U.S. Navy and U.S. Army. I also want to thank Dr. Richard Newman-Wolfe for numerous fruitful discussions which eventually led to some significant results of this research. Other dissertation committee members also contributed to the work: Dr. Rick Smith made helpful suggestions on my research; 1 learned so much from his excellent course on analysis of algorithms. Dr. Yann-Hang Lee provided insight on some topics of the research. I am very grateful to both of them. Thanks are also due to Dr. Mark Yang of the statistics department for graciously consenting to be on my com- mittee. Finally, I would like to thank my wife, Sze-Mei, who has patiently watched this endeavor and given me the greatest encouragement. ii TABLE OF CONTENTS ACKNOWLEDGEMENTS ii ABSTRACT v CHAPTERS I INTRODUCTION 1 LI Scope and Objectives 1 L2 Mobile Network Reconfiguration 2 1.3 Reliable Message Broadcast. 2 1.4 Global Network Survey 3 n MOBILE NETWORK RECONHGURATION 5 2.1 Background 5 2.2 A Satellite Network Model 7 2.2.1 Assumptions 7 2.2.2 Partitioning Space into Regions 8 2.2.3 Reconfiguring Networks 9 2.2.4 Scheduling Network Reconfiguration 10 2.3 The Hierarchical Approach for Reconfiguration 11 2.3.1 Topologies for tiie Primitive Case 12 2.3.2 Topologies for the Recursive Case 13 2.3.3 Topologies for Inter-Region Links 15 2.3.4 The Link Assignment Algorithm and Its Analysis 16 2.3.5 Applying the Algorithm 18 2.4 Proofs on Network Connectivity 19 2.5 Failiu-es and Recoveries 27 2.5.1 Link Failures 28 2.5.2 Node Failures 28 m RELL\BLE MESSAGE BROADCAST 29 3.1 Background 29 3.2 Network Model 31 3.2.1 Preliminary Definitions 31 3.2.2 Assumptions 32 3.3 The Two-Phase Broadcast Protocol 33 3.3. 1 Description of the Protocol 33 3.3.1.1 Broadcast phase 34 iii 3.3.1.2 Reply phase 35 3.3.2 Message Buffer Management 45 3.3.3 The Node Algorithm 46 3.4 Performance 50 3.4.1 Minimum Broadcast Delay 50 3.4.2 Communication Cost 51 3.4.3 Speedup in Reply Phase 54 IV GLOBAL NETWORK SURVEY 57 4.1 Background 57 4.2 The Model and Motivation 58 4.3 Lower Bounds 60 4.4 Optimizing Message Complexity 68 4.4.1 The Message Structure 70 4.4.2 Alternative Path Search 71 4.4.2.1 Low-tree algorithm 71 4.4.2.2 High-tree algorithm 76 4.4.3 Validation of the Protocol 78 4.4.4 Computational Complexity 81 V CONCLUSION 84 5.1 Major Results 84 5.2 Future Research Directions 86 REFERENCES 87 BIOGRAPfflCAL SKETCH 90 iv Abstract of Dissenation Presented to the Graduate School of the University of Florida in Panial Fulfillment of the Requirements for the Degree of Doctor of Philosophy FUNDAMENT.\L DISTRIBUTED ALGORITHMS FOR MANAGEMENT OF COMPUTER NETWORKS WITH CHANGING TOPOLOGIES By KENNETH CHIH-KEN LUO December 1991 Chairman: Dr. Yuan-Chieh Chow Major Depanment: Computer and Informauon Sciences Distributed computing systems hosted in a network with changing topologies often encounter management problems which require the application of various control algorithms. Typical networks with such property include packet-radio networks, point-to-point satellite communication networks and networks with a high degree of link failures. In such networks, due to the instability of network topology, maintaining a reliable distributed computing environment is a complicated issue. There are a number of management problems inherent in these network systems. This research focuses on issues that are considered to be more fundamental. In the first pan of this research, we consider a network reconfiguration problem in network topology designs. The problem can be briefly described as follows: given a number of nodes in the network, each node has a fixed number of communication pons and is mobile. How can an algorithm be designed that reconfigures the network periodically such that the network always has maximum connectivity? We present a partitioning method to achieve this goal. The algorithm performs better than the best of the previous results in terms of time complexity. The second issue addressed is a message broadcast problem. We consider a gen- eral model of a network with changing topologies, the "eventually connected" network model, i.e., networks that may be undergoing arbitrary topological changes but are not permanently disconnected. We focus on the issue of message broadcasts on such net- works. A message broadcast protocol is a distributed algorithm that allows a message to be sent to every node in the network via the communication links connecting the nodes. It has been shown that in this kind of network, no conventional broadcast pro- tocol is reliable. It has also been shown that a reliable broadcast protocol can be designed with unbounded message buffers. We improve this result by presenting an efficient reliable broadcast protocol using only bounded message buffers. In the third part of this research, we consider a closely related problem of the message broadcast, the so-called global survey problem. That is, given a dynamic net- work which is subject to unpredictable link failures but remains connected and in which each node is unaware of the network topology, can a node broadcast a message to every node and then get a reply from every one? We provide a message-complexity lower bound as well as a time-complexity lower bound. An optimal protocol is also given. In summary, this dissertation is devoted to a detailed and in-depth study of some fundamental issues involved in networks with changing topologies. Significant results have been produced that may have profound impacts on research areas such as net- work topology designs, network protocol designs, distributed databases and the theory of distributed computings. vi CHAPTER I INTRODUCTION 1.1 Scope and Objectives A distributed system hosted in a network with changing topologies often encounters management problems which require the application of various control algorithms. Typical networks with such properties include packet-radio networks and point-to-point satellite communication networks which are among the main research areas of the Strategic Defense Initiative (SDI). In these networks, due to the instabil- ity of network topology, it is much more complicated to maintain a network system so as to provide a reliable distributed computing environment. There are a number of management problems inherent in these network systems. This research focus on several of them which are considered to be fundamental: network reconfiguration, message broadcast, and network surveys. The solutions for these issues rely heavily on the design of effective and efficient distributed algorithms. Unlike those centralized algorithms that are executed as a single program by a single processor or by a number of processors but one at a time, a distributed algorithm is usually implemented as a node algorithm, i.e., each node in a network is installed with one copy of the algorithm. The algorithm works in an event-driven fashion; each node responds to a particular event by running a par- ticular procedure. If no event occurs, the algorithm is idle. An important feature of distributed algorithms is that the communication between nodes running the same algorithm is done by exchanging messages. There- fore, in addition to being correct and efficient, a distributed algorithm often needs to be designed in such a way that the total number of messages exchanged in a network 1 2 is minimized, which further complicates the task of designing a good distributed algorithm. Some of the problems considered in this proposal require substantial modifications and extensions of the existing distributed algorithms which are usually intended for networks with fixed topologies; others are new problems for which no algorithm is available, let alone an efficient one. These problems are briefly discussed as follows. 1.2 Mobile Network Reconfiguration A network with changing topologies needs to reconfigure its topology for the following two reasons: 1) Communication links may be broken due to the limited broadcasting power of each node, as shown in packet-radio networks, or due to the blocking of transmission by the Eanh, as in the case of satellite networks in which a communication link between a pair of nodes is established by targeting laser trans- ceivers toward each other. 2) The routing scheme of the network becomes inefficient. As topology keeps changing, so does the propagation delay of each link. Eventually, message routing becomes inefficient because previously computed routes are no longer optimal. One of the most widely adopted criteria for measuring the performance of a reconfiguration algorithm is the connectivity of the network obtained. In this research, we present a new algorithm for this problem. The algorithm is based on network par- titioning. Our results show that the algorithm yields optimal connectivity and is supe- rior to existing algorithms in terms of time complexity [22]. 1.3 Reliable Message Broadcast For various reasons, a node in a network may need to broadcast messages to the other nodes. In networks with fixed topologies, messages are usually delivered through some broadcasting tree for the sake of reducing the number of messages exchanged. In unreliable networks, the task is fairly complicated. 3 Some sophisticated broadcast techniques have been developed to achieve reli- able broadcast in networks with slowly changing topologies. This research focuses on developing a reliable and efficient broadcast protocol in networks where nodes are mobile with varying communication delay across each link, and where links may fail or , recover in an unpredictable fashion. Links may fail or recover for two reasons: (1) due to an unplanned occurrence, i.e., a link breakdown, or (2) as part of a net- work reconfiguration scheme to update topology. In this problem, a protocol is measured by three standard crlieria: reliability, message delay, and communication cost. We present a reliable protocol which is simple, efficient, and robust. This protocol is a distributed algorithm which responds to the arrival of a message by performing special functions. Our protocol can achieve a minimum message delay with low communication cost [23]. In addition, we address the problem of message buffer utilization, which has never been successfully addressed in unreliable networks. 1.4 Global Network Survey The network survey problem considered in the research can be briefly defined as follows. There exists a distinguished node in the network and it wants to glean some global information from other nodes in the network. The task of the node is to send a short message to every node in the network and get a response from each node in finite time. We call this a "network survey" which is conducted in an unreliable environment; in a network with changing topologies. The network environment discussed in the topics is slighdy different from the problem of network broadcast, in that the network, despite link failures, remains con- nected throughout the whole survey process. The significance of the problem is its close relationship with other fundamental problems such as network votings and dis- tributed consensus. 4 Significant results have been produced on this problem. We have shown some lower bounds on the time and message complexities of this problem. Furthermore, a message-optimal algorithm has been developed during this research [24]. CHAPTER II MOBILE NETWORK RECONFIGURATION 2.1 Background The problem of network reconfiguration is considered a major issue in the back- bone design (or topological design) of networks with rapidly changing topologies. For satellite networks connected through the use of laser transceivers, this problem is very critical for overall network performance. However, it is a complicated task to find an efficient and optimal algorithm for reconfiguration due to the following inher- ited properties: 1) variable propagation delay due to the changing lengths of the links, 2) visibility constraints due to the blocking of the Earth, i.e., not every pair of satel- lites can see each other, and 3) periodic link reconfiguration as the line of sight between two connected satellites may be blocked by the Earth over time. Since net- work reconfiguration in the laser-based point-to-point satellite networks involves link reassignment, we will use the two terms link assignment and reconfiguration inter- changeably throughout this thesis. A major issue in network reconfiguration is that of maximizing connectivity of a network. This problem has been studied extensively in point-to-point store-and- forward networks [6,19,21,34]. It can also be transformed into that of finding the maximum number of node-disjoint paths between every pair of nodes in a graph [13,21,31]. In particular, Schumacher [31] has given an 0{n}) algorithm (where n is the number of nodes in the graph) for constructing /: -regular graphs with maximum connectivity. The algorithm presented by Schumacher is the fastest algorithm as far as the time complexity is concerned. However, Schumatcher's approach is a pure - graph algorithm in which every pair of nodes is connectable. Thus the algorithm is 5 6 inapplicable (at least directly) to satellite network reconfigurations. The visibility con- straint on the satellite computer networks makes it more difficult to reconfigure a net- work with respect to maximizing network connectivity because each node can see only a subset of the nodes. Relatively few works have been done on maximizing net- work connectivities for point-to-point satellite networks. McLochlin [25] and Ward [36] proposed methods to handle a special case of this link reconfiguration problem by imposing strict constraints on satellite orbits so that satellites circulate on several equally spaced planes and within each plane satellites are approximately equidistant from their neighbors. Another issue in point-to-point satellite networks is to determine how and when reconfiguration is invoked. Research on other mobile networks also deals with the problem of network reconfiguring such as those for packet-radio networks [17,18,30]. Most of these works adopt the failure-driven reconfiguring approach, i.e., as long as the node or link is properly functioning, no reconfiguration is needed [7]. One cri- terion often adopted in packet-radio network reconfigurations is "minimum distur- bance" [18] in the sense that reconfiguration normally requires updating or swapping links locally and thus the unaffected links should be preserved. In point-to-point satellite networks, this "minimum disturbance" is very difficult to obtain because most of the existing links will be blocked by the Earth sooner or later. Considering the periodic orbiting of satellites, the link reconfiguration needs to be carried out glo- bally for there is no way to restrict the reconfiguration to a local level. Instead, we choose to do over-all network reconfiguration on a periodic basis to ensure that a net- work is highly connected all the time except at the moments of reconfiguration. 7 2.2 A Satellite Network Model 2.2.1 Assumptions For systems as complex as satellite networks, it is necessary to make some basic assumptions so that solutions are feasible. Here we assume the following things hold: (1) Global Knowledge - It is assumed that information about exact locations, direc- tions of motion and other related data about the satellites can al ,vays be derived through mathematical computations based on orbital mechanics or alternatively from broadcasting of ground-based tracking stations. (2) Fixed Number of Links at Each Satellite - Ward [36] has shown that it is most cost-effective if each satellite is installed with 4 transceivers. In our model we also assume that each satellite has exactly 4 transceivers. So, except in some unusual circumstances, the network can be regarded as a 4-regular graph, i.e., each node has exactly 4 edges. (3) Global Time Scheme - Each satellite will need a clock synchronized to all the clocks on the other satellites. Since communication links between satellites are established by targeting laser transceivers to each other, some kind of synchroni- zation is necessary. In this case we believe that using a global clock to drive the link assignment process is the simplest way to ensure reliable retargeting. (4) Periodic Reconfiguration - The reconfiguration is done by retargeting tran- sceivers among satellites to establish a new set of links on a periodic basis. Each satellite synchronizes its reconfiguring process to the others at regular periods. The reconfiguration is synchronized by having the satellites run their reconfiguring process simultaneously. During this short period of network reconfiguration, no packets can be exchanged between satellites. The advantage of periodic reconfiguration is that it allows satellites to perform more smoothly 8 by taking reconfiguration into account. For example, by knowing the next reconfiguration is approaching, a node may decide to disconnect the virtual cir- cuits built on the present topology and/or set up some new ones based on the incoming topology. 2.2.2 Partitioning Space into Regions A good solution to the link assignment problem should satisfy two goals - max- imize connectivity and be computationally efficient. The approach we propose satisfies these requirements in handling large networks. In this method, a network is partitioned into small subnetworks. Each subnetwork independendy generates a locally optimized topology in terms of connectivity. Then these subnetworks are con- nected together to form a complete network with maximum connectivity. Applying this method to satellite networks, the space in which satellites orbit is partitioned into a number of regions of approximately the same size. Each region contains a subnetwork. For the sake of easy demonstration, the ensuing discussions assume that all satellites orbit at the same altitude. The multiple-altitude cases can be handled essentially in the same way. The size of a region is determined based on the principle that two satellites between a link are visible to each other all the time before the next reconfiguration occurs. The optimization of the region size will be discussed in detail in section 5. To specify regions and satellites formally, a coordinate system (y, 6, \|/), where y represents the altitude, 9 the latitude, and \\f the longitude, is adopted. Basically, the surface where satellites orbit is partitioned in such a way that the latitude is parti- tioned into twice as many intervals as is the longitude. For instance, if the longitude has X intervals, [\^q, \|/2], [\|/2, Vsl. - > [Vx-i. ¥x]. then the latitude has 2 x intervals, [%, Gi], [61, 62], [62, 63], ... , [e2^_i, Q2xl Note that, Vq = and Bq = because the surface is round. So the network is partitioned into 2x^ regions. 9 Each region is represented by ([0,-, 8,+i], \J/y+i]). Since we intend to make the area of each region approximately the same, these intervals are not of equal width. For example, the widths of 6-intervals near two poles, as their two \|/-borders merge into poles, should be greater than those of intervals in other areas. Let Vi, V2, ... , v„ denote all satellites in the network, and Ri, R2, ••• , Rm represent all regions partitioned from the spherical surface. A node v, at (0, , \|/,- , y,- ) is in region R if 0,- e [0„(/?,),0^ (/?,)] and V)/,- e ),\|/^ (/?, )] where 0„(^,), (/?, ), 0j(i?, ) and Vvv(^i) the four borders of /?,-. Ties are broken arbitrarily. Note that not every region has a northern or southern border. For instance, a region near a pole may lack a northern or southern border (in fact, the border reduces to a point), and the shape of a region in this area is much like a triangle on a spherical surface as illustrated in Figure 2.2.1. The shapes of the other regions are rectangular or trapezoidal, depending on their latitudes. Figure 2.2. 1 Regions and their connections near a pole. 2.2.3 Reconfiguring Networks Since a global timing scheme is used, each satellite will be running the same link assignment algorithm simultaneously to reconfigure the network. The synchroni- zation of link establishment for the network is realizable as long as each node keeps a copy of the whole network graph. When every node is executing the same algo- 10 rithm, a consensus can be reached as each node knows which edges should be dis- carded and which edges should be set up. The links formed by a satellite are determined by the region it is in and by its location within the region at the time of reconfiguration. Between reconfigurations (link assignments), it is permissible for a satellite to cross the borders into the other regions. It is assumed that transceivers on each link will keep pointing/tracking each other once the link is set up, regardless of where the other satellite is. If the region size and the reconfiguration period are properly determined, links will not be broken when satellites cross the borders. 2.2.4 Scheduling Network Reconfiguration As has been pointed out, the reconfiguration of the network is done periodically. The next question is how often should a network be reconfigured. Since each reconfiguration requires a temporary halting of message exchanges, to maximize net- work utilization the interval between two consecutive reconfigurations should be as long as possible under the constraint that no links will be broken in between. Thus, the maximum reconfiguration period is the minimum of the longest time interval any two linked satellites can orbit and still maintain their common link. This can be determined after the region size has been decided. Figure 2.2.2 illustrates how the period is calculated. Suppose that the following regions Ri, R2 each have only one node and at the time of a reconfiguration their positions are Xi and yi. Let X2 and y2 be two points in the space such that [x2, J2} ^ [ {x, y] : \ x - y \ = the maximum distance that 2 satellites are visible to each other ). Then the time for a satellite to travel firom x^ to ^2 or from y^ to 3^2 is the longest period of reconfiguration for which no links will be broken. This can be computed as soon as the region size and the orbits of the satellites are determined. *■ 11 11, ^2 Figure 2.2.2 The longest distance between two satellites in adjacent regions 2.3. The Hierarchical Approach for Reconfiguration Before we describe the hierarchical approach, let's introduce some terms. As the network is partitioned into regions, some of the nodes have to set up links across the regions so that packets can go through these links to other regions. Those nodes having border-crossing links are called backbone nodes and the others are called internal nodes or non-backbone nodes. Similarly, those border-crossing links are named as backbone links and the others are internal links. The role each node plays is not fixed. It depends on the location of the node. In general, each region will select nodes closest to the 4 borders (eastern, southern, western, and northern bord- ers) to be its backbone nodes. Backbone nodes play more important roles in routing packets because inter-region packets will be forwarded via at least one of these nodes. In each region, except for the one-node case as shown in Fig 2.3.1. a, there are 6 backbone links; 4 links cross the northern and the southern borders; the other 2 links cross the eastern and the western borders. The distribution of links over borders is not unique. For instance, another alternative would be to put 4 links on the eastern and western borders and 2 links on the others. The basic principle ("single-hop" prin- ciple) is to allow the by-passing messages to be forwarded in one hop vertically as well as horizontally. The 6-backbone-links arrangement is derived partly from the 4- regular graph constraint and partly from empirical results of constructing primitive 12 topologies for very small networks. There are possibly other arrangements with different numbers of backbone links in a region available to serve the purposes of obtaining regional maximum connectivity and of observing the single-hop principle as mentioned above. The hierarchical link assignment algorithm can be divided into two levels, the first-level task is to determine a regional topology which includes establishing inter- nal links and selecting backbone nodes. The second-level task is to combine regions into a connected network by setting up backbone links among backbone nodes of adjacent regions. The algorithm is designed to maximize regional connectivities as well as the whole network connectivity. In the first-level task, regions are handled in two different ways: the primitive case and the recursive case. If a region contains 7 or fewer nodes, it is handled as the primitive case. Otherwise, it is regarded as the recursive case. The primitive case consists of 7 topologies. In the primitive case, the regional topology is predetermined and links are set up accordingly. In the recursive case, the link assignment algorithm works by "peeling off layers of nodes, 4 nodes at a time, until it ends as one of 4 terminal cases; each layer contains the 4 nodes nearest to the 4 borders. The four terminal cases are shown in Figure 2.3.3. 2.3.1 Topologies for the Primitive Case Seven primitive topologies are provided for the construction of regional subnet- works with fewer than 8 nodes. Figure 2.3.1 illustrates all of them. Note that these topologies are chosen to maximize the regional connectivity. By maximum regional connectivity, we actually mean that a node can construct 4 node-disjoint paths to any other node in the same region via the nodes in the same regions or in neighboring regions. The primitive topologies are among the few topologies which meet the max- 13 imum connectivity requirement and the single-hop principle. The topologies have the following properties: (1) no links are redundant in the sense that every transceiver is used and no more than one link is assigned to any pair of nodes; (2) messages generated from a host inside a region can be forwarded efficiently out of this region via fewer than 3 intermediate nodes; (3) "straight-line hops," east-west or north-south message deliveries, are favored. There are backbone nodes in each case (e.g., Vj and V2 in Figure 2.3. Lb, or and V3 in Figure 2.3.1.c-g) that have 2 backbone links crossing east- west and north-south borders. Therefore, one-hop transmitting is allowed through this region. However, with the 4-link constraint, no similar "convenience" is offered if a by-passing message wants to "make a turn" through this region. 2.3.2 Topologies for the Recursive Case If the number of nodes in a region is greater than 7, the strategy is to construct a topology recursively. The link assignment algorithm works in the following manner: First, 4 nodes, each of them the nearest node to a border, are chosen to form the first layer of the region. Note that the nodes in the first layer will also be the 14 backbone nodes of the region and will connect to backbone nodes in the neighboring regions. Second, the algorithm continues to peel off a set of 4-nearest-to-border nodes to form intermediate layers until one of the terminal cases is reached for which topologies exist. Figure 2.3.3 shows the topologies of all four permissible terminal cases. Finally, any two adjacent layers, i and i+1, arc connected by their correspond- ing 4 nodes (e.g., vP, vP, etc.) as shown in Figure 2.3.2.b where the v/'^'s are of one layer above the Vj^^'*'^^'s. Note that the property (3) in the primitive case is also preserved. In general, outgoing messages originating from a node of a particular layer are forwarded via the corresponding nodes of its upper layers all the way up to the corresponding backbone node and then transmitted to other regions. layer/ and below (a) (b) Figure 2.3.2 Backbone Layer and Intermediate Layer Topologies 15 V V (c) (d) Figure 2.3.3 The Terminal Cases (Final Layer Topologies) 2.3.3 Topologies for Inter-region Links The inter-region backbone links are established in the second-level task. Only backbone nodes are involved in this task level. Generally, each region will have 4 backbone nodes. To illustrate how inter-region links are established, consider a region R, denote the backbone node closest to the west border by Vj, the backbone node closest to the south border by V2, the backbone node closest to the east border by V3, and the backbone node closest to the north border by V4. Let /?,-, i e {1, 2, 3, 4} be the neighboring region of R, e.g., let /?i be the west neighboring region and /? 2 be the south neighboring region, . . . etc.; and the backbone node closest to /?, 's west border is represented as v,- j, . . . etc. See examples in Figure 2.3.3. The inter-region links are set up based on the following rules: (1) Vj is connected to Vj j and V31, i.e., "horizontal links" are established (since the positions of the nodes change over time, the links are not strictly horizontal, but are so called for the sake of convenience). (2) V3 is connected to Vj^s and V4 3, i.e., "vertical links" are established. 16 (3) V2 is linked to V2,4 and V4 is linked to V4 2, (4) If the number of backbone nodes in R is less than 4, then some backbone nodes have to play the roles of more than one node. For example, in Figure 2.3.3.a R contains only 2 nodes, while in Figure 2.3.3.b R contains at least 4 nodes. Therefore, and V2 in Figure 2.3.3.a are performing the roles of Vj and V2, and V3 and V4, respectively, in Figure 2.3.3.b. "4.8 "4.2 / "1,1 "1 \ ■*•••. • "2 ■"3,1 "2,4 "2.^ "4.3 "4.2 ■■■"3,1 "1,1 "2 "2,4 "2.S (a) (b) Figure 2.3.3 Examples of Inter-Region Topologies 2.3.4 The Link Assignment Algorithm and Its Analysis The link assignment algorithm consists of procedures REGION_LINK and NETWORK_LINK. REGIONLINK handles the first-level task and NETWORK_LINK deals with the second level task. It is assumed that the space is partitioned and thus the region borders are determined before the network is initial- ized. Each node is associated with an ID. To execute the algorithm, each node should have the positional data, represented as (ID, 0, of all the nodes. The data are obtained through calculations of orbits. In fact, the data can be precomputed, stored in each node and accessed during reconfiguration. With this data, the member- ship of the nodes (i.e., to which region a node belongs) can be determined during the 17 reconfiguration, and they remain unchanged until the next reconfiguration. If a node is detected dead or missing, the information will be broadcast across the network and the node will be ignored for the reconfiguration. The procedures are described as follows: Procedure REGION LINK; Input: a set of nodes of a region of the form- {ID, (9, \^)] Steps: 1. i<-l; If # of nodes < 8, then construct the topology according to the arrangement of Figure 2.3.1; return. /* primitive cases */ 2. ConsOuct 2 bi-directional queues from the nodes, and Qq, j2y is ordered non-decreasingly according to the distance between a node and the eastem border, Qq 'is ordered non-decreasingly according to the distance between a node and the northern border; ties are broken arbitrarily. /* recursive cases */ 3. Remove the first and the last "unremoved" nodes from and Qq separately. These 4 nodes, mariced "removed", form layer i. If i = 1, then construct links as specified in Figure 2.3.2.a (the backbone layer intra-region topology), else construct links as specified in Figure 2.3.2.b (the intermediate layer topology), i i + 1. 4. If # of "uruemoved" nodes in each queue > 7, then go to Step 3. 5. Construct final layer topology as specified in Figure 2.3.3. end REGION LINK. Let R be the region in which the node executing the algorithm is located. Procedure NETWORK LINK; Input: the set of nodes of the netwoik of the form- {ID, (9, \|/)} Steps: 1. Call REGION_LINK 5 times to derive the regional topologies of R, R J, /?2i /?3. ^4. 2. Construct inter-region topology according to the 4 simple rules specified in section 3.3 (examples are shown in Figure 2.3.3). end NETWORK LINK. The analysis of the time complexity of the link assignment algorithm is straight- forward. Let k be the number of nodes in a region. In REGION_LINK, step 1 and 5 take 0(1) (note that the process of checking whether a node is "removed" or "unremoved" in a queue takes 0(1) as it can be done by indexing a binary array REMOVED[k]). Step 2, mainly the sorting procedures, take Oiklogk). Steps 3 and 4 serve as a loop. Since 4 nodes are handled each time, the number of loopings is 18 approximately klA. Hence, steps 3 and 4 jointly take 0{k). So, the total time com- plexity of REGION_LINK is 0{k\ogk). In NETWORK_LINK, as each node is only concerned with the topologies of its 4 neighboring regions and its own, the time complexity is still O {k logk ). Now, let n be the total number of nodes in the network and m be the number of partitioned regions. It takes time O (n ) to determine the memberships of the nodes. By assuming that the number of nodes in each region is approximately n/m, it can be seen that NETWORK_LINK takes time O ({n /m)\ogin /m)) which is bounded by Oinlogn). So, the algorithm is completed in O (n logn ) time. 2.3.5 Applying the Algorithm Given that there are 4 transceivers in each node, the maximum node- connectivity of a network is no more than 4. However, consider the case where there exists a region with only one node. In order to maintain a 4-connected network, all the other regions of the same longitude must contain exactly one node. This is simply because there is only one backbone link crossing each of the northern and southern borders for the one-node regions, while there are two such links for the other cases (see Figure 2.3.1). Consequently, a good decision on the size of a region should manage to prevent the one-node-in-a-region case when a network starts its reconfiguration. In fact, we will show that the link assignment algorithm maximizes the node-connectivity of a network if every region of the network contains at least 2 nodes during each moment of reconfiguration. A natural way to achieve this is to increase the region size (i.e., to reduce the total number of regions) such that every region contains at least 2 nodes all the time, which can be done by calculating satel- lites' orbits and adjusting region sizes accordingly. The algorithm is most applicable when dealing with a large number of satellites and not very suitable for small satellite networks for two reasons. First, if a satellite 19 network is small, its optimal link assignment can be found by a direct exhaustive search within an acceptable time limit. Second, in order to handle networks in which the number of satellites is small, large region sizes are needed. But, if a region is too large, it will make two nodes within the region invisible to each other, this will make the algorithm fail. In dealing with large satellite networks, the algorithm can handle the recursive case fairly efficiently. It can take advantage of the locality of nodes and come up with relatively short links (in a dynamic sense, of course, as the length of a link varies) except perhaps for the backbone links. With these short hnks in a regional network, it is expected that the mean propagation delay for local messages can be significantly reduced. Considering that propagation delay is a substantial part of the end-to-end delay in satellite communications, the algorithm will help speed up the routing in addition to providing high network connectivity. 2.4. Proofs on Network Connectivity Connectivity is one of the important criteria in network topology design. A net- work should be able to remain connected even if some links or nodes fail. Before discussing the connectivity of a network, we formally define some terms. Definition 2.4.1 Let G=(V, E) be an undirected graph representing a network N. V is the set of nodes in N, and E is the set of edges/links in N. A path from to y is a sequence of edges of the form <Vo, Vj, V2, ... , Vj> where (v,_i, v, ) is an edge for l^^y, and x = Vq, y = Vj. Two paths from v,- to are node-disjoint if they contain no common nodes except the two end nodes v,- and Vj. The node- connectivity of a pair of nodes is the minimum number of nodes whose removals will disconnect the two nodes. Equivalentiy, it can be defined as the maximum number of node-disjoint paths between them. The node connectivity of a network is defined as the minimum node-connectivity of all pairs of nodes in V. There is another type of 20 connectivity for a graph, called edge-connectivity which can be defined as the minimum number of edges whose removals will disconnect a graph. Since node- connectivity is a stronger condition than edge-connectivity in terms of reliability, node-connectivity is adopted here; throughout this chapter connectivity will mean node-connectivity. Given that each node has at most 4 links in the network, we will show that the link assignment algorithm NETWORK_LINK yields a surprisingly good connectivity in satellite network topology designs. Definition 2.4.2 Two regions /?,• and Rj are adjacent if is contiguous to Rj , or if they intersect in a pole and are in radially symmetric positions with respect to the pole. A region path between two regions is a sequence of regions of the form </?i, R2, R2, R4, — , Ri-i, Ri> where Rj_i is adjacent to Rj, 2 < j < i. Two region paths from Ri to Rj are region-disjoint if they have no regions in common except /?,• and Rj. Lemma 2.4.1 Let R = [R^, R2, ... , } be the set of the partitioned regions. Let Ri j denote the region [i, i+1] x [j, j+1], and /?, j € R. For any pair of regions, there exist 4 region-disjoint paths between them. Proof. The proof is straightforward. We can show that by explicitly setting up 4 region-disjoint paths between any pair of regions. There are two possible cases to be discussed as follows: Case 1. Ri J is adjacent to Rj j. It is trivially true (see Figure 2.4. La for exam- ple). Case 2. is nonadjacent to Rj j. Without loss of generality, suppose that i < j as shown in Figure 2.4.1.b. Construct a path as <^/,,, /?j+2./, •.• . Rj,i, ^y,i+i» ^j,i+2> — ' ^jj-^- Construct the second path P2 as </?,•,,-, 21 - ' - . ^;J>- Construct the third path as </?.-,•, ^,•,+1. ^f,.+2> - ' ^ij. - . ^;.;>- Construct 7^4 as </?.-,-, Pj, ^2' ^3' ^4 ^ The other cases such as i > j can be demonstrated similarly and thus are omitted here. □ Each time REGION_LINK is completed, an intra-region topology is constructed with layers of nodes. Except for the final layer, which has 4 to 7 nodes, each layer consists of 4 nodes of the form {vf \ V;^'\ vj'^ v^'^ ). For convenience we may call {vf'\ v^'\ vP, } layer i. In particular, layer 1 is the backbone layer, and also denoted as {vi, Vj, V3, V4}. Notice that the links, (v,-^\ v,*^'"^^)), (v^), vp"^^)) and (v|^) , v^'+i)) for all 1< i < 4 and j>l, will be assigned in REGION_LINK. Lemma 2.4.2 There exist disjoint paths from any node x to the other 3 nodes in the same layer i such that these 3 paths consist of nodes only in layer i, layer (i-1), or outside the region. Proof . There are three cases to be discussed. Case 1. Here i is the backbone layer. Without loss of generality, suppose x = v^. Let Vj^f denote the node v,- in region X(R), where i € {1, 2, 3, 4} and X e {W, NW, N, NE, E, SE, S, SW}. Then the 3 disjoint paths are: <Vi, V4>, <Vi, v^,y/^J^^^, V4,W'(/?). V2^(/?), V4jv,y(;f), VijvW(;j), Vi^(^), V4jv(/?), "^l^if^Ry '^3J^(Ry ^3>^ <Vi, ^i,EiR> ^4,£(/f). ^ZEiRy ^4.SE(R)y "^ijSEiRy ^uc/?)' ^4^(.Ry (see Figure 2.4.2.a). Similarly, we can show that the lemma holds for x = V2, V3, or V4. Case 2. In this instance, i is an intermediate layer. Again, assume that x = v{'^ Since (v{'\ v^'^) and {v{'\ v^'^) are 2 edges in layer i, 2 paths, <v|'\ V2^'^> and <vl'\ v|'^>, exist. The third path is from vf'^ to vj'\ which can be constructed 22 one layer above (i-1). If (i-1) is the backbone layer, i.e., i=2, then the path = <v{'\ v{^\ vP, v^, vP, v;j'^> (see Figure 2.4.2.b). If (i-1) is an intermediate layer, the path can be <v}''), v}'"!), v^-'^K v^'-'^\vP> (see Figure 2.4.2.C). Case 3. In the third case, i is the final layer. There are 4 topologies for the final layer. In the 4-node topology as shown in Figure 2.3.3.a, there is a complete graph. So, there are 3 disjoint paths from any node to the other 3 nodes. For the 5-node, 6-node, and 7-node cases, it can be checked easily that 3 disjoint paths exist from any node to the other 3, based on the similar enumerative method as in the inter- mediate case. □ Lemma 2.4.3 There are disjoint paths from any node x to the other 3 nodes in the same layer i, where layer i an intermediate layer, such that these 3 paths consist of nodes only in layer i or i+1. Proof. Let x be v}'^ Two obvious disjoint paths are <v}'\ v^'^> and <v}'\ v|')>. The third path, from vf ^ to vj'\ can be set up via nodes in layer i+1, i.e., <v{'\ v{''*'^\ v|'"''^\ v|'^^\ vj'^>. Since the link assignment in the intermediate layer is symmetric, it holds for the cases where J e {vj'), }• □ Lemma 2.4.4 There are 4 disjoint paths between every pair of nodes [x,y} where X and y are in different layers. Proof. Without loss of generality, suppose that x is in layer i and y is in layer j, i < j. From Lemma 4.2 we know that there exist 3 disjoint paths from x to the other three nodes in layer i by using nodes only in layer i or above (possibly nodes in neighboring regions). Now, since there is an edge (v/'\ v^^'"^^^) for all k e {1, 2, 3, 4} and i is either a backbone layer or an intermediate layer, 4 paths, <v^^\ v/'"^^\ ... , Vi^~^\ Vi^h>, where k = 1, 2, 3, 4, can be constructed. Hence, we are able to build up 4 disjoint paths from x to v^^ for k = 1,2, 3, 4. 23 If layer j is an intermediate layer, then by Lemma 4.3 there must exist 3 disjoint paths from y to the other 3 nodes in {vp^ v^M that do not contain nodes above layer j. Therefore, there are 4 disjoint paths from x io y (see Figure 2.4.4.a). If layer j is the final layer (i.e. layer j is equal to layer L in Figure 2.3.3), there are 2 cases. If 3^ e [v\'\ v^, v^'M = {v{^), v^^^, v^^^, vf >}, from the final layer topologies as shown in Figure 2.3.3, it can be seen that each node in [v[^\v^\ v^^] has disjoint paths to the other 3 nodes in the set. So 4 dis- joint paths can be constructed from x to y. If y e {vj^\ v^^\ vj^M, since each node in {v^^\ v^^), >} has 4 edges to all of the nodes in {v[^\ v^^\ vf \ v^^} as shown in topologies for terminal cases (see Figure 2.3.3 and Figure 2.4.4.b). □ Lemma 2.4.5 The connectivity of every pair of nodes in the same layer i is 4. Proof. It suffices to prove that, for each pair of nodes in the same layer, there exist 4 disjoint paths between them. This can be done by path enumeration. Since it will be a lengthy proof if we enumerate all of the 4 disjoint paths for every pair of nodes, some typical cases are illustrated to show that the disjoint paths can be constructed explicitly. The rest of the cases can be done similarly. Case 1. Here i is the backbone layer (i = 1). Let us show that, for {y^\ v^^), we can explicitly set up 4 disjoint paths connecting them as follows: The first path is <vp\ v2\ v^^>. Now, via its west, northwest and north neighboring regions, then going through V3, a second path can be established. Third, via east, southeast and south neighboring regions, the third path can be set up. Finally, the last path can visit vp\ , and then V;^^^ (see Figure 4.5.a). It is easy to verify that there exist 4 node-disjoint paths between any other pair in the backbone layer. 24 Case 2. In this instance, i is the final layer (i = L, see Figure 2.3.3). If the number of nodes in the final layer is 4 (as shown in Figure 2.3.3.a), every pair of nodes is directly connected, so the node-connectivity is a maximum. If the number of nodes in the final layer is 5 (as shown in Figure 2.3.3.b), then, for {v{^\ M in Figure 2.3.3.b, the 3 disjoint paths between v}^) and v^^^ arc <v{^\ v^^\ v^^>, <v^\ v|^^ vj^^>, and <v{^\ v^^\ v|'^>.The fourth path will pass through v[^~^\ v^^~^\ and v^^~^^ to v^^K Following the same enumerative process, we can verify that it also holds for the 6-node case (as shown in Figure 2.3.3.C) and the 7-node case (as shown in Figure 2.3.2.d). Case 3. In the third case, i is an intermediate layer. As Figure 2.3.2.b shows, we can choose any pair, say [v[^\v^^^, and construct the following 4 disjoint paths between them: <Vi('), v^'"), vP>, <vP, vf, >, <Vi('), vf'-D, v^'-^), vt^\ vP>, and <vP, vp^\ v^'+^\ v^'+^), vP> (see Figure 2.4.5.b). The other cases for an intermediate layer are similar. □ Let = (V, E) be the network topology derived from applying the NETWORK_LINK algorithm; and let {Vj, Vj- - ^^m) ^ ^he set of the nodes of the regions where each V,- resides in region . Theorem 2.4.1 If I V,- I > 2 for all V,-, then the network N has connectivity = 4. Proof. Let v,- and Vj be two nodes in V. Case 1. Here v,- and Vj belong to the same region /?,-. If/?,- contains fewer than 8 nodes, i.e., a primitive case (Figure 2.3. l.b to Figure 2.3. l.f), we can enumerate all the 4 disjoint paths between any two nodes. These paths will visit 4 adjacent regions of Ri (i.e., N(/?,), S(/?,), WiRi), and E(/?,)), and 4 non-adjacent regions (i.e., NW(/?,), NE(Ri), SE(/?,), and SW(/?,)). For the sake of briefness, they are not enumerated here (a similar example is shown in Figure 2.4.5.a). If /?,• contains more 25 than 7 nodes, it follows immediately from Lemma 2.4.4 and Lemma 2.4.5 that 4 dis- joint paths exist for each pair. Case 2. In the second case, v,- and Vj are not in the same region, (e.g., v,- is in Ri and Vy is in Rj). If /?,• is reconfigured as a primitive topology, it is easy to verify that there exist 4 disjoint paths from any node in any of the 7 primitive cases to its 4 adjacent neighboring regions. If 7?, contains more than 7 nodes, then as we have shown in the proof of Lemma 2.4.2 and Lemma 2.4.3, there are 4 disjoint paths from V,- to the four backbone nodes of i?, ; and each backbone node has an edge to its neighboring regions. Thus, regardless of the number of nodes /?,■ gets, there are always 4 disjoint paths from v,- to its 4 neighboring regions. Similarly, there are always 4 disjoint paths from Vj in Rj to its 4 neighboring regions. Now, by applying Lemma 2.4.1 we can construct 4 disjoint region-paths from v,- to 4 of its neighboring regions. Finally, via these disjoint region-paths 4 disjoint paths from v,- to Vy can be successfully established. □ Hi A I I -Y- (a) (b) Figure 2.4.1 Disjoint Region-paths 26 ^-1 ^ — \ nw _ - -t — ' / / / ^-1 n ; w « / s se (a) (b) .(•-1) (••) .-1) .('■) (0 Figure 2.4.2 Three cases of Lemma 3.4.2 (c) V (.+1) V Figure 2.4.3 Routing in the same layer (a) Figure 2.4.4 Routing in different layer 27 nw w s se (a) (b) Figure 2.4.5 Two examples of external routing 2.5 Failures and Recoveries In a mobile satellite network, a broken link can be detected by the loss of sig- nal, which may be due to the failure of a transceiver on either side. The failure of a node is detected through a consensus reached by its adjacent nodes. With 4- connectivity, a network can survive without being partitioned as long as no more than 4 nodes or links fail. Yet, there are backbone Unks and nodes whose failures have more serious impacts on the overall performance of a network than those of non-backbone links and nodes. When a backbone node/link fails, a local topological reconfiguration is needed to maintain sufficient backbone links and nodes; non- backbone link/node failures may be handled in the routing level (e.g., by reconstruct- ing virtual circuits) instead of in the link assignment level. As failure events occur asynchronously, the recovery should be done dynamically. The recovery will require each party involved to share the same information about the local reconfiguration and will need a centralized algorithm distributively run by the affected nodes. This cen- tralized recovery algorithm will do link-switching if a backbone link fails. It will also do backbone-node-selecting and link-switching if a backbone node fails. It is assumed that all the failure events are single-failure events; a multiple- 28 '1 failure event rarely occurs. When it does occur, its failures are treated just like a sequence of single failures. One thing we would like to point out is that, if any failure occurs at a time very close to the next reconfiguration, the recovery algorithm may ignore it because it will soon be taken care of by the global reconfiguration. 2.5.1 Link Failures Failure on a non-backbone link will be ignored as the network has a high degree of connectivity. A failure on a backbone link will not partition the network, but it will have a significant impact on the routing. All packets going to other regions are supposed to be forwarded through one of the backbone links. A failed backbone link will cause packets to be routed via longer paths and may create traffic jams for other backbone links. Hence, link failure of this kind should be handled properly. This can be done through link- switching. The backbone node which has a malfunctioning trans-ceiver will consult the other three nodes connected to it and decide which of the three links will be used to replace the broken backbone link. In this way the network can cushion the impact of a broken link to the minimum. 2.5.2 Node Failures A node may be detected to have failed ( e.g., through the failures of all the links incident to it). The failure of a non-backbone node will not result in a local reconfiguration. But a backbone-node failure needs to be handled properly, consider- ing its more important role. The recovery process works as follows: The region con- taining the failed node will select a node to replace it. The principle for the replace- ment is "choosing-the-nearest-node," i.e., the node that is closest to the failed node is chosen as the new backbone. As each node has information on the locations of the other nodes, this can be done in a straightforward manner. By doing this, those nodes originally linked to the failed node will be able to connect to the new back- bone node. CHAPTER III RELIABLE MESSAGE BROADCAST 3.1 Background For various reasons, a node in a network needs to broadcast messages to all other nodes. A broadcast protocol is reliable if every node can receive the broadcast messages in finite time. In networks with fixed topologies, broadcasting is usually done by delivering messages through some broadcasting trees (e.g., a minimum- spanning tree) [7,10,11,14,15,26-28] for the sake of reducing the number of messages exchanged. For networks where the topology keeps changing, the task is more com- plicated because a pre-constructed routing path may be disconnected. Even if no paths are disconnected, with varying propagation delays over communication links, the routing of the broadcast messages may not be optimal if pre-computed broadcast- ing paths are used in the protocol. It has been shown by Awerbuch and Even [3] that a reliable broadcast in a changing-topology network is possible if the network is "eventually connected" (no nodes are permanendy isolated from the others). Some sophisticated broadcasting algorithms have been developed to achieve reliable broadcast in eventually connected networks with slowly changing topologies [3] [32] [35]. However, the eventually con- nected networks for which these broadcast algorithms were designed are mainly sta- tionary networks undergoing arbitrary link failures and recoveries and thus the significance of these algorithms are very restrictive. In addition, the algorithms are not guaranteed to achieve minimum broadcast delay in networks with rapidly chang- ing topologies such as mobile communication networks described below. In this research, the concept of eventually connected networks is further 29 30 extended to include mobile networks subject to frequent link reconfigurations [30]. Typical networks of this kind include large-scale mobile communication networks such as packet radio networks and point-to-point satellite networks. There are three fundamental properties in the topologies of the networks: (i) each node is associated with a number of communication ports with which communication links can be esta- blished, (ii) the nodes are mobile and communication delay across each link is vary- ing, (iii) link reconfiguration is necessary in order to preserve the connectivity of the network [7][17-18][22]; this is due to restrictions such as limited traniiUiission powers in the transceivers of packet radio networks when nodes are too far apart, or the blocking of the earth in satellite networks connected by laser transceiver. It has been shown [26] that in changing-topology networks, if a broadcast mes- sage is not buffered, it is likely that some node may never receive that message. For example, suppose that / is the only node connected to j and a message M arrives at / immediately after the link between / and j is broken, if j later reconnects to / or connects to any node which has already received M and discarded it, j will not receive M . The questions which arise are: 1) how long a message should be buffered and 2) how should the message buffers be managed. These issues have never been success- fully addressed in networks with changing topologies. In [3], the algorithms work by assuming infinite message buffer space in each node. However, in any network each node has only a finite memory. In this research, we show that bounded message buffers are feasible for a reliable broadcast in eventually connected networks. Toward this end, we take a rather unconventional approach to the design of reli- able broadcast protocols. By using message reply vectors, we design a reliable and efficient broadcast protocol which requires bounded message buffers for broadcast messages in each node. The protocol also provides a mechanism which enables obsolete messages to be dynamically flushed out. This will substantially increase the 31 utilization of limited buffer space. In particular, the algorithm is proven to be optimal in terms of broadcast delay in eventually connected networks with arbitrarily chang- ing topologies. 3.2 Network Model Let a mobile communication network be represented as G = (V, E), where V represents the nodes in the network, and E represents the operating communication links in the network. \V\ is fixed while \E I is bounded and time-varying. 3.2.1 Preliminary Definitions In addition to all the basic notions of graph theory, we define some related terms as follows. A directed tree in a network is a tree in which each node, except for the root of the tree, has a neighboring node (nodes connected by a link are neigh- bors ) as its parent node, and each parent has one or more child node s or children . A link is connected if it provides at least a minimum operating interval x during which a message can be successfully transmitted, otherwise it is considered discon- nected. Two nodes that are connected by a link but that do not have a parent-child relationship are siblings. Two trees are called adjacent if they are connected by at least one link. The links of a tree are called parent -child links while the links by which two siblings are connected are sibling links. The sibling links can be further classified as internal and external; an internal sibling link connects two siblings in the same tree (i.e., they have the same root), and an external sibling link is an edge between two siblings in different trees. The issue on how the relationship between neighbors is determined will be discussed in the next section. A path P from x toy is a sequence of edges of the form <Vo, Vj, V2, ... , Vy> where (v,_i, v,) is a link for 1 < i ^ j, and x = VQ,y = Vy . If every (v,_i, v, ) is a parent-child link, then P is called a primary -path ; if some (v, _i, v, ) are siblings link, then P is called an alternative-path. 32 The concept of eventually connected networks is defined as follows. A node u is said to be eventually connected to node 5 in G if, at any time t^, there exists a constant (3 and a node sequence P = <Vo, Vi, V2, ... , Vy_i, Vy> where Vq = m and Vy = S, such that a time sequence T = (rj, - . 0') ^ found with tQ<t^<t2^ ... < tj <tQ + ^, and {v,_i, } is connected by a link at r, and a minimum operating interval [r,-, ti+x] exists for {v,_i, v,- }. If every node in E is eventually connected to node S , then E is said to be eventually connected with respect to node 5 . The node sequence P above does not have to be a physically connected path (P can be called an eventually connected path). Nor is it assumed that a message can always be delivered from m to 5 via P within the interval [tQ, tQ + 3.2.2 Assumptions It is assumed that the following statements hold: (1) Each node has a unique ID and has a complete list of the IDs of the other nodes in the network, but has no information about the complete network topology; the topological knowledge of a node is restricted to that of its neighbors (in a net- work undergoing topological changes, it is impractical to assume that complete topological data is available to every node). (2) Delivery is reliable across a connected link and the time to transmit a message through a link is finite and bounded. (3) The time spent in processing an incoming or outgoing message is negligible, compared to the propagation delay of sending a message through a link. (4) Links may reconfigure for certain reasons (e.g., maintaining network connec- tivity), which are unpredictable to each node; each reconfiguration may involve disconnections (link-downs), connections (link-ups), or switchings of links (reconfigurations that are not due to a link failure or recovery). In addition, if the reconfiguration is an "involuntary" one, i.e., a link failure, it can be detected 33 by the nodes on both ends. It is also assumed that a node will never fail so that every node will respond to an incoming message. (5) The network is eventually connected with (3 bounded; an unbounded P would imply that some nodes may be isolated from the rest of the network for an arbi- trarily long time regardless of what protocol is used, which has little or no prac- tical significance in message broadcasting. Observe that this assumption does not imply that every link reconfiguration should be finished in finite time or that every failed link should be recovered in finite time. All it requires is that the network be maintained in such a way that an eventually connected path exists between any two nodes. 3.3 The Two-Phase Broadcast Protocol 3.3.1 Description of the Protocol The broadcast protocol is a distributed algorithm in the sense that each node reacts to different conditions independentiy. For the purpose of brevity and ease of demonstration, a single-source broadcast environment is assumed. By assuming that broadcast messages do not interfere with each other, and that each node treats the broadcast messages from different sources independently, the protocol can be easily extended to multiple- source cases. Each broadcast message is associated with an acknowledgment message or a reply message that, appended on the end of the broad- cast message, contains one bit position for each node in the network (i.e., an n-bit vector initialized to 0). Throughout this chapter, we use reply messages (or ack- nowledgment messages) and reply vectors ( or acknowledgment vectors) interchange- ably. For the following discussion of the protocol, we shall assume that the broad- cast messages are from a single source node S . The protocol consists of two phases: broadcast phase and reply phase. 34 3.3.1.1 Broadcast phase In the broadcast phase, each node performs the flooding procedure [33] [35], i.e., each node receiving an incoming message M will keep a copy and send it to all the neighboring nodes except the nodes from which M came. Unlike conventional flood- ings, a message received is buffered in memory and will not be discarded until an explicit flush -out message from S arrives. The broadcast phase begins with 5 send- ing a message to each of its neighbors. Let node / be a node that has just received M, and NEIBi be the set of neighboring nodes of node / . When noce / receives M, it sets the position corresponding to / in the reply vector of M to 1. The first node from which M arrives is chosen to be the parent of /, denoted by PRi, and a parent- notifying control message is sent to PRi acknowledging the relationship. The nodes from which node / receives a parent- notifying message form the set of node /'s chil- dren, denoted by C/?,- . The other nodes sending M to j are the siblings of / , denoted by SB I . Again, a control message is sent to each sibling to acknowledge the relation- ship. Notice that 5fl,- consists of either the nodes which exchange the same M with node i due to the propagation delay, or the nodes which deliver M to i slightly later than PRi . Siblings play an important role in the reply phase when a parent-child link is disconnected. During the broadcast phase, if a new link is established between two nodes, both sides will exchange information about broadcast messages. Each node will transmit over the link the messages the other side does not have. As the broadcast messages are spreading across the network, a spanning tree rooted at S is constructed and expanded. The spanning tree consists of parent-child links and will be used to route the reply messages of M . If the network is connected and no links are discon- nected during the broadcast phase, a complete spanning tree can eventually be esta- blished. If links are broken in the meantime, then the spanning tree could be parti- tioned into two or more subtrees, which may cause reply vectors to be detoured. 35 Theorem 3.1 The broadcast protocol is reliable if the network is eventually con- nected. Proof. Let M be a broadcast message from 5. Assume that v never receives M. Since the network is eventually connected and M is broadcast by flooding, the number of nodes that receive M must increase over a finite period of time. If v never receives M, then the number of nodes receiving M must stop increasing after a cer- tain time, and this number is less than \V\. Let Ci be the set of nodes receiving M and C2 = V \ Ci where v e €2- As the protocol ensures that each node will inform its neighbors of M, there must be a permanent edge-cut between and €2- This implies that the network is not eventually connected, a contradiction. Hence, the broadcast protocol is reliable. □ 3.3.1.2 Reply phase A node receiving M initiates the reply phase under the following two condi- tions: 1) it has no neighbors to which M needs to be sent or 2) all the edges to its children are disconnected. In the former case, the broadcast phase ends at this node and we call such node a leaf node having no children. In the latter case, there is no need to wait for the reply messages from its children as it has no way of knowing when or whether the links connected to the children will be restored. In both cases, these nodes are called initiating nodes. If a node is not an initiating node, it accumulates all the vectors acknowledging the acceptance of M from its connected children, logically ORs them together to form a new version of the vector, and forwards the copy of this vector to its own parent node. Notice that, by piecing together the reply vectors from its children, a node can also construct a descendent list of its own. Lemma 3.1 If each parent-child link is not disconnected before the child node 36 finishes replying its vector, then every reply message can be sent to S through a primary-path. Proof . If each parent-child link is not disconnected before the child node finishes replying its vector, then each node can reply to its parent via the parent-child link. Hence, the reply message will be forwarded to S through a path consisting of only parent-child links, which is a primary-path. Q During the reply phase, if a node ; is disconnected from PRj, a sibling in SBj is chosen as the new parent of If it happens that two siblings are disconnected from their parents and thus acknowledge each other as their new parents, the tie is broken arbitrarily so that one is the parent and the other is a child. If node ; has no siblings, a new tree, splitting from the old, emerges with ; being the root. This may result in an alternative-path search if j has not yet replied its vector, because no primary-paths connecting j and S exist. The alternative-path algorithm. The alternative-path algorithm is based on the idea of finding external siblings. When a node, which has lost its parent and has no connected siblings, is unable to send a reply vector, it starts alternative-path search by passing the vector to its children. Receiving the vector from its parent, each child updates its own vector and sends a copy of the vector to each of its external siblings. As an internal sibling is a neighboring node in the same tree, it is useless to exchange reply vectors among internal siblings because they lead to no new paths to other trees. An external sibling can be discerned from an internal one from the infor- mation contained in the reply vector sent by the parent (i.e., the internal siblings will have Is' in the reply vector). In addition to sending the updated reply vector to its external siblings if it has any, a node needs to pass the vector to each of its own children. If a node has no siblings, it simply passes the vector to each of its children. The necessity of sending 37 vectors to children even if external siblings are found is justified by the following lemma. Lemma 3.2 An alternative-path is not guaranteed to be found if not every node in the new tree, generated as a result of link disconnection, is explored. Proof. Without loss of generality, suppose that v in the new tree Tj is a leaf node connected to the root r via a path P = <r, vj, ... , v, , v> and that v has no external siblings. Assume that the alternative-path search will only explore the nodes having external siblings. Therefore, the search stops at some node Vj where \ < j <i and v^- has external siblings. If it happens that none of the extemal siblings of 7, can lead a path to S but a new link is just established between v and a node u not in 7, and u has a path to 5, this path then becomes the only alternative path to 5. Since v is unexplored and thus does not receive a vector from its parent, the alternative-path may never be found even if the network is eventually connected. □ Figure 3.1 shows a simple example. Suppose that an alternative-path search will stop when the first extemal sibUng is found. Let and r2 be roots of two adjacent trees and Vi and V2 be two extemal siblings. Both and r2 start altemative-path search due to the disconnections of the links to their parents. The reply information on both trees may be exchanged over the extemal sibling link connecting Vj and then sent to Ti and and finally back to and V2 again. Both sides exchange reply vectors again and reject each other as their reply vectors are identical. The search on both sides stops at and V2 without discovering that there exists an altemative-path via V3 to S. Figure 3.1 A failed altemative-path search When a sibling node in a neighboring tree receives a reply message via an external sibling link, if the node has akeady done its own replying and the vector contains new information that it does not have, it will forward this vector immedi- ately to its parent. If it has not yet replied with its own vector or the vector contains new information, it will create an update vector by merging this vector with its own, wait for its children's reply vectors if it has not yet replied, and then reply the vector. If the vector contains no additional reply information, it will be rejected by this external sibling. It is possible that an external sibling may receive the same reply vector from different external siblings from the same tree. This "multiple-copy-to-a-single-sibling" situation causes redundancy, but it is risky to be avoided completely. The reason is as follows: Suppose that the algorithm can be designed in such a way that no siblings will be sent with more than one copy of the vector. This can be done by the follow- ing steps: 1) search in parallel all the nodes in the tree (assuming that the search is not interrupted by link disconnections), 2) bring back information about each external sibling from each descendent, 3) let the tree root decide which external sibling should be sent by which node. However, an altemative-path may be missed if the desig- nated sibling is disconnected from its external sibling after the duty is assigned and 39 before it can send the reply vector. Missing a particular alternative-path may cause a serious delay or a permanent blocking of the vector-replying for reasons similar to Lemma 3.2. However, the situation can be safely alleviated by refining the algorithm as fol- lows: A list of external siblings which have been reached by the reply vector is sent, along with the vector, to each children so that each node will not send the vector to the siblings on the list. After each node has finished routing the vector, the external siblings just reached with the vector are added to the list, which is dien delivered to its own children. The alternative-path algorithm propagates all the way down to the leaves of the tree. If none of these descendents of the root have external siblings, the tree is tem- porarily isolated from the rest of the network. As the network is assumed to be eventually connected, in finite time at least one node will connect to the other nodes outside the tree through link reconfiguration. This ensures that the reply vectors can continue to be forwarded toward 5. It is likely that, due to the operation of a net- work reconfiguration algorithm, some link disconnections may result in a partition of the network; a spanning tree becomes a spanning forest Therefore, an alternative- path eventually leading to 5 may consist of a sequence of parent-child links and external sibling links. The sibling-reclassify algorithm. After replying to its parent, a node is still likely to be disconnected from its parent and become a root of a new tree if it has no siblings. This makes the vector reply more complicated. It seems at first that this condition can simply be ignored since the reply vector, which contains reply mes- sages of all the nodes within this tree, has been sent out of this tree. However, ignor- ing this condition could result in a long delay or permanent blocking for replying vectors. Consider the following example. Let x and y be two nodes in tree and x is a child node of y. Suppose that x has no siblings. If x is disconnected from y, ■V 40 then X and y belong to different trees with x being the root of a new tree Ti- Now, the sibling links between nodes in and nodes in T2 change from internal to exter- nal. If these links are not "reactivated" as external links, where it might be the case that the only alternative-path from nodes in Tj to S must go through nodes in Tj, then the alternative-path can not be found. Therefore, sibling links between Ti and T2 need to be reclassified. The sibling-reclassify algorithm works in a way similar to that of the alternative-path algorithm. The primary difference is that the sibling-reclassify algo- rithm requires a specific control message, originating from the root of the new tree, to indicate which nodes are still in the same tree. This is because all the nodes in the tree may have the same information in their reply vectors due to the alternative-path search process invoked before the tree is split into two. Hence, a node has no way of knowing who its external sibUngs by merely referring its reply vector as what have been done in the alternative-path algorithm. The specific message contains the descendent list of the root, which is constructed when the root is doing its first vector reply. The algorithm proceeds as each node sends to its children the message contain- ing the set of all the nodes in the tree. Each node receiving this message can deter- mine which siblings have become external, and sends each of them a message declar- ing the relationship to be external. A node which receives such a declaration message will acknowledge the sending node with its reply vector. This is an important step to prevent the other side from missing the new reply information which arrives when the links have not yet been declared to be external. In summary, when a link is disconnected in the reply phase, the relationship between the nodes on both ends terminates. If a node is disconnected from its parent, it will select a new parent from its siblings if it has any. Otherwise, a new tree is bom. If the node has not replied its vector, the reply of its vector has to be done by 41 invoking the alternative-path algorithm. If the node has akeady finished replying, it reclassifies the siblings so that siblings in neighboring trees can be reactivated and can be used as possible bridges leading to the source. On the other hand, when a new link is established, two nodes on both ends of the link will consider each other as siblings and exchange message status in order to resume message broadcasts and/or to continue vector-replying. Lemma 3.3 Let C be a connected component and S is not in C. If C remains con- nected but isolated from the rest of the network sufficiently long, the reply vectors of nodes in C will converge to an identical form. Proof. We consider the four possible cases. Case 1: Suppose that no links are disconnected or newly connected in this connected component C for a sufficiently long period of time. If there is only one tree in C, then all the reply vectors will reach the root. Since the root can not forward its reply vector, it will send it to all of its descendents which will update their own reply vec- tors. This means every node in this tree will have an identical reply vector. If there are more than one tree, say Tj, T2, ... ,T^, there must exist at least one external sibling link between any two adjacent trees. Let T,- and Tj be two adjacent trees. Since the roots of and Tj will eventually invoke the alternative-path algorithm, there must be some exchanges of reply information between nodes in 7,- and Tj through external sibling links. As each node will route the reply vector from its external sibling in the adjacent tree to its parent, the reply will eventually reach the tree root The root will soon propagate it to all the nodes in the tree as a result of the alternative-path algorithm. Hence, we can see that nodes in J,- and Tj will have the same reply vectors. Similarly, it can be inferred that all the trees in the same con- nected component will have the same reply vector if C is isolated long enough. 42 Case 2: Suppose that some links are disconnected but C remains connected. Without loss of generality, let L, be the link that is disconnected. There are two cases to be discussed: Case 2.1: If L,- is a parent-child link, either the child selects a new parent or it pro- duces a newly-born tree. In the former case, if the new parent is in the same tree, it causes no problem because it is only a switching of a parent-child link. If the new parent is not in the same tree (an external sibling is chosen), the split new tree rooted at the child is now merged into another tree; the total number of trees in this com- ponent remains the same. As in Case 1, the merged tree will eventually share the same reply information as a result of exchanging reply vectors. In the latter case, a new tree is born, which increments the number of trees in this component by 1. The disconnection either invokes the alternative-path algorithm or sibling-reclassify algo- rithm. In both cases, the external siblings between the new tree and the old tree (i.e., its previous parent's tree) are declared and exchange reply information. Similar argu- ments as in Case 1 can be applied for the rest of this case. Case 2.2: If L,- is a sibling link, then it does not affect the convergence. The reason is as follows. If Li is internal, the disconnection only drops out a useless link in the tree. If L,- is external, as C remains connected, even if L,- is the last external sibling link connecting the two trees, the nodes in both trees can exchange reply information via other trees between them. The rest of this case is similar to Case 1. Case 3: Suppose that some links are established. Since a new link will only make nodes on both sides exchange broadcast and reply information; it does not block vector-replying. On the contrary, it may speed up the process of the convergence. 43 Case 4: Suppose that some links are disconnected and some are connected. Similar reasoning can be applied as in Case 2 and 3. □ Lemma 3.4 Each reply vector can be delivered to S through the primary path or an alternative-path in time that is finite and bounded. Proof. Let V be a node sending a reply message to 5 . If there is no link disconnec- tion, from Lemma 3.1 we know that 5 can receive v 's reply message. If some links in the primary path from v to 5 are disconnected, two possible cases are considered: Case 1: assume that v and S are in the same connected component C for sufficiently long. Let 5 be in tree 7, and v be in tree T^. If 7, and are adjacent to each other, then v can be routed to S via a sibling link between the two trees. If and are not adjacent, there must exist some trees Tj, T2, ... Ji, such that is adja- cent to Ti, Ti is adjacent to Tj, ... , and T,- is adjacent to T^. Without loss of gen- erality, we assume that no new tree is bom in the meantime. From the alternative- path algorithm, we know that there exists an alternative-path between 7, and via Tj, ... , Ti. Since each vector can be sent over a link in finite and bounded time, v's reply vector can be delivered to 5 in time that is finite and bounded. » Case 2: assume that v and S are not in the same connected component. Let be the set of nodes which have received v's reply information. As the reply process proceeds, IC^ I is increasing like a step-function. From Lemma 3.3, it can be seen that v's reply information will not stop propagating as long as some nodes in the same connected component have yet to receive it. If ICy I stops increasing tem- porarily, then there must exist an edge-cut between Cy and V \ C^. Since the net- work is eventually connected, there must exist a node u and a node w such that u e Cv and we V \C^ and a new link [u, w] is established in finite and bounded time. Therefore, due to the reply information exchanges between u and w, C^ is now 44 expanded to be Q u w . As the total number of nodes in this network is fixed, S will get v's reply message in time that is finite and bounded. Similarly, we can con- clude that every reply vector will be received by S in finite and bounded time. □ When the source S receives all of the reply vectors for message M (i.e., the final version of the vector has value 1 in every bit in the vector), it attaches a flush- out message in its next broadcast message ready to be sent, which directs each node receiving this message to discard M. A flush-out message is considered as a part of the broadcast message to which it is appended and it will be discarded along with the broadcast message later. It is important to note that the protocol does not imply that each broadcast mes- sage can not be broadcast until all the reply information of its predecessor has been received by S. As indicated earlier, each node can treat each broadcast message independently. If more than one broadcast message from the same source are allowed to coexist in the network, what has to be done is to allow each broadcast has its own "family" in each node. As the network is subject to frequent link reconfigurations and propagation delay in each link is changing, a broadcast message should not take advantage of the direct tree established by its predecessor by broadcasting along the tree links. Flooding is still needed if reliability and minimum broadcast delay are to be achieved. Theorem 3.2 Every broadcast message will be discarded in finite and bounded time. Proof. From Theorem 3.1 and Lemma 3.4, we know that the source S will detect that every broadcast message has received a broadcast message in finite and bounded time. Hence, a flush-out message can be broadcast to clear that message, which implies that the stay of each broadcast message in each node is finite and bounded. □ 45 Corollary 3.1 If the generation rate of broadcast messages is bounded, then the buffer space needed in each node is bounded. □ 3.3.2 Message Buffer Management Message buffer management is one of the major issues in communication proto- col designs. As each node has only limited memory space, this issue becomes criti- cal because an ineffective management policy can result in a great waste of buffer space or, even worse, an unreliable protocol. Theoretically, the broadcast algorithm above can still work as a fast and reliable communication protocol without the reply phase, provided that each node has a sufficiently large memory to buffer broadcast messages and is capable of determining the right time to discard the messages. Without a reply mechanism, there are two possible ways of managing message buffers: (1) Static policy - Each message received will remain in the buffer only a fixed period of time and then be eUminated from the buffer. This fixed interval is determined by an upper bound which is long enough to ensure that a message has been received by every node. The weakness of the method is two-fold: First, it is very difficult to accurately estimate the upper time limit for a broadcast message to stay in the buffer and the limit tends to be over-estimated. Second, by fixing the time a message has to stay, it is possible that too many obsolete messages, i.e., messages already received by every node, stay in the buffer memory for too long. This will cause a significant waste of the buffer space. (2) Forced -out policy - Broadcast messages are discarded only when the buffer is full and messages are discarded in a first-in-first-out order. This method has the same weakness as the static policy in wasting buffer space. In addition, it also causes great complexity in allocating message buffers since broadcast messages are not the only messages a node needs to handle. If a broadcast message buffer 46 is too small, a broadcast message may be eliminated while some nodes are still waiting for that message. This violates the reliability of broadcasting. If too large a broadcast message buffer is allocated, again, memory space is substan- tially wasted. The two-phase broadcast protocol provides a dynamic buffer management scheme. This management scheme has two major advantages: First, the reliability of the protocol is achieved in a more flexible manner. Since each message will be dis- carded only if the corresponding flush-out message is received, a broadcast message is allowed to stay in the buffer as long as it is needed without being accidentally eliminated from the buffer. If few link reconfigurations occur during the broadcast and reply phase, a message can be discarded quickly. If more link disconnections and connections occur, a message is allowed to stay longer. In short, the protocol's dynamic management policy provides a significantly better buffer utilization than the two policies above. Second, Corollary 1 implies that each node can estimate the upper bound of the buffer size needed to store broadcast messages by estimating the maximum time interval a message will stay. With this time bound closely approxi- mated, the buffer allocation problem can be substantially simplified. 3.3.3 The Node Algorithm In this section we present the two-phase broadcast protocol in the form of a node algorithm, consisting of a number of event-driven procedures. A node reacts to each condition/event by running the corresponding procedure. Before we formally describe the node algorithm, some terms and variables used in the algorithm are defined as follows. a. M : a broadcast message from the source S . b. ACK-parent(i): the parent-notifying message from node / to its parent. 47 c. ACK-sibling(/ ): the sibling-notifying message from node i to its sibling. d. DCL-exsib(/): the message from node i to declare external-sibling. e. ACK-vector(M,0 : the current acknowledgment vector of a broadcast message M at node /. It is an n-bit binary vector and ACK-vector(M,/ ) 3 ACK- vector(Mj) if ACK-vector(M,i)[^] > ACK-vector(M,y)[;:] for all k, where 1 < k < n. ACK-vector(M,/)[A:] is 1 if node / "knows" that node k has received M, 0 if otherwise. f. UNKi : neighbors of node / and whose relationships to / are still unknown (e.g. they could be either siblings or children of / ); UNKi is initialized to be NEIBi . g. E-SB, : the set of /'s external siblings. h. DESi : the set of i 's descendents. i. SB -DONE : the set of the external siblings already been reached by the same reply vector from ancestors. j. BUFi : message buffer of node / . k. FLUSH-OUT(M): a message indicating that M can be discarded from the buffer memory. 1. LINK -DOWN j i a message indicating that the link to node j is disconnected, m. LINK -UP j -. a message indicating that the link to node j is established, n. \: difference, a set operation. o. or: a bitwise logical-or operation defined on n-bit ACK-vectors. The Basic Node Algorithm for node / : For a message M from j: a.1) Mark ; RECEIVED for M; a.2) UNKi <- UNKi \ {; ); a.3) if M is in BUFi,SBi <- u ; and send ACK-sibling(i ) to a.4) else begin /* is the first node sending M to i */ a.5) put M into fif/F, ; a.6) ACK-vector(M^ )[j] <- 1; a.7) send ACK-parentO) to j; a.8) PRi<r-j; a.9) UNKi <r- NEIBi \ PRi ; a. 10) for every k g NEIBi \ O ). a.11) sendMto*; a.12) end a. 13) if every k g NEIBi is RECEIVED for M and UNKi = 0, a. 14) call reply_ack; For an ACK-parentO ): b. l) CRi <r- CRi ^ U); b.2) UNKi <^ UNKi \{j]; For an ACK-siblingO ): C.1) Sfi, SB, u {;■); C.2) UNKi i- UNKi \[j]; For a UNK-DOWNj message: dl) NEIBi i- NEIBi \{j]; d.2) if y G UNKi , UNKi <- UNKi \{j}; d.3) itjeSBi,SBi<r-SBi\[j}; A4) if 7 G £ -SBi , E <- E -SB, \ {y ) ; A5) if 7 G CRi, d.6) begin A7) CRi <^ CRi\[j]; d.8) if / is not REPUED and (C/?, = 0 or every k g C/?i is REPUED), d.9) call reply ack; d.lO) end d.11) if =PRi and SB, 0, d.12) begin d.13) choose a new parent k from SB, ; d. 14) send ACK-parent(i ) to i ; d.15) end d.16) else if ; = P/?, and i is REPUED , d.17) begin /* starts sibling-reclassifying */ d.18) create Z)£S,; d.19) sendDES, to every k e CRi ; d. 20) end For a LINK -UP j message: e. l) NEIBi <^ NEIBi u {7); e.2) SB, <- SB, u {7 ); e.3) J exchanges broadcast message status with 7 ; e.4) for every M not received by 7 , e.5) send M to 7; e.6) i exchanges reply message status with 7 ; e. 7) perform steps f.l-f.7 for the reply messages from 7; For an ACK-vector(M,7 ) where 7 g CRi or 7 g SB, : I* reply message for M arrives from a child node or a sibling node of / */ f. l) if ACK-vector(M,0 □ ACK-vector(M,7 ), then ignore ACK-vector(MJ); f.2) else begin f.3) ACK-vector(M ) <- ACK-vector(M,z ) or ACK-vector(M J ); f.4) mark 7 REPLIED for M; f.5) if every k g CRi is REPLIED , f.6) call reply_ack; f. 7) end For an ACK-vector(M,7) where 7 g PB, : I* reply message for M arrives from the parent node of i */ g. l) if £ -SB, is not initialized, g.2) E-SBi <- SB, \ {k I ACK-vector(M,7)[A:] = 1); g.3) ACK-vector(M,/' ) <- ACK-vector(M,/" ) or ACK-vector(M,7 ); g.4) if £ -SB; * 0, then for every k g £-SB, , 49 g.5) send ACK-vector(M,i) to k; g.6) for every / g CRi g. 7) send ACK-vector(M,/) to / ; FoT&DESj message; h. l) for every k e h.2) ifk not 6 E-SBi and k not e DESj, h.3) begin h.4) E-SBi <r- E-SBi u {i); h.5) send DCL-exsib(j )tok; h.6) end h.7) for every / g CRi h. 8) s&nd DESj to /; For a DCL-exsib message from i. l) E-SBi <- E-SBi u i.2) send ACK-vector(M^' ) to ); For a FLUSH-OUT(M) message from j: j.l) eliminate M from fiC/F, ; /* this message came along with a broadcast message */ Procedure reply_ack for i : k. 1) if PRi € NEIBi , send ACK-vector(M,i ) to PRi , k.2) else begin f* altemative-path search */ k.3) if SB -DONE is not initialized, k.4) SB -DONE <r- 0; k.5) if E-SBi ^0, k.6) begin k.7) SB -DONE <- SB -DONE u E -55, ; k.8) for every k e E \ SB -DONE k.9) send ACK-vector(M,j ) and SB -DONE to k ; k.10) delete 5B -DONE from / ; k.11) end k.12) for every k e CRi k. 1 3) send ACK- vector(M,i ) to jfc ; k.14) end Some notes on the basic node algorithm: (a) In e.3 and e.6, when nodes exchange the status of messages, only the headers of the messages are exchanged so that the communication cost can be reduced, (b) A reply message can be rejected by the parent or a sibling if it contains no additional reply information, but a reply vec- tor from a parent will not be rejected by a child (see algorithm g.*). Consider the fol- lowing situation. During an alternative-path search, an external sibling has received a reply vector from its sibling. Vg will forward the vector to its parent, assuming that additional reply information is contained in that vector. Suppose that a child node is allowed to reject a reply vector from its parent. If the vector is eventu- ally routed back from the root node as a result of a link disconnection, will reject 50 this vector because it has already received the same vector. Consequently, some des- cendents of this rejecting node will not receive this reply vector. From Lemma 3.2, we know that the alternative-path may never be found, (c) For the source node S , if the vectors from its children indicate that all the nodes have received a particular broadcast message M (i.e., every bit is 1 in ACK-vector(M ,5 )), S attaches a FLUSH-OUT(M) to its next broadcast message. 3.4 Performance 3.4.1 Minimum Broadcast Delay The broadcast delay of a message M for a single node / is defined to be the time delay from the time S starts to broadcast M to the first reception of M at node i, denoted by £>,(M). The broadcast delay of M is defined to be Max:[Dj(M)\ j e V} Since it is assumed that "zero time" is spent at each node in assumption (3), D, (M) is the sum of the propagation delays of M on each communication link in a path from 5 to / and/or the time delays of waiting for the connections of some links. The delay caused by waiting for a link connection happens when the network is tem- porarily partitioned into several disconnected components. Theorem 3.3 Minimum broadcast delay is achieved by the basic broadcast protocol. Proof . For each node t, t * S,\etP = <S , Vi, ... , v,_i, v,-, t> be the eventually connected path through which message M arrives at t using the basic broadcast pro- tocol, and each node in P is the parent node of the next node (e.g., v, is the parent node of t). We will show by contradiction that the delay for r to receive M is minimized if M travels along P . Suppose that P is not the shortest path to deliver M and P = <S', V 1, ... , v'y, t> is the shortest path. Let (v'^, v'^^.i) be the first pair of nodes such that v ^ is not the parent node of v^^j, and that u is the parent node of 51 v'jt+j. Since the parent node is chosen as the first node from which M arrives, there must be a shorter path from S via m to v ^+1, and then from v to r. So, P is not minimum, a contradiction. Hence, P is the minimum path from S to t for delivering M. □ Notice that, as each link disconnection results in an exchange of received mes- sages on both sides of the link, partitioning of the network does not prevent messages from finding the shortest path to a node. Since the protocol deals with networks hav- ing arbitrarily changing topologies, it is possible that the order in which messages are received at a node is different from the order in which the messages were broadcast. 3.4.2 Communication Cost The communication cost for a broadcast message M is defined to be the number of messages exchanged during the broadcast and reply phases of broadcasting M . Messages in the networks can be classified as broadcast messages and control mes- sages; control messages consist of ACK-parents, ACK-siblings, ACK-vectors, and DCL-exsibs. Lemma 3.5 The communication cost in the broadcast phase is 0(\E I). Proof . Let Af be a message broadcast from 5 . As the flood of M goes forward, each edge is traveled at most twice. The edge between a child and its parent is traveled only once by M because each node receiving the first copy of M from its parent will not send the same message back to its parent. The number of these parent-child edges in a network is (IV 1-1). So, in the worst case, the total number of the copies of M exchanged is 2\E I - (IVl-l). In the best case, each edge is traveled exactly once, which implies that \E I copies of M are exchanged. The cost for control mes- sages is 2\E \-\V\. Therefore, the communication cost is 0(\E I). □ 52 The analysis for the cost of control messages in reply phase is more complicated due to the fact that the network under which the protocol is performed is subject to link reconfigurations. Lemma 3.6 The communication cost incurred by a link disconnection is 6> (!£■ I). Proof . The disconnection of a sibling link causes no message exchanges. A tree link (parent-child link) disconnection causes no problem if the child has a sibling. Other- wise, it causes a tree-splitting and invoke either an alternative-path search or a sibling reclassification. In the latter case, DCL-exsibs are sent to the "new" external siblings which were internal siblings before the link was broken. In the former case, ACK- vectors are sent to the tree's external siblings, which may result in subsequent alter- native path searches on the other trees. In both cases, the total number of reply vec- tors exchanged over tree links is on the order of O (IV I). It remains to be determined how many reply vectors are exchanged over the external sibling links which connect the affected trees. In the normal case, this number should be much smaller than 0{\V\). In the worst case, it should not exceed 0(\E\) (in a highly dense network). Hence, for each link disconnection, the communication cost incurred is 0{\E I). Lemma 3.7 The communication cost incurred by a link connection is 0{\E\). Proof . Each link connection will cause exchanges of broadcast status and reply vec- tor status. If both sides contain the same reply information, no more messages are exchanged. If one side contains new reply information, the other side may invoke vector replying, which in turn may result in an alternative-path search. From Lemma 3.6, we know that the maximum cost induced \% 0{\E I). Lemma 3.8 The communication cost in the reply phase is bounded by 0{\V\\E\). Proof . In the best case, no parent-child links are disconnected during the reply 53 phase; every reply vector can be routed to 5 via a primary path. Thus, the communi- cation cost is IV 1-1. Otherwise, there are 2 cases to be considered: Case 1: communication cost incurred by link disconnections. During the reply phase, every link in the network is labeled either as a parent-child link or a sibling link. As shown in Lemma 3.6, a single link disconnection can cause at most 0{\E\) message exchanges. With the number of parent-child links bounded by IVl-1, the total communication cost induced by link disconnections can not exceed O (IV 1 1£ I). Case 2: communication cost incurred by link connections. We consider the following two cases, (i) Suppose that a link is connected within a tree. If the two nodes have identical reply vectors, it causes no additional cost except for the exchange of reply information. If two nodes have different reply vectors, both of them will send their newly acquired reply vectors to their parents. Since these two nodes are in the same tree, the reply vectors will stop at the first common ancestor of the two nodes. Hence, the cost of this connection is bounded by (9(IV I), (ii) Suppose that a link is connected between different trees. If both sides have different reply vectors, in the worst case it may cause the reply vectors be sent up and then down as a result of alternative-path search which is bounded by 0{\E\). Subsequent sibling link connec- tions between the two trees will induce no cost in vector replies because the nodes in these two trees will have identical reply vectors by Lemma 3.3. Since there are at most \V I different trees, the cost of link connections is at most O {\V I \E I), Summarizing Case 1 and Case 2, it can be seen that the communication of the reply phase is bounded by 0(IV ll£ I). □. Theorem 3.4 The communication cost of broadcasting a message is bounded by OiWWEl). Proof . It follows immediately from Lemma 3.5, 3.6, 3.7 and 3.8. □ 54 It is important to notice that, even at the extreme case where the network topo- logies are changing so rapidly that the number of link disconnections and connections is in the order of 0(l£: I) (i.e., almost every link is reconfigured), the upper bound 0(\V\\E\) still holds (i.e., not 0(l£:l^)) because of the convergence of the reply information. From a practical point of view, in most of the mobile communication networks each node has only a fixed number of communication ports which are used to send and receive data. Therefore, the worst-case communication cost of the proto- col is approximately 0(\V\^). 3.4.3 Speedup in Reply Phase In this section, we discuss a possible way of speeduping the reply phase. As each node tries to send the reply vector of a received broadcast message to S during the reply phase of the algorithm, the performance of the reply process depends on the number of link disconnections in the network. The number of link changes of a net- work in tum depends on the reconfiguration algorithms. For networks with good reconfiguration algorithms, very few links will be disconnected, which ensures that reply vectors can be quickly forwarded to a source node. Furthermore, the reply phase may be shorter than the broadcast phase if fewer links are disconnected during the reply phase because reply vectors are normally short packets compared to broad- cast messages which may be arbitrarily long. For a network that is undergoing rapid topological changes, Lemma 3.8 indi- cates that the reply phase may incur a substantial overhead in communication cost. This suggests that an improvement in the protocol is desirable, particularly in the speeding up of the reply phase. A key observation to such an improvement is that, during the reply phase, the first node which detects an "all-1" reply vector may not be S. Recall that each node which receives the first copy of a broadcast message will put 1 into its own position 55 in the ACK-reply (see line a.l of the node algorithm). Thus, we can modify the algo- rithm by allowing any node (instead of only S) to issue a FLUSH-OUT message to remove a broadcast message M if it detects the fact that the message has been received by all the nodes. If node i is the first node detecting the fact, it discards M from its buffer memory and attaches a FLUSH-OUT(M) message to its own message to be broadcast. If / has no immediate messages of its own to broadcast, it can either pass the FLUSH-OUT(A/ ) independently or pass it to its parent. Its parent will do likewise. When the FLUSH-OUT(A/ ) reaches the root of a tree, it will be broad- cast independently. This modification requires that an ACK-vector is appended to a broadcast message at the beginning during the broadcast phase. When the nodes on both sides of a link exchange a broadcast message (observe that they are candidate parents of each other), they also update their current ACK-vectors on that message. It is difficult to analyze the performance gains by this speedup mechanism. However, in Figure 3.2 we show some network topologies in which significant speed-ups can be achieved by using the modified protocol. Notice that, in these examples, when r's start the reply phase, they immediately detect that all the nodes have received the broadcast message delivered along the flooding paths, so they can issue FLUSH-OUT messages and thus a high degree of speed-up is obtained. The degree of speed-up obtainable for a particular network depends not only on the net- work topology but also on the location of the source node; some nodes are better able to take advantage of the above modified algorithm than others. 56 network edges flooding paths Figure 3.2 Some Examples of Speed-up 1 ST" CHAPTER IV GLOBAL NETWORK SURVEY 4.1 Background The global survey problem (GSP) studied in this paper can be briefly specified as follows: Given a general network G = (V,E) whose topological information is not available to any node in the network, and the network is vulnerable to unexpected link failures but remains connected, a global survey is conducted by a node v e V; during the survey, a question Q, which allows a short answer (e.g., a message of constant length) respond, is broadcast to every one in the networks, how a distributed algorithm can be designed to complete the survey (the completion of a survey requires the answer from every node to be received in finite time) ? A simple instance of such survey that requires binary answers is voting; that is, all nodes in the network vote to decide a particular issue such as whether to allow a new node to join this network. Alternatively, we may consider the problem as that of global informa- tion collecting in the sense that a node in a dynamic network may occasionally wish to collect some specific information from every one. It is also closedly related to other research issues such as distributed consensus. Designing a distributed algorithm for the GSP normally requires two essential operations: broadcast and reply. A survey question needs to be broadcast to all the nodes in the network before the answers to the question can be replied by the nodes. In this sense, the GSP can be considered as an extension of network broadcasting which has been extensively studied for more than a decade [3,10,11,14,25,33,35]. To the best of our knowledge, no previous work has addressed the GSP from the same perspective. A similar work was done by Baratz, Gopal and Segall [5] on fault- 57 58 tolerant queries in networks. But the technique used in [5] is relatively straightfor- ward and no optimization was attempted to reduce the message complexity, which is the primary concern of this work. Like many other disdibuted computing problems, the performance measurement of an algorithm for the GSP involves time-message complexity trade-off. An efficient algorithm generally has higher message complexity while algorithms with low mes- sage complexities tend to take longer time to terminate. In this paper, we present lower bounds for the GSP on both time and message complexities. We show that Q(IVl) is a time-complexity lower bound and Q(inin(IV l,^)IV l+lE I) is a lower bound for message complexity. It is observed that the complexities on obtaining these two bounds are asymmetric. While there exists a straightforward algorithm to achieve a time complexity lower bounds, a message-optimal algorithm is complicated to design. Our main effort is to achieve the latter. Based on depth-first-search and multiple-tree traversal, we present a lower-bound algorithm in terms of its message complexity. The significance of the algorithm is that it achieves a message- complexity lower bound regardless of the number of link failures as long as the net- work remains connected. That is, even if the number of hnk failures during the sur- vey is as large as the message complexity of the algorithm is still bounded by O0V\\ 4.2 The Model and Motivation We consider this problem in networks in which the following things hold: (i) The network is subject to unpredictable topological changes. Consequentiy, it is assumed that for each node, the topological information is restricted to 1- neighborhood (i.e., each node knows only to whom it is connected). This condition is the same as what has generally been assumed in other distributed computing prob- lems such as broadcasts in dynamic networks [3,35], constructing MSTs and leader 59 elections [4,16,29]. (ii) Each node has a unique id. which is available to every one; In this way, a node conducting a survey will be able to tell whether every reply has been received. It is also assumed that there exists a canonical order among these IDs. A straightforward implementation of such ordering would be ranking the nodes according to the lexical order of their IDs. (iii) The time to send a message over a communication link is arbitrary but finite, (iv) During a survey, some links may fail but nodes will not, which ensures tiiat every node in the network will eventually reply. Despite the link failures, the network remains a connected component. A failed link is eliminated from the network and is not usable for the rest of the pro- cess. The rationale behind this is that, considering the communication delays in gen- eral computer networks, the time required to recover a typical link failure is unpro- portionally long. The performance of a distributed GSP algorithm is measured according to two major criteria: time and communication complexities. The time complexity is the total time to finish a survey task and as commonly assumed, the exchange of a mes- sage over a link takes (9(1) time. There are basically two ways to define communi- cation complexity. One is defined in terms of bit complexity. That is, the total net- work overhead is calculated according to the total number of bits transmitted over links in the network as a result of the algorithm. This definition is generally accept- able in many networks. However, from the perspective of a node processor, there are some overheads associated with each message transmitting, which are largely independent of the message length (e.g., channel allocation, routing decision, com- munication process generation etc.) Under these circumstances, the communication complexity of an algorithm may be defined in terms of its message complexity, which is the total number of message exchanges over the links in the network. In some distributed computing problems, it is possible to bound the lengths of messages generated by an algorithm to a constant. Unfortunately, for GSP each node 60 must reply with its ID. as well as its answer, otherwise the conducting node will not be able to tell which answer belong to which node and thus unable to determine whether a survey is complete. This implies that each message requires at least O (log IV I) bits. As each answer replied to a question is short and of constant length, sending these messages on individual basis would flood the network. Therefore, the idea of luny}ing is considered a better alternative, which means concatenating many short messages into a slightly long but reasonable-length one. Through message lumping the message complexity of an algorithm can be significantly reduced. But a straight- forward lumping implies the possibility of creating messages of (9(IVlloglVl)-bit long. In this model, we manage to avoid the creation of such long messages and allow messages to be bounded by O(IVl) bits. We shall prove that no algorithm based on message lumping can be designed without using messages less than 0(IV I) bits. In other words, 0(lyl) is a lower bound on message size. It is the main focus of this research to examine in detail the GSP algorithms based on lumping and their performances in terms of message complexity. 4.3 Lower Bounds In this section, we proceed to show that Q(IVl) and Q(niin(IV l,^)IV I) are lower bounds on time complexity and message complexity for the GSP respectively. For the convenience of discussion, let v be the node that conducts the survey process for a question Q. That is, v is the source of the question Q and every node will have to reply to v with its answer to Q . Theorem 4.2.1 Given k link failures during the survey process where k is arbi- trarily large as long as the network remains connected, f2(IVt) is a lower bound on time complexity for GSP. 61 Proof. Consider the case where the topology of a network is as shown in Fig. 4.1. The source node, v, broadcast Q to Ui, M2, ... , W|vl-i- Suppose that when these nodes receiving Q are ready to reply, (v, U'2), (v, Uj), ... ,(v, fails. All the replies must be forwarded via (v, u{). Thus, given any protocol, it takes 0{\V\-\) time for v to receive all the replies. The theorem thus follows. □ Ui V Fig. 4. 1 An example for lower bound It is interesting to observe that as the time complexity is computed with respect to the original network, an Q(D ) lower bound is not possible, where D is the diameter of the network. This is also shown in Fig. 1 where D = 2 but the time complexity is (9(IVl). The survey for Q can be divided into two stages: broadcast stage and reply stage. As each node is assumed to know only its neighboring nodes, the only reli- able way to broadcast the question Q to every node is message flooding. The flood- ing technique is very efficient and commonly used in sparse networks. Initially, v sends a copy of Q to all its neighbors. Each node, after receiving the first instance of message Q , keeps a copy and sends it to all neighboring nodes except the node/s from which Q arrived. It is easy to verify that the flooding described above is a reliable algorithm to broadcast a message for a connected network. Thus a simple algorithm, the two-way flooding, can be constructed to obtain a time complexity lower bound for GSP (In [5] Baratz etc. use similar but slightly advanced idea to design algorithms for searching 62 distributed resources.) This algorithm requires nodes to send their answers back to v by flooding. Theorem 4.2.2 The two-way flooding algorithm is time-optimal for the GSP. Furth- ermore, it achieves minimum delay for the survey process. Proof. As the flooding ensures that in the broadcast phase each node has its neigh- bors informed of the broadcast message Q , if the network remains connected in spite of link failures, for any node u * v, there must exist a simple path from v to « along which Q can be delivered to u. Since the network has \V\ nodes, the time for M to receive Q is at most \V\-\ units. Similarly, we can show that it takes at most \V\-\ units for v to receive the answer of Q from u if the reply is done by flooding. Therefore, the two-way flooding is time-optimal. Now, we proceed to show that the algorithm achieves minimum delay for GSP. For each node f e V^, let P = <v , Vj, vj, ... , v,_i, v,> be the path through which Q arrives at t using the flooding, where v,- = t and for all \<j<i, Vj is the parent of Vy+j. It can be shown, by contradiction, that the delay for t to receive Q is minim- ized if Q travels along P. Suppose that P' = <v, v\, ... , Vy_i, v'j> is a shorter path than P to deliver Q from v to r, where v) =t. Let it be the smallest index for a link (v V ^^.j) in P' such that v\ is not the parent node of v'^+i, and that w is the parent node of v'^^.i. Since the parent node is chosen as the first node from which Q arrives, there must be a shorter path from v to v'^+j via w, and then from V to t. which implies that P' is not minimum, a contradiction. Hence, P is the shortest path. Since a node sends its answer of Q also by flooding, we can see that the two-way flooding algorithm can achieve minimum delay. □. 63 Clearly, the cost of implementing the efficient two-way flooding is very high in terms of its message complexity, considering that the message complexity of the algorithm can be as high as O(IVP) in edge-dense networks. In many cir- cumstances, the existence of a large amount of messages in the network may result in message congestions, which will degrade the overall system performance. In this case, an algorithm with low message complexity would be more preferable. Nevertheless, the unpredictability of link failures significantly complicate the design of algorithms for such a purpose. Definition 4.1 Let u be a node in G = (V, E) that received Q. The neighboring nodes of w can be classified into three categories: parent, siblings, and children. The parent of u is the node from which Q arrived first (ties are broken arbitrarily). Conversely, if the parent of u is w , then m is a child node of w . The neighboring nodes to which u is their parent form the children of u. The other neighboring nodes of u are its siblings. Definition 4.2 The p-neighborhood of a node u is defined to be the set of nodes that can be reached from u by no more than p-hops, where each hop represents a crossing of a communication link. For each node u that has received Q , we denote the reply to Q by R(u) which contains at least the answer of m to Q. There exists at least one node in the network that is an intended receiver of R (u ). Conversely, if w is an intended receiver of u , then w is an intended sender of w. Definition 4.3. A distributed algorithm for GSP is called a p-neighborhood proto- col, if it satisfies the following axioms: For each node u, (1) its intended receivers are within its p-neighborhood; (2) if m is a receiver, it can either send its reply indi- vidually or send the concatenated message of its reply and the replies of its senders; (3) if all the replies from its intended senders have been received, it must reply in finite time; (4) any message arrives after u has replied to its intended receivers that contains new reply information should be forwarded by u in finite time. 64 Axiom (4) implies that if a node is disconnected from its intended receiver, it will be forced to forward its reply to one of its non-intended neighbors. This neighbor, receiving an unintended message, must forward it independently as its lumping stage has ended. It can be observed that p-neighborhood protocols can be most realistically implemented when p = I. The reason is simple: when p >2, a. message intended for a non-neighboring node should be routed through intermediate nodes. Nevertheless, as each node has only local topological information, to reliably deliver messages to the intended receivers, local-flooding (i.e., flooding messages within p- neighborhood), may be necessary. This will inevitably result in high message com- plexity. One alternative is for each node to construct routing paths through local information exchanges within its p-neighborhood, which again may also require local-flooding. Furthermore, even if these routing paths can be established, link failures will certainly complicate such routings. Consequently, we are most interested in the protocols of 1 -neighborhood lumping. In the next theorem, we show a mes- sage complexity lower bound of p-neighborhood protocols for GSP in networks with unpredictable link failures. Theorem 4.2.3 Given k link failures during the survey process where k is arbi- trarily large but the network remains connected, Q(min(lK l+l£ I) is a lower bound on message complexity of 1 -neighborhood protocols for GSP. Proof . As flooding is considered the only reliable broadcasting protocol during the broadcast stage, the message complexity for broadcasting Q to all the nodes in the network is bounded by OOE I). From the axioms of the 1 -neighborhood protocols, each node has at least one intended receiver in its adjacent nodes to which a reply of Q can be sent. It is easy to show that if a node chooses a child as its intended receiver, it is possible that its reply will never be forwarded to the source (e.g., con- sider the case that the edge connecting the node and its child is a bridge of the 65 network). Therefore, it is clear that for each node, its parent and siblings are the only possible candidates for intended receivers as they are the nodes from which Q arrived. That is, despite no global topological information is available to any one, the flooding of Q implies that there exists at least one path from the parent or a sibling to the source. Thus, in the reply phase, these properties give us the following possi- ble cases to consider. Case 1: (no lumping) a node sends out its answer without waiting for its senders' replies. In Fig. 4.2, we can see that it takes ()\4-l) messages to broadcast Q and in the worst case there are 1 + 2 + ... + (M-l) messages needed to reply the answers. Hence, the number of messages exchanged over links is (IVl-1) + 1 + 2 + ... + (M-l) = (IVl-l)(IVl+2)/2 = O(lKl^). This situation occurs even without a single link failure. Case 2: (lumping) we consider the case where each node may reply after lumping together the replies from its intended senders. Case 2.1: a node chooses its parent as its intended receiver. Consider the following case shown in Fig. 4.3, V|y| is the common parent of Wj, M2, ... , and Uj^. Suppose that all the parent Unks of U2, ... , M^t f^il before the nodes are ready to reply. Clearly, v \y \ will start replying to its parent when it receives the reply from u 1 as u 1 is its remaining connected child now. The replies from a 2' — > have to be delivered via m j as it is the only node leading to v . According to the Axioms, in the worst case where U2 replies before the answer from M3 arrives, Uj replies before it receives M4's answer, and so on, it causes 0(k\V\) messages exchanged. Case 2.2: a node chooses its siblings as its intended receivers. Consider the example shown in Fig. 4.4. As w,- and M;t_,+i are siblings for all 1 < / < ^, these siblings will exchange their replies. However, all the paths leading to v must go through Vj, vj, ... , v\y\. Similar to Case 2.1, in the worst case the number of message exchanges can be (9 (/:1k I). 66 Case 2.3: a node chooses its parent and siblings as its intended receivers. Consider the case shown in Fig. 4.5. and u4i will collect the reply from u3i for all 1 < / ^ k before they reply; will collect all the repUes from Uji, U22, - . "2t before it replies; and M4 will lump together the replies from u^i for all 1 < / < k. Suppose that the links between "3i disconnected before the replies from m, 's can be delivered, they must be routed via M4. Since M2I' - ' "2;fe 1- neighborhood, The worst case is that their replies are routed individually through u^, which results in 0{k\V\) messages being delivered. □ As we have point out the difficulty of implementing an algorithm adopting mes- sage lumping with neighborhood > 2, nevertheless, we can extend the previous theorem to show that widening neighborhood in general does not reduce message complexity. Theorem 4.2.4 Given that k links fail during the survey process where k is arbi- trarily large but the network remains connected, Q(min(IV l-i-l£ I) is a lower bound on message complexity of p-neighborhood protocols for GSP with p being an arbitrarily large constant. Proof . The proof can be extended directly from that of Theorem 4.2.3 by the tech- nique of "node padding". Detailed proof is omitted here. The basic idea is to insert sufficient number of nodes to push some critical nodes beyond some neighborhoods. More specifically, consider the example in Fig. 4.5. By inserting p colums of nodes between and M21 where 1 < / < k, the lumpings of U4 does not include replies from U21, ... , "2* since they are beyond M4 p-neighborhood. This can force these messages being replied individually through M4 in case that the links between U2i, for all 1 < /■ < k, and M2 are disconnected. The arguments of other cases can be simi- larly constructed. □ 68 4.4 Optimizing Message Complexity The observation that enlarging a node's lumping neighborhood does not yield a better lower-bound result leads us to focus on exploring better routing strategies. Without a global picture of the network, each node is forced to make routing deci- sion on local basis. The routing decisions, considered in a broad sense, include not only message forwarding but also selecting a neighborhood, choosing intended receivers, etc. Our main effort, in this section, is to present an algorithm which is reliable and optimal in terms of message complexity for arbitrarily large number of link failures. The underlined strategy of this algorithm is based on the concept of alternative-path search. The algorithm belongs to 1 -neighborhood protocol. Within its 1- neighborhood, each node has a unique intended receiver for its reply message that is its parent node. In the operational sense, the algorithm can be roughly divided into two phases: broadcast phase and reply phase. In the temporal sense, there is no clear distinction; some nodes may be still in their broadcast phases while the others may be already in their reply phases. In the broadcast phase, each node performs the flooding procedure [33][35]. The message Q broadcast from v initially contains only the question. As this message is disseminated across the network, each node receiving an incoming Q will do the fol- lowing things: (1) keep a copy, (2) send it to all the neighboring nodes except those from which Q arrived. According to Definition 4.1, each node establish a relation- ship with each of its neighbors. Considering only the parent-child relationship among nodes in the network, a spanning tree is formed with v, the source of Q, being the root of the tree. The path connecting a node u to v that consists of only tree edges is the primary path of u . The primary path of m is a shortest path along which Q arrives. It may not be the unique shortest path from v to m. Under the assumption of unpredictable propagation delay of a link, it is not unlikely that the 69 primary path of u may change in the next survey. Nevertheless, a primary path does provide a useful hint for replying answers (i.e., with limited topological information, it is reasonable to choose the primary path to send answers back to v in the reply phase). The reply phase is initiated by nodes which satisfy one of the following condi- tions: (1) it has no neighbors to which Q needs to be sent or (2) all the links to its children are disconnected. In both cases the nodes are called leaves. A leaf node will send its answer to its parent immediately. If a node is not a leaf node, it replies with a message that is the concatenation of its own reply to Q and the replies from its children. The message is then forwarded to its parent. Link failures during broadcasting are handled by discarding the link. Failures that occur during the reply phase have different implications. Let / be a communica- tion link that fails. There are two cases to be considered. If / is a sibling link, no action is neeed to be taken. If / is a tree link, proper algorithms are required to ensure the reliability of the protocol, which will be discussed in the next two sec- tions. Definition 4.4 Let V;, and V/ be the nodes connected by tree link / and v^, is the parent of V/. The failure of / makes V/ become the root of a new tree, denoted by 7/. This new tree is said to be a low tree. In contrast, the "old tree" (i.e., the tree that contains V;, ) is called a high tree, denoted by Tf, . The birth of a new tree has significant consequences for routing reply messages back to the source v . For instance, an altemate-path search is needed if V/ has not yet replied with its answer message, because the primary path connecting V; and v no long exists. Definition 4.5 Two nodes are said to be internal siblings to each other if they are in the same tree and connected by a non-tree link; They are external siblings if they are not in the same tree but are directly connected. 70 4.4.1 The Message Structure Before we get into the detail of our protocol, it is necessary to discuss the way each message is constructed. The structure of a broadcast message is quite simple; it contains the question and the source. Thus it takes <9(loglVl) bits. A reply message requires more complex design. The reason is that it will contain not only answers to the question but also some routing information for maintaining good routing proper- ties. The reply message structure contains mainly four parts: i) answers to the survey problem; ii) reply-record which consists of the nodes that provide these answers; iii) descendent-record which, created by the root of a tree, is the set of node within the tree; iv) visited- node-record which is the set of nodes that have been visited; v) visited-tree-record consisting of trees, presented as a set of nodes, that have been par- tially visited, i.e., some of the nodes in the trees may yet to be explored. The pur- pose of reply-record is to keep track of the nodes that completed their replies. A descendent-record is used to determine whether a sibUng is internal or external. The purposes of visited-node-record and visited-tree-record will be apparent in the coming sections. Except for the answers, which are of constant length each , all the three records are represented as bit-vectors. Each node is associated with a position in the records. These positions are arranged in a canonical order so that each node knows which position belongs to which node. A bit 1 is set for a node in its corresponding position at the reply-record if its answer is contained in the message; otherwise it remains 0. In order to enforce a one-to-one correspondence between answers and reply-record, the answers are arranged in the same order as the reply-record. For example, the n th answer is associated with the n th bit that has been set to 1 . For descendent-record bit 1 represents "is-in". Similarly, for the rest of two records, bit 1 denotes "visited". The size of each reply message structured in this fashion is (9 (IV I) bits. 71 4.4.2 Alternative Path Search If none of the failures occurs at tree links, all the replies are ensured to be for- warded to V via tree links. However, this is not always the case. A tree link failure imposes great routing difficulty on parent as well as child. In the next two sections, we show how a failure is properly handled if the failed link is a tree edge. Two different processes will be invoked when a new tree 7/ is splitting from an old tree T^. Each process is initiated by one of the two nodes connected via the tree link whose failure results in a tree splitting. Let us call the algorithm executed by the child a low-tree algorithm and the one undertaken by the parent the high-tree algo- rithm. 4.4.2.1 Low-tree algorithm The low-tree algorithm is the core of the alternative path search. The main task of this algorithm is to route messages from nodes in the low tree to v . Since the link between v, and v^, failed, messages from 7/ to v must be routed via non-tree links. One way to achieve this is by flooding at , which is similar to the algorithm proposed by Baratz, Gopal and Segall [5]. The advantage is its simplicity, but its message complexity is not optimal. Instead of flooding at v^, we provide an altemative-path-search algorithm based on depth-first-search and loop-free routing. The algorithm is initiated at V/ . In this algorithm, V/ will try to find a sibling first. If it succeeds, the reply from V/ is forwarded to the sibling. Otherwise, the message is sent back to one of its chil- dren. A message that is sent from a parent to a child is called a backtracked mes- sage. A child node receiving a backtracked message from its parent attempts to send this message out of the low tree by identifying an external sibling. This is crucial to the low-tree algorithm in terms of optimizing the message complexity since passing a 72 message to an internal sibling does not further the goal of forwarding the backtracked message to the source as the message will remain in the same tree. There is a static way of discerning external siblings from internal siblings. From Definition 4.1, we can see that two siblings are internal if their least-common- ancestor is in the same tree as they are, otherwise they are external siblings. Con- sider an algorithm as follows: During the broadcast phase, each node broadcasts Q that is appended with its node id. By exchanging its broadcast message with a neighboring node, a node can identify whether the neighbor is an external sibling by checking the ancestor-record of the neighbor. The advantage is that it can be done off-line. Nevertheless, implementing such procedure may generate, in the worst case, broadcast messages that are as long as O (IV lloglK I) bits due to the concatenation of node IDs. In our algorithm, external siblings are identified dynamically. The method is to allow the root of a tree to pass the necessary information in the form of a descendent record which contains all the node IDs. of the tree. A node can extract the descendent-record from the backtracked message it received and then the decision is straightforward since a sibling not in the record must be external. The low-tree algo- rithm consists of three major components which interact with each other. A. The depth-first-search. In this procedure, the backtracked message is routed according to the following precedence: external siblings, child nodes, and the parent. That is, a node receiving a backtracked message will always try to find an external sibling. If it succeeds, the message is forwarded to the sibling. If the message is eventually returned by the sibling (i.e., the sibling fails to find an alternative path to v) or the link connecting it to the sibling fails (thus, it is not certain if the routing can succeeds via this sibling), another external sibling is tried. If it has no external sibling or all the external siblings have been tried and no alternative path is found, the backtracked message is sent to one of its children if it has any; if it does not have 73 any child node, the message is sent back to its parent. If that child fails its search and returns the message, it tries another child until all the children have been explored. In this case, the message is sent back to its parent. By doing this, either the message will eventually be sent out of this low tree or all the external siblings needed to be explored have been explored. If it is the latter case, the backtracked message is returned to the external sibling from which it was delivered to this low tree. It is worth mentioning that in order to avoid the repetition of fruitless sibling exploration, each external sibling that has returned a backtracked message is deac- tivated and will not being explored for the routing of subsequent messages unless it is reactivated. B. Loop-free routing and new-tree broadcasting. It is clear that during the search of an alternative path, an external sibling should not be visited by the same backtracked message more than once. That is, the path that a backtracked message traverses should be loop-free. To achieve this, each reply message is associated with a visited- node-record. Each time a reply message is received by a node, the id. of the node is stamped to the visited-node-record. Thus, an external sibling visited before by the message can be excluded from being explored. In addition, we see some practical reasons for the algorithm to keep loop-free the routing of a backtracked message in the "tree level" as well as in the node level. In other words, a message is not to be sent to those trees which have been visited by the same message earlier even though the trees contain nodes that have not been visited by the message. There are two major reasons for loop-free traversal in the tree level: First, for reducing the number of message exchanges as a non-loop-free traversal will inevit- ably cause more nodes to be visited. Second, for good management of routing paths; a non-loop-free traversal is difficult to manage as it may cause problems when 74 messages start backtracking, especially when a traversal path is disconnected by link failures. For such a purpose, a visited-tree-record is also required for each reply mes- sage. When a root of a tree receives a message, it stamps the visited-tree-record with its tree information. Therefore, an external sibling in the visited-tree-record is avoided fmm being explored by the message. There is, however, some inherent complexity in maintaining a loop-free traversal in the alternate-path search on dynamic and faulty networks. The following example illustrates this. Example 1. Consider that tree receives a message from one of its external siblings in its neighboring tree, denoted by n^, and that consists of two parts, Tn and Ti2y connected by / which is the only link connecting T^^ and T^j- An alternative-path search proceeds in such a way that message visits trees T2, T^, ... ,Tj_i, and Tj, one after another as shown in Fig. 4.7. Now, suppose that Tj has no external siblings other than those in T■^ 2■ Since the traversal of the backtracked message should be loop-free at tree level, it will not be sent to 2- Therefore, backtracks to r^.j, Tj_2 and so on. In the meantime, link / fails, which results in the split of T12 from as a new tree. Consequently, when is eventually returned to Ti, T-^ i will return to rig. Hence, 7i 2 is not explored by the alterna- tive path search. If T12 happens to be the only tree via which can be sent to v, the search will fail to find the only path leading to v. This makes the protocol unreliable. Therefore, the situation must be avoided. 75 V Fig. 4.7 An example A reasonable way to prevent this is as follows: Once a new tree is bom (split from the old tree), this information is broadcast to all of its neighboring trees. This ensures that if these neighboring trees have any backtracked messages to forward, the messages can be sent to this new tree. If a backtracked message can be delivered to V via the newly-bom tree, a reactivation message is created. A reactivation message is used to inform those external siblings that the external siblings in the newly-bom tree can be re-explored for further routings. A reactivation message is sent along the path via which a message backtracked. When a node receives a reactivation message from a deactivated extemal sibling, it will reactivate this sibling. The algorithm to broadcast a tree-born message from the newly-split tree is designed by the same principle of the altemative path search of a backtracked mes- sage. That is, the tree-bom message initiated by the root traverses in the tree in a depth-first-search manner. When a child receives the message from its parent, it sends a copy of the message to all of its external siblings which have not yet received the message. To achieve this, each time a message is sent to an extemal sibling, the id. 76 of the sibling is kept in the visited-node-record of the message, which is similar to the alternative path search of a backtracked message. By doing so, the external sibling can be exempted from receiving the same tree-bom message more than once from nodes in the new tree. On the other hand, once a node receives the message of a new tree information, the message is forward to its root. After being informed of this information, the root will broadcast this message to all its descendents via tree links. C. Sibling reclassification. The third part of the low-tree algorithm is to maintain the correct relationship between siblings. When the low tree is disconnected from the high tree, the nodes in the high tree that are the siblings of the low tree must be "re- classified" as external siblings since they are no long in the same tree. The root of the low tree must broadcast a reclassification message to its descendents. This is done by sending such message, which contains the IDs. of the descendents, along the tree links. Each node receiving the message will redefine its sibling relationship accordingly. Those "newly-bom" external sibUngs of the low tree will subsequentiy be explored by the alternative path search. Note that there is no need to send a mes- sage across the links between the low tree and the high tree as the reclassification action will be taken by the high tree as well. 4.4.2.2 High-tree algorithm We now consider the algorithm for V;, in the case where loses one of its chil- dren because of link failure. Tasks need to be handled are included in the following procedures: A. Resumption of alternative-path search. If Vf, is currently conducting an altemative-path search on this disconnected branch, Vf^ will abort the search on the failed branch and resume the search on the other child branches. This may result in two altemative-path search processes concurrently exist, one in the low tree and 77 another in the high tree. The necessity of such action is clear since v^, is unable to know whether the disconnected search in the low tree will succeed or not. B. Sibling reclassification. Similar steps need to be taken by the high tree to readjust the status of its siblings. This is done by sending a message (i.e., a descendent-record which consists of nodes in the low tree) from v^, along tree edges to the root of the high tree. Once the root of the high tree receives such a message, it will broadcast a reclassification message, also containing nodes in the low tree, along the tree links. A node receiving a reclassification message will consider those siblings in the low tree external. C. Exploration of new siblings. As is not the root of Tf, but a leaf of the tree, it is possible that when the root of 7^ receives the message sent by V;, , which indicates a link failure, an altemative-path search (invoked by the root of Tf^ as part of its low-tree algorithm) may be already undergoing in Tf, . If this is the case, these new external siblings, previously excluded because they were internal siblings, should be explored for routing backtracked messages in T/,. It is not unlikely that pathes from the high tree to v may go through these new external siblings of Tf, . Notes: i) During the reply phase and in both of the low-tree and the high-tree algorithms, a message from a child or a sibling may arrive that contains no additional reply information. This message will be rejected, ii) Each backtracked message has its own alternative-path search. Thus, it is not unlikely that backtracked messages in two adjacent trees may explore each other. Consequently, there may exist more than one backtracked message in a tree. In this case, each backtracked message, which can collect as many reply messages along its altemative-path search by merging the bit- vectors of the reply-records encountered into its own, will proceed independently, iii) Also note that the low-tree algorithm and high-tree algorithm are the necessary steps to be performed from a low tree and high tree point of views. These algo- rithms are separately discussed mainly for the convenience of illustrating the whole 78 protocol. In fact, it is possible that in a tree, both algorithms may be invoked con- currently by different nodes. In this case, these algorithms will proceed indepen- dently in the tree. 4.4.3 Validation of the Protocol Now, we shall show that the protocol is reliable. The basic idea of the proof is to show that no external siblings ever needed to be explored are not visited during the alternative-path search. That is to show that the alternative-path search of the low-tree algorithm is exhaustive. Lemma 4.3.1 If each tree link is not disconnected before the child node finishes replying, then every reply message can be sent to v through the primary path. Proof . If a tree link is not disconnected before the child node finishes replying its vector, then each node can reply to its parent via the tree link. It immediately follows that each reply message will be forwarded to v through a path consisting of only tree links, which is the primary path. □ Lemma 4.3.2 The alternative-path search is exhaustive. Proof . Let 7,- be a tree that has a backtracked message to send out of this tree. Observe that there are three groups of external siblings of T, that will not be explored by when an alternative-path search is conducted in 7,- . The first group of external siblings are those contained in the visited-node- record of Af^, since they are akeady been explored. The second group of external siblings are those contained in the visited-tree- record of B but not in the visited-node-record of B . As discussed in the low-tree algorithm, this is to ensure that a loop-free traversal for a backtracked message can be maintained. However, these external siblings will be explored in either of the two i 79 conditions: 1) Af^, is eventually returned to the trees that contain such external siblings since these siblings have not yet been explored by these trees (they are not in the visited-node-record of Mf,). 2) The trees that contain these external siblings separate from their original trees due to link failures. This will force the newly-born tree to broadcast a tree-bom message to 7,-, which will cause 7,- to send to these external siblings. Therefore, eventually these external siblings will be explored by M{, if they ever need to be. The last group of external sibUngs are those "originally internal" siblings of 7,-. These siblings are contained in a tree splitting from 7,- and thus become external siblings of 7,-. And these siblings will be explored by once a reclassification message arrives at the root of 7, as illustrated in the high-tree algorithm. From the above cases discussed, it is clear that no external siblings are ignored by the altemative-path search if they are in the paths leading to the source v . □ Theorem 4.3.1 The protocol guarantees that every reply message will be delivered to V. Proof . Let message be the answer of node u to the question Q . If no link in the primary path from m to v fails, from Lemma 4.3.1, we know that can be sent to V. Suppose that the primary path from a to v is disconnected during the reply phase. Let v ^ be the first node encountered by in the primary path such that v i is disconnected from its parent. According to the low-tree algorithm, v j is the root of a tree 7i and an altemative-path search will be initiated when arrives at Vj. From Lemma 4.3.2, we know that will exhaustively explore all extemal siblings in the neighboring trees that are needed to be explored. Since 7i is not isolated from the rest of the network, there must exist a tree, say T2 rooted at V2 such that can be 80 delivered to T2. If V2 is not v, another alternative-path search process will be ini- tiated upon the arrival of at V2. Consequently, the reply of the message may incur a number of alternative-path search processes in trees. We show, by contradiction, that the alternative-path search of will eventu- ally succeed. Suppose that there exists a path from Vj to v and the alternative-path search of terminates without finding it. Since the network remains connected despite of link failures, there must exist two adjacent trees, 7,- and Tj , such that 7,- has a path to v but unexplored by and that Tj is explored by . The fact that r,- is not explored by from Tj gives us two possible cases to discuss: Case 1: 7,- contains nodes which are in the visited-tree-record but not in the visited- node-record of M^. This implies that 7, must be a newly-bom tree which is split from the tree that have been explored (partially) by M^, otherwise it will be explored. But as shown in the low-tree algorithm, a newly-bom tree will broadcast a message to its neighboring trees and thus invite a backtracked message. This means that Ti will invite a from Tj , a contradiction. Case 2: 7,- contains nodes neither in the visited-node-record nor in the visited-tree- record of Ma , which implies that 7,- is a new tree split from Tj . In response to that, both the low-tree algorithm and the high-tree algorithm will do sibling- reclassification, which ensures that after finite time, siblings between 7,- and Tj will be classified as external. Therefore, must be sent to 7,- ; again, this is a contradic- tion. Therefore, we conclude that if the network is connected, every reply message will be routed to v . □ From Theorem 4.3.1, the following theorem immediately follows. 81 Corollary 4.3.1 The protocol is reliable if no link failure causes partitioning of the network. □ 4.4.4 Computational complexity In this section, we discuss the computational complexity of the algorithm. The algorithm is designed to achieve the message complexity lower bound for GSP. Theorem 4.4.1 The message complexity of the algorithm for GSP is (IVP) with arbitrary link failures as long as the network remains connected. Proof . Without loss of generality, assume that all of the link failures occur in the reply phase. As the survey process proceeds, each tree link failure will increment the number of trees by one. Let Tj, T2, ... , T^^ be all the trees in the network at the end of the process with < IV I. Consider a tree 7, where 1< / < n,. Let A/^ be a reply message that arrives at Ti from a neighboring tree Tj (a backtracked message). Let us count the maximum number of message exchanges in T,- for forwarding to v . The message exchanges occur either at tree links or external sibling links. First, we consider how many messages are possibly exchanged over tree links due to the routing of to v . The worst case occurs when the low-tree algorithm can not deliver to v via any of its external siblings so that is eventually returned to Tj. Since the alternative-path search works in a depth-first-search manner, tiie number of messages exchanged in 7,- caused by the arrival of is bounded by 0(lr,l). There are at most such backtracked messages like that may reach T,-, excluding the source v (recall that a duplicate message from a sibling or a child will be rejected). Hence, during the whole process, the total number of mes- sages exchanged over tree links in T,- is at most O (lrjl(IV l-lr,l-l)). 82 Now, we compute the upper bound on the number of messages exchanged over the external sibling links of 7,-. Suppose that there are e^igj external sibling links for r,-. For each message M^, each time when is returned from an external sibling or when the chosen external sibling link fails, the algorithm has to find another external sibling to retry. This increments the number of messages exchanged over sibling links by one. The alternative-path search ensures that if a reply message is returned from a neighboring tree, the tree will not be explored again unless the external siblings in the tree are reactivated. Let r^,^ _, be the total number of external siblings of 7, that have been reactived. This makes the total number of "retries" at Ti less than e^ig j+r^ig i. Again, no more than M-tT,!-! such backtracked messages like Ma may arrive at T,-. Therefore, during the survey process, the total number of messages exchanged over the sibling links connecting 7,- and its neighboring trees are bounded by O (e^,-^ i +r^ig -tM-lT, 1-1 ). Summing up the two terms for all T^, ... , gives the following total number of messages exchanged during the reply phase, £ iOOTXW 1-17,1-1)) + 0(e^,.^,.+r^,.^,-fVH7,l-l)) = 0(l7il(IVl-l7il-l)) + 0(e^ig^i-iW\-\Ti\-l) + 0(1721(1^1-17^1)) + Oie^ig,2+^V\-\T:^l) + ... + 6>(I7^UVI-I7„,I-1)) +Oie^ig,„+\V\-\T,\-l) + 0(r^,-,.i + ... + r^^..^,) = {0(l7il(IVl-l7il-l)) + 0(l72l(IVl-l72l-l)) + ... +0(l7„,l(IVl-l7„,l-l))} + {0(e^ig,i+\V\-\Til-l) + Oie^ig^2+^V\-\T:^l) + ... +0(e^,^,„ +IV l-l7„,H)} + Oir^igj + ... + r^ig^^) < 0(IVl(l7iW72l+ ... + \r^\)) +0(e^,g^,+e^,g^2 + - + e^^g^) + 0(nM 83 <0(IV|2) + + (9(r^.-g.i + ... + r^ig^) = Oi\V\'^) + 0(r^ig^i + ...+r^ig^) Since each time a new tree is generated, at most (9(IVl) external siblings in the net- work can be reactived, 0(r^/^,i + ... + r^ig,n,) < < (9(lyP). So, 0(IVlVO(r^i,.i + ... + r^ig^,) = Finally, let us count the number of message exchanges due to maintenance pur- poses, i.e., the sibling reclassification and tree-bom message broadcast. It is clear that sibling reclassification involves only tree edges traversals; thus the total number of message exchanges due to this factor is bounded by O(IVP). Similarly, tree-bom message broadcast is done in depth-first-search fashion and involves tree-traversals and message exchanges over extemal sibling links that is bounded by the number of neighboring trees. Therefore, it incurs no more than (9(lyl^) message exchanges. Hence, including the I) messages exchanged during the broadcast phase and the OClVl'^) reactivation messages exchanged as a result of the generation of new trees, we conclude that the algorithm has an upper bound of 0{\V\^) as its worst-case mes- sage complexity. □ Corollary 4.4.1 The algorithm for GSP has optimal message complexity. □ An interesting open question is whether there exists a protocol for this global survey problem which can achieve the lower bounds on both time and message com- plexities. CHAPTER V CONCLUSION In this closing chapter, we summarize what have been achieved during this research and suggest some open problems for future research. 5.1 Major Results In Chapter 2, We present a hierarchical, partitioning approach to the network reconfiguration problem of large-scale satellite networks. By partitioning a satellite network space into regions of approximately equal sizes, the reconfiguration problem of the network can be reduced to a number of regional link assignment problems which, due to their relative simplicity, can be solved recursively by connecting satel- lites in a layer-by-layer manner. The algorithm proves to be superior to other methods in terms of time complexity. It is also shown that, in the large-scale networks where each region contains at least two nodes, the node-connectivity of the network is maximized when each node is restricted to have four transceivers. With this property, packet routing is more flexible, and the possibility of packet congestion can be effectively reduced. In addi- tion, we also provide some fault-tolerant mechanisms which enable us to dynamically handle the unexpected topological changes such as link failures or node failures. In short, our effort on studying satellite network reconfiguration has resulted in a significant method. Unlike the NpNs model, the method provides a more general net- work model which does not impose any restriction on the orbits of the satellites. Consequently, it is free of the vulnerability that the NpNs model suffers. In Chapter 3, we have discussed the issues of message broadcast in networks which are subject to frequent Unk reconfigurations but are eventually connected. The 84 85 protocol presented has been shown to achieve reliability, efficiency, and finite mes- sage buffering (previously only infinite-buffered algorithms are available). We believe that the protocol has significant implications for broadcast protocol designs in communication networks with such characteristics, in particular, packet radio net- works and point-to-point satellite networks. With minor modifications, the protocol can be extended to serve as an information-collecting algorithm in eventually-connected networks. Information col- lecting is necessary in networks where a node may need specific information from every node in the network. For example, the network may need to select a master file server or elect a leader so that some administrative things can be centralized. Hence, the reply phase of the algorithm can be modified to bring back not only the reply vector but also the needed information from each node. The algorithm also can be applied to other issues such as network votings and synchronizations. Based on similar but relatively restrictive model, we discussed another research topic in Chapter 4: network survey (or GSP, Global Survey Problem). The problem assumes that the network suffers link failures which may not be recovered during the survey process and that the network remains connected despite of these failures. We show that Q(IVl) is a lower bound on time complexity and a message-complexity lower bound is Q(min(IV l,^)IV :+\E I) for lumping algorithms. Previously some works have been done on fault-tolerant network resource allo- cation in which a modified flooding algorithm is provided. The algorithm can be con- verted to an algorithm for this problem in a straightforward manner. However, the algorithm has 0(IVll£l) message complexity, which is not optimal, in particular when C>(l£'l) = (9(lyP). In this research we design a loop-free algorithm, which consists of the lower-tree algorithm and the high-tree algorithm. The algorithm is bounded by O(IVP) regardless of the number of link failures, thus an optimal algo- rithm in terms of message complexity. 86 5.2 Future Research Directions Future research on network reconfiguration includes extending the hierarchical partioning approach to handle more general cases (e.g. cases in which the number of links in a node is not fixed). Another interesting direction for future work is to design the hierarchical routing algorithms which can take advantage of the hierarchi- cal topologies. Regarding distributed algorithm designs in networics with changing topologies, we believe that the following two problem are practically significant and worth pur- suing: fault-tolerant leader election and broadcasting tree maintenance. Leader elec- tion problems have been extensively studied for the past ten years. However, there is very little literature addressing this issue under changing-topology environments. The problem is expected to be rather complex as the complexity of a leader election will increase substantially when links or nodes fail. The broadcasting tree maintenance is about designing an algorithm that can dynamically restructure the current spanning tree used for broadcasting purpose. It will be very beneficial to nodes in the network if a message can be broadcast along the edges of a spanning tree. Good algorithms for such a problem that are able to handle both link failures and node failures are yet to be discovered. REFERENCES [I] J. Adams and M. Fishetti. "Star wars SDI: The great experiment," IEEE Spec- trum , 22(9), pp. 34-64, September 1985. [2] A. Aho, J. Hopcroft and J. Ullman. The Design and Analysis of Computer Algorithms. Addison- Wesley, Reading, Massachusetts., 1974. [3] B. Awerbuch and S. Even. "Reliable broadcast protocols in unreliable net- works," Networks, Vol. 16, pp. 381-396, 1986. [4] B. Awerbuch. "Optimal distributed algorithms for minimum weight spanning tree, counting, leader election and related problem," Proc. of the 19th ACM Symp. on Theory of Computing, pp. 230-240, May 1987. [5] A. Baratz, I. Gopal and A. Segall. "Fault tolerant queries in computer net- works," The 2nd International Workshop on Distributed Algorithms. IEEE, pp. 31-40., July 1987. [6] R.R. Boorstyn and H. Frank. "Large scale network topological optimization," IEEE Trans, on Communications, COM-25(l), pp. 29-47, January 1977. [7] C. Cheng, I. Cimet and S. Kumar. "A protocol to maintain a minimum span- ning tree in a dynamic topology," ACM SIGCOMM 88, pp. 330-338, August, 1988. [8] E.G. Coffman and P.J. Denning. Operating system Theory , Prentice-Hall, New Jersey, 1973. [9] E.G. Coffman. Computer and Job Shop Scheduling, John Wiley, New York, 1976. [10] Y. Dalai. "Broadcast protocols in packet switched computer networks," Ph.D. dissertation, Stanford Univ., Stanford, California, Digital System Lab. Tech. Rep. 128, April 1977. [II] Y. Dalai and R. Metcalfe. "Reverse path forwarding of broadcast packets," Commun. Ass. Comput. Mach., Vol. 21, pp. 1040-1048, December 1978. [12] S. Even. "An algorithm for determining whether the connectivity of a graph is at least k," SIAM Journal of Computing, pp. 393-96, September 1975. [13] S. Even. Graph Algorithms, Computer Science Press, Reading, New York, 1979. [14] E. Gafni and D. Bertselcas. "Distributed algorithms for generating loop-free routes in networks with frequentiy changing topologies," IEEE Trans, on 87 88 Communications, COM-29(l), pp. 11-18, January 1981. [15] R. Gallager. "An optimal routing algorithm using distributed computation," IEEE Trans, on Communications, COM-25, pp. 73-85, 1977. [16] R. Gallager, P. Humblet, and P. Spira. "A distributed algorithm for minimum- weight spanning trees," ACM Trans, on Programming Languages and Sys- tems, Vol. 5, No. 1, January, 1983, pp. 66-77. [17] J. Garcia-Luna. "A fail-safe routing algorithm for multihop packet-radio net- works," Proc. of IEEE Infocom 86, pp. 434-443, April 1986. [18] J. Garcia-Luna-Aceves and R. Rom. "On the dynamic control of network con- nectivity," Proc. of IEEE Infocom 87, pp. 207-217. [19] M. Gerla and L. Kleinrock. "On the topological design of distributed computer networks," IEEE Trans, on Communications, COM-25, January 1977. [20] L. Kleinrock. Queueing Systems, Volume 2: Computer Applications, Wiley- Interscience, New York, 1976. [21] D. Kleitman. "Methods for investigating the connectivity of large graphs," IEEE Trans, on Circuit Theory, CT-16, pp. 232-3, May 1969. [22] K. Luo, Y. Chow and R. Newman-Wolfe. "An efficient algorithm for reconfiguration of large-scale point-to-point satellite computer networks with maximum connectivity," The 9th IEEE International Phoenix Conference on Computers and Communications, pp. 194-201, March 1990. [23] K. Luo, Y. Chow and R. Newman-Wolfe. "An efficient broadcast protocol in networks with changing topologies," The 2nd IEEE Workshop on Future Trends of Distributed Computing Systems, pp. 88-93, September 1990. [24] K. Luo and Y. Chow. "A message-optimal protocol for global surveys in faulty networks," IEEE Infocom '91, pp.525-532, April 1991. [25] C. McLochlin, C. Ward, Y. C. Chow, R. Newman-Wolfe, J. N. Wilson and T. B. Hughes. "Optimizing the delay and reliability of low altitude satellite net- work topologies," IEEE Milcom '87, October 19-22 1987. [26] J. McQuillan, I. Richer and E. Rosen. "The new routing algorithm for the APARNET," IEEE Trans, on Communications, COM-28, pp. 711-719, 1980. [27] P. Merlin and A. Segall. "A failsafe distributed routing protocol," IEEE Trans, on Communications, COM-27, pp. 1280-1287, 1979. [28] F. Moss and J. Jaffe. "A responsive distributed routing protocol," IEEE Trans, on Communications, COM-30, pp. 1758-1762, 1982. [29] D. Peleg. "Time-optimal leader election in general networks," Journal of Parallel and Distributed Computing 8, pp. 96-99, 1990. 89 [30] T. Roberttazzi and P. Sarachik. "Self-organizing communication networks," IEEE Communications Magazine, Vol. 24, no. 1, pp. 28-33, January 1986. [31] U. Schumacher. "An algorithm for construction of a ^-connected graph with minimum number of edges and quasiminimal diameter," Networks, Vol. 14, pp. 63-74, 1984. [32] A. Segall. "Distributed network protocols," IEEE Trans, on Inf. Theory, YT- 29, pp. 23-35, 1983. [33] A. Segall and B. Awerbuch. "A reliable broadcast protocol," IEEE Trans, on Communications, COM-31, pp. 895-901, 1983. [34] A. Tanenbaunx Computer Networks . Addison- Wesley, Reading, Massachusetts, 1980. [35] U. Vishkin. "A distributed orientation algorithm," IEEE Trans, on Inf. Theory, Vol. IT-29, pp. 624-629, 1983. [36] C. Ward. "Topologies and link assignment problems in low altitude satellite networks", Ph.D. dissertation. University of Florida, 1987. BIOGRAPHICAL SKETCH Kenneth C.-K. Luo was bom in Hsinchu, Taiwan. He received his B.S. degree in computer science and information engineering from National Taiwan University in 1981. He served as a communications officer in the Taiwan Air Force for two years. In 1983, he joined the Institute for Information Industry where he worlced as a software engineer. He received his M.S. degree in computer science from the Univer- sity of California in 1986. He came to the University of Florida in 1987, where he is working on his Ph.D. degree in computer science. His research interest includes dis- tributed computing, computer networks, fault-tolerant computing, combinatorial optimizations and system performance evaluations. He is married to Sze-Mei Ju and has a four-month-old daughter. 90