Skip to main content

Full text of "Fundamental distributed algorithms for management of computer networks with changing topologies"

See other formats








I would like to take this opponunity to express my sincere gratitude to my 
dissertation advisor. Dr. Yuan-Chieh Chow, for his guidance and encouragement of 
this research, which was supponed under the SDI/IST program, administered through 
U.S. Navy and U.S. Army. I also want to thank Dr. Richard Newman-Wolfe for 
numerous fruitful discussions which eventually led to some significant results of this 

Other dissertation committee members also contributed to the work: Dr. Rick 
Smith made helpful suggestions on my research; 1 learned so much from his excellent 
course on analysis of algorithms. Dr. Yann-Hang Lee provided insight on some 
topics of the research. I am very grateful to both of them. Thanks are also due to Dr. 
Mark Yang of the statistics department for graciously consenting to be on my com- 

Finally, I would like to thank my wife, Sze-Mei, who has patiently watched this 
endeavor and given me the greatest encouragement. 







LI Scope and Objectives 1 

L2 Mobile Network Reconfiguration 2 

1.3 Reliable Message Broadcast. 2 

1.4 Global Network Survey 3 


2.1 Background 5 

2.2 A Satellite Network Model 7 

2.2.1 Assumptions 7 

2.2.2 Partitioning Space into Regions 8 

2.2.3 Reconfiguring Networks 9 

2.2.4 Scheduling Network Reconfiguration 10 

2.3 The Hierarchical Approach for Reconfiguration 11 

2.3.1 Topologies for tiie Primitive Case 12 

2.3.2 Topologies for the Recursive Case 13 

2.3.3 Topologies for Inter-Region Links 15 

2.3.4 The Link Assignment Algorithm and Its Analysis 16 

2.3.5 Applying the Algorithm 18 

2.4 Proofs on Network Connectivity 19 

2.5 Failiu-es and Recoveries 27 

2.5.1 Link Failures 28 

2.5.2 Node Failures 28 


3.1 Background 29 

3.2 Network Model 31 

3.2.1 Preliminary Definitions 31 

3.2.2 Assumptions 32 

3.3 The Two-Phase Broadcast Protocol 33 

3.3. 1 Description of the Protocol 33 Broadcast phase 34 

iii Reply phase 35 

3.3.2 Message Buffer Management 45 

3.3.3 The Node Algorithm 46 

3.4 Performance 50 

3.4.1 Minimum Broadcast Delay 50 

3.4.2 Communication Cost 51 

3.4.3 Speedup in Reply Phase 54 


4.1 Background 57 

4.2 The Model and Motivation 58 

4.3 Lower Bounds 60 

4.4 Optimizing Message Complexity 68 

4.4.1 The Message Structure 70 

4.4.2 Alternative Path Search 71 Low-tree algorithm 71 High-tree algorithm 76 

4.4.3 Validation of the Protocol 78 

4.4.4 Computational Complexity 81 


5.1 Major Results 84 

5.2 Future Research Directions 86 




Abstract of Dissenation Presented to the Graduate School 
of the University of Florida in Panial Fulfillment of the 
Requirements for the Degree of Doctor of Philosophy 



December 1991 

Chairman: Dr. Yuan-Chieh Chow 

Major Depanment: Computer and Informauon Sciences 

Distributed computing systems hosted in a network with changing topologies 
often encounter management problems which require the application of various control 
algorithms. Typical networks with such property include packet-radio networks, 
point-to-point satellite communication networks and networks with a high degree of 
link failures. In such networks, due to the instability of network topology, maintaining 
a reliable distributed computing environment is a complicated issue. There are a 
number of management problems inherent in these network systems. This research 
focuses on issues that are considered to be more fundamental. 

In the first pan of this research, we consider a network reconfiguration problem in 
network topology designs. The problem can be briefly described as follows: given a 
number of nodes in the network, each node has a fixed number of communication 
pons and is mobile. How can an algorithm be designed that reconfigures the network 
periodically such that the network always has maximum connectivity? We present a 
partitioning method to achieve this goal. The algorithm performs better than the best 
of the previous results in terms of time complexity. 

The second issue addressed is a message broadcast problem. We consider a gen- 
eral model of a network with changing topologies, the "eventually connected" network 
model, i.e., networks that may be undergoing arbitrary topological changes but are not 
permanently disconnected. We focus on the issue of message broadcasts on such net- 
works. A message broadcast protocol is a distributed algorithm that allows a message 
to be sent to every node in the network via the communication links connecting the 
nodes. It has been shown that in this kind of network, no conventional broadcast pro- 
tocol is reliable. It has also been shown that a reliable broadcast protocol can be 
designed with unbounded message buffers. We improve this result by presenting an 
efficient reliable broadcast protocol using only bounded message buffers. 

In the third part of this research, we consider a closely related problem of the 
message broadcast, the so-called global survey problem. That is, given a dynamic net- 
work which is subject to unpredictable link failures but remains connected and in 
which each node is unaware of the network topology, can a node broadcast a message 
to every node and then get a reply from every one? We provide a message-complexity 
lower bound as well as a time-complexity lower bound. An optimal protocol is also 

In summary, this dissertation is devoted to a detailed and in-depth study of some 
fundamental issues involved in networks with changing topologies. Significant results 
have been produced that may have profound impacts on research areas such as net- 
work topology designs, network protocol designs, distributed databases and the theory 
of distributed computings. 



1.1 Scope and Objectives 

A distributed system hosted in a network with changing topologies often 
encounters management problems which require the application of various control 
algorithms. Typical networks with such properties include packet-radio networks and 
point-to-point satellite communication networks which are among the main research 
areas of the Strategic Defense Initiative (SDI). In these networks, due to the instabil- 
ity of network topology, it is much more complicated to maintain a network system 
so as to provide a reliable distributed computing environment. There are a number 
of management problems inherent in these network systems. This research focus on 
several of them which are considered to be fundamental: network reconfiguration, 
message broadcast, and network surveys. 

The solutions for these issues rely heavily on the design of effective and 
efficient distributed algorithms. Unlike those centralized algorithms that are executed 
as a single program by a single processor or by a number of processors but one at a 
time, a distributed algorithm is usually implemented as a node algorithm, i.e., each 
node in a network is installed with one copy of the algorithm. The algorithm works 
in an event-driven fashion; each node responds to a particular event by running a par- 
ticular procedure. If no event occurs, the algorithm is idle. 

An important feature of distributed algorithms is that the communication 
between nodes running the same algorithm is done by exchanging messages. There- 
fore, in addition to being correct and efficient, a distributed algorithm often needs to 
be designed in such a way that the total number of messages exchanged in a network 



is minimized, which further complicates the task of designing a good distributed 
algorithm. Some of the problems considered in this proposal require substantial 
modifications and extensions of the existing distributed algorithms which are usually 
intended for networks with fixed topologies; others are new problems for which no 
algorithm is available, let alone an efficient one. These problems are briefly discussed 
as follows. 

1.2 Mobile Network Reconfiguration 
A network with changing topologies needs to reconfigure its topology for the 
following two reasons: 1) Communication links may be broken due to the limited 
broadcasting power of each node, as shown in packet-radio networks, or due to the 
blocking of transmission by the Eanh, as in the case of satellite networks in which a 
communication link between a pair of nodes is established by targeting laser trans- 
ceivers toward each other. 2) The routing scheme of the network becomes inefficient. 
As topology keeps changing, so does the propagation delay of each link. Eventually, 
message routing becomes inefficient because previously computed routes are no 
longer optimal. 

One of the most widely adopted criteria for measuring the performance of a 
reconfiguration algorithm is the connectivity of the network obtained. In this research, 
we present a new algorithm for this problem. The algorithm is based on network par- 
titioning. Our results show that the algorithm yields optimal connectivity and is supe- 
rior to existing algorithms in terms of time complexity [22]. 

1.3 Reliable Message Broadcast 
For various reasons, a node in a network may need to broadcast messages to the 
other nodes. In networks with fixed topologies, messages are usually delivered 
through some broadcasting tree for the sake of reducing the number of messages 
exchanged. In unreliable networks, the task is fairly complicated. 


Some sophisticated broadcast techniques have been developed to achieve reli- 
able broadcast in networks with slowly changing topologies. This research focuses 
on developing a reliable and efficient broadcast protocol in networks where nodes are 
mobile with varying communication delay across each link, and where links may fail 
or , recover in an unpredictable fashion. Links may fail or recover for two reasons: 
(1) due to an unplanned occurrence, i.e., a link breakdown, or (2) as part of a net- 
work reconfiguration scheme to update topology. 

In this problem, a protocol is measured by three standard crlieria: reliability, 
message delay, and communication cost. We present a reliable protocol which is 
simple, efficient, and robust. This protocol is a distributed algorithm which responds 
to the arrival of a message by performing special functions. Our protocol can achieve 
a minimum message delay with low communication cost [23]. In addition, we 
address the problem of message buffer utilization, which has never been successfully 
addressed in unreliable networks. 

1.4 Global Network Survey 

The network survey problem considered in the research can be briefly defined as 
follows. There exists a distinguished node in the network and it wants to glean some 
global information from other nodes in the network. The task of the node is to send a 
short message to every node in the network and get a response from each node in 
finite time. We call this a "network survey" which is conducted in an unreliable 
environment; in a network with changing topologies. 

The network environment discussed in the topics is slighdy different from the 
problem of network broadcast, in that the network, despite link failures, remains con- 
nected throughout the whole survey process. The significance of the problem is its 
close relationship with other fundamental problems such as network votings and dis- 
tributed consensus. 


Significant results have been produced on this problem. We have shown some 
lower bounds on the time and message complexities of this problem. Furthermore, a 
message-optimal algorithm has been developed during this research [24]. 


2.1 Background 

The problem of network reconfiguration is considered a major issue in the back- 
bone design (or topological design) of networks with rapidly changing topologies. 
For satellite networks connected through the use of laser transceivers, this problem is 
very critical for overall network performance. However, it is a complicated task to 
find an efficient and optimal algorithm for reconfiguration due to the following inher- 
ited properties: 1) variable propagation delay due to the changing lengths of the links, 
2) visibility constraints due to the blocking of the Earth, i.e., not every pair of satel- 
lites can see each other, and 3) periodic link reconfiguration as the line of sight 
between two connected satellites may be blocked by the Earth over time. Since net- 
work reconfiguration in the laser-based point-to-point satellite networks involves link 
reassignment, we will use the two terms link assignment and reconfiguration inter- 
changeably throughout this thesis. 

A major issue in network reconfiguration is that of maximizing connectivity of a 
network. This problem has been studied extensively in point-to-point store-and- 
forward networks [6,19,21,34]. It can also be transformed into that of finding the 
maximum number of node-disjoint paths between every pair of nodes in a graph 
[13,21,31]. In particular, Schumacher [31] has given an 0{n}) algorithm (where n is 
the number of nodes in the graph) for constructing /: -regular graphs with maximum 
connectivity. The algorithm presented by Schumacher is the fastest algorithm as far 
as the time complexity is concerned. However, Schumatcher's approach is a pure - 
graph algorithm in which every pair of nodes is connectable. Thus the algorithm is 



inapplicable (at least directly) to satellite network reconfigurations. The visibility con- 
straint on the satellite computer networks makes it more difficult to reconfigure a net- 
work with respect to maximizing network connectivity because each node can see 
only a subset of the nodes. Relatively few works have been done on maximizing net- 
work connectivities for point-to-point satellite networks. McLochlin [25] and Ward 
[36] proposed methods to handle a special case of this link reconfiguration problem 
by imposing strict constraints on satellite orbits so that satellites circulate on several 
equally spaced planes and within each plane satellites are approximately equidistant 
from their neighbors. 

Another issue in point-to-point satellite networks is to determine how and when 
reconfiguration is invoked. Research on other mobile networks also deals with the 
problem of network reconfiguring such as those for packet-radio networks [17,18,30]. 
Most of these works adopt the failure-driven reconfiguring approach, i.e., as long as 
the node or link is properly functioning, no reconfiguration is needed [7]. One cri- 
terion often adopted in packet-radio network reconfigurations is "minimum distur- 
bance" [18] in the sense that reconfiguration normally requires updating or swapping 
links locally and thus the unaffected links should be preserved. In point-to-point 
satellite networks, this "minimum disturbance" is very difficult to obtain because 
most of the existing links will be blocked by the Earth sooner or later. Considering 
the periodic orbiting of satellites, the link reconfiguration needs to be carried out glo- 
bally for there is no way to restrict the reconfiguration to a local level. Instead, we 
choose to do over-all network reconfiguration on a periodic basis to ensure that a net- 
work is highly connected all the time except at the moments of reconfiguration. 


2.2 A Satellite Network Model 

2.2.1 Assumptions 

For systems as complex as satellite networks, it is necessary to make some basic 
assumptions so that solutions are feasible. Here we assume the following things 

(1) Global Knowledge - It is assumed that information about exact locations, direc- 
tions of motion and other related data about the satellites can al ,vays be derived 
through mathematical computations based on orbital mechanics or alternatively 
from broadcasting of ground-based tracking stations. 

(2) Fixed Number of Links at Each Satellite - Ward [36] has shown that it is most 
cost-effective if each satellite is installed with 4 transceivers. In our model we 
also assume that each satellite has exactly 4 transceivers. So, except in some 
unusual circumstances, the network can be regarded as a 4-regular graph, i.e., 
each node has exactly 4 edges. 

(3) Global Time Scheme - Each satellite will need a clock synchronized to all the 
clocks on the other satellites. Since communication links between satellites are 
established by targeting laser transceivers to each other, some kind of synchroni- 
zation is necessary. In this case we believe that using a global clock to drive the 
link assignment process is the simplest way to ensure reliable retargeting. 

(4) Periodic Reconfiguration - The reconfiguration is done by retargeting tran- 
sceivers among satellites to establish a new set of links on a periodic basis. 
Each satellite synchronizes its reconfiguring process to the others at regular 
periods. The reconfiguration is synchronized by having the satellites run their 
reconfiguring process simultaneously. During this short period of network 
reconfiguration, no packets can be exchanged between satellites. The advantage 
of periodic reconfiguration is that it allows satellites to perform more smoothly 


by taking reconfiguration into account. For example, by knowing the next 
reconfiguration is approaching, a node may decide to disconnect the virtual cir- 
cuits built on the present topology and/or set up some new ones based on the 
incoming topology. 

2.2.2 Partitioning Space into Regions 

A good solution to the link assignment problem should satisfy two goals - max- 
imize connectivity and be computationally efficient. The approach we propose 
satisfies these requirements in handling large networks. In this method, a network is 
partitioned into small subnetworks. Each subnetwork independendy generates a 
locally optimized topology in terms of connectivity. Then these subnetworks are con- 
nected together to form a complete network with maximum connectivity. 

Applying this method to satellite networks, the space in which satellites orbit is 
partitioned into a number of regions of approximately the same size. Each region 
contains a subnetwork. For the sake of easy demonstration, the ensuing discussions 
assume that all satellites orbit at the same altitude. The multiple-altitude cases can be 
handled essentially in the same way. The size of a region is determined based on the 
principle that two satellites between a link are visible to each other all the time 
before the next reconfiguration occurs. The optimization of the region size will be 
discussed in detail in section 5. 

To specify regions and satellites formally, a coordinate system (y, 6, \|/), where y 
represents the altitude, 9 the latitude, and \\f the longitude, is adopted. Basically, the 
surface where satellites orbit is partitioned in such a way that the latitude is parti- 
tioned into twice as many intervals as is the longitude. For instance, if the longitude 
has X intervals, [\^q, \|/2], [\|/2, Vsl. - > [Vx-i. ¥x]. then the latitude has 2 x 

intervals, [%, Gi], [61, 62], [62, 63], ... , [e2^_i, Q2xl Note that, Vq = and Bq = 

because the surface is round. So the network is partitioned into 2x^ regions. 


Each region is represented by ([0,-, 8,+i], \J/y+i]). Since we intend to make the 
area of each region approximately the same, these intervals are not of equal width. 
For example, the widths of 6-intervals near two poles, as their two \|/-borders merge 
into poles, should be greater than those of intervals in other areas. 

Let Vi, V2, ... , v„ denote all satellites in the network, and Ri, R2, ••• , Rm 
represent all regions partitioned from the spherical surface. A node v, at (0, , \|/,- , y,- ) 
is in region R if 0,- e [0„(/?,),0^ (/?,)] and V)/,- e ),\|/^ (/?, )] where 0„(^,), 

(/?, ), 0j(i?, ) and Vvv(^i) the four borders of /?,-. Ties are broken arbitrarily. 
Note that not every region has a northern or southern border. For instance, a region 
near a pole may lack a northern or southern border (in fact, the border reduces to a 
point), and the shape of a region in this area is much like a triangle on a spherical 
surface as illustrated in Figure 2.2.1. The shapes of the other regions are rectangular 
or trapezoidal, depending on their latitudes. 

Figure 2.2. 1 Regions and their connections near a pole. 
2.2.3 Reconfiguring Networks 

Since a global timing scheme is used, each satellite will be running the same 
link assignment algorithm simultaneously to reconfigure the network. The synchroni- 
zation of link establishment for the network is realizable as long as each node keeps 
a copy of the whole network graph. When every node is executing the same algo- 


rithm, a consensus can be reached as each node knows which edges should be dis- 
carded and which edges should be set up. 

The links formed by a satellite are determined by the region it is in and by its 
location within the region at the time of reconfiguration. Between reconfigurations 
(link assignments), it is permissible for a satellite to cross the borders into the other 
regions. It is assumed that transceivers on each link will keep pointing/tracking each 
other once the link is set up, regardless of where the other satellite is. If the region 
size and the reconfiguration period are properly determined, links will not be broken 
when satellites cross the borders. 

2.2.4 Scheduling Network Reconfiguration 

As has been pointed out, the reconfiguration of the network is done periodically. 
The next question is how often should a network be reconfigured. Since each 
reconfiguration requires a temporary halting of message exchanges, to maximize net- 
work utilization the interval between two consecutive reconfigurations should be as 
long as possible under the constraint that no links will be broken in between. Thus, 
the maximum reconfiguration period is the minimum of the longest time interval any 
two linked satellites can orbit and still maintain their common link. This can be 
determined after the region size has been decided. Figure 2.2.2 illustrates how the 
period is calculated. Suppose that the following regions Ri, R2 each have only one 
node and at the time of a reconfiguration their positions are Xi and yi. Let X2 and y2 
be two points in the space such that [x2, J2} ^ [ {x, y] : \ x - y \ = the maximum 
distance that 2 satellites are visible to each other ). Then the time for a satellite to 
travel firom x^ to ^2 or from y^ to 3^2 is the longest period of reconfiguration for 
which no links will be broken. This can be computed as soon as the region size and 
the orbits of the satellites are determined. 





Figure 2.2.2 The longest distance between two satellites 
in adjacent regions 

2.3. The Hierarchical Approach for Reconfiguration 
Before we describe the hierarchical approach, let's introduce some terms. As 
the network is partitioned into regions, some of the nodes have to set up links across 
the regions so that packets can go through these links to other regions. Those nodes 
having border-crossing links are called backbone nodes and the others are called 
internal nodes or non-backbone nodes. Similarly, those border-crossing links are 
named as backbone links and the others are internal links. The role each node plays 
is not fixed. It depends on the location of the node. In general, each region will 
select nodes closest to the 4 borders (eastern, southern, western, and northern bord- 
ers) to be its backbone nodes. Backbone nodes play more important roles in routing 
packets because inter-region packets will be forwarded via at least one of these 
nodes. In each region, except for the one-node case as shown in Fig 2.3.1. a, there 
are 6 backbone links; 4 links cross the northern and the southern borders; the other 2 
links cross the eastern and the western borders. The distribution of links over borders 
is not unique. For instance, another alternative would be to put 4 links on the eastern 
and western borders and 2 links on the others. The basic principle ("single-hop" prin- 
ciple) is to allow the by-passing messages to be forwarded in one hop vertically as 
well as horizontally. The 6-backbone-links arrangement is derived partly from the 4- 
regular graph constraint and partly from empirical results of constructing primitive 


topologies for very small networks. There are possibly other arrangements with 
different numbers of backbone links in a region available to serve the purposes of 
obtaining regional maximum connectivity and of observing the single-hop principle 
as mentioned above. 

The hierarchical link assignment algorithm can be divided into two levels, the 
first-level task is to determine a regional topology which includes establishing inter- 
nal links and selecting backbone nodes. The second-level task is to combine regions 
into a connected network by setting up backbone links among backbone nodes of 
adjacent regions. The algorithm is designed to maximize regional connectivities as 
well as the whole network connectivity. 

In the first-level task, regions are handled in two different ways: the primitive 
case and the recursive case. If a region contains 7 or fewer nodes, it is handled as the 
primitive case. Otherwise, it is regarded as the recursive case. The primitive case 
consists of 7 topologies. In the primitive case, the regional topology is predetermined 
and links are set up accordingly. In the recursive case, the link assignment algorithm 
works by "peeling off layers of nodes, 4 nodes at a time, until it ends as one of 4 
terminal cases; each layer contains the 4 nodes nearest to the 4 borders. The four 
terminal cases are shown in Figure 2.3.3. 

2.3.1 Topologies for the Primitive Case 

Seven primitive topologies are provided for the construction of regional subnet- 
works with fewer than 8 nodes. Figure 2.3.1 illustrates all of them. Note that these 
topologies are chosen to maximize the regional connectivity. By maximum regional 
connectivity, we actually mean that a node can construct 4 node-disjoint paths to any 
other node in the same region via the nodes in the same regions or in neighboring 

The primitive topologies are among the few topologies which meet the max- 


imum connectivity requirement and the single-hop principle. The topologies have the 
following properties: 

(1) no links are redundant in the sense that every transceiver is used and no more 
than one link is assigned to any pair of nodes; 

(2) messages generated from a host inside a region can be forwarded efficiently out 
of this region via fewer than 3 intermediate nodes; 

(3) "straight-line hops," east-west or north-south message deliveries, are favored. 
There are backbone nodes in each case (e.g., Vj and V2 in Figure 2.3. Lb, or 
and V3 in Figure 2.3.1.c-g) that have 2 backbone links crossing east- west and 
north-south borders. Therefore, one-hop transmitting is allowed through this 
region. However, with the 4-link constraint, no similar "convenience" is offered 
if a by-passing message wants to "make a turn" through this region. 

2.3.2 Topologies for the Recursive Case 

If the number of nodes in a region is greater than 7, the strategy is to construct 
a topology recursively. The link assignment algorithm works in the following 
manner: First, 4 nodes, each of them the nearest node to a border, are chosen to form 
the first layer of the region. Note that the nodes in the first layer will also be the 


backbone nodes of the region and will connect to backbone nodes in the neighboring 
regions. Second, the algorithm continues to peel off a set of 4-nearest-to-border 
nodes to form intermediate layers until one of the terminal cases is reached for which 
topologies exist. Figure 2.3.3 shows the topologies of all four permissible terminal 
cases. Finally, any two adjacent layers, i and i+1, arc connected by their correspond- 
ing 4 nodes (e.g., vP, vP, etc.) as shown in Figure 2.3.2.b where the v/'^'s are 
of one layer above the Vj^^'*'^^'s. 

Note that the property (3) in the primitive case is also preserved. In general, 
outgoing messages originating from a node of a particular layer are forwarded via the 
corresponding nodes of its upper layers all the way up to the corresponding backbone 
node and then transmitted to other regions. 

and below 



Figure 2.3.2 Backbone Layer and Intermediate Layer Topologies 






Figure 2.3.3 The Terminal Cases (Final Layer Topologies) 

2.3.3 Topologies for Inter-region Links 

The inter-region backbone links are established in the second-level task. Only 
backbone nodes are involved in this task level. Generally, each region will have 4 
backbone nodes. To illustrate how inter-region links are established, consider a 
region R, denote the backbone node closest to the west border by Vj, the backbone 
node closest to the south border by V2, the backbone node closest to the east border 
by V3, and the backbone node closest to the north border by V4. Let /?,-, i e {1, 2, 3, 
4} be the neighboring region of R, e.g., let /?i be the west neighboring region and 
/? 2 be the south neighboring region, . . . etc.; and the backbone node closest to /?, 's 
west border is represented as v,- j, . . . etc. See examples in Figure 2.3.3. 

The inter-region links are set up based on the following rules: 

(1) Vj is connected to Vj j and V31, i.e., "horizontal links" are established (since the 
positions of the nodes change over time, the links are not strictly horizontal, but 
are so called for the sake of convenience). 

(2) V3 is connected to Vj^s and V4 3, i.e., "vertical links" are established. 


(3) V2 is linked to V2,4 and V4 is linked to V4 2, 

(4) If the number of backbone nodes in R is less than 4, then some backbone nodes 
have to play the roles of more than one node. 

For example, in Figure 2.3.3.a R contains only 2 nodes, while in Figure 2.3.3.b R 
contains at least 4 nodes. Therefore, and V2 in Figure 2.3.3.a are performing the 
roles of Vj and V2, and V3 and V4, respectively, in Figure 2.3.3.b. 


"4.2 / 


"1 \ 

• "2 













Figure 2.3.3 Examples of Inter-Region Topologies 
2.3.4 The Link Assignment Algorithm and Its Analysis 

The link assignment algorithm consists of procedures REGION_LINK and 
NETWORK_LINK. REGIONLINK handles the first-level task and 
NETWORK_LINK deals with the second level task. It is assumed that the space is 
partitioned and thus the region borders are determined before the network is initial- 
ized. Each node is associated with an ID. To execute the algorithm, each node 
should have the positional data, represented as (ID, 0, of all the nodes. The data 
are obtained through calculations of orbits. In fact, the data can be precomputed, 
stored in each node and accessed during reconfiguration. With this data, the member- 
ship of the nodes (i.e., to which region a node belongs) can be determined during the 


reconfiguration, and they remain unchanged until the next reconfiguration. If a node 
is detected dead or missing, the information will be broadcast across the network and 
the node will be ignored for the reconfiguration. The procedures are described as 

Procedure REGION LINK; 

Input: a set of nodes of a region of the form- {ID, (9, \^)] 

1. i<-l; 

If # of nodes < 8, then construct the topology 

according to the arrangement of Figure 2.3.1; return. /* primitive cases */ 

2. ConsOuct 2 bi-directional queues from the nodes, and Qq, 
j2y is ordered non-decreasingly according to the distance 
between a node and the eastem border, 

Qq 'is ordered non-decreasingly according to the distance 

between a node and the northern border; 

ties are broken arbitrarily. /* recursive cases */ 

3. Remove the first and the last "unremoved" nodes from and Qq 
separately. These 4 nodes, mariced "removed", form layer i. 

If i = 1, then construct links as specified in Figure 2.3.2.a 

(the backbone layer intra-region topology), 
else construct links as specified in Figure 2.3.2.b 

(the intermediate layer topology), 
i i + 1. 

4. If # of "uruemoved" nodes in each queue > 7, then go to Step 3. 

5. Construct final layer topology as specified in Figure 2.3.3. 

Let R be the region in which the node executing the algorithm is located. 
Procedure NETWORK LINK; 

Input: the set of nodes of the netwoik of the form- {ID, (9, \|/)} 

1. Call REGION_LINK 5 times to derive the regional topologies of 

R, R J, /?2i /?3. ^4. 

2. Construct inter-region topology according to the 4 simple rules 
specified in section 3.3 (examples are shown in Figure 2.3.3). 


The analysis of the time complexity of the link assignment algorithm is straight- 
forward. Let k be the number of nodes in a region. In REGION_LINK, step 1 and 5 
take 0(1) (note that the process of checking whether a node is "removed" or 
"unremoved" in a queue takes 0(1) as it can be done by indexing a binary array 
REMOVED[k]). Step 2, mainly the sorting procedures, take Oiklogk). Steps 3 and 
4 serve as a loop. Since 4 nodes are handled each time, the number of loopings is 


approximately klA. Hence, steps 3 and 4 jointly take 0{k). So, the total time com- 
plexity of REGION_LINK is 0{k\ogk). 

In NETWORK_LINK, as each node is only concerned with the topologies of its 
4 neighboring regions and its own, the time complexity is still O {k logk ). Now, let n 
be the total number of nodes in the network and m be the number of partitioned 
regions. It takes time O (n ) to determine the memberships of the nodes. By assuming 
that the number of nodes in each region is approximately n/m, it can be seen that 
NETWORK_LINK takes time O ({n /m)\ogin /m)) which is bounded by Oinlogn). 
So, the algorithm is completed in O (n logn ) time. 

2.3.5 Applying the Algorithm 

Given that there are 4 transceivers in each node, the maximum node- 
connectivity of a network is no more than 4. However, consider the case where there 
exists a region with only one node. In order to maintain a 4-connected network, all 
the other regions of the same longitude must contain exactly one node. This is simply 
because there is only one backbone link crossing each of the northern and southern 
borders for the one-node regions, while there are two such links for the other cases 
(see Figure 2.3.1). Consequently, a good decision on the size of a region should 
manage to prevent the one-node-in-a-region case when a network starts its 
reconfiguration. In fact, we will show that the link assignment algorithm maximizes 
the node-connectivity of a network if every region of the network contains at least 2 
nodes during each moment of reconfiguration. A natural way to achieve this is to 
increase the region size (i.e., to reduce the total number of regions) such that every 
region contains at least 2 nodes all the time, which can be done by calculating satel- 
lites' orbits and adjusting region sizes accordingly. 

The algorithm is most applicable when dealing with a large number of satellites 
and not very suitable for small satellite networks for two reasons. First, if a satellite 


network is small, its optimal link assignment can be found by a direct exhaustive 
search within an acceptable time limit. Second, in order to handle networks in which 
the number of satellites is small, large region sizes are needed. But, if a region is too 
large, it will make two nodes within the region invisible to each other, this will make 
the algorithm fail. 

In dealing with large satellite networks, the algorithm can handle the recursive 
case fairly efficiently. It can take advantage of the locality of nodes and come up 
with relatively short links (in a dynamic sense, of course, as the length of a link 
varies) except perhaps for the backbone links. With these short hnks in a regional 
network, it is expected that the mean propagation delay for local messages can be 
significantly reduced. Considering that propagation delay is a substantial part of the 
end-to-end delay in satellite communications, the algorithm will help speed up the 
routing in addition to providing high network connectivity. 

2.4. Proofs on Network Connectivity 
Connectivity is one of the important criteria in network topology design. A net- 
work should be able to remain connected even if some links or nodes fail. Before 
discussing the connectivity of a network, we formally define some terms. 

Definition 2.4.1 Let G=(V, E) be an undirected graph representing a network N. V 
is the set of nodes in N, and E is the set of edges/links in N. A path from to y is 
a sequence of edges of the form <Vo, Vj, V2, ... , Vj> where (v,_i, v, ) is an edge 
for l^^y, and x = Vq, y = Vj. Two paths from v,- to are node-disjoint if they 
contain no common nodes except the two end nodes v,- and Vj. The node- 
connectivity of a pair of nodes is the minimum number of nodes whose removals will 
disconnect the two nodes. Equivalentiy, it can be defined as the maximum number of 
node-disjoint paths between them. The node connectivity of a network is defined as 
the minimum node-connectivity of all pairs of nodes in V. There is another type of 


connectivity for a graph, called edge-connectivity which can be defined as the 
minimum number of edges whose removals will disconnect a graph. Since node- 
connectivity is a stronger condition than edge-connectivity in terms of reliability, 
node-connectivity is adopted here; throughout this chapter connectivity will mean 

Given that each node has at most 4 links in the network, we will show that the 
link assignment algorithm NETWORK_LINK yields a surprisingly good connectivity 
in satellite network topology designs. 

Definition 2.4.2 Two regions /?,• and Rj are adjacent if is contiguous to Rj , or if 
they intersect in a pole and are in radially symmetric positions with respect to the 
pole. A region path between two regions is a sequence of regions of the form </?i, 
R2, R2, R4, — , Ri-i, Ri> where Rj_i is adjacent to Rj, 2 < j < i. Two region paths 
from Ri to Rj are region-disjoint if they have no regions in common except /?,• and 


Lemma 2.4.1 Let R = [R^, R2, ... , } be the set of the partitioned regions. Let 
Ri j denote the region [i, i+1] x [j, j+1], and /?, j € R. For any pair of regions, there 
exist 4 region-disjoint paths between them. 

Proof. The proof is straightforward. We can show that by explicitly setting up 4 
region-disjoint paths between any pair of regions. There are two possible cases to be 
discussed as follows: 

Case 1. Ri J is adjacent to Rj j. It is trivially true (see Figure 2.4. La for exam- 

Case 2. is nonadjacent to Rj j. Without loss of generality, suppose that i < 
j as shown in Figure 2.4.1.b. Construct a path as <^/,,, /?j+2./, •.• . Rj,i, 

^y,i+i» ^j,i+2> — ' ^jj-^- Construct the second path P2 as </?,•,,-, 


- ' - . ^;J>- Construct the third path as </?.-,•, 

^,•,+1. ^f,.+2> - ' ^ij. - . ^;.;>- Construct 7^4 as </?.-,-, 

Pj, ^2' ^3' ^4 ^ 

The other cases such as i > j can be demonstrated similarly and thus are omitted 
here. □ 

Each time REGION_LINK is completed, an intra-region topology is constructed 
with layers of nodes. Except for the final layer, which has 4 to 7 nodes, each layer 
consists of 4 nodes of the form {vf \ V;^'\ vj'^ v^'^ ). For convenience we may 
call {vf'\ v^'\ vP, } layer i. In particular, layer 1 is the backbone layer, and 
also denoted as {vi, Vj, V3, V4}. Notice that the links, (v,-^\ v,*^'"^^)), (v^), vp"^^)) 
and (v|^) , v^'+i)) for all 1< i < 4 and j>l, will be assigned in REGION_LINK. 

Lemma 2.4.2 There exist disjoint paths from any node x to the other 3 nodes in the 
same layer i such that these 3 paths consist of nodes only in layer i, layer (i-1), or 
outside the region. 

Proof . There are three cases to be discussed. 

Case 1. Here i is the backbone layer. Without loss of generality, suppose x = 
v^. Let Vj^f denote the node v,- in region X(R), where i € {1, 2, 3, 4} and X e {W, 
NW, N, NE, E, SE, S, SW}. Then the 3 disjoint paths are: <Vi, V4>, <Vi, v^,y/^J^^^, 

V4,W'(/?). V2^(/?), V4jv,y(;f), VijvW(;j), Vi^(^), V4jv(/?), "^l^if^Ry '^3J^(Ry ^3>^ <Vi, 

^i,EiR> ^4,£(/f). ^ZEiRy ^4.SE(R)y "^ijSEiRy ^uc/?)' ^4^(.Ry (see Figure 2.4.2.a). 
Similarly, we can show that the lemma holds for x = V2, V3, or V4. 

Case 2. In this instance, i is an intermediate layer. Again, assume that x = v{'^ 
Since (v{'\ v^'^) and {v{'\ v^'^) are 2 edges in layer i, 2 paths, <v|'\ V2^'^> and 
<vl'\ v|'^>, exist. The third path is from vf'^ to vj'\ which can be constructed 


one layer above (i-1). If (i-1) is the backbone layer, i.e., i=2, then the path = <v{'\ 
v{^\ vP, v^, vP, v;j'^> (see Figure 2.4.2.b). If (i-1) is an intermediate layer, the 
path can be <v}''), v}'"!), v^-'^K v^'-'^\vP> (see Figure 2.4.2.C). 

Case 3. In the third case, i is the final layer. There are 4 topologies for the final 
layer. In the 4-node topology as shown in Figure 2.3.3.a, there is a complete graph. 
So, there are 3 disjoint paths from any node to the other 3 nodes. For the 5-node, 
6-node, and 7-node cases, it can be checked easily that 3 disjoint paths exist from 
any node to the other 3, based on the similar enumerative method as in the inter- 
mediate case. □ 

Lemma 2.4.3 There are disjoint paths from any node x to the other 3 nodes in the 
same layer i, where layer i an intermediate layer, such that these 3 paths consist of 
nodes only in layer i or i+1. 

Proof. Let x be v}'^ Two obvious disjoint paths are <v}'\ v^'^> and <v}'\ v|')>. 
The third path, from vf ^ to vj'\ can be set up via nodes in layer i+1, i.e., <v{'\ 
v{''*'^\ v|'"''^\ v|'^^\ vj'^>. Since the link assignment in the intermediate layer is 
symmetric, it holds for the cases where J e {vj'), }• □ 

Lemma 2.4.4 There are 4 disjoint paths between every pair of nodes [x,y} where 
X and y are in different layers. 

Proof. Without loss of generality, suppose that x is in layer i and y is in layer j, i < 
j. From Lemma 4.2 we know that there exist 3 disjoint paths from x to the other 
three nodes in layer i by using nodes only in layer i or above (possibly nodes in 
neighboring regions). Now, since there is an edge (v/'\ v^^'"^^^) for all k e {1, 2, 3, 
4} and i is either a backbone layer or an intermediate layer, 4 paths, <v^^\ v/'"^^\ ... , 
Vi^~^\ Vi^h>, where k = 1, 2, 3, 4, can be constructed. Hence, we are able to build 
up 4 disjoint paths from x to v^^ for k = 1,2, 3, 4. 


If layer j is an intermediate layer, then by Lemma 4.3 there must exist 3 disjoint 
paths from y to the other 3 nodes in {vp^ v^M that do not contain 

nodes above layer j. Therefore, there are 4 disjoint paths from x io y (see Figure 

If layer j is the final layer (i.e. layer j is equal to layer L in Figure 2.3.3), there 
are 2 cases. If 3^ e [v\'\ v^, v^'M = {v{^), v^^^, v^^^, vf >}, from the 
final layer topologies as shown in Figure 2.3.3, it can be seen that each node in 
[v[^\v^\ v^^] has disjoint paths to the other 3 nodes in the set. So 4 dis- 
joint paths can be constructed from x to y. If y e {vj^\ v^^\ vj^M, since each 
node in {v^^\ v^^), >} has 4 edges to all of the nodes in {v[^\ v^^\ vf \ 
v^^} as shown in topologies for terminal cases (see Figure 2.3.3 and Figure 2.4.4.b). 

Lemma 2.4.5 The connectivity of every pair of nodes in the same layer i is 4. 

Proof. It suffices to prove that, for each pair of nodes in the same layer, there exist 
4 disjoint paths between them. This can be done by path enumeration. Since it will 
be a lengthy proof if we enumerate all of the 4 disjoint paths for every pair of nodes, 
some typical cases are illustrated to show that the disjoint paths can be constructed 
explicitly. The rest of the cases can be done similarly. 

Case 1. Here i is the backbone layer (i = 1). Let us show that, for {y^\ v^^), 
we can explicitly set up 4 disjoint paths connecting them as follows: The first path is 
<vp\ v2\ v^^>. Now, via its west, northwest and north neighboring regions, then 
going through V3, a second path can be established. Third, via east, southeast and 
south neighboring regions, the third path can be set up. Finally, the last path can visit 
vp\ , and then V;^^^ (see Figure 4.5.a). It is easy to verify that there exist 4 
node-disjoint paths between any other pair in the backbone layer. 


Case 2. In this instance, i is the final layer (i = L, see Figure 2.3.3). If the 
number of nodes in the final layer is 4 (as shown in Figure 2.3.3.a), every pair of 
nodes is directly connected, so the node-connectivity is a maximum. If the number 
of nodes in the final layer is 5 (as shown in Figure 2.3.3.b), then, for {v{^\ M in 
Figure 2.3.3.b, the 3 disjoint paths between v}^) and v^^^ arc <v{^\ v^^\ v^^>, 
<v^\ v|^^ vj^^>, and <v{^\ v^^\ v|'^>.The fourth path will pass through 
v[^~^\ v^^~^\ and v^^~^^ to v^^K Following the same enumerative process, we 
can verify that it also holds for the 6-node case (as shown in Figure 2.3.3.C) and the 
7-node case (as shown in Figure 2.3.2.d). 

Case 3. In the third case, i is an intermediate layer. As Figure 2.3.2.b shows, 
we can choose any pair, say [v[^\v^^^, and construct the following 4 disjoint paths 
between them: <Vi('), v^'"), vP>, <vP, vf, >, <Vi('), vf'-D, v^'-^), vt^\ 
vP>, and <vP, vp^\ v^'+^\ v^'+^), vP> (see Figure 2.4.5.b). The other cases 
for an intermediate layer are similar. □ 

Let = (V, E) be the network topology derived from applying the 
NETWORK_LINK algorithm; and let {Vj, Vj- - ^^m) ^ ^he set of the nodes of 
the regions where each V,- resides in region . 

Theorem 2.4.1 If I V,- I > 2 for all V,-, then the network N has connectivity = 4. 

Proof. Let v,- and Vj be two nodes in V. 

Case 1. Here v,- and Vj belong to the same region /?,-. If/?,- contains fewer than 
8 nodes, i.e., a primitive case (Figure 2.3. l.b to Figure 2.3. l.f), we can enumerate all 
the 4 disjoint paths between any two nodes. These paths will visit 4 adjacent regions 
of Ri (i.e., N(/?,), S(/?,), WiRi), and E(/?,)), and 4 non-adjacent regions (i.e., 
NW(/?,), NE(Ri), SE(/?,), and SW(/?,)). For the sake of briefness, they are not 
enumerated here (a similar example is shown in Figure 2.4.5.a). If /?,• contains more 


than 7 nodes, it follows immediately from Lemma 2.4.4 and Lemma 2.4.5 that 4 dis- 
joint paths exist for each pair. 

Case 2. In the second case, v,- and Vj are not in the same region, (e.g., v,- is in 
Ri and Vy is in Rj). If /?,• is reconfigured as a primitive topology, it is easy to verify 
that there exist 4 disjoint paths from any node in any of the 7 primitive cases to its 4 
adjacent neighboring regions. If 7?, contains more than 7 nodes, then as we have 
shown in the proof of Lemma 2.4.2 and Lemma 2.4.3, there are 4 disjoint paths from 
V,- to the four backbone nodes of i?, ; and each backbone node has an edge to its 
neighboring regions. Thus, regardless of the number of nodes /?,■ gets, there are 
always 4 disjoint paths from v,- to its 4 neighboring regions. Similarly, there are 
always 4 disjoint paths from Vj in Rj to its 4 neighboring regions. Now, by applying 
Lemma 2.4.1 we can construct 4 disjoint region-paths from v,- to 4 of its neighboring 
regions. Finally, via these disjoint region-paths 4 disjoint paths from v,- to Vy can be 
successfully established. □ 

Hi A 

I I 




Figure 2.4.1 Disjoint Region-paths 



^ — \ 


_ - -t 
— ' / 

/ ^-1 

n ; 


« / 










Figure 2.4.2 Three cases of Lemma 3.4.2 


V (.+1) V 

Figure 2.4.3 Routing in the same layer 


Figure 2.4.4 Routing in different layer 








Figure 2.4.5 Two examples of external routing 


Failures and Recoveries 

In a mobile satellite network, a broken link can be detected by the loss of sig- 
nal, which may be due to the failure of a transceiver on either side. The failure of a 
node is detected through a consensus reached by its adjacent nodes. With 4- 
connectivity, a network can survive without being partitioned as long as no more 
than 4 nodes or links fail. Yet, there are backbone Unks and nodes whose failures 
have more serious impacts on the overall performance of a network than those of 
non-backbone links and nodes. When a backbone node/link fails, a local topological 
reconfiguration is needed to maintain sufficient backbone links and nodes; non- 
backbone link/node failures may be handled in the routing level (e.g., by reconstruct- 
ing virtual circuits) instead of in the link assignment level. As failure events occur 
asynchronously, the recovery should be done dynamically. The recovery will require 
each party involved to share the same information about the local reconfiguration and 
will need a centralized algorithm distributively run by the affected nodes. This cen- 
tralized recovery algorithm will do link-switching if a backbone link fails. It will also 
do backbone-node-selecting and link-switching if a backbone node fails. 

It is assumed that all the failure events are single-failure events; a multiple- 



failure event rarely occurs. When it does occur, its failures are treated just like a 
sequence of single failures. One thing we would like to point out is that, if any 
failure occurs at a time very close to the next reconfiguration, the recovery algorithm 
may ignore it because it will soon be taken care of by the global reconfiguration. 

2.5.1 Link Failures 

Failure on a non-backbone link will be ignored as the network has a high degree 
of connectivity. A failure on a backbone link will not partition the network, but it 
will have a significant impact on the routing. All packets going to other regions are 
supposed to be forwarded through one of the backbone links. A failed backbone link 
will cause packets to be routed via longer paths and may create traffic jams for other 
backbone links. Hence, link failure of this kind should be handled properly. This can 
be done through link- switching. The backbone node which has a malfunctioning 
trans-ceiver will consult the other three nodes connected to it and decide which of the 
three links will be used to replace the broken backbone link. In this way the network 
can cushion the impact of a broken link to the minimum. 

2.5.2 Node Failures 

A node may be detected to have failed ( e.g., through the failures of all the links 
incident to it). The failure of a non-backbone node will not result in a local 
reconfiguration. But a backbone-node failure needs to be handled properly, consider- 
ing its more important role. The recovery process works as follows: The region con- 
taining the failed node will select a node to replace it. The principle for the replace- 
ment is "choosing-the-nearest-node," i.e., the node that is closest to the failed node is 
chosen as the new backbone. As each node has information on the locations of the 
other nodes, this can be done in a straightforward manner. By doing this, those 
nodes originally linked to the failed node will be able to connect to the new back- 
bone node. 


3.1 Background 

For various reasons, a node in a network needs to broadcast messages to all 
other nodes. A broadcast protocol is reliable if every node can receive the broadcast 
messages in finite time. In networks with fixed topologies, broadcasting is usually 
done by delivering messages through some broadcasting trees (e.g., a minimum- 
spanning tree) [7,10,11,14,15,26-28] for the sake of reducing the number of messages 
exchanged. For networks where the topology keeps changing, the task is more com- 
plicated because a pre-constructed routing path may be disconnected. Even if no 
paths are disconnected, with varying propagation delays over communication links, 
the routing of the broadcast messages may not be optimal if pre-computed broadcast- 
ing paths are used in the protocol. 

It has been shown by Awerbuch and Even [3] that a reliable broadcast in a 
changing-topology network is possible if the network is "eventually connected" (no 
nodes are permanendy isolated from the others). Some sophisticated broadcasting 
algorithms have been developed to achieve reliable broadcast in eventually connected 
networks with slowly changing topologies [3] [32] [35]. However, the eventually con- 
nected networks for which these broadcast algorithms were designed are mainly sta- 
tionary networks undergoing arbitrary link failures and recoveries and thus the 
significance of these algorithms are very restrictive. In addition, the algorithms are 
not guaranteed to achieve minimum broadcast delay in networks with rapidly chang- 
ing topologies such as mobile communication networks described below. 

In this research, the concept of eventually connected networks is further 



extended to include mobile networks subject to frequent link reconfigurations [30]. 
Typical networks of this kind include large-scale mobile communication networks 
such as packet radio networks and point-to-point satellite networks. There are three 
fundamental properties in the topologies of the networks: (i) each node is associated 
with a number of communication ports with which communication links can be esta- 
blished, (ii) the nodes are mobile and communication delay across each link is vary- 
ing, (iii) link reconfiguration is necessary in order to preserve the connectivity of the 
network [7][17-18][22]; this is due to restrictions such as limited traniiUiission powers 
in the transceivers of packet radio networks when nodes are too far apart, or the 
blocking of the earth in satellite networks connected by laser transceiver. 

It has been shown [26] that in changing-topology networks, if a broadcast mes- 
sage is not buffered, it is likely that some node may never receive that message. For 
example, suppose that / is the only node connected to j and a message M arrives at 
/ immediately after the link between / and j is broken, if j later reconnects to / or 
connects to any node which has already received M and discarded it, j will not 
receive M . 

The questions which arise are: 1) how long a message should be buffered and 2) 
how should the message buffers be managed. These issues have never been success- 
fully addressed in networks with changing topologies. In [3], the algorithms work by 
assuming infinite message buffer space in each node. However, in any network each 
node has only a finite memory. In this research, we show that bounded message 
buffers are feasible for a reliable broadcast in eventually connected networks. 

Toward this end, we take a rather unconventional approach to the design of reli- 
able broadcast protocols. By using message reply vectors, we design a reliable and 
efficient broadcast protocol which requires bounded message buffers for broadcast 
messages in each node. The protocol also provides a mechanism which enables 
obsolete messages to be dynamically flushed out. This will substantially increase the 


utilization of limited buffer space. In particular, the algorithm is proven to be optimal 
in terms of broadcast delay in eventually connected networks with arbitrarily chang- 
ing topologies. 

3.2 Network Model 
Let a mobile communication network be represented as G = (V, E), where V 
represents the nodes in the network, and E represents the operating communication 
links in the network. \V\ is fixed while \E I is bounded and time-varying. 

3.2.1 Preliminary Definitions 

In addition to all the basic notions of graph theory, we define some related 
terms as follows. A directed tree in a network is a tree in which each node, except 
for the root of the tree, has a neighboring node (nodes connected by a link are neigh- 
bors ) as its parent node, and each parent has one or more child node s or children . 
A link is connected if it provides at least a minimum operating interval x during 
which a message can be successfully transmitted, otherwise it is considered discon- 
nected. Two nodes that are connected by a link but that do not have a parent-child 
relationship are siblings. Two trees are called adjacent if they are connected by at 
least one link. The links of a tree are called parent -child links while the links by 
which two siblings are connected are sibling links. The sibling links can be further 
classified as internal and external; an internal sibling link connects two siblings in 
the same tree (i.e., they have the same root), and an external sibling link is an edge 
between two siblings in different trees. The issue on how the relationship between 
neighbors is determined will be discussed in the next section. A path P from x toy 
is a sequence of edges of the form <Vo, Vj, V2, ... , Vy> where (v,_i, v,) is a 
link for 1 < i ^ j, and x = VQ,y = Vy . If every (v,_i, v, ) is a parent-child link, then P 
is called a primary -path ; if some (v, _i, v, ) are siblings link, then P is called an 


The concept of eventually connected networks is defined as follows. A node u 
is said to be eventually connected to node 5 in G if, at any time t^, there exists a 
constant (3 and a node sequence P = <Vo, Vi, V2, ... , Vy_i, Vy> where Vq = m and Vy 
= S, such that a time sequence T = (rj, - . 0') ^ found with tQ<t^<t2^ 
... < tj <tQ + ^, and {v,_i, } is connected by a link at r, and a minimum operating 
interval [r,-, ti+x] exists for {v,_i, v,- }. If every node in E is eventually connected to 
node S , then E is said to be eventually connected with respect to node 5 . The node 
sequence P above does not have to be a physically connected path (P can be called 
an eventually connected path). Nor is it assumed that a message can always be 
delivered from m to 5 via P within the interval [tQ, tQ + 

3.2.2 Assumptions 

It is assumed that the following statements hold: 

(1) Each node has a unique ID and has a complete list of the IDs of the other nodes 
in the network, but has no information about the complete network topology; the 
topological knowledge of a node is restricted to that of its neighbors (in a net- 
work undergoing topological changes, it is impractical to assume that complete 
topological data is available to every node). 

(2) Delivery is reliable across a connected link and the time to transmit a message 
through a link is finite and bounded. 

(3) The time spent in processing an incoming or outgoing message is negligible, 
compared to the propagation delay of sending a message through a link. 

(4) Links may reconfigure for certain reasons (e.g., maintaining network connec- 
tivity), which are unpredictable to each node; each reconfiguration may involve 
disconnections (link-downs), connections (link-ups), or switchings of links 
(reconfigurations that are not due to a link failure or recovery). In addition, if 
the reconfiguration is an "involuntary" one, i.e., a link failure, it can be detected 


by the nodes on both ends. It is also assumed that a node will never fail so that 
every node will respond to an incoming message. 
(5) The network is eventually connected with (3 bounded; an unbounded P would 
imply that some nodes may be isolated from the rest of the network for an arbi- 
trarily long time regardless of what protocol is used, which has little or no prac- 
tical significance in message broadcasting. Observe that this assumption does not 
imply that every link reconfiguration should be finished in finite time or that 
every failed link should be recovered in finite time. All it requires is that the 
network be maintained in such a way that an eventually connected path exists 
between any two nodes. 

3.3 The Two-Phase Broadcast Protocol 

3.3.1 Description of the Protocol 

The broadcast protocol is a distributed algorithm in the sense that each node 
reacts to different conditions independentiy. For the purpose of brevity and ease of 
demonstration, a single-source broadcast environment is assumed. By assuming that 
broadcast messages do not interfere with each other, and that each node treats the 
broadcast messages from different sources independently, the protocol can be easily 
extended to multiple- source cases. Each broadcast message is associated with an 
acknowledgment message or a reply message that, appended on the end of the broad- 
cast message, contains one bit position for each node in the network (i.e., an n-bit 
vector initialized to 0). Throughout this chapter, we use reply messages (or ack- 
nowledgment messages) and reply vectors ( or acknowledgment vectors) interchange- 
ably. For the following discussion of the protocol, we shall assume that the broad- 
cast messages are from a single source node S . The protocol consists of two phases: 
broadcast phase and reply phase. 

34 Broadcast phase 

In the broadcast phase, each node performs the flooding procedure [33] [35], i.e., 
each node receiving an incoming message M will keep a copy and send it to all the 
neighboring nodes except the nodes from which M came. Unlike conventional flood- 
ings, a message received is buffered in memory and will not be discarded until an 
explicit flush -out message from S arrives. The broadcast phase begins with 5 send- 
ing a message to each of its neighbors. Let node / be a node that has just received 
M, and NEIBi be the set of neighboring nodes of node / . When noce / receives M, 
it sets the position corresponding to / in the reply vector of M to 1. The first node 
from which M arrives is chosen to be the parent of /, denoted by PRi, and a parent- 
notifying control message is sent to PRi acknowledging the relationship. The nodes 
from which node / receives a parent- notifying message form the set of node /'s chil- 
dren, denoted by C/?,- . The other nodes sending M to j are the siblings of / , denoted 
by SB I . Again, a control message is sent to each sibling to acknowledge the relation- 
ship. Notice that 5fl,- consists of either the nodes which exchange the same M with 
node i due to the propagation delay, or the nodes which deliver M to i slightly later 
than PRi . Siblings play an important role in the reply phase when a parent-child link 
is disconnected. 

During the broadcast phase, if a new link is established between two nodes, 
both sides will exchange information about broadcast messages. Each node will 
transmit over the link the messages the other side does not have. As the broadcast 
messages are spreading across the network, a spanning tree rooted at S is constructed 
and expanded. The spanning tree consists of parent-child links and will be used to 
route the reply messages of M . If the network is connected and no links are discon- 
nected during the broadcast phase, a complete spanning tree can eventually be esta- 
blished. If links are broken in the meantime, then the spanning tree could be parti- 
tioned into two or more subtrees, which may cause reply vectors to be detoured. 


Theorem 3.1 The broadcast protocol is reliable if the network is eventually con- 

Proof. Let M be a broadcast message from 5. Assume that v never receives M. 
Since the network is eventually connected and M is broadcast by flooding, the 
number of nodes that receive M must increase over a finite period of time. If v never 
receives M, then the number of nodes receiving M must stop increasing after a cer- 
tain time, and this number is less than \V\. Let Ci be the set of nodes receiving M 
and C2 = V \ Ci where v e €2- As the protocol ensures that each node will inform 
its neighbors of M, there must be a permanent edge-cut between and €2- This 
implies that the network is not eventually connected, a contradiction. Hence, the 
broadcast protocol is reliable. □ Reply phase 

A node receiving M initiates the reply phase under the following two condi- 
tions: 1) it has no neighbors to which M needs to be sent or 2) all the edges to its 
children are disconnected. In the former case, the broadcast phase ends at this node 
and we call such node a leaf node having no children. In the latter case, there is no 
need to wait for the reply messages from its children as it has no way of knowing 
when or whether the links connected to the children will be restored. In both cases, 
these nodes are called initiating nodes. 

If a node is not an initiating node, it accumulates all the vectors acknowledging 
the acceptance of M from its connected children, logically ORs them together to 
form a new version of the vector, and forwards the copy of this vector to its own 
parent node. Notice that, by piecing together the reply vectors from its children, a 
node can also construct a descendent list of its own. 

Lemma 3.1 If each parent-child link is not disconnected before the child node 


finishes replying its vector, then every reply message can be sent to S through a 

Proof . If each parent-child link is not disconnected before the child node finishes 
replying its vector, then each node can reply to its parent via the parent-child link. 
Hence, the reply message will be forwarded to S through a path consisting of only 
parent-child links, which is a primary-path. Q 

During the reply phase, if a node ; is disconnected from PRj, a sibling in SBj 
is chosen as the new parent of If it happens that two siblings are disconnected 
from their parents and thus acknowledge each other as their new parents, the tie is 
broken arbitrarily so that one is the parent and the other is a child. If node ; has no 
siblings, a new tree, splitting from the old, emerges with ; being the root. This may 
result in an alternative-path search if j has not yet replied its vector, because no 
primary-paths connecting j and S exist. 

The alternative-path algorithm. The alternative-path algorithm is based on the 
idea of finding external siblings. When a node, which has lost its parent and has no 
connected siblings, is unable to send a reply vector, it starts alternative-path search 
by passing the vector to its children. Receiving the vector from its parent, each child 
updates its own vector and sends a copy of the vector to each of its external siblings. 
As an internal sibling is a neighboring node in the same tree, it is useless to 
exchange reply vectors among internal siblings because they lead to no new paths to 
other trees. An external sibling can be discerned from an internal one from the infor- 
mation contained in the reply vector sent by the parent (i.e., the internal siblings will 
have Is' in the reply vector). 

In addition to sending the updated reply vector to its external siblings if it has 
any, a node needs to pass the vector to each of its own children. If a node has no 
siblings, it simply passes the vector to each of its children. The necessity of sending 


vectors to children even if external siblings are found is justified by the following 

Lemma 3.2 An alternative-path is not guaranteed to be found if not every node in 
the new tree, generated as a result of link disconnection, is explored. 

Proof. Without loss of generality, suppose that v in the new tree Tj is a leaf node 
connected to the root r via a path P = <r, vj, ... , v, , v> and that v has no external 
siblings. Assume that the alternative-path search will only explore the nodes having 
external siblings. Therefore, the search stops at some node Vj where \ < j <i and v^- 
has external siblings. If it happens that none of the extemal siblings of 7, can lead a 
path to S but a new link is just established between v and a node u not in 7, and u 
has a path to 5, this path then becomes the only alternative path to 5. Since v is 
unexplored and thus does not receive a vector from its parent, the alternative-path 
may never be found even if the network is eventually connected. □ 

Figure 3.1 shows a simple example. Suppose that an alternative-path search will stop 
when the first extemal sibUng is found. Let and r2 be roots of two adjacent trees 
and Vi and V2 be two extemal siblings. Both and r2 start altemative-path search 
due to the disconnections of the links to their parents. The reply information on both 
trees may be exchanged over the extemal sibling link connecting Vj and then sent 
to Ti and and finally back to and V2 again. Both sides exchange reply vectors 
again and reject each other as their reply vectors are identical. The search on both 
sides stops at and V2 without discovering that there exists an altemative-path via 
V3 to S. 

Figure 3.1 A failed altemative-path search 
When a sibling node in a neighboring tree receives a reply message via an 
external sibling link, if the node has akeady done its own replying and the vector 
contains new information that it does not have, it will forward this vector immedi- 
ately to its parent. If it has not yet replied with its own vector or the vector contains 
new information, it will create an update vector by merging this vector with its own, 
wait for its children's reply vectors if it has not yet replied, and then reply the vector. 
If the vector contains no additional reply information, it will be rejected by this 
external sibling. 

It is possible that an external sibling may receive the same reply vector from 
different external siblings from the same tree. This "multiple-copy-to-a-single-sibling" 
situation causes redundancy, but it is risky to be avoided completely. The reason is 
as follows: Suppose that the algorithm can be designed in such a way that no siblings 
will be sent with more than one copy of the vector. This can be done by the follow- 
ing steps: 1) search in parallel all the nodes in the tree (assuming that the search is 
not interrupted by link disconnections), 2) bring back information about each external 
sibling from each descendent, 3) let the tree root decide which external sibling should 
be sent by which node. However, an altemative-path may be missed if the desig- 
nated sibling is disconnected from its external sibling after the duty is assigned and 


before it can send the reply vector. Missing a particular alternative-path may cause a 
serious delay or a permanent blocking of the vector-replying for reasons similar to 
Lemma 3.2. 

However, the situation can be safely alleviated by refining the algorithm as fol- 
lows: A list of external siblings which have been reached by the reply vector is sent, 
along with the vector, to each children so that each node will not send the vector to 
the siblings on the list. After each node has finished routing the vector, the external 
siblings just reached with the vector are added to the list, which is dien delivered to 
its own children. 

The alternative-path algorithm propagates all the way down to the leaves of the 
tree. If none of these descendents of the root have external siblings, the tree is tem- 
porarily isolated from the rest of the network. As the network is assumed to be 
eventually connected, in finite time at least one node will connect to the other nodes 
outside the tree through link reconfiguration. This ensures that the reply vectors can 
continue to be forwarded toward 5. It is likely that, due to the operation of a net- 
work reconfiguration algorithm, some link disconnections may result in a partition of 
the network; a spanning tree becomes a spanning forest Therefore, an alternative- 
path eventually leading to 5 may consist of a sequence of parent-child links and 
external sibling links. 

The sibling-reclassify algorithm. After replying to its parent, a node is still 
likely to be disconnected from its parent and become a root of a new tree if it has no 
siblings. This makes the vector reply more complicated. It seems at first that this 
condition can simply be ignored since the reply vector, which contains reply mes- 
sages of all the nodes within this tree, has been sent out of this tree. However, ignor- 
ing this condition could result in a long delay or permanent blocking for replying 
vectors. Consider the following example. Let x and y be two nodes in tree and x 
is a child node of y. Suppose that x has no siblings. If x is disconnected from y, 



then X and y belong to different trees with x being the root of a new tree Ti- Now, 
the sibling links between nodes in and nodes in T2 change from internal to exter- 
nal. If these links are not "reactivated" as external links, where it might be the case 
that the only alternative-path from nodes in Tj to S must go through nodes in Tj, 
then the alternative-path can not be found. Therefore, sibling links between Ti and 
T2 need to be reclassified. 

The sibling-reclassify algorithm works in a way similar to that of the 
alternative-path algorithm. The primary difference is that the sibling-reclassify algo- 
rithm requires a specific control message, originating from the root of the new tree, 
to indicate which nodes are still in the same tree. This is because all the nodes in the 
tree may have the same information in their reply vectors due to the alternative-path 
search process invoked before the tree is split into two. Hence, a node has no way 
of knowing who its external sibUngs by merely referring its reply vector as what 
have been done in the alternative-path algorithm. The specific message contains the 
descendent list of the root, which is constructed when the root is doing its first vector 

The algorithm proceeds as each node sends to its children the message contain- 
ing the set of all the nodes in the tree. Each node receiving this message can deter- 
mine which siblings have become external, and sends each of them a message declar- 
ing the relationship to be external. A node which receives such a declaration message 
will acknowledge the sending node with its reply vector. This is an important step to 
prevent the other side from missing the new reply information which arrives when 
the links have not yet been declared to be external. 

In summary, when a link is disconnected in the reply phase, the relationship 
between the nodes on both ends terminates. If a node is disconnected from its parent, 
it will select a new parent from its siblings if it has any. Otherwise, a new tree is 
bom. If the node has not replied its vector, the reply of its vector has to be done by 


invoking the alternative-path algorithm. If the node has akeady finished replying, it 
reclassifies the siblings so that siblings in neighboring trees can be reactivated and 
can be used as possible bridges leading to the source. On the other hand, when a new 
link is established, two nodes on both ends of the link will consider each other as 
siblings and exchange message status in order to resume message broadcasts and/or 
to continue vector-replying. 

Lemma 3.3 Let C be a connected component and S is not in C. If C remains con- 
nected but isolated from the rest of the network sufficiently long, the reply vectors of 
nodes in C will converge to an identical form. 

Proof. We consider the four possible cases. 

Case 1: Suppose that no links are disconnected or newly connected in this connected 
component C for a sufficiently long period of time. If there is only one tree in C, 
then all the reply vectors will reach the root. Since the root can not forward its reply 
vector, it will send it to all of its descendents which will update their own reply vec- 
tors. This means every node in this tree will have an identical reply vector. If there 
are more than one tree, say Tj, T2, ... ,T^, there must exist at least one external 
sibling link between any two adjacent trees. Let T,- and Tj be two adjacent trees. 
Since the roots of and Tj will eventually invoke the alternative-path algorithm, 
there must be some exchanges of reply information between nodes in 7,- and Tj 
through external sibling links. As each node will route the reply vector from its 
external sibling in the adjacent tree to its parent, the reply will eventually reach the 
tree root The root will soon propagate it to all the nodes in the tree as a result of 
the alternative-path algorithm. Hence, we can see that nodes in J,- and Tj will have 
the same reply vectors. Similarly, it can be inferred that all the trees in the same con- 
nected component will have the same reply vector if C is isolated long enough. 


Case 2: Suppose that some links are disconnected but C remains connected. Without 
loss of generality, let L, be the link that is disconnected. There are two cases to be 

Case 2.1: If L,- is a parent-child link, either the child selects a new parent or it pro- 
duces a newly-born tree. In the former case, if the new parent is in the same tree, it 
causes no problem because it is only a switching of a parent-child link. If the new 
parent is not in the same tree (an external sibling is chosen), the split new tree rooted 
at the child is now merged into another tree; the total number of trees in this com- 
ponent remains the same. As in Case 1, the merged tree will eventually share the 
same reply information as a result of exchanging reply vectors. In the latter case, a 
new tree is born, which increments the number of trees in this component by 1. The 
disconnection either invokes the alternative-path algorithm or sibling-reclassify algo- 
rithm. In both cases, the external siblings between the new tree and the old tree (i.e., 
its previous parent's tree) are declared and exchange reply information. Similar argu- 
ments as in Case 1 can be applied for the rest of this case. 

Case 2.2: If L,- is a sibling link, then it does not affect the convergence. The reason 
is as follows. If Li is internal, the disconnection only drops out a useless link in the 
tree. If L,- is external, as C remains connected, even if L,- is the last external sibling 
link connecting the two trees, the nodes in both trees can exchange reply information 
via other trees between them. The rest of this case is similar to Case 1. 

Case 3: Suppose that some links are established. Since a new link will only make 
nodes on both sides exchange broadcast and reply information; it does not block 
vector-replying. On the contrary, it may speed up the process of the convergence. 


Case 4: Suppose that some links are disconnected and some are connected. Similar 
reasoning can be applied as in Case 2 and 3. □ 

Lemma 3.4 Each reply vector can be delivered to S through the primary path or an 
alternative-path in time that is finite and bounded. 

Proof. Let V be a node sending a reply message to 5 . If there is no link disconnec- 
tion, from Lemma 3.1 we know that 5 can receive v 's reply message. If some links 
in the primary path from v to 5 are disconnected, two possible cases are considered: 

Case 1: assume that v and S are in the same connected component C for sufficiently 
long. Let 5 be in tree 7, and v be in tree T^. If 7, and are adjacent to each 
other, then v can be routed to S via a sibling link between the two trees. If and 
are not adjacent, there must exist some trees Tj, T2, ... Ji, such that is adja- 
cent to Ti, Ti is adjacent to Tj, ... , and T,- is adjacent to T^. Without loss of gen- 
erality, we assume that no new tree is bom in the meantime. From the alternative- 
path algorithm, we know that there exists an alternative-path between 7, and via 
Tj, ... , Ti. Since each vector can be sent over a link in finite and bounded time, v's 
reply vector can be delivered to 5 in time that is finite and bounded. 


Case 2: assume that v and S are not in the same connected component. Let be 
the set of nodes which have received v's reply information. As the reply process 
proceeds, IC^ I is increasing like a step-function. From Lemma 3.3, it can be seen 
that v's reply information will not stop propagating as long as some nodes in the 
same connected component have yet to receive it. If ICy I stops increasing tem- 
porarily, then there must exist an edge-cut between Cy and V \ C^. Since the net- 
work is eventually connected, there must exist a node u and a node w such that u e 
Cv and we V \C^ and a new link [u, w] is established in finite and bounded time. 
Therefore, due to the reply information exchanges between u and w, C^ is now 


expanded to be Q u w . As the total number of nodes in this network is fixed, S 
will get v's reply message in time that is finite and bounded. Similarly, we can con- 
clude that every reply vector will be received by S in finite and bounded time. □ 

When the source S receives all of the reply vectors for message M (i.e., the 
final version of the vector has value 1 in every bit in the vector), it attaches a flush- 
out message in its next broadcast message ready to be sent, which directs each node 
receiving this message to discard M. A flush-out message is considered as a part of 
the broadcast message to which it is appended and it will be discarded along with the 
broadcast message later. 

It is important to note that the protocol does not imply that each broadcast mes- 
sage can not be broadcast until all the reply information of its predecessor has been 
received by S. As indicated earlier, each node can treat each broadcast message 
independently. If more than one broadcast message from the same source are allowed 
to coexist in the network, what has to be done is to allow each broadcast has its own 
"family" in each node. As the network is subject to frequent link reconfigurations and 
propagation delay in each link is changing, a broadcast message should not take 
advantage of the direct tree established by its predecessor by broadcasting along the 
tree links. Flooding is still needed if reliability and minimum broadcast delay are to 
be achieved. 

Theorem 3.2 Every broadcast message will be discarded in finite and bounded time. 

Proof. From Theorem 3.1 and Lemma 3.4, we know that the source S will detect 
that every broadcast message has received a broadcast message in finite and bounded 
time. Hence, a flush-out message can be broadcast to clear that message, which 
implies that the stay of each broadcast message in each node is finite and bounded. □ 


Corollary 3.1 If the generation rate of broadcast messages is bounded, then the 
buffer space needed in each node is bounded. □ 

3.3.2 Message Buffer Management 

Message buffer management is one of the major issues in communication proto- 
col designs. As each node has only limited memory space, this issue becomes criti- 
cal because an ineffective management policy can result in a great waste of buffer 
space or, even worse, an unreliable protocol. Theoretically, the broadcast algorithm 
above can still work as a fast and reliable communication protocol without the reply 
phase, provided that each node has a sufficiently large memory to buffer broadcast 
messages and is capable of determining the right time to discard the messages. 
Without a reply mechanism, there are two possible ways of managing message 

(1) Static policy - Each message received will remain in the buffer only a fixed 
period of time and then be eUminated from the buffer. This fixed interval is 
determined by an upper bound which is long enough to ensure that a message 
has been received by every node. The weakness of the method is two-fold: First, 
it is very difficult to accurately estimate the upper time limit for a broadcast 
message to stay in the buffer and the limit tends to be over-estimated. Second, 
by fixing the time a message has to stay, it is possible that too many obsolete 
messages, i.e., messages already received by every node, stay in the buffer 
memory for too long. This will cause a significant waste of the buffer space. 

(2) Forced -out policy - Broadcast messages are discarded only when the buffer is 
full and messages are discarded in a first-in-first-out order. This method has the 
same weakness as the static policy in wasting buffer space. In addition, it also 
causes great complexity in allocating message buffers since broadcast messages 
are not the only messages a node needs to handle. If a broadcast message buffer 


is too small, a broadcast message may be eliminated while some nodes are still 
waiting for that message. This violates the reliability of broadcasting. If too 
large a broadcast message buffer is allocated, again, memory space is substan- 
tially wasted. 

The two-phase broadcast protocol provides a dynamic buffer management 
scheme. This management scheme has two major advantages: First, the reliability of 
the protocol is achieved in a more flexible manner. Since each message will be dis- 
carded only if the corresponding flush-out message is received, a broadcast message 
is allowed to stay in the buffer as long as it is needed without being accidentally 
eliminated from the buffer. If few link reconfigurations occur during the broadcast 
and reply phase, a message can be discarded quickly. If more link disconnections and 
connections occur, a message is allowed to stay longer. In short, the protocol's 
dynamic management policy provides a significantly better buffer utilization than the 
two policies above. Second, Corollary 1 implies that each node can estimate the 
upper bound of the buffer size needed to store broadcast messages by estimating the 
maximum time interval a message will stay. With this time bound closely approxi- 
mated, the buffer allocation problem can be substantially simplified. 

3.3.3 The Node Algorithm 

In this section we present the two-phase broadcast protocol in the form of a 
node algorithm, consisting of a number of event-driven procedures. A node reacts to 
each condition/event by running the corresponding procedure. Before we formally 
describe the node algorithm, some terms and variables used in the algorithm are 
defined as follows. 

a. M : a broadcast message from the source S . 

b. ACK-parent(i): the parent-notifying message from node / to its parent. 


c. ACK-sibling(/ ): the sibling-notifying message from node i to its sibling. 

d. DCL-exsib(/): the message from node i to declare external-sibling. 

e. ACK-vector(M,0 : the current acknowledgment vector of a broadcast message 
M at node /. It is an n-bit binary vector and ACK-vector(M,/ ) 3 ACK- 
vector(Mj) if ACK-vector(M,i)[^] > ACK-vector(M,y)[;:] for all k, where 1 < 
k < n. ACK-vector(M,/)[A:] is 1 if node / "knows" that node k has received M, 
0 if otherwise. 

f. UNKi : neighbors of node / and whose relationships to / are still unknown (e.g. 
they could be either siblings or children of / ); UNKi is initialized to be NEIBi . 

g. E-SB, : the set of /'s external siblings. 

h. DESi : the set of i 's descendents. 

i. SB -DONE : the set of the external siblings already been reached by the same 
reply vector from ancestors. 

j. BUFi : message buffer of node / . 

k. FLUSH-OUT(M): a message indicating that M can be discarded from the buffer 

1. LINK -DOWN j i a message indicating that the link to node j is disconnected, 
m. LINK -UP j -. a message indicating that the link to node j is established, 
n. \: difference, a set operation. 

o. or: a bitwise logical-or operation defined on n-bit ACK-vectors. 

The Basic Node Algorithm for node / : 
For a message M from j: 

a.1) Mark ; RECEIVED for M; 

a.2) UNKi <- UNKi \ {; ); 

a.3) if M is in BUFi,SBi <- u ; and send ACK-sibling(i ) to 

a.4) else begin /* is the first node sending M to i */ 

a.5) put M into fif/F, ; 

a.6) ACK-vector(M^ )[j] <- 1; 

a.7) send ACK-parentO) to j; 

a.8) PRi<r-j; 

a.9) UNKi <r- NEIBi \ PRi ; 

a. 10) for every k g NEIBi \ O ). 

a.11) sendMto*; 

a.12) end 

a. 13) if every k g NEIBi is RECEIVED for M and UNKi = 0, 

a. 14) call reply_ack; 
For an ACK-parentO ): 

b. l) CRi <r- CRi ^ U); 
b.2) UNKi <^ UNKi \{j]; 

For an ACK-siblingO ): 

C.1) Sfi, SB, u {;■); 

C.2) UNKi i- UNKi \[j]; 
For a UNK-DOWNj message: 

dl) NEIBi i- NEIBi \{j]; 

d.2) if y G UNKi , UNKi <- UNKi \{j}; 

d.3) itjeSBi,SBi<r-SBi\[j}; 

A4) if 7 G £ -SBi , E <- E -SB, \ {y ) ; 

A5) if 7 G CRi, 

d.6) begin 

A7) CRi <^ CRi\[j]; 

d.8) if / is not REPUED and (C/?, = 0 or every k g C/?i is REPUED), 

d.9) call reply ack; 

d.lO) end 

d.11) if =PRi and SB, 0, 

d.12) begin 

d.13) choose a new parent k from SB, ; 

d. 14) send ACK-parent(i ) to i ; 

d.15) end 

d.16) else if ; = P/?, and i is REPUED , 

d.17) begin /* starts sibling-reclassifying */ 

d.18) create Z)£S,; 

d.19) sendDES, to every k e CRi ; 

d. 20) end 
For a LINK -UP j message: 

e. l) NEIBi <^ NEIBi u {7); 
e.2) SB, <- SB, u {7 ); 

e.3) J exchanges broadcast message status with 7 ; 

e.4) for every M not received by 7 , 

e.5) send M to 7; 

e.6) i exchanges reply message status with 7 ; 

e. 7) perform steps f.l-f.7 for the reply messages from 7; 
For an ACK-vector(M,7 ) where 7 g CRi or 7 g SB, : 

I* reply message for M arrives from a child node or a sibling node of / */ 

f. l) if ACK-vector(M,0 □ ACK-vector(M,7 ), then ignore ACK-vector(MJ); 
f.2) else begin 

f.3) ACK-vector(M ) <- ACK-vector(M,z ) or ACK-vector(M J ); 

f.4) mark 7 REPLIED for M; 

f.5) if every k g CRi is REPLIED , 

f.6) call reply_ack; 

f. 7) end 

For an ACK-vector(M,7) where 7 g PB, : 

I* reply message for M arrives from the parent node of i */ 

g. l) if £ -SB, is not initialized, 

g.2) E-SBi <- SB, \ {k I ACK-vector(M,7)[A:] = 1); 

g.3) ACK-vector(M,/' ) <- ACK-vector(M,/" ) or ACK-vector(M,7 ); 

g.4) if £ -SB; * 0, then for every k g £-SB, , 


g.5) send ACK-vector(M,i) to k; 

g.6) for every / g CRi 

g. 7) send ACK-vector(M,/) to / ; 
FoT&DESj message; 

h. l) for every k e 

h.2) ifk not 6 E-SBi and k not e DESj, 

h.3) begin 

h.4) E-SBi <r- E-SBi u {i); 

h.5) send DCL-exsib(j )tok; 

h.6) end 

h.7) for every / g CRi 

h. 8) s&nd DESj to /; 
For a DCL-exsib message from 

i. l) E-SBi <- E-SBi u 

i.2) send ACK-vector(M^' ) to ); 
For a FLUSH-OUT(M) message from j: 

j.l) eliminate M from fiC/F, ; /* this message came along with a broadcast message */ 

Procedure reply_ack for i : 

k. 1) if PRi € NEIBi , send ACK-vector(M,i ) to PRi , 

k.2) else begin f* altemative-path search */ 

k.3) if SB -DONE is not initialized, 

k.4) SB -DONE <r- 0; 

k.5) if E-SBi ^0, 

k.6) begin 

k.7) SB -DONE <- SB -DONE u E -55, ; 

k.8) for every k e E \ SB -DONE 

k.9) send ACK-vector(M,j ) and SB -DONE to k ; 

k.10) delete 5B -DONE from / ; 

k.11) end 

k.12) for every k e CRi 

k. 1 3) send ACK- vector(M,i ) to jfc ; 

k.14) end 

Some notes on the basic node algorithm: (a) In e.3 and e.6, when nodes 
exchange the status of messages, only the headers of the messages are exchanged so 
that the communication cost can be reduced, (b) A reply message can be rejected by 
the parent or a sibling if it contains no additional reply information, but a reply vec- 
tor from a parent will not be rejected by a child (see algorithm g.*). Consider the fol- 
lowing situation. During an alternative-path search, an external sibling has 
received a reply vector from its sibling. Vg will forward the vector to its parent, 
assuming that additional reply information is contained in that vector. Suppose that a 
child node is allowed to reject a reply vector from its parent. If the vector is eventu- 
ally routed back from the root node as a result of a link disconnection, will reject 


this vector because it has already received the same vector. Consequently, some des- 
cendents of this rejecting node will not receive this reply vector. From Lemma 3.2, 
we know that the alternative-path may never be found, (c) For the source node S , if 
the vectors from its children indicate that all the nodes have received a particular 
broadcast message M (i.e., every bit is 1 in ACK-vector(M ,5 )), S attaches a 
FLUSH-OUT(M) to its next broadcast message. 

3.4 Performance 

3.4.1 Minimum Broadcast Delay 

The broadcast delay of a message M for a single node / is defined to be the 
time delay from the time S starts to broadcast M to the first reception of M at node 
i, denoted by £>,(M). The broadcast delay of M is defined to be Max:[Dj(M)\ j e 
V} Since it is assumed that "zero time" is spent at each node in assumption (3), 
D, (M) is the sum of the propagation delays of M on each communication link in a 
path from 5 to / and/or the time delays of waiting for the connections of some links. 
The delay caused by waiting for a link connection happens when the network is tem- 
porarily partitioned into several disconnected components. 

Theorem 3.3 Minimum broadcast delay is achieved by the basic broadcast protocol. 

Proof . For each node t, t * S,\etP = <S , Vi, ... , v,_i, v,-, t> be the eventually 
connected path through which message M arrives at t using the basic broadcast pro- 
tocol, and each node in P is the parent node of the next node (e.g., v, is the parent 
node of t). We will show by contradiction that the delay for r to receive M is 
minimized if M travels along P . Suppose that P is not the shortest path to deliver M 
and P = <S', V 1, ... , v'y, t> is the shortest path. Let (v'^, v'^^.i) be the first pair of 
nodes such that v ^ is not the parent node of v^^j, and that u is the parent node of 


v'jt+j. Since the parent node is chosen as the first node from which M arrives, there 
must be a shorter path from S via m to v ^+1, and then from v to r. So, P is not 
minimum, a contradiction. Hence, P is the minimum path from S to t for delivering 
M. □ 

Notice that, as each link disconnection results in an exchange of received mes- 
sages on both sides of the link, partitioning of the network does not prevent messages 
from finding the shortest path to a node. Since the protocol deals with networks hav- 
ing arbitrarily changing topologies, it is possible that the order in which messages are 
received at a node is different from the order in which the messages were broadcast. 

3.4.2 Communication Cost 

The communication cost for a broadcast message M is defined to be the number 
of messages exchanged during the broadcast and reply phases of broadcasting M . 
Messages in the networks can be classified as broadcast messages and control mes- 
sages; control messages consist of ACK-parents, ACK-siblings, ACK-vectors, and 

Lemma 3.5 The communication cost in the broadcast phase is 0(\E I). 

Proof . Let Af be a message broadcast from 5 . As the flood of M goes forward, each 
edge is traveled at most twice. The edge between a child and its parent is traveled 
only once by M because each node receiving the first copy of M from its parent will 
not send the same message back to its parent. The number of these parent-child 
edges in a network is (IV 1-1). So, in the worst case, the total number of the copies 
of M exchanged is 2\E I - (IVl-l). In the best case, each edge is traveled exactly 
once, which implies that \E I copies of M are exchanged. The cost for control mes- 
sages is 2\E \-\V\. Therefore, the communication cost is 0(\E I). □ 


The analysis for the cost of control messages in reply phase is more complicated 
due to the fact that the network under which the protocol is performed is subject to 
link reconfigurations. 

Lemma 3.6 The communication cost incurred by a link disconnection is 6> (!£■ I). 

Proof . The disconnection of a sibling link causes no message exchanges. A tree link 
(parent-child link) disconnection causes no problem if the child has a sibling. Other- 
wise, it causes a tree-splitting and invoke either an alternative-path search or a sibling 
reclassification. In the latter case, DCL-exsibs are sent to the "new" external siblings 
which were internal siblings before the link was broken. In the former case, ACK- 
vectors are sent to the tree's external siblings, which may result in subsequent alter- 
native path searches on the other trees. In both cases, the total number of reply vec- 
tors exchanged over tree links is on the order of O (IV I). It remains to be determined 
how many reply vectors are exchanged over the external sibling links which connect 
the affected trees. In the normal case, this number should be much smaller than 
0{\V\). In the worst case, it should not exceed 0(\E\) (in a highly dense network). 
Hence, for each link disconnection, the communication cost incurred is 0{\E I). 

Lemma 3.7 The communication cost incurred by a link connection is 0{\E\). 

Proof . Each link connection will cause exchanges of broadcast status and reply vec- 
tor status. If both sides contain the same reply information, no more messages are 
exchanged. If one side contains new reply information, the other side may invoke 
vector replying, which in turn may result in an alternative-path search. From Lemma 
3.6, we know that the maximum cost induced \% 0{\E I). 

Lemma 3.8 The communication cost in the reply phase is bounded by 0{\V\\E\). 
Proof . In the best case, no parent-child links are disconnected during the reply 


phase; every reply vector can be routed to 5 via a primary path. Thus, the communi- 
cation cost is IV 1-1. Otherwise, there are 2 cases to be considered: 

Case 1: communication cost incurred by link disconnections. During the reply 
phase, every link in the network is labeled either as a parent-child link or a sibling 
link. As shown in Lemma 3.6, a single link disconnection can cause at most 0{\E\) 
message exchanges. With the number of parent-child links bounded by IVl-1, the 
total communication cost induced by link disconnections can not exceed O (IV 1 1£ I). 

Case 2: communication cost incurred by link connections. We consider the following 
two cases, (i) Suppose that a link is connected within a tree. If the two nodes have 
identical reply vectors, it causes no additional cost except for the exchange of reply 
information. If two nodes have different reply vectors, both of them will send their 
newly acquired reply vectors to their parents. Since these two nodes are in the same 
tree, the reply vectors will stop at the first common ancestor of the two nodes. 
Hence, the cost of this connection is bounded by (9(IV I), (ii) Suppose that a link is 
connected between different trees. If both sides have different reply vectors, in the 
worst case it may cause the reply vectors be sent up and then down as a result of 
alternative-path search which is bounded by 0{\E\). Subsequent sibling link connec- 
tions between the two trees will induce no cost in vector replies because the nodes in 
these two trees will have identical reply vectors by Lemma 3.3. Since there are at 
most \V I different trees, the cost of link connections is at most O {\V I \E I), 

Summarizing Case 1 and Case 2, it can be seen that the communication of the 
reply phase is bounded by 0(IV ll£ I). □. 

Theorem 3.4 The communication cost of broadcasting a message is bounded by 

Proof . It follows immediately from Lemma 3.5, 3.6, 3.7 and 3.8. □ 


It is important to notice that, even at the extreme case where the network topo- 
logies are changing so rapidly that the number of link disconnections and connections 
is in the order of 0(l£: I) (i.e., almost every link is reconfigured), the upper bound 
0(\V\\E\) still holds (i.e., not 0(l£:l^)) because of the convergence of the reply 
information. From a practical point of view, in most of the mobile communication 
networks each node has only a fixed number of communication ports which are used 
to send and receive data. Therefore, the worst-case communication cost of the proto- 
col is approximately 0(\V\^). 

3.4.3 Speedup in Reply Phase 

In this section, we discuss a possible way of speeduping the reply phase. As 
each node tries to send the reply vector of a received broadcast message to S during 
the reply phase of the algorithm, the performance of the reply process depends on the 
number of link disconnections in the network. The number of link changes of a net- 
work in tum depends on the reconfiguration algorithms. For networks with good 
reconfiguration algorithms, very few links will be disconnected, which ensures that 
reply vectors can be quickly forwarded to a source node. Furthermore, the reply 
phase may be shorter than the broadcast phase if fewer links are disconnected during 
the reply phase because reply vectors are normally short packets compared to broad- 
cast messages which may be arbitrarily long. 

For a network that is undergoing rapid topological changes, Lemma 3.8 indi- 
cates that the reply phase may incur a substantial overhead in communication cost. 
This suggests that an improvement in the protocol is desirable, particularly in the 
speeding up of the reply phase. 

A key observation to such an improvement is that, during the reply phase, the 
first node which detects an "all-1" reply vector may not be S. Recall that each node 
which receives the first copy of a broadcast message will put 1 into its own position 


in the ACK-reply (see line a.l of the node algorithm). Thus, we can modify the algo- 
rithm by allowing any node (instead of only S) to issue a FLUSH-OUT message to 
remove a broadcast message M if it detects the fact that the message has been 
received by all the nodes. If node i is the first node detecting the fact, it discards M 
from its buffer memory and attaches a FLUSH-OUT(M) message to its own message 
to be broadcast. If / has no immediate messages of its own to broadcast, it can 
either pass the FLUSH-OUT(A/ ) independently or pass it to its parent. Its parent will 
do likewise. When the FLUSH-OUT(A/ ) reaches the root of a tree, it will be broad- 
cast independently. This modification requires that an ACK-vector is appended to a 
broadcast message at the beginning during the broadcast phase. When the nodes on 
both sides of a link exchange a broadcast message (observe that they are candidate 
parents of each other), they also update their current ACK-vectors on that message. 

It is difficult to analyze the performance gains by this speedup mechanism. 
However, in Figure 3.2 we show some network topologies in which significant 
speed-ups can be achieved by using the modified protocol. Notice that, in these 
examples, when r's start the reply phase, they immediately detect that all the nodes 
have received the broadcast message delivered along the flooding paths, so they can 
issue FLUSH-OUT messages and thus a high degree of speed-up is obtained. The 
degree of speed-up obtainable for a particular network depends not only on the net- 
work topology but also on the location of the source node; some nodes are better 
able to take advantage of the above modified algorithm than others. 


network edges flooding paths 

Figure 3.2 Some Examples of Speed-up 

1 ST" 


4.1 Background 

The global survey problem (GSP) studied in this paper can be briefly specified 
as follows: Given a general network G = (V,E) whose topological information is not 
available to any node in the network, and the network is vulnerable to unexpected 
link failures but remains connected, a global survey is conducted by a node v e V; 
during the survey, a question Q, which allows a short answer (e.g., a message of 
constant length) respond, is broadcast to every one in the networks, how a distributed 
algorithm can be designed to complete the survey (the completion of a survey 
requires the answer from every node to be received in finite time) ? A simple 
instance of such survey that requires binary answers is voting; that is, all nodes in the 
network vote to decide a particular issue such as whether to allow a new node to join 
this network. Alternatively, we may consider the problem as that of global informa- 
tion collecting in the sense that a node in a dynamic network may occasionally wish 
to collect some specific information from every one. It is also closedly related to 
other research issues such as distributed consensus. 

Designing a distributed algorithm for the GSP normally requires two essential 
operations: broadcast and reply. A survey question needs to be broadcast to all the 
nodes in the network before the answers to the question can be replied by the nodes. 
In this sense, the GSP can be considered as an extension of network broadcasting 
which has been extensively studied for more than a decade [3,10,11,14,25,33,35]. To 
the best of our knowledge, no previous work has addressed the GSP from the same 
perspective. A similar work was done by Baratz, Gopal and Segall [5] on fault- 



tolerant queries in networks. But the technique used in [5] is relatively straightfor- 
ward and no optimization was attempted to reduce the message complexity, which is 
the primary concern of this work. 

Like many other disdibuted computing problems, the performance measurement 
of an algorithm for the GSP involves time-message complexity trade-off. An efficient 
algorithm generally has higher message complexity while algorithms with low mes- 
sage complexities tend to take longer time to terminate. In this paper, we present 
lower bounds for the GSP on both time and message complexities. We show that 
Q(IVl) is a time-complexity lower bound and Q(inin(IV l,^)IV l+lE I) is a lower 
bound for message complexity. It is observed that the complexities on obtaining 
these two bounds are asymmetric. While there exists a straightforward algorithm to 
achieve a time complexity lower bounds, a message-optimal algorithm is complicated 
to design. Our main effort is to achieve the latter. Based on depth-first-search and 
multiple-tree traversal, we present a lower-bound algorithm in terms of its message 
complexity. The significance of the algorithm is that it achieves a message- 
complexity lower bound regardless of the number of link failures as long as the net- 
work remains connected. That is, even if the number of hnk failures during the sur- 
vey is as large as the message complexity of the algorithm is still bounded 
by O0V\\ 

4.2 The Model and Motivation 
We consider this problem in networks in which the following things hold: (i) 
The network is subject to unpredictable topological changes. Consequentiy, it is 
assumed that for each node, the topological information is restricted to 1- 
neighborhood (i.e., each node knows only to whom it is connected). This condition 
is the same as what has generally been assumed in other distributed computing prob- 
lems such as broadcasts in dynamic networks [3,35], constructing MSTs and leader 


elections [4,16,29]. (ii) Each node has a unique id. which is available to every one; 
In this way, a node conducting a survey will be able to tell whether every reply has 
been received. It is also assumed that there exists a canonical order among these IDs. 
A straightforward implementation of such ordering would be ranking the nodes 
according to the lexical order of their IDs. (iii) The time to send a message over a 
communication link is arbitrary but finite, (iv) During a survey, some links may fail 
but nodes will not, which ensures tiiat every node in the network will eventually 
reply. Despite the link failures, the network remains a connected component. A 
failed link is eliminated from the network and is not usable for the rest of the pro- 
cess. The rationale behind this is that, considering the communication delays in gen- 
eral computer networks, the time required to recover a typical link failure is unpro- 
portionally long. 

The performance of a distributed GSP algorithm is measured according to two 
major criteria: time and communication complexities. The time complexity is the 
total time to finish a survey task and as commonly assumed, the exchange of a mes- 
sage over a link takes (9(1) time. There are basically two ways to define communi- 
cation complexity. One is defined in terms of bit complexity. That is, the total net- 
work overhead is calculated according to the total number of bits transmitted over 
links in the network as a result of the algorithm. This definition is generally accept- 
able in many networks. However, from the perspective of a node processor, there are 
some overheads associated with each message transmitting, which are largely 
independent of the message length (e.g., channel allocation, routing decision, com- 
munication process generation etc.) Under these circumstances, the communication 
complexity of an algorithm may be defined in terms of its message complexity, which 
is the total number of message exchanges over the links in the network. 

In some distributed computing problems, it is possible to bound the lengths of 
messages generated by an algorithm to a constant. Unfortunately, for GSP each node 


must reply with its ID. as well as its answer, otherwise the conducting node will not 
be able to tell which answer belong to which node and thus unable to determine 
whether a survey is complete. This implies that each message requires at least 
O (log IV I) bits. 

As each answer replied to a question is short and of constant length, sending 
these messages on individual basis would flood the network. Therefore, the idea of 
luny}ing is considered a better alternative, which means concatenating many short 
messages into a slightly long but reasonable-length one. Through message lumping 
the message complexity of an algorithm can be significantly reduced. But a straight- 
forward lumping implies the possibility of creating messages of (9(IVlloglVl)-bit 
long. In this model, we manage to avoid the creation of such long messages and 
allow messages to be bounded by O(IVl) bits. We shall prove that no algorithm 
based on message lumping can be designed without using messages less than 0(IV I) 
bits. In other words, 0(lyl) is a lower bound on message size. It is the main focus 
of this research to examine in detail the GSP algorithms based on lumping and their 
performances in terms of message complexity. 

4.3 Lower Bounds 
In this section, we proceed to show that Q(IVl) and Q(niin(IV l,^)IV I) are 
lower bounds on time complexity and message complexity for the GSP respectively. 
For the convenience of discussion, let v be the node that conducts the survey process 
for a question Q. That is, v is the source of the question Q and every node will 
have to reply to v with its answer to Q . 

Theorem 4.2.1 Given k link failures during the survey process where k is arbi- 
trarily large as long as the network remains connected, f2(IVt) is a lower bound on 
time complexity for GSP. 


Proof. Consider the case where the topology of a network is as shown in Fig. 4.1. 
The source node, v, broadcast Q to Ui, M2, ... , W|vl-i- Suppose that when these 
nodes receiving Q are ready to reply, (v, U'2), (v, Uj), ... ,(v, fails. All the 

replies must be forwarded via (v, u{). Thus, given any protocol, it takes 0{\V\-\) 
time for v to receive all the replies. The theorem thus follows. □ 



Fig. 4. 1 An example for lower bound 

It is interesting to observe that as the time complexity is computed with respect to 
the original network, an Q(D ) lower bound is not possible, where D is the diameter 
of the network. This is also shown in Fig. 1 where D = 2 but the time complexity is 

The survey for Q can be divided into two stages: broadcast stage and reply 
stage. As each node is assumed to know only its neighboring nodes, the only reli- 
able way to broadcast the question Q to every node is message flooding. The flood- 
ing technique is very efficient and commonly used in sparse networks. Initially, v 
sends a copy of Q to all its neighbors. Each node, after receiving the first instance 
of message Q , keeps a copy and sends it to all neighboring nodes except the node/s 
from which Q arrived. 

It is easy to verify that the flooding described above is a reliable algorithm to 
broadcast a message for a connected network. Thus a simple algorithm, the two-way 
flooding, can be constructed to obtain a time complexity lower bound for GSP (In [5] 
Baratz etc. use similar but slightly advanced idea to design algorithms for searching 


distributed resources.) This algorithm requires nodes to send their answers back to v 
by flooding. 

Theorem 4.2.2 The two-way flooding algorithm is time-optimal for the GSP. Furth- 
ermore, it achieves minimum delay for the survey process. 

Proof. As the flooding ensures that in the broadcast phase each node has its neigh- 
bors informed of the broadcast message Q , if the network remains connected in spite 
of link failures, for any node u * v, there must exist a simple path from v to « 
along which Q can be delivered to u. Since the network has \V\ nodes, the time for 
M to receive Q is at most \V\-\ units. Similarly, we can show that it takes at most 
\V\-\ units for v to receive the answer of Q from u if the reply is done by flooding. 
Therefore, the two-way flooding is time-optimal. 

Now, we proceed to show that the algorithm achieves minimum delay for GSP. 
For each node f e V^, let P = <v , Vj, vj, ... , v,_i, v,> be the path through which Q 
arrives at t using the flooding, where v,- = t and for all \<j<i, Vj is the parent of 
Vy+j. It can be shown, by contradiction, that the delay for t to receive Q is minim- 
ized if Q travels along P. Suppose that P' = <v, v\, ... , Vy_i, v'j> is a shorter 
path than P to deliver Q from v to r, where v) =t. Let it be the smallest index for 
a link (v V ^^.j) in P' such that v\ is not the parent node of v'^+i, and that w is 
the parent node of v'^^.i. Since the parent node is chosen as the first node from 
which Q arrives, there must be a shorter path from v to v'^+j via w, and then from 
V to t. which implies that P' is not minimum, a contradiction. Hence, P is the 
shortest path. Since a node sends its answer of Q also by flooding, we can see that 
the two-way flooding algorithm can achieve minimum delay. □. 


Clearly, the cost of implementing the efficient two-way flooding is very high in 
terms of its message complexity, considering that the message complexity of the 
algorithm can be as high as O(IVP) in edge-dense networks. In many cir- 
cumstances, the existence of a large amount of messages in the network may result in 
message congestions, which will degrade the overall system performance. In this 
case, an algorithm with low message complexity would be more preferable. 
Nevertheless, the unpredictability of link failures significantly complicate the design 
of algorithms for such a purpose. 

Definition 4.1 Let u be a node in G = (V, E) that received Q. The neighboring 
nodes of w can be classified into three categories: parent, siblings, and children. 
The parent of u is the node from which Q arrived first (ties are broken arbitrarily). 
Conversely, if the parent of u is w , then m is a child node of w . The neighboring 
nodes to which u is their parent form the children of u. The other neighboring 
nodes of u are its siblings. 

Definition 4.2 The p-neighborhood of a node u is defined to be the set of nodes 
that can be reached from u by no more than p-hops, where each hop represents a 
crossing of a communication link. For each node u that has received Q , we denote 
the reply to Q by R(u) which contains at least the answer of m to Q. There exists at 
least one node in the network that is an intended receiver of R (u ). Conversely, if w 
is an intended receiver of u , then w is an intended sender of w. 
Definition 4.3. A distributed algorithm for GSP is called a p-neighborhood proto- 
col, if it satisfies the following axioms: For each node u, (1) its intended receivers 
are within its p-neighborhood; (2) if m is a receiver, it can either send its reply indi- 
vidually or send the concatenated message of its reply and the replies of its senders; 
(3) if all the replies from its intended senders have been received, it must reply in 
finite time; (4) any message arrives after u has replied to its intended receivers 
that contains new reply information should be forwarded by u in finite time. 


Axiom (4) implies that if a node is disconnected from its intended receiver, it will be 
forced to forward its reply to one of its non-intended neighbors. This neighbor, 
receiving an unintended message, must forward it independently as its lumping stage 
has ended. 

It can be observed that p-neighborhood protocols can be most realistically 
implemented when p = I. The reason is simple: when p >2, a. message intended for 
a non-neighboring node should be routed through intermediate nodes. Nevertheless, 
as each node has only local topological information, to reliably deliver messages to 
the intended receivers, local-flooding (i.e., flooding messages within p- 
neighborhood), may be necessary. This will inevitably result in high message com- 
plexity. One alternative is for each node to construct routing paths through local 
information exchanges within its p-neighborhood, which again may also require 
local-flooding. Furthermore, even if these routing paths can be established, link 
failures will certainly complicate such routings. Consequently, we are most interested 
in the protocols of 1 -neighborhood lumping. In the next theorem, we show a mes- 
sage complexity lower bound of p-neighborhood protocols for GSP in networks with 
unpredictable link failures. 

Theorem 4.2.3 Given k link failures during the survey process where k is arbi- 
trarily large but the network remains connected, Q(min(lK l+l£ I) is a lower 
bound on message complexity of 1 -neighborhood protocols for GSP. 
Proof . As flooding is considered the only reliable broadcasting protocol during the 
broadcast stage, the message complexity for broadcasting Q to all the nodes in the 
network is bounded by OOE I). From the axioms of the 1 -neighborhood protocols, 
each node has at least one intended receiver in its adjacent nodes to which a reply of 
Q can be sent. It is easy to show that if a node chooses a child as its intended 
receiver, it is possible that its reply will never be forwarded to the source (e.g., con- 
sider the case that the edge connecting the node and its child is a bridge of the 


network). Therefore, it is clear that for each node, its parent and siblings are the only 
possible candidates for intended receivers as they are the nodes from which Q 
arrived. That is, despite no global topological information is available to any one, the 
flooding of Q implies that there exists at least one path from the parent or a sibling 
to the source. Thus, in the reply phase, these properties give us the following possi- 
ble cases to consider. 

Case 1: (no lumping) a node sends out its answer without waiting for its senders' 
replies. In Fig. 4.2, we can see that it takes ()\4-l) messages to broadcast Q and in 
the worst case there are 1 + 2 + ... + (M-l) messages needed to reply the answers. 
Hence, the number of messages exchanged over links is (IVl-1) + 1 + 2 + ... + (M-l) 
= (IVl-l)(IVl+2)/2 = O(lKl^). This situation occurs even without a single link 

Case 2: (lumping) we consider the case where each node may reply after lumping 
together the replies from its intended senders. 

Case 2.1: a node chooses its parent as its intended receiver. Consider the following 
case shown in Fig. 4.3, V|y| is the common parent of Wj, M2, ... , and Uj^. Suppose 
that all the parent Unks of U2, ... , M^t f^il before the nodes are ready to reply. 
Clearly, v \y \ will start replying to its parent when it receives the reply from u 1 as u 1 
is its remaining connected child now. The replies from a 2' — > have to be 
delivered via m j as it is the only node leading to v . According to the Axioms, in the 
worst case where U2 replies before the answer from M3 arrives, Uj replies before it 
receives M4's answer, and so on, it causes 0(k\V\) messages exchanged. 
Case 2.2: a node chooses its siblings as its intended receivers. Consider the example 
shown in Fig. 4.4. As w,- and M;t_,+i are siblings for all 1 < / < ^, these siblings will 
exchange their replies. However, all the paths leading to v must go through Vj, vj, 
... , v\y\. Similar to Case 2.1, in the worst case the number of message exchanges 
can be (9 (/:1k I). 


Case 2.3: a node chooses its parent and siblings as its intended receivers. Consider 
the case shown in Fig. 4.5. and u4i will collect the reply from u3i for all 1 < / 
^ k before they reply; will collect all the repUes from Uji, U22, - . "2t before it 
replies; and M4 will lump together the replies from u^i for all 1 < / < k. Suppose 
that the links between "3i disconnected before the replies from m, 's can 

be delivered, they must be routed via M4. Since M2I' - ' "2;fe 1- 
neighborhood, The worst case is that their replies are routed individually through u^, 
which results in 0{k\V\) messages being delivered. □ 

As we have point out the difficulty of implementing an algorithm adopting mes- 
sage lumping with neighborhood > 2, nevertheless, we can extend the previous 
theorem to show that widening neighborhood in general does not reduce message 

Theorem 4.2.4 Given that k links fail during the survey process where k is arbi- 
trarily large but the network remains connected, Q(min(IV l-i-l£ I) is a lower 
bound on message complexity of p-neighborhood protocols for GSP with p being an 
arbitrarily large constant. 

Proof . The proof can be extended directly from that of Theorem 4.2.3 by the tech- 
nique of "node padding". Detailed proof is omitted here. The basic idea is to insert 
sufficient number of nodes to push some critical nodes beyond some neighborhoods. 
More specifically, consider the example in Fig. 4.5. By inserting p colums of nodes 
between and M21 where 1 < / < k, the lumpings of U4 does not include replies 
from U21, ... , "2* since they are beyond M4 p-neighborhood. This can force these 
messages being replied individually through M4 in case that the links between U2i, for 
all 1 < /■ < k, and M2 are disconnected. The arguments of other cases can be simi- 
larly constructed. □ 


4.4 Optimizing Message Complexity 
The observation that enlarging a node's lumping neighborhood does not yield a 
better lower-bound result leads us to focus on exploring better routing strategies. 
Without a global picture of the network, each node is forced to make routing deci- 
sion on local basis. The routing decisions, considered in a broad sense, include not 
only message forwarding but also selecting a neighborhood, choosing intended 
receivers, etc. 

Our main effort, in this section, is to present an algorithm which is reliable and 
optimal in terms of message complexity for arbitrarily large number of link failures. 
The underlined strategy of this algorithm is based on the concept of alternative-path 
search. The algorithm belongs to 1 -neighborhood protocol. Within its 1- 
neighborhood, each node has a unique intended receiver for its reply message that is 
its parent node. In the operational sense, the algorithm can be roughly divided into 
two phases: broadcast phase and reply phase. In the temporal sense, there is no 
clear distinction; some nodes may be still in their broadcast phases while the others 
may be already in their reply phases. 

In the broadcast phase, each node performs the flooding procedure [33][35]. The 
message Q broadcast from v initially contains only the question. As this message is 
disseminated across the network, each node receiving an incoming Q will do the fol- 
lowing things: (1) keep a copy, (2) send it to all the neighboring nodes except those 
from which Q arrived. According to Definition 4.1, each node establish a relation- 
ship with each of its neighbors. Considering only the parent-child relationship 
among nodes in the network, a spanning tree is formed with v, the source of Q, 
being the root of the tree. The path connecting a node u to v that consists of only 
tree edges is the primary path of u . The primary path of m is a shortest path along 
which Q arrives. It may not be the unique shortest path from v to m. Under the 
assumption of unpredictable propagation delay of a link, it is not unlikely that the 


primary path of u may change in the next survey. Nevertheless, a primary path does 
provide a useful hint for replying answers (i.e., with limited topological information, 
it is reasonable to choose the primary path to send answers back to v in the reply 

The reply phase is initiated by nodes which satisfy one of the following condi- 
tions: (1) it has no neighbors to which Q needs to be sent or (2) all the links to its 
children are disconnected. In both cases the nodes are called leaves. A leaf node will 
send its answer to its parent immediately. If a node is not a leaf node, it replies with 
a message that is the concatenation of its own reply to Q and the replies from its 
children. The message is then forwarded to its parent. 

Link failures during broadcasting are handled by discarding the link. Failures 
that occur during the reply phase have different implications. Let / be a communica- 
tion link that fails. There are two cases to be considered. If / is a sibling link, no 
action is neeed to be taken. If / is a tree link, proper algorithms are required to 
ensure the reliability of the protocol, which will be discussed in the next two sec- 

Definition 4.4 Let V;, and V/ be the nodes connected by tree link / and v^, is the 
parent of V/. The failure of / makes V/ become the root of a new tree, denoted by 
7/. This new tree is said to be a low tree. In contrast, the "old tree" (i.e., the tree 
that contains V;, ) is called a high tree, denoted by Tf, . 

The birth of a new tree has significant consequences for routing reply messages 
back to the source v . For instance, an altemate-path search is needed if V/ has not 
yet replied with its answer message, because the primary path connecting V; and v 
no long exists. 

Definition 4.5 Two nodes are said to be internal siblings to each other if they are 
in the same tree and connected by a non-tree link; They are external siblings if they 
are not in the same tree but are directly connected. 


4.4.1 The Message Structure 

Before we get into the detail of our protocol, it is necessary to discuss the way 
each message is constructed. The structure of a broadcast message is quite simple; it 
contains the question and the source. Thus it takes <9(loglVl) bits. A reply message 
requires more complex design. The reason is that it will contain not only answers to 
the question but also some routing information for maintaining good routing proper- 

The reply message structure contains mainly four parts: i) answers to the survey 
problem; ii) reply-record which consists of the nodes that provide these answers; iii) 
descendent-record which, created by the root of a tree, is the set of node within the 
tree; iv) visited- node-record which is the set of nodes that have been visited; v) 
visited-tree-record consisting of trees, presented as a set of nodes, that have been par- 
tially visited, i.e., some of the nodes in the trees may yet to be explored. The pur- 
pose of reply-record is to keep track of the nodes that completed their replies. A 
descendent-record is used to determine whether a sibUng is internal or external. The 
purposes of visited-node-record and visited-tree-record will be apparent in the coming 

Except for the answers, which are of constant length each , all the three records 
are represented as bit-vectors. Each node is associated with a position in the records. 
These positions are arranged in a canonical order so that each node knows which 
position belongs to which node. A bit 1 is set for a node in its corresponding position 
at the reply-record if its answer is contained in the message; otherwise it remains 0. 
In order to enforce a one-to-one correspondence between answers and reply-record, 
the answers are arranged in the same order as the reply-record. For example, the n th 
answer is associated with the n th bit that has been set to 1 . For descendent-record bit 
1 represents "is-in". Similarly, for the rest of two records, bit 1 denotes "visited". 
The size of each reply message structured in this fashion is (9 (IV I) bits. 


4.4.2 Alternative Path Search 

If none of the failures occurs at tree links, all the replies are ensured to be for- 
warded to V via tree links. However, this is not always the case. A tree link failure 
imposes great routing difficulty on parent as well as child. In the next two sections, 
we show how a failure is properly handled if the failed link is a tree edge. Two 
different processes will be invoked when a new tree 7/ is splitting from an old tree 
T^. Each process is initiated by one of the two nodes connected via the tree link 
whose failure results in a tree splitting. Let us call the algorithm executed by the 
child a low-tree algorithm and the one undertaken by the parent the high-tree algo- 
rithm. Low-tree algorithm 

The low-tree algorithm is the core of the alternative path search. The main task 
of this algorithm is to route messages from nodes in the low tree to v . Since the 
link between v, and v^, failed, messages from 7/ to v must be routed via non-tree 
links. One way to achieve this is by flooding at , which is similar to the algorithm 
proposed by Baratz, Gopal and Segall [5]. The advantage is its simplicity, but its 
message complexity is not optimal. Instead of flooding at v^, we provide an 
altemative-path-search algorithm based on depth-first-search and loop-free routing. 
The algorithm is initiated at V/ . 

In this algorithm, V/ will try to find a sibling first. If it succeeds, the reply from 
V/ is forwarded to the sibling. Otherwise, the message is sent back to one of its chil- 
dren. A message that is sent from a parent to a child is called a backtracked mes- 

A child node receiving a backtracked message from its parent attempts to send 
this message out of the low tree by identifying an external sibling. This is crucial to 
the low-tree algorithm in terms of optimizing the message complexity since passing a 


message to an internal sibling does not further the goal of forwarding the backtracked 
message to the source as the message will remain in the same tree. 

There is a static way of discerning external siblings from internal siblings. 
From Definition 4.1, we can see that two siblings are internal if their least-common- 
ancestor is in the same tree as they are, otherwise they are external siblings. Con- 
sider an algorithm as follows: During the broadcast phase, each node broadcasts Q 
that is appended with its node id. By exchanging its broadcast message with a 
neighboring node, a node can identify whether the neighbor is an external sibling by 
checking the ancestor-record of the neighbor. The advantage is that it can be done 
off-line. Nevertheless, implementing such procedure may generate, in the worst case, 
broadcast messages that are as long as O (IV lloglK I) bits due to the concatenation of 
node IDs. 

In our algorithm, external siblings are identified dynamically. The method is to 
allow the root of a tree to pass the necessary information in the form of a descendent 
record which contains all the node IDs. of the tree. A node can extract the 
descendent-record from the backtracked message it received and then the decision is 
straightforward since a sibling not in the record must be external. The low-tree algo- 
rithm consists of three major components which interact with each other. 

A. The depth-first-search. In this procedure, the backtracked message is routed 
according to the following precedence: external siblings, child nodes, and the parent. 
That is, a node receiving a backtracked message will always try to find an external 
sibling. If it succeeds, the message is forwarded to the sibling. If the message is 
eventually returned by the sibling (i.e., the sibling fails to find an alternative path to 
v) or the link connecting it to the sibling fails (thus, it is not certain if the routing 
can succeeds via this sibling), another external sibling is tried. If it has no external 
sibling or all the external siblings have been tried and no alternative path is found, 
the backtracked message is sent to one of its children if it has any; if it does not have 


any child node, the message is sent back to its parent. If that child fails its search 
and returns the message, it tries another child until all the children have been 
explored. In this case, the message is sent back to its parent. By doing this, either the 
message will eventually be sent out of this low tree or all the external siblings 
needed to be explored have been explored. If it is the latter case, the backtracked 
message is returned to the external sibling from which it was delivered to this low 

It is worth mentioning that in order to avoid the repetition of fruitless sibling 
exploration, each external sibling that has returned a backtracked message is deac- 
tivated and will not being explored for the routing of subsequent messages unless it 
is reactivated. 

B. Loop-free routing and new-tree broadcasting. It is clear that during the 
search of an alternative path, an external sibling should not be visited by the same 
backtracked message more than once. That is, the path that a backtracked message 
traverses should be loop-free. To achieve this, each reply message is associated with 
a visited- node-record. Each time a reply message is received by a node, the id. of 
the node is stamped to the visited-node-record. Thus, an external sibling visited 
before by the message can be excluded from being explored. 

In addition, we see some practical reasons for the algorithm to keep loop-free 
the routing of a backtracked message in the "tree level" as well as in the node level. 
In other words, a message is not to be sent to those trees which have been visited by 
the same message earlier even though the trees contain nodes that have not been 
visited by the message. 

There are two major reasons for loop-free traversal in the tree level: First, for 
reducing the number of message exchanges as a non-loop-free traversal will inevit- 
ably cause more nodes to be visited. Second, for good management of routing paths; 
a non-loop-free traversal is difficult to manage as it may cause problems when 


messages start backtracking, especially when a traversal path is disconnected by link 
failures. For such a purpose, a visited-tree-record is also required for each reply mes- 
sage. When a root of a tree receives a message, it stamps the visited-tree-record with 
its tree information. Therefore, an external sibling in the visited-tree-record is avoided 
fmm being explored by the message. There is, however, some inherent complexity in 
maintaining a loop-free traversal in the alternate-path search on dynamic and faulty 
networks. The following example illustrates this. 

Example 1. Consider that tree receives a message from one of its external 
siblings in its neighboring tree, denoted by n^, and that consists of two parts, Tn 
and Ti2y connected by / which is the only link connecting T^^ and T^j- An 
alternative-path search proceeds in such a way that message visits trees T2, T^, 
... ,Tj_i, and Tj, one after another as shown in Fig. 4.7. Now, suppose that Tj has 
no external siblings other than those in T■^ 2■ Since the traversal of the backtracked 
message should be loop-free at tree level, it will not be sent to 2- Therefore, 
backtracks to r^.j, Tj_2 and so on. In the meantime, link / fails, which results 
in the split of T12 from as a new tree. Consequently, when is eventually 
returned to Ti, T-^ i will return to rig. Hence, 7i 2 is not explored by the alterna- 
tive path search. If T12 happens to be the only tree via which can be sent to v, 
the search will fail to find the only path leading to v. This makes the protocol 
unreliable. Therefore, the situation must be avoided. 



Fig. 4.7 An example 

A reasonable way to prevent this is as follows: Once a new tree is bom (split 
from the old tree), this information is broadcast to all of its neighboring trees. This 
ensures that if these neighboring trees have any backtracked messages to forward, the 
messages can be sent to this new tree. If a backtracked message can be delivered to 
V via the newly-bom tree, a reactivation message is created. A reactivation message 
is used to inform those external siblings that the external siblings in the newly-bom 
tree can be re-explored for further routings. A reactivation message is sent along the 
path via which a message backtracked. When a node receives a reactivation message 
from a deactivated extemal sibling, it will reactivate this sibling. 

The algorithm to broadcast a tree-born message from the newly-split tree is 
designed by the same principle of the altemative path search of a backtracked mes- 
sage. That is, the tree-bom message initiated by the root traverses in the tree in a 
depth-first-search manner. When a child receives the message from its parent, it sends 
a copy of the message to all of its external siblings which have not yet received the 
message. To achieve this, each time a message is sent to an extemal sibling, the id. 


of the sibling is kept in the visited-node-record of the message, which is similar to 
the alternative path search of a backtracked message. By doing so, the external 
sibling can be exempted from receiving the same tree-bom message more than once 
from nodes in the new tree. On the other hand, once a node receives the message of 
a new tree information, the message is forward to its root. After being informed of 
this information, the root will broadcast this message to all its descendents via tree 

C. Sibling reclassification. The third part of the low-tree algorithm is to maintain 
the correct relationship between siblings. When the low tree is disconnected from the 
high tree, the nodes in the high tree that are the siblings of the low tree must be "re- 
classified" as external siblings since they are no long in the same tree. The root of 
the low tree must broadcast a reclassification message to its descendents. This is 
done by sending such message, which contains the IDs. of the descendents, along the 
tree links. Each node receiving the message will redefine its sibling relationship 
accordingly. Those "newly-bom" external sibUngs of the low tree will subsequentiy 
be explored by the alternative path search. Note that there is no need to send a mes- 
sage across the links between the low tree and the high tree as the reclassification 
action will be taken by the high tree as well. High-tree algorithm 

We now consider the algorithm for V;, in the case where loses one of its chil- 
dren because of link failure. Tasks need to be handled are included in the following 

A. Resumption of alternative-path search. If Vf, is currently conducting an 
altemative-path search on this disconnected branch, Vf^ will abort the search on the 
failed branch and resume the search on the other child branches. This may result in 
two altemative-path search processes concurrently exist, one in the low tree and 


another in the high tree. The necessity of such action is clear since v^, is unable to 
know whether the disconnected search in the low tree will succeed or not. 

B. Sibling reclassification. Similar steps need to be taken by the high tree to 
readjust the status of its siblings. This is done by sending a message (i.e., a 
descendent-record which consists of nodes in the low tree) from v^, along tree edges 
to the root of the high tree. Once the root of the high tree receives such a message, 
it will broadcast a reclassification message, also containing nodes in the low tree, 
along the tree links. A node receiving a reclassification message will consider those 
siblings in the low tree external. 

C. Exploration of new siblings. As is not the root of Tf, but a leaf of the 
tree, it is possible that when the root of 7^ receives the message sent by V;, , which 
indicates a link failure, an altemative-path search (invoked by the root of Tf^ as part 
of its low-tree algorithm) may be already undergoing in Tf, . If this is the case, these 
new external siblings, previously excluded because they were internal siblings, should 
be explored for routing backtracked messages in T/,. It is not unlikely that pathes 
from the high tree to v may go through these new external siblings of Tf, . 

Notes: i) During the reply phase and in both of the low-tree and the high-tree 
algorithms, a message from a child or a sibling may arrive that contains no additional 
reply information. This message will be rejected, ii) Each backtracked message has 
its own alternative-path search. Thus, it is not unlikely that backtracked messages in 
two adjacent trees may explore each other. Consequently, there may exist more than 
one backtracked message in a tree. In this case, each backtracked message, which can 
collect as many reply messages along its altemative-path search by merging the bit- 
vectors of the reply-records encountered into its own, will proceed independently, 
iii) Also note that the low-tree algorithm and high-tree algorithm are the necessary 
steps to be performed from a low tree and high tree point of views. These algo- 
rithms are separately discussed mainly for the convenience of illustrating the whole 


protocol. In fact, it is possible that in a tree, both algorithms may be invoked con- 
currently by different nodes. In this case, these algorithms will proceed indepen- 
dently in the tree. 

4.4.3 Validation of the Protocol 

Now, we shall show that the protocol is reliable. The basic idea of the proof is 
to show that no external siblings ever needed to be explored are not visited during 
the alternative-path search. That is to show that the alternative-path search of the 
low-tree algorithm is exhaustive. 

Lemma 4.3.1 If each tree link is not disconnected before the child node finishes 
replying, then every reply message can be sent to v through the primary path. 

Proof . If a tree link is not disconnected before the child node finishes replying its 
vector, then each node can reply to its parent via the tree link. It immediately follows 
that each reply message will be forwarded to v through a path consisting of only tree 
links, which is the primary path. □ 

Lemma 4.3.2 The alternative-path search is exhaustive. 

Proof . Let 7,- be a tree that has a backtracked message to send out of this tree. 
Observe that there are three groups of external siblings of T, that will not be 
explored by when an alternative-path search is conducted in 7,- . 

The first group of external siblings are those contained in the visited-node- 
record of Af^, since they are akeady been explored. 

The second group of external siblings are those contained in the visited-tree- 
record of B but not in the visited-node-record of B . As discussed in the low-tree 
algorithm, this is to ensure that a loop-free traversal for a backtracked message can 
be maintained. However, these external siblings will be explored in either of the two 



conditions: 1) Af^, is eventually returned to the trees that contain such external 
siblings since these siblings have not yet been explored by these trees (they are not in 
the visited-node-record of Mf,). 2) The trees that contain these external siblings 
separate from their original trees due to link failures. This will force the newly-born 
tree to broadcast a tree-bom message to 7,-, which will cause 7,- to send to these 
external siblings. Therefore, eventually these external siblings will be explored by 
M{, if they ever need to be. 

The last group of external sibUngs are those "originally internal" siblings of 7,-. 
These siblings are contained in a tree splitting from 7,- and thus become external 
siblings of 7,-. And these siblings will be explored by once a reclassification 
message arrives at the root of 7, as illustrated in the high-tree algorithm. 

From the above cases discussed, it is clear that no external siblings are ignored 
by the altemative-path search if they are in the paths leading to the source v . □ 

Theorem 4.3.1 The protocol guarantees that every reply message will be delivered 
to V. 

Proof . Let message be the answer of node u to the question Q . If no link in 
the primary path from m to v fails, from Lemma 4.3.1, we know that can be sent 
to V. 

Suppose that the primary path from a to v is disconnected during the reply 
phase. Let v ^ be the first node encountered by in the primary path such that v i 
is disconnected from its parent. According to the low-tree algorithm, v j is the root of 
a tree 7i and an altemative-path search will be initiated when arrives at Vj. From 
Lemma 4.3.2, we know that will exhaustively explore all extemal siblings in the 
neighboring trees that are needed to be explored. Since 7i is not isolated from the 
rest of the network, there must exist a tree, say T2 rooted at V2 such that can be 


delivered to T2. If V2 is not v, another alternative-path search process will be ini- 
tiated upon the arrival of at V2. Consequently, the reply of the message may 
incur a number of alternative-path search processes in trees. 

We show, by contradiction, that the alternative-path search of will eventu- 
ally succeed. Suppose that there exists a path from Vj to v and the alternative-path 
search of terminates without finding it. Since the network remains connected 
despite of link failures, there must exist two adjacent trees, 7,- and Tj , such that 7,- 
has a path to v but unexplored by and that Tj is explored by . The fact that 
r,- is not explored by from Tj gives us two possible cases to discuss: 

Case 1: 7,- contains nodes which are in the visited-tree-record but not in the visited- 
node-record of M^. This implies that 7, must be a newly-bom tree which is split 
from the tree that have been explored (partially) by M^, otherwise it will be 
explored. But as shown in the low-tree algorithm, a newly-bom tree will broadcast a 
message to its neighboring trees and thus invite a backtracked message. This means 
that Ti will invite a from Tj , a contradiction. 

Case 2: 7,- contains nodes neither in the visited-node-record nor in the visited-tree- 
record of Ma , which implies that 7,- is a new tree split from Tj . In response to that, 
both the low-tree algorithm and the high-tree algorithm will do sibling- 
reclassification, which ensures that after finite time, siblings between 7,- and Tj will 
be classified as external. Therefore, must be sent to 7,- ; again, this is a contradic- 

Therefore, we conclude that if the network is connected, every reply message 
will be routed to v . □ 

From Theorem 4.3.1, the following theorem immediately follows. 


Corollary 4.3.1 The protocol is reliable if no link failure causes partitioning of the 
network. □ 

4.4.4 Computational complexity 

In this section, we discuss the computational complexity of the algorithm. The 
algorithm is designed to achieve the message complexity lower bound for GSP. 

Theorem 4.4.1 The message complexity of the algorithm for GSP is (IVP) with 
arbitrary link failures as long as the network remains connected. 

Proof . Without loss of generality, assume that all of the link failures occur in the 
reply phase. As the survey process proceeds, each tree link failure will increment the 
number of trees by one. Let Tj, T2, ... , T^^ be all the trees in the network at the end 
of the process with < IV I. 

Consider a tree 7, where 1< / < n,. Let A/^ be a reply message that arrives at 
Ti from a neighboring tree Tj (a backtracked message). Let us count the maximum 
number of message exchanges in T,- for forwarding to v . The message exchanges 
occur either at tree links or external sibling links. 

First, we consider how many messages are possibly exchanged over tree links 
due to the routing of to v . The worst case occurs when the low-tree algorithm 
can not deliver to v via any of its external siblings so that is eventually 
returned to Tj. Since the alternative-path search works in a depth-first-search manner, 
tiie number of messages exchanged in 7,- caused by the arrival of is bounded by 
0(lr,l). There are at most such backtracked messages like that may 

reach T,-, excluding the source v (recall that a duplicate message from a sibling or a 
child will be rejected). Hence, during the whole process, the total number of mes- 
sages exchanged over tree links in T,- is at most O (lrjl(IV l-lr,l-l)). 


Now, we compute the upper bound on the number of messages exchanged over 
the external sibling links of 7,-. Suppose that there are e^igj external sibling links 
for r,-. For each message M^, each time when is returned from an external 
sibling or when the chosen external sibling link fails, the algorithm has to find 
another external sibling to retry. This increments the number of messages exchanged 
over sibling links by one. The alternative-path search ensures that if a reply message 
is returned from a neighboring tree, the tree will not be explored again unless the 
external siblings in the tree are reactivated. Let r^,^ _, be the total number of external 
siblings of 7, that have been reactived. This makes the total number of "retries" at 
Ti less than e^ig j+r^ig i. Again, no more than M-tT,!-! such backtracked messages 
like Ma may arrive at T,-. Therefore, during the survey process, the total number of 
messages exchanged over the sibling links connecting 7,- and its neighboring trees are 
bounded by O (e^,-^ i +r^ig -tM-lT, 1-1 ). 

Summing up the two terms for all T^, ... , gives the following total number 
of messages exchanged during the reply phase, 

£ iOOTXW 1-17,1-1)) + 0(e^,.^,.+r^,.^,-fVH7,l-l)) 

= 0(l7il(IVl-l7il-l)) + 0(e^ig^i-iW\-\Ti\-l) + 0(1721(1^1-17^1)) + 

+ ... + 6>(I7^UVI-I7„,I-1)) +Oie^ig,„+\V\-\T,\-l) + 0(r^,-,.i + ... + r^^..^,) 

= {0(l7il(IVl-l7il-l)) + 0(l72l(IVl-l72l-l)) + ... +0(l7„,l(IVl-l7„,l-l))} 
+ {0(e^ig,i+\V\-\Til-l) + Oie^ig^2+^V\-\T:^l) + ... +0(e^,^,„ +IV l-l7„,H)} 
+ Oir^igj + ... + r^ig^^) 

< 0(IVl(l7iW72l+ ... + \r^\)) +0(e^,g^,+e^,g^2 + - + e^^g^) + 0(nM 


<0(IV|2) + + (9(r^.-g.i + ... + r^ig^) 

= Oi\V\'^) + 0(r^ig^i + ...+r^ig^) 

Since each time a new tree is generated, at most (9(IVl) external siblings in the net- 
work can be reactived, 0(r^/^,i + ... + r^ig,n,) < < (9(lyP). So, 
0(IVlVO(r^i,.i + ... + r^ig^,) = 

Finally, let us count the number of message exchanges due to maintenance pur- 
poses, i.e., the sibling reclassification and tree-bom message broadcast. It is clear 
that sibling reclassification involves only tree edges traversals; thus the total number 
of message exchanges due to this factor is bounded by O(IVP). Similarly, tree-bom 
message broadcast is done in depth-first-search fashion and involves tree-traversals 
and message exchanges over extemal sibling links that is bounded by the number of 
neighboring trees. Therefore, it incurs no more than (9(lyl^) message exchanges. 
Hence, including the I) messages exchanged during the broadcast phase and the 
OClVl'^) reactivation messages exchanged as a result of the generation of new trees, 
we conclude that the algorithm has an upper bound of 0{\V\^) as its worst-case mes- 
sage complexity. □ 

Corollary 4.4.1 The algorithm for GSP has optimal message complexity. □ 

An interesting open question is whether there exists a protocol for this global 
survey problem which can achieve the lower bounds on both time and message com- 


In this closing chapter, we summarize what have been achieved during this 
research and suggest some open problems for future research. 

5.1 Major Results 

In Chapter 2, We present a hierarchical, partitioning approach to the network 
reconfiguration problem of large-scale satellite networks. By partitioning a satellite 
network space into regions of approximately equal sizes, the reconfiguration problem 
of the network can be reduced to a number of regional link assignment problems 
which, due to their relative simplicity, can be solved recursively by connecting satel- 
lites in a layer-by-layer manner. The algorithm proves to be superior to other 
methods in terms of time complexity. 

It is also shown that, in the large-scale networks where each region contains at 
least two nodes, the node-connectivity of the network is maximized when each node 
is restricted to have four transceivers. With this property, packet routing is more 
flexible, and the possibility of packet congestion can be effectively reduced. In addi- 
tion, we also provide some fault-tolerant mechanisms which enable us to dynamically 
handle the unexpected topological changes such as link failures or node failures. 

In short, our effort on studying satellite network reconfiguration has resulted in a 
significant method. Unlike the NpNs model, the method provides a more general net- 
work model which does not impose any restriction on the orbits of the satellites. 
Consequently, it is free of the vulnerability that the NpNs model suffers. 

In Chapter 3, we have discussed the issues of message broadcast in networks 
which are subject to frequent Unk reconfigurations but are eventually connected. The 



protocol presented has been shown to achieve reliability, efficiency, and finite mes- 
sage buffering (previously only infinite-buffered algorithms are available). We 
believe that the protocol has significant implications for broadcast protocol designs in 
communication networks with such characteristics, in particular, packet radio net- 
works and point-to-point satellite networks. 

With minor modifications, the protocol can be extended to serve as an 
information-collecting algorithm in eventually-connected networks. Information col- 
lecting is necessary in networks where a node may need specific information from 
every node in the network. For example, the network may need to select a master 
file server or elect a leader so that some administrative things can be centralized. 
Hence, the reply phase of the algorithm can be modified to bring back not only the 
reply vector but also the needed information from each node. The algorithm also can 
be applied to other issues such as network votings and synchronizations. 

Based on similar but relatively restrictive model, we discussed another research 
topic in Chapter 4: network survey (or GSP, Global Survey Problem). The problem 
assumes that the network suffers link failures which may not be recovered during the 
survey process and that the network remains connected despite of these failures. We 
show that Q(IVl) is a lower bound on time complexity and a message-complexity 
lower bound is Q(min(IV l,^)IV :+\E I) for lumping algorithms. 

Previously some works have been done on fault-tolerant network resource allo- 
cation in which a modified flooding algorithm is provided. The algorithm can be con- 
verted to an algorithm for this problem in a straightforward manner. However, the 
algorithm has 0(IVll£l) message complexity, which is not optimal, in particular 
when C>(l£'l) = (9(lyP). In this research we design a loop-free algorithm, which 
consists of the lower-tree algorithm and the high-tree algorithm. The algorithm is 
bounded by O(IVP) regardless of the number of link failures, thus an optimal algo- 
rithm in terms of message complexity. 


5.2 Future Research Directions 
Future research on network reconfiguration includes extending the hierarchical 
partioning approach to handle more general cases (e.g. cases in which the number 
of links in a node is not fixed). Another interesting direction for future work is to 
design the hierarchical routing algorithms which can take advantage of the hierarchi- 
cal topologies. 

Regarding distributed algorithm designs in networics with changing topologies, 
we believe that the following two problem are practically significant and worth pur- 
suing: fault-tolerant leader election and broadcasting tree maintenance. Leader elec- 
tion problems have been extensively studied for the past ten years. However, there is 
very little literature addressing this issue under changing-topology environments. The 
problem is expected to be rather complex as the complexity of a leader election will 
increase substantially when links or nodes fail. 

The broadcasting tree maintenance is about designing an algorithm that can 
dynamically restructure the current spanning tree used for broadcasting purpose. It 
will be very beneficial to nodes in the network if a message can be broadcast along 
the edges of a spanning tree. Good algorithms for such a problem that are able to 
handle both link failures and node failures are yet to be discovered. 


[I] J. Adams and M. Fishetti. "Star wars SDI: The great experiment," IEEE Spec- 
trum , 22(9), pp. 34-64, September 1985. 

[2] A. Aho, J. Hopcroft and J. Ullman. The Design and Analysis of Computer 
Algorithms. Addison- Wesley, Reading, Massachusetts., 1974. 

[3] B. Awerbuch and S. Even. "Reliable broadcast protocols in unreliable net- 
works," Networks, Vol. 16, pp. 381-396, 1986. 

[4] B. Awerbuch. "Optimal distributed algorithms for minimum weight spanning 
tree, counting, leader election and related problem," Proc. of the 19th ACM 
Symp. on Theory of Computing, pp. 230-240, May 1987. 

[5] A. Baratz, I. Gopal and A. Segall. "Fault tolerant queries in computer net- 
works," The 2nd International Workshop on Distributed Algorithms. IEEE, pp. 
31-40., July 1987. 

[6] R.R. Boorstyn and H. Frank. "Large scale network topological optimization," 
IEEE Trans, on Communications, COM-25(l), pp. 29-47, January 1977. 

[7] C. Cheng, I. Cimet and S. Kumar. "A protocol to maintain a minimum span- 
ning tree in a dynamic topology," ACM SIGCOMM 88, pp. 330-338, August, 

[8] E.G. Coffman and P.J. Denning. Operating system Theory , Prentice-Hall, New 
Jersey, 1973. 

[9] E.G. Coffman. Computer and Job Shop Scheduling, John Wiley, New York, 

[10] Y. Dalai. "Broadcast protocols in packet switched computer networks," Ph.D. 
dissertation, Stanford Univ., Stanford, California, Digital System Lab. Tech. 
Rep. 128, April 1977. 

[II] Y. Dalai and R. Metcalfe. "Reverse path forwarding of broadcast packets," 
Commun. Ass. Comput. Mach., Vol. 21, pp. 1040-1048, December 1978. 

[12] S. Even. "An algorithm for determining whether the connectivity of a graph is 
at least k," SIAM Journal of Computing, pp. 393-96, September 1975. 

[13] S. Even. Graph Algorithms, Computer Science Press, Reading, New York, 

[14] E. Gafni and D. Bertselcas. "Distributed algorithms for generating loop-free 
routes in networks with frequentiy changing topologies," IEEE Trans, on 



Communications, COM-29(l), pp. 11-18, January 1981. 

[15] R. Gallager. "An optimal routing algorithm using distributed computation," 
IEEE Trans, on Communications, COM-25, pp. 73-85, 1977. 

[16] R. Gallager, P. Humblet, and P. Spira. "A distributed algorithm for minimum- 
weight spanning trees," ACM Trans, on Programming Languages and Sys- 
tems, Vol. 5, No. 1, January, 1983, pp. 66-77. 

[17] J. Garcia-Luna. "A fail-safe routing algorithm for multihop packet-radio net- 
works," Proc. of IEEE Infocom 86, pp. 434-443, April 1986. 

[18] J. Garcia-Luna-Aceves and R. Rom. "On the dynamic control of network con- 
nectivity," Proc. of IEEE Infocom 87, pp. 207-217. 

[19] M. Gerla and L. Kleinrock. "On the topological design of distributed computer 
networks," IEEE Trans, on Communications, COM-25, January 1977. 

[20] L. Kleinrock. Queueing Systems, Volume 2: Computer Applications, Wiley- 
Interscience, New York, 1976. 

[21] D. Kleitman. "Methods for investigating the connectivity of large graphs," 
IEEE Trans, on Circuit Theory, CT-16, pp. 232-3, May 1969. 

[22] K. Luo, Y. Chow and R. Newman-Wolfe. "An efficient algorithm for 
reconfiguration of large-scale point-to-point satellite computer networks with 
maximum connectivity," The 9th IEEE International Phoenix Conference on 
Computers and Communications, pp. 194-201, March 1990. 

[23] K. Luo, Y. Chow and R. Newman-Wolfe. "An efficient broadcast protocol in 
networks with changing topologies," The 2nd IEEE Workshop on Future 
Trends of Distributed Computing Systems, pp. 88-93, September 1990. 

[24] K. Luo and Y. Chow. "A message-optimal protocol for global surveys in faulty 
networks," IEEE Infocom '91, pp.525-532, April 1991. 

[25] C. McLochlin, C. Ward, Y. C. Chow, R. Newman-Wolfe, J. N. Wilson and T. 
B. Hughes. "Optimizing the delay and reliability of low altitude satellite net- 
work topologies," IEEE Milcom '87, October 19-22 1987. 

[26] J. McQuillan, I. Richer and E. Rosen. "The new routing algorithm for the 
APARNET," IEEE Trans, on Communications, COM-28, pp. 711-719, 1980. 

[27] P. Merlin and A. Segall. "A failsafe distributed routing protocol," IEEE Trans, 
on Communications, COM-27, pp. 1280-1287, 1979. 

[28] F. Moss and J. Jaffe. "A responsive distributed routing protocol," IEEE Trans, 
on Communications, COM-30, pp. 1758-1762, 1982. 

[29] D. Peleg. "Time-optimal leader election in general networks," Journal of 
Parallel and Distributed Computing 8, pp. 96-99, 1990. 


[30] T. Roberttazzi and P. Sarachik. "Self-organizing communication networks," 
IEEE Communications Magazine, Vol. 24, no. 1, pp. 28-33, January 1986. 

[31] U. Schumacher. "An algorithm for construction of a ^-connected graph with 
minimum number of edges and quasiminimal diameter," Networks, Vol. 14, pp. 
63-74, 1984. 

[32] A. Segall. "Distributed network protocols," IEEE Trans, on Inf. Theory, YT- 
29, pp. 23-35, 1983. 

[33] A. Segall and B. Awerbuch. "A reliable broadcast protocol," IEEE Trans, on 
Communications, COM-31, pp. 895-901, 1983. 

[34] A. Tanenbaunx Computer Networks . Addison- Wesley, Reading, Massachusetts, 

[35] U. Vishkin. "A distributed orientation algorithm," IEEE Trans, on Inf. 
Theory, Vol. IT-29, pp. 624-629, 1983. 

[36] C. Ward. "Topologies and link assignment problems in low altitude satellite 
networks", Ph.D. dissertation. University of Florida, 1987. 


Kenneth C.-K. Luo was bom in Hsinchu, Taiwan. He received his B.S. degree 
in computer science and information engineering from National Taiwan University in 
1981. He served as a communications officer in the Taiwan Air Force for two years. 
In 1983, he joined the Institute for Information Industry where he worlced as a 
software engineer. He received his M.S. degree in computer science from the Univer- 
sity of California in 1986. He came to the University of Florida in 1987, where he is 
working on his Ph.D. degree in computer science. His research interest includes dis- 
tributed computing, computer networks, fault-tolerant computing, combinatorial 
optimizations and system performance evaluations. He is married to Sze-Mei Ju and 
has a four-month-old daughter.