# Full text of "Shaw, Robert - Dripping Faucet As A Model Chaotic System( 1984)"

## See other formats

The Science Frontier Express Series I .S45 1984 The Dripping Faucet as a Model Chaotic System by Robert Shaw ISBN Number 0942344-05-7 Copyright Aerial Press, Inc., 1984. Reproduction of this volume in any form is prohibited (except for purposes of review) without the prior written permission of either Aerial Press, Inc. or the author. Aerial Press, Inc. RO. Box 1360, Santa Cruz, CA 95061. (408) 425-8619. Abstract Water drops falling from an orifice present a system which is both easily accessible to experiment and common in everyday life. As the flow rate is varied, many features of the phenomenology of nonlinear systems can be seen, including chaotic transitions, familiar and unfamiliar bifurcation sequences, hysteresis, and multiple basins of attraction. Observation of a physical system in a chaotic regime raises general ques¬ tions concerning the modeling process. Given a stream of data from an experi¬ ment, how does one construct a representation of the deterministic aspects of the system? Elementary information theory provides a basis for quantifying the predic¬ tability of noisy dynamical systems. Examples are given from the experimental data of computations of the two dynamical invariants: a) the information stored in a system, and b) the entropy, or rate of loss of this information. Contents 1. Introduction 2. The experiment 3. Modeling strategy, some expectations about the data 4. Some phenomenology 5. A naive analog model 6. Information theory and dynamics __ 7. The minimum information distribution P(x) 8. Quantification of predictability 9. Entropy of purely deterministic systems 10. The noise problem 11. Information stored in a system 12. Examples from the data 13. Rate of loss of stored information 14. Example from the data 15. The entropy in the limit of "pure determinism" 16. "Observational noise" a. Measuring instrument as channel b. Observation rate c. Complete and incomplete measurements 17. Symbolic dynamics 18. Example from the data 19. Limitations of this modeling procedure 20. More general models, how "good" is a model? 21. The undecidability of optimum modeling 22. Conclusion, "ideas" as models Appendix 1 - Continuous time calculation Appendix 2 - The dimension question 1 In a lecture entitled, "Is the end in sight for theoretical physics?", Steven Hawking considers the possibility that all fundamental physical laws may soon be known[l]. He comments: "It Is a tribute to how far we have come already in theoretical physics that it now takes enormous machines and a great deal of money to perform an experiment whose results we cannot predict." If a descrip¬ tion of Nature is to consist solely of elementary particles and the forces between them, this statement might be true. However systems made up of a number, even a small number, of subunits can behave in ways which completely transcend our present understanding. The motion of fluids provides a good exam¬ ple of how a vast phenomenology can arise which is to a considerable degree independent of microscopic physics. We live in a whirl of moving structures, swept by social, economic, and personal currents whose dominant theme is one of unpredictability. Yet laws, constraints of some sort seem to be operating, as evinced by our ability to function. The central Issue of physics, that of pred¬ ictability , is in fact addressed as a practical matter by each newborn Infant: How do we construct a model from a stream of experimental data which we have not seen before? How do we use the model to make predictions? What are the limits of our predictive ability? Simple experiments, as well as the experience of daily living, still have much to teach us. Here, as a case In point, is an experimental study of a dripping faucet. Jl* Introduction A common complaint of insomniacs is a leaking faucet. No matter how severely the tap is wrenched shut, water squeezes through, the steady, clocklike sound of the falling drops often seems just loud enough to preclude sleep. If the leak happens to be worse, the patter of the drops can be more rapid, and irregular. A dripping faucet is an example of a system capable of a chaotic transition , the same system can change from a periodic and predictable to an 2 aperiodic, quasi-random pattern of behavior, as a single parameter (in this case, the flow rate) is varied. Such a transition can readily be seen by eye in many faucets, and is an experiment well worth performing in the privacy of one^s own kitchen. If you slowly turn up the flow rate, you can often find a regime where the drops, while still separate and distinct, fall in an irregular, never-repeating pattern. The pipe is fixed, the pressure is fixed; what is the source of the irregularity? Only recently has it been generally realized that simple mechanical oscil¬ lators can undergo a transition from predictable to unpredictable behavior analogous to the transition from laminar to turbulent flow in a fluid. A random component in a physical variable is not necessarily due to the interaction of that variable with a great many degrees of freedom, but can be due to "chaotic dynamics" among just a few. A system in such a regime is characterized by short-term predictability, but long-term unpredictability. The system state at one instant of time is causally disconnected with its state far enough into the future. The study of simple nonlinear models capable of chaotic behavior has become an active area of physics and mathematics research, see references [2] for an introduction. The existence of dynamical chaotic behavior places severe limits on our ability to predict the future of even completely deterministic systems. The question soon arises, if a system is unpredictable, how unpredictable is it? Can we quantify a degree of predictability? Also, can we tell the difference between randomness arising from the interaction of many degrees of freedom, and randomness due to the chaotic dynamics of only a few? Limits on predictability imply a loss of information , and motivates a look at the formalism and concepts of information theory. Information theory, as developed largely by Shannon, often provides a natural and satisfying framework 3 for the study of predictability. A main goal of this paper is to demonstrate, by example, how one can take a stream of data from a physical experiment and , compute quantities such as a) the amount of information a system is capable of storing, or transmitting from one instant of time to the next, and b) the rate of loss of this information. The paper will conclude with a more general dis¬ cussion of the modeling process. 4 2. The experiment The dripping faucet has been used as an example of an everyday chaotic sys¬ tem in lectures by Rossler[3], Ruelle, myself, and others, but to my knowledge, this is the first experimental study. Experimental work is being performed at UCSC, in collaboration with Peter Scott. Figure 1 illustrates the simple apparatus. Water from a large tank is metered through a valve to a brass nozzle with about a 1 mm. orifice. Drops falling from the orifice break a light beam, pro¬ ducing pulses in a photocell signal. The pulses operate a timer, yielding the basic data of this experiment, time intervals between successive drops. A Z80 based microcomputer, built by Jim Crutchfield, controls the valve via a stepper motor, reads and resets the timer, and collects data. Timer resolution is 4 microseconds. Chief sources of external noise are vibration and air currents, to which the system is quite sensitive, and drifts in the flow rate, probably due to the strong temperature dependence of both viscosity and surface tension. However the system is easy to operate, with a little care, and produced interesting data almost immediately. Drop rates of interest were in the range of 1 to 10 per second, corresponding to flow rates of about 30 to 300 gpf [4] . ^3. Modeling strategy , some expectations about the data A traditional physics approach to modeling the dripping faucet might be to try and write down the equations of motion, coupled differential equations describing fluid flow and the forces of surface tension, and solve them with appropriate boundary conditions. But this is an exceedingly difficult problem. The calculation of the shape of even a static hanging drop is at present a state-of-the-art computer calculation[5]. A numerical foray into the largely unexplored world of three-dimensional coupled nonlinear PDE"s is probably overambitious• Another, more tractable, approach is to try and build a model directly from the data, without attempting a model from Newton"s equations and first princi¬ ples . Here we forget the physics entirely, and view the system as a black box producing a stream of numbers , T^> ... in this case the drop intervals. All our questions about predictability are posed abstractly, in terms of these numbers. Is there a causal relation between T and T ,,, between one drop interval n n+1 and the next? A common method for testing a stream of numbers for determinism is to construct a "return map”, the numbers are taken in pairs and a point is plotted for each n. Figure 2 shows how such a plot would appear for two extreme types of data. Fig. 2 Vn a) If the drops are absolutely periodic, then T^= T 2 = T^, etc, and each pair of numbers plots the same point. b) Suppose each number was selected at random from some distribution of possi¬ ble intervals P(T ), with no correlation between one number and the next. In the latter case, the plotted pairs would produce a scatter plot, reflecting the absence of any "history" in the number stream. The distribution of points on the scatter plot would be uniquely determined by the rule for sta¬ tistical independence, PCT^jT^^) = P(T n )P(T n+ ^). This provides the definition for a "purely stochastic", or Markov, number stream. In general we expect, on physical grounds, something in-between. Systems are usually causal, with the information stored in the dynamical variables pro¬ viding continuity through time, but we also know that every physical system is subject to noise, which limits the total amount of information it can carry. Of course, we cannot expect to construct a model of the entire drop dynam¬ ics from a series of drop intervals. First, the discretization of the continu¬ ous system effected by looking only at drop intervals performs, roughly, a "Poincar£ cross-section" of the flow in the system state space. For example, a regular oscillatory dropping motion becomes a single point in the return map. One could imagine many different oscillations with the same period. Even worse, we are projecting the presumably infinite-dimensional dynamics onto a single time variable. The fact that this procedure can yield any structure at all makes a statement about the ability of continuous systems to behave in a "low- dinensional" fashion. Nevertheless we can hope, if the system is "low-dimensional" enough, to compute the same measures of predictability that we would obtain if we had available much more complete data, for example the position of the whole surface of the drop as a function of time. In order to be useful, quantities such as the "information stored in a system" must be properties of the system, and not the particular type of measurement. This implies an invariance under coordinate transformations, a property which appropriately defined measures of information possess. Thus we can hope that, in some cases, a "good enough" measurement of 7 any dynamical variable will serve to characterize the predictability, 4. Some phenomenology The behavior of our dripping faucet system exhibits a rich phenomenology as the flow rate is varied, one which we have not yet systematically investigated. The qualitative type of behavior at a given flow rate, e.g. periodic or chaotic, can depend on initial conditions, indicating multiple "basins of attraction”, and changes in behavior as a function of flow rate are often sudden and hys- teretic. The drop rate is not a monotone function of the flow rate. An over¬ view of the faucet behavior will appear[6], but at the moment only preliminary results will be presented, consisting of examples of the different types of behavior which can occur. At low flow rates, the system is in its "water clock" regime, quite strict periodicity is typical* At somewhat increased flow rates, corresponding to about 5 drops per second with our nozzle, variations in the drop intervals on the order of 5*10% start to appear* These can be difficult to detect by eye, the drops still appearing periodic, but the time vs* time maps show that varia¬ tions are present, and quite structured. Figure 3 illustrates one type of behavior, every other drop interval is somewhat longer than its predecessor, corresponding physically to a pairing up of the drops, as schematically illustrated to the right of the figure. This is an example of a "period doubling” bifurcation, two drops have to fall before the system behavior repeats itself, and successive points on the time vs. time map alternate between two locations. 9 Another flow rate might yield a picture such as Fig. 4a. Here no periodi¬ city is apparent, but the system clearly is not totally random either, a consid¬ erable degree of determinism connects one drop interval with the next. The time vs. time map could be approximated by a parabola with some ’’pure stochastic” noise added. In fact, considerable study has been devoted lately to simple difference equations of the form = f(X^) + ^ * where fCX^) is a parabola, and ^ is a noise term, hopefully modeling a system with mixed deterministic and random elements[7]. For comparison, the plot resulting from iteration of such a one-dimensional map with added noise is shown in Fig 4b. data 1-d map .180 T n (sec) .193 Fig. 4 - Data modeled by the 1-d map with noise: x" = 3.6x(l-x) + .02 £ 10 Still another type of behavior is shown in Fig. 5a. This is a very fami¬ liar picture to those of us with experience with driven nonlinear oscillators, it is a Poincar§ cross-section of a "two-band” attractor, which occurs in the driven Van der Pol oscillator, the driven Duffing oscillator, and many other systems. The two-band attractor is part of the complete period-doubling bifur¬ cation sequence, as described originally by Grossman and Thomae in the context of one-dimensional maps[8], and Shimada and Nagashimaf9], LorenzflO], and Crutchfield et al[ll] in the context of continuous flows. One should imagine trajectories moving along a ribbon which loops through the cutting plane twice, the ribbon folding into itself at some point along its length. This is an exam¬ ple of a system with a mixture of periodic and chaotic aspects, points strictly alternate between the two islands in the time vs. time map, but move chaoti¬ cally within each island. This situation has been termed "noisy periodicity" by May[12], and "semiperiodicity" by Lorenz[10]. Similar pictures occur in iterated maps of the plane such as the H§non mapping[13], as illustrated in Fig. 5b. data Henon map Fig. 5 - Data modeled by the H§non map: x' = y + 1 - 2.07x 2 y' 88 -. 33x 11 At a certain flow rate, one starts to see pictures like Fig. 6. The data still has a one-dimensional character, but the map is tangled, one drop interval no longer uniquely determines the next. There apparently still exists a single coordinate which more or less specifies the state of the system, but this coor¬ dinate is no longer projected in a one-to-one fashion onto the drop interval. One could resolve the ambiguities in the time vs. time map with various condi¬ tional probability schemes, but a more entertaining method is to add another coordinate, namely T^^, t0 che display, and view the resulting time vs. time vs. time map as a stereo plot. In this fashion, we let the visual cortex do the work, and those able to view stereo pictures with crossed eyes can verify that the strings which appear to cross in the two-dimensional figure are in fact dis¬ tinct, and one could in principle lay more reasonable system coordinates along them. Fig. 6 12 All of the examples presented so far display a nearly one-dimensional char¬ acter, that is, the value of a single coordinate, often just the drop interval, determines to within fairly narrow limits the value of the next interval. As far as the data is concerned, the state of the system can be specified by a value along a single coordinate. Furthermore, the behavior is familiar from the phenomenology of simple low-dimensional nonlinear systems, despite the fact that the dripping faucet is a continuous fluid system, with, in theory, an infinite number of degrees of freedom. We can only expect that eventually more of those ’’degrees of freedom” will assert themselves. Figure 7 illustrates a type of behavior which appears at somewhat increased flow rates. Much of the data falls along one-dimensional curves, but there are regions of the time vs. time map which appear to be higher dimensional. Appen¬ dix 2 discusses the interesting possibility of the common occurrence of attrac¬ tors of what might be called ’’mixed dimensionality”; in some regions of the state space, the state of the system can be optimally specified by the value of a single coordinate, but in other regions, more coordinates are required. Fig. 7 3 At still higher flow rates, higher dimensional behavior becomes more and more evident. The variations in drop interval increase to 20% and more, and one obtains pictures like Fig. 8. There is still plenty of structure present, but, as the phenomenology of higher dimensional nonlinear systems is nearly com¬ pletely unknown, studies are postponed to the future. 14 Further work will be directed toward illuminating the new nonlinear phenomenology which is lurking in this system. A comforting feature of the non¬ linear jungle is that certain landscapes tend to reappear; whereas dynamical detail can vary endlessly from system to system, certain themes become familiar. For example, in the dripping faucet system we have recognized forms and bifurca¬ tion sequences common to low-dimensional models. Perhaps by studying faucet phenomena that is less familiar, we can build metaphors which will be of use in comprehending more complex behavior in other systems. 5. A naive analog model The term "model** is commonly used in physics to describe a representation of a physical system not involving all of its relevant variables, but rather a much simpler set of variables and a dynamic between them which still manages to preserve some aspect of the qualitative behavior of the complete system. Clearly some degree of predictability will usually be lost in the simplifica¬ tion, but a simple model often has value in its conciseness and explanative power, as well as as a starting point for a more complete description. Several years ago, I constructed a model of this type for a dripping fau¬ cet, and implemented it on an analog computer[14]. This simple, one-dimensional model is so crude as to have perhaps little or no predictive power, but it may be useful in describing to first order why the dripping faucet behaves as it does. When the water drop data collection system was constructed, it became a matter of curiosity to plug the analog computer in place of the phy¬ sical system, and see if any comparisons could be made. The model is sketched in Fig. 9. A mass, representing the drop, grows linearly in time, stretching a spring, representing the force of surface ten¬ sion. When the spring reaches a certain length the mass is suddenly reduced, 15 representing a drop detaching, by an amount dependent on the speed of the mass when it reaches the critical distance. We thus have a driven nonlinear oscilla¬ tor, the nonlinearity arising from the sudden change in mass, and with position, velocity, and mass providing the three variables required for the occurence of chaotic behavior in a system evolving in continuous time. Fig. 9 - Analog model for water drop at low flow rates. A drop detaches at a distance x indicated by the dotted line. ° 16 Sure enough, as a parameter is increased, the analog model enters into periodic motion, then period-doubles, then goes chaotic. The time between “drops", or the sudden reductions in mass of the analog simulation, is the discrete model variable corresponding to the time intervals observed in the phy¬ sical system. Physical faucet data can be found which closely resembles time vs. time maps obtained from the analog simulation, see for example Fig. 10. 80 (msec) 90 Fig. 10 - Comparison of experiment and "theory". 17 The analog simulation makes available continuous variables, which are not obtained from the physical experiment as it is presently conducted. By plotting the variables of the analog simulation, we can get an idea of the geometry of the ,, attractor ,, describing the motion of the fluid system. Fig. 11 plots two of the three continuous model variables, for the same parameter values yielding the time vs. time map of Fig. 10. This structure can be recognized as a Rossler attractor in its "screw type" or “funnel” parameter regime[15]. The close correspondence of model and experiment of Fig. 10 argues that such a structure is imbedded in the infinite-dimensional state space of the fluid system. Fig. 12 plots the position of the model drop as a function of time. The interaction of the phase of the natural drop oscillations with the discrete events of the drops detaching give rise to varying degrees of "rebound" of the remaining mass, and the resulting motion can be nonperiodic. position X v X -(up) Fig. 11 - Attractor reconstructed from the analog model. Note the rapid rebound when a drop detaches at x q . 18 Fig. 12 - Analog model in the nonperiodic regime corresponding to Fig. 11. No strong claims are made about the accuracy of this particular model, as nearly any driven nonlinear oscillator will behave this way in some parameter range. Nevertheless, a few points are well illustrated by this example. a) A model of only a few dimensions can sometimes adequately describe the chaotic behavior of a continuum system, the high dimensionality of the fluid system is not required. This is in line with the original discussions of Lorenz[16], Ruelle and Takens[17], and Gollub and Swinney[18]. b) There exist fundamental geometrical structures which reappear in many dif¬ ferent nonlinear systems. The equations of the original Rossler system are quite different from those of this model, yet the same attractors appear. Furthermore, Rossler's original work, as well as work in this laboratory and elsewhere, indicate that the changes in topology of the various Rossler 19 attractors as system parameters are varied is much more general than any one set of equations- Yet the parameter space of even the Rossler system, which pro¬ duces the simplest chaotic attractor, is largely unexplored. Much work remains to be done to determine "typical" paths through parameter space, and the result¬ ing changes in the behavior of a physical system. . c) Analog and special purpose hardware could play a leading role in the development of our understanding of nonlinear systems- General purpose digital machines are admirably suited for accurate calculations in particular cases, but often do not serve well for rapid scans of large classes of equations where qualitative understanding is totally lacking. The model-data correspondence of Fig. 10 was obtained by a search through a three-dimensional parameter space which took only a few moments on an analog machine, parameters are varied simply by turning knobs, and feedback is immediate. The time and expense required for the same job on a standard digital machine is horrendous to contemplate. The radical reduction in price and increase in capability of integrated digital cir¬ cuitry in the past decade enable the construction at low cost of special purpose and hybrid devices specifically designed to rapidly increase our general knowledge of dynamical systems theory. Alas, we have found this viewpoint dif¬ ficult to communicate to scientific officialdomf19]. JIUgtttmt nan QJarimruttimm 20 6^ Information theory and dynamics We now turn to the question of describing and quantifying the predictabil¬ ity of a stream of numbers obtained from repeated measurements of a physical system. Before treating the water drop data, a brief general discussion of the relevance of information theory to studies of dynamical systems is perhaps in order. Although only well-known results in information theory will be described, a general appreciation has been lacking of their usefulness in describing the predictability of physical systems, particularly in the presence of noise. The viewpoint and vocabulary used have been developed over the past few years at Santa Cruz in collaboration with Jim Crutchfield, Doyne Farmer, and Norman Packard. The concept of M information" as a physical entity, with a p log p type measure, was perhaps first appreciated by Boltzmann[20]. Gibbs, Szilard, and others addressed the concept with varying degrees of directness, but it was not until Shannon's work "The Mathematical Theory of Communication" that a clear and explicit discussion of information and its properties appeared[21]. Shannon carefully restricted his discussion to the transmission of information from a "transmitter" to a "receiver" through a "channel”, each with known statistical properties. Some version of the following figure appears in the early pages of most books on information theory: transmitter receiver Fig. 13 p(x) P(y) 21 P(x) describes the distribution of possible transmitted messages x, P(y) the distribution of possible received messages y, and the properties of the "chan¬ nel" connecting the two distributions is contained in the conditional distribu¬ tion P(ylx), the probability of receiving message y given that message x was transmitted. The set of possible transmitted and received messages labeled by x and y may assume a number of forms# If the messages are selected from a finite set, for example the letters of the alphabet, then we can label them with an integer, and the conditional distribution becomes a finite matrix P^ = PCy^lx^), con¬ necting {x^} with {y^}. The set of x and y might be continuous, describing for example voltage levels, or they may be multidimensional. If the distributions P(x), P(y), and P(y|x) are available, quantities such as the "information" passing through the channel are well-defined and easily computable. In the case of communication systems, these distributions are, as a practical matter, readily available: transmitter, channel, and receiver hardware are man-made devices with known properties, and the first-order sta¬ tistical properties of the English or other language used (which partially determine the "transmitter" distribution) can be estimated by constructing his¬ tograms from a large amount of text, as demonstrated by Shannon. From an economic point of view, Shannons analysis was required at that point in his¬ tory, to enable the development of modern communication networks. The mapping of Shannon's communication metaphor onto the problem of predic¬ tability in dynamical systems theory Is a small conceptual step. A dynamical system "communicates" some, but not necessarily all. Information about its past state Into the future: 22 P as t future Observable past and future system variables are described by x and x', all the other time independent system properties are lumped into the conditional distri¬ bution P(x'|x) which describes the causal connection between past and future given by the system dynamics. Present theory usually treats only autonomous, non-evolving systems, hence x and x' describe the same set of coordinates at two different times, x » x^t^), x 0 (t^), ... and x' * x^Ct^), x 2 (t 2 ), ... etc. This provides a sim¬ plification over the more general case in communication theory, where the transmitted and received symbols may be different. Again, the symbols x and x' should be thought of as general labels for the coordinates, which might be discrete, continuous, or multidimensional. Knowledge of the past and future system states is represented by probabil¬ ity distributions P(x) and P(x^) over the variables. If this correspondence can be carried out, the problem of prediction, formally at least, becomes simple, given P(x) and P(x'|x), compute P(x^): PCr')- ^ PCt‘|x) PO) Jpc In this picture, the "system dynamics" P(x^|x) defines a map from distributions to distributions: 23 Fig. 15 - Distributions to distributions. To study the behavior of the system farther into the future than t^» we simply reapply the mapping F to the distribution P(x'), producing a new predicted distribution at t^. Repeating this process models changes in knowledge of the system variables as it moves through time. Much of the modern work in dynamics involves studies of these sorts of iterated mappings. Classi¬ cally, we can discuss systems evolving in continuous time by considering map¬ pings which move the system forward in time by an infinitesimal amount. In the case of the water drop system, the measured system variable Is the drop interval T , and the dynamics are given by the conditional distributions P(T It ) which relate one drop Interval to the next. The time variable is n+1 n r discrete, and is labeled by the drop number n. Intuitively, we expect highly peaked probability distributions to represent fairly definite knowledge about the variables, and broad distributions relative ignorance. Thus, widening of a distribution under the dynamics will indicate a loss of knowledge, or predictive ability. Our task will be to quantify this notion. 24 7. The minimum information distribution P(x) Of all the possible distributions P(x) over the variables x representing various guesses about their positions, there is one special distribution P(x) representing a minimum knowledge of the system state * This is our best guess about the variable positions assuming that we have total knowledge of the time- independent part of the system, the "equations of motion'* P(x"|x), but zero additional knowledge about the variables. Usually, the best representation of minimum knowledge of the position of some variable is a flat distribution over the range of that variable, but if the variable is part of a known system, our expectation may be otherwise. For exam¬ ple, consider an observation of the position of a harmonic oscillator with known energy but unknown phase. Our expectation is that we are more likely to find the oscillator near the ends of its travel, as it moves slower there. The pro¬ bability of finding the oscillator at some position is proportional to the inverse of the velocity at that position. X Fig. 16 - Minimum information distribution for the simple harmonic oscillator. 25 The minimum information distribution P(x) is thus a property of the system P(x^|x), or of a particular "basin" of the system, if the system state space can be decomposed into independent dynamical regions. The distribution T(x) is the baseline against which all more sharply peaked distributions will be measured. As we shall see, if a system has any tendency to lose information, all input distributions P(x) will move toward P(x) in a quantifiable sense. If the system is incapable of transmitting information into the indefinite future, then any initial distribution P(x) will relax in time to the asymptotic "equilibrium distribution" P(x). The final minimum state of knowledge will be independent of any initial information. This is the original and intuitively correct concept Boltzmann described by the word "ergodic". The harmonic oscil¬ lator is clearly not ergodic in this sense, as it can preserve phase information indefinitely. P(x) also has the property that it is unchanged by the system dynamics, that is, it is invariant under the map F. Clearly F(x) must be the same distri¬ bution as P(x'). Otherwise, a prediction could be made about a change of state of the system at a particular time, and our knowledge is not at a minimum. Hence the term for P(x) found in the mathematics literature, "invariant meas¬ ure". If a system can assume only a finite number of states x^, the dynamical rule connecting states becomes a finite transition matrix P . In this context, li¬ the unique asymptotic state vector is described by a classical theorem of Frobenius: A matrix consisting of all positive elements has only one positive eigenvalue, furthermore, any positive vector will approach the direction of the unique positive eigenvector under repeated application of the matrix P . This fact allows an easy computation of the invariant distribution of some map, one simply starts with any distribution P^^ and repetitively applies the map until 26 the "power method" for finding the largest eigenvalue of a matrix, and is dis¬ cussed in the literature under the label "Frobenius-Perron operator"« Its use will be demonstrated later in this paper, and it was explicitly applied in the dynamical systems context in Ref. [22]. The division of seamless reality into "dynamics" P(x'ix) and "variables" x is to some extent arbitrary, no physical system is truly independent of time. But once such a model has been constructed, a single P(x) is uniquely defined for any time-invariant system by the requirement that it describe a minimum knowledge of the system state. 8. Quantification of predictability The appropriate measure to quantify an amount of knowledge contained in a distribution P(x) is the Boltzmann Integral I = J pCx) \»J Pfr) <lx This measure yields zero when applied to a flat distribution P(x) - 1, and If the logarithm is taken to the base two, the units of this quantity are "bits". 27 The number resulting from this integral quantifies the amount of informa¬ tion gained when we learn, for example by experiment, that a variable is distri¬ buted as P(x), relative to a flat distribution, representing no particular knowledge of x before the experiment* It is important to realize that "informa¬ tion” in this sense is a relative concept, the information in a distribution P(x) is always measured relative to an a priori expectation P(x). The informa¬ tion measure is thus a function from a pair of distributions to the real numbers* If we know something about the distribution of x even before we learn P(x), our a priori expectation ^(x) Is not flat, and the appropriate formula is I (■ p, p3p J ^ Io j ^ This Is the "Information gain", a sort of distance from the expectation F(x) to the new information P(x), which is Independent of the particular coordinates used to describe the variables x. The Information gain is then invariant under smooth coordinate changes on x. The information gain is a concept familiar in the context of statistical mechanics[23]* Here the distribution P(x) characterizing minimum information, or "thermal equilibrium", is the Boltzmann distribution. If the states of a system have an energy distribution differing from the Boltzmann, then there is a "free energy" which In principle can be extracted from the system without lower¬ ing its temperature. The knowledge of the system which allows us to extract "work" from it is the distance from the actual distribution of energies to the d Vai I a ioj ^ Boltzmann, measured by the information gain. The A free energy Is simply the information gain multiplied by kT, Af » I(P,P) kT ln2. The informational description of dynamical systems is somewhat less straightforward than the static situation of statistical mechanics. For exam¬ ple, the Boltzmann and other familiar distributions characterizing static 28 equilibria are obtained directly by minimizing the information integral P log P subject to appropriate constraints[21]. But in the dynamical case, the "con¬ straints” are given by equations of motion, and P(x) is not so easy to deter¬ mine. Typically, P(x) is experimentally measured, or computed from equations of motion via an iterative procedure, examples will be given in a later section. Any actual measurement of a physical variable x, corresponding to a sample from the distribution P(x), must be to finite precision , typically we know only that x is within some interval x Q to x^ + dx. This motivates modeling the meas¬ urement process by creating a "partition", that is, breaking up a continuous variable x into a finite number of subintervals x^. Repeated measurements, or samples from P^x), gives a string of whatever symbols we are using to label the {x^}, with probabilities determined by how much of the probability density P(x) falls in each little bin x^. The degree of unpredictability, or entropy of the sequence of symbols is likewise quantifiable: H ='? 1°; ^ / This measure increases with the degree of randomness, if the sequence simply repeats a single symbol x^, then p^ = 1 and all other probabilities are zero, the sequence is totally predictable and the entropy per symbol is zero. The maximum value is obtained when all possible symbols are equally .likely to occur. The entropy, as simply defined above, is clearly not a coordinate invariant quantity. It varies as the partition changes, and generally will increase as i the partition is made finer, creating more possible symbols. To see the rela¬ tion between the information I(P,F) of P(x) with respect to P(x), and the entropy of a finite partition {x^} of the continuous variable x, consider the special partition of P(x) which cuts it into N equiprobable elements: The entropy of P(x) with respect to this partition is simply log(N). Now the entropy of the narrower distribution P(x) with respect to the same partition is approximately H = log(N) - I(P,P). That is, the information provides a con¬ straint on the entropy, lowering it from the maximum value it could have with a given number of elements in the partition. The equality of this relation becomes more exact as the number of elements grows. Here the entropy measures the degree of unpredictability of a series of discrete symbols, and information the degree of predictability of events arising from a continuous distribution, with respect to a background expectation. Entropy and information defined this way are both positive quantities. 9^ Entropy of purely deterministic systems The most common treatment of classical mechanics has been purely deter¬ ministic. One considers the past state of a system as a "point” x, and computes its future position, another "point" x', using a dynamical rule x' = F(x,t), usually obtained from Newton's equations. Physical systems governed by linear forces, central forces, and a few others respond well to this treatment, eclipses are predictable far into the future, and the world is full of clocks 30 and gyroscopes. But generalizing from this experience contributes to a false world view. Poincar§ realized that even in celestial mechanics determinism did not imply unlimited predictability, and the lesson of recent work in nonlinear dynamics is that in even the simplest of ‘'purely deterministic" dynamical sys¬ tems our predictability may be, as a practical matter, extremely limited. Difficulties in predictability arise because, in many systems, nearby “points" are spread apart exponentially fast by the deterministic dynamics, "errors" grow uncontrollably. The motion of the macroscopic variables becomes dominated by microscopic fluctuations, supplied by the heat bath for any physi¬ cal system, and the system displays "chaotic behavior”. The average rate of spreading becomes an important parameter of a given system, its "entropy", governing its degree of unpredictability* The existence of this quantity, and the utility of Shannon's information theory in dynamics, was shown by Kolmogorov in a 1959 article, "Entropy per unit time as a metric invariant of automor¬ phisms" [25] . Although the variables of a purely deterministic description are usually considered to be continuous, attempts to compute measures of predictability by performing continuous integrations quickly run into a practical problem. Here the function governing the dynamics takes "points" to "points": Fig. 19 - Points to points 31 The conditional distributions P(x^|x) are delta functions, and terms such as log P(x'|x) appearing in the integrals diverge* This corresponds to the fact that a "point'* represents an infinite amount of information, to specify its position completely would require an infinite number of digits* The concept of "point", while useful, is thoroughly unphysical, being a convenient name for "arbitrarily sharp probability distribution". Kolmogorov and Sinai found a way around this difficulty, while partially preserving the fiction of "pure determinism"* They applied a "partition" to the domain of the continuous variables x and x% breaking it up into small elements- The deterministic dynamics can now map an image of each little block, conserving the underlying probability: Fig. 20 As the above familiar picture shows, the image of a little block in x may fall over several blocks in x". With a certain probability a "point" starting in the block in x may arrive at any of these. By labeling the little blocks {x^ and {x^}, one can construct a finite matrix of transition probabilities p^* To compute the entropy, or degree of unpredictability, of a given system with respect to this partition, one now uses directly Shannons formula for the information received from a discrete channel when the input is known, H(x"|x). Anything coming out of the channel that could be predicted from the input is not "information" in this sense, there is no "surprise" associated with it. 32 The quantity H(x"|x) correctly measures the unpredictability. To compute H(x"*|x) from the finite matrix one first considers the entropy of the possi¬ ble symbols x^^ received given that a particular symbol x Q has been transmitted, as given by the matrix entries The average information produced by the system is obtained by weighting this sum by the probability P(x^) that the various possible actually occur: H(*'bO The number so computed will depend on the partition chosen, but one can now, conceptually at least, pass to the limit of an infinitely fine partition, or the "refinement ft of the partition obtained by considering higher and higher interates of the map. Kolmogorov, Sinai, and others were able to show that, if one considers the set of all possible partitions, and selects the one which gives the largest numerical value for the entropy, the number one obtains is a topological invariant of the system, independent of the coordinate system used to describe the original continuous variables x* It should not be too surpris¬ ing that the optimum partition is one whose number density of microscopic parti¬ tion elements is proportional to the minimum information probability distribu¬ tion P(x). A very simple deterministic map which exhibits the main features of a deterministic system with a positive entropy is graphed below (for the n time): When viewed from a "point to point" perspective, the map is purely determinis¬ tic, each "point" x yields a unique image "point" x" * F(x). However, if one partitions the interval into small subintervals, representing the uncertainty which must exist in any measurement of x and x", one finds that, because the slope of the map is two, any element in x will shadow two on the average in y, as indicated on the right-hand part of the figure. Because the uncertainty is doubled, there is an unpredictable element present in x' of log(2)» 1 bit, regardless of how accurately we might know x. Thus the "purely deterministic" map acts as an information source [ 22 ]. It should be noted that the partition¬ ing, no matter how fine, is a conceptual necessity. The positive entropy of this map can also be seen by considering a point in x" and trying to determine its preimage point in x* Because the map is two- onto-one, the inverse is non-unique, to go backwards in time, we must replace at each step the one bit which the system produced in the forward time direction. Entropies for these sort of deterministic "one-dimensional maps" can be computed this way by looking backwards in time, see work by Oono[26]. Considerable com¬ puter studies of the properties of one-dimensional maps have been performed at Santa Cruz[27]. 34 Entropy measures are nothing more than the result of a simple counting pro¬ cedure, in this case we are determining the ratio of "paths" to "states": Fig. 22 As the above figure, adapted from Shannon, illustrates, a partition divides the variable domain into a number of "states", and the deterministic dynamics con¬ nects them with a certain number of "paths". The entropy is simply the log of the ratio of "paths" to "states", where both are understood to be appropriately weighted by their probability of occurence. It is a property of the "purely deterministic" approximation that this ratio converges, regardless of how many "states" are considered. A more physical interpretation of the entropy was suggested in ref.[22]. Any physical system will be imbedded in a "heat bath", producing random micros¬ copic variations of the variables describing the system. If the deterministic approximation to the system dynamics has a positive entropy, these perturbations will be systematically amplified. The entropy describes the rate at which information flows from the microscopic variables up to the macroscopic. From this point of view, "chaotic" motion of macroscopic variables is not surprising, as it reflects directly the chaotic motion of the heat bath. 35 10* The noise problem Difficulties arise in the definition of entropy discussed above when "noise"* or a stochastic element, is allowed to enter a purely deterministic description of a system* Typically the noise smears out the dynamics at some length scale* The function governing the dynamics can now be described as tak¬ ing "points" to "distributions": P(x) If we consider a sequence of input functions P(x) of increasing narrowness cen¬ tered around some value of x, we will reach a size below which the output dis¬ tribution P(x') won^t change, we are "in the noise"* We can, if we like, then pass to the limit where P(x) is a delta function, or "point", with no further change in P(x^). It might be noted that this reasonable physical assumption implies that the map F describing a system with noise is a linear functional* The image of any P(x) must be the same as that obtained by breaking up P(x) into weighted "points" x^, and superposing the resulting weighted distributions P(x'| Xi ). if we try to apply the "partition" definition of entropy in this situation, we will have trouble- As, during the limiting process, the partition becomes finer than the length scale defined by the noise, the number of elements in {x'^} accessible to a given x.^ increases without bound: 36 noise level Fig. 24 The ratio of "paths” to "states” does not converge, and the entropy diverges with the log of the number of elements in the partition* This in fact makes physical sense* If an experimenter with a good amplif¬ ier turns the gain way up, he will see plenty of "unpredictable information”, in the form of thermal noise, no matter how well-behaved the input signal is. Further increasing the gain will not help him to resolve the input signal better, nor will it help him characterize the amplifier. Apparently the quantity H(x'|x) is not an appropriate measure of the pred¬ ictability of a system with a noise component, as it actually measures the unpredictability, which will diverge as resolution is increased* The correct measure is obtained by an even more direct application of Shannon's communica¬ tion channel image* 37 11 . Information stored in a. system A "system" might be considered a body of information propagating through time. If there is any causal connection between past and future system states, then information can be said to be "stored” in the state variables, or communi¬ cated to the immediate future. This stored information places constraints on future states, and enables a degree of predictability. Again, ±f_ the system dynamics can be represented by a set of transition probabilities P(x'|x), this amount of predictability is directly quantifiable, and corresponds, in Shannon's vocabulary, to a particular rate of transmission of information through the "channel" defined by P(x'|x). Imagine two observers with access to the same system, at two different instants of time. Suppose they both have complete knowledge of the system dynamics P(x'|x), but that the later observer knows nothing of the system state at the earlier time. The earlier observer could attempt to "communicate" to the later by putting the system in a particular one of its allowed configurations x. The dynamics would then carry the system through time to the later observer who would read off the resulting system state x' and, perhaps with the help of a code book, discover the intended message. Figure 25 illustrates two possibilities. Suppose the earlier observer attempts to communicate by restricting the system state to a narrow Interval In x, described by P(x), but that, at the later time, the distribution has relaxed back to the minimum knowledge state F(x') . If this Is true regardless of the input distribution then no communication Is possible, and no Information is stored in the system. If, on the other hand, the output distribution varies with the input, some Information is passing into the future. 38 no stored information some information stored Fig. 25 Notice that, under the rules of this game, an unchanging and completely known "variable" communicates no information into the future. Although a static structure is certainly a predictable element, establishing continuity through time, it becomes a part of the fixed time-independent system description. The "information" measures described here are a property of the system dynamics. In mechanics, as in the human sphere, the transmission of information requires the possibility of change. 39 To quantify the average amount of information passing from past to future, first consider the earlier observer transmitting a particular x^ and the later receiving some from the distribution P(x"|xq). The information, or "surprise" associated with this event to the later observer will be measured only with respect to the asymptotic distribution p’(x'), as he does not know x Q . The mean over all emanating from some x rt is v LO'I*.)-- | PCx'lx.) kj -Jgf and, finally, the average information transmitted using an ensemble of input messages P(x) is: Wx)= This quantity Shannon calls the rate of transmission of information for a con¬ tinuous channel* He comments that this rate can vary up or down depending on the statistics of the input messages P(x), some will be more suitable than oth¬ ers to the particular properties of the channel* The channel capacity he defines as the maximum rate one can find by varying P(x) over all possible input ensembles, and using the optimum. In the dynamical systems context, a particular input ensemble is selected by the properties of the system itself, namely the minimum information distribu¬ tion P(x). As was mentioned earlier, this distribution has the property that an input P(x) produces an identical output distribution P(x"), and it will automat¬ ically be generated by an autonomous dynamical system operating recursively. We can thus define the information stored in a system as the Shannon channel rate for an input (and output) distribution given by the equilibrium distribution for the system: 40 SUJ LferU;** = J Poo j Pfr'M Io j ^0) This formula quantifies the average increase in our ability to predict the future of a system when we learn its past. It appears in a number of guises and Interpretations, written with joint probability distributions P(x,x") ■ P(x)P(x'|x) it assumes the symmetric form >■ rr \ i FC^/^) j*s I(*V)= J] ?<*■*'> * * In the more general case of a communication channel described by a map F(y|x), where the output symbols y need not be the same as the input symbols x, the channel rate retains its symmetry, I(y|x) = I(x|y). This is not true of the conditional entropies, R(y|x) £ H(x|y) if H(x) ^ H(y). If the dynamics P(x'|x) is specified in the form of a finite-dimensional transition matrix P the stored information appears as I (*/l=0 = L R p, i„ Vp. j M and terms of entropies IM?) - HCx 1 )- HCx'lx) The latter form makes It clear that the stored information is the difference in the expected randomness of a system with and without knowledge of its past. Note that both H(x') and H(x')x) will diverge as we proceed to the continuum limit with a noise element present, but that their difference remains finite* 41 The stored information is a function of the dynamics P(x"|x) and the invariant distribution P(x) which is also determined by the dynamics. Thus the stored information is a map which takes a dynamical system, or a particular ^ basin of a dynamical system, to a positive real number. It is, as Shannon notes, another invariant quantity, independent of the particular coordinate sys¬ tem used to describe the dynamical variables. It correctly goes to zero when P(x J '|x) - P(x') meaning the future and past are statistically independent, also notice that it diverges in the case of "pure determinism". A perfectly deter¬ ministic system, if such a thing existed, could store an infinite amount of information, even if it had positive entropy, because it would be able to pro¬ pagate a "point** from the past into the future. If we rewrite the relation between stored information and the entropies: HCx'lx) = HC/) - iMx) we see how Kolmogorov was able to define a finite conditional entropy in the noise-free continuum limit. Both terms on the right diverge as the partition is taken to be indefinitely fine, but their difference H(x^|x) remains finite. However, the measured information storage capacity iCx^lx) of any real system will be finite. 12 . Examples from the data At last we are able to return to the dripping faucet, and attempt some data analysis. Figs. 26 and 27 display two sets of data which represent, to coarse resolution, periodic drops. The upper right-hand panels show the data on a greatly expanded scale. We can see that the drops are in fact not strictly periodic, there is what appears to be stochastic noise admixed. It might be of interest to know whether there are correlations In the fluctuations from drop to drop or whether the noise is "purely stochastic". To determine this quantita¬ tively we compute the drop-to-drop stored Information. 128 ^n+l 0 T n (msec) Fig. 27 - Same display as Fi internal structur* KtJT.) 44 An approximate model of the dynamics of the observed variable is con¬ structed by accumulating the transition matrix P(T |T ) as a histogram. The iiTl n observed variation in the drop interval is split into fifteen bins, so each pair of intervals (T ,T ..) will increment one of about 225 matrix entries. The only n n+1 other quantity required to compute the stored information is the equilibrium probability density P(T) which is also accumulated as a histogram. The first "periodic” regime of Fig. 26 is seen to store virtually zero information, fluctuations in the drop intervals are indeed "purely stochastic". The forward image of one bin in the histogram reproduces the shape of the entire equilibrium distribution, and this is true for every bin. The second string of data, shown in Fig. 27, stores a small but significant amount of information from drop to drop, interval variations are not completely independent. This can be seen qualitatively by the pattern of points in the time vs, time map, the system might be said to be in a "period two" regime obscured by noise. A third example, illustrated in Fig. 28, is supplied by the system in a clear period-two regime. Here long and short drop intervals strictly alternate, but an observer who had just started to look at the system would not be able to tell which would be next. Thus, knowledge of the system past is worth one bit, allowing him to correctly predict whether a long or short drop interval will be next. A check for further structure in the pattern can be performed by accumu¬ lating histograms as above, the results are negative. The system stores only a single bit, the rest of the variation is "stochastic". However, this system differs from the one described in Fig. 27 above in that the information stored perseveres indefinitely , allowing predictions to be made about times arbitrarily far into the future. The period-two drop regime is thus not ergodic, a localized probability distribution does not relax to the distribution P(T) representing minimum knowledge of the system state. A single observation determines the long-short sequence for all time. T n (msec) 0 Fig. 28 - 200 171.2 T n (msec) two" orbit. Here 46 A careful reader may have objected to the statement that the nearly periodic example of Fig. 26 contains no stored information. What about the phase? The system is a rather good clock, with scatter in the drop intervals of well under 1%. An early observer could clearly communicate to a later one via the system by adjusting the phase degree of freedom, which the later could read off by comparison with a reference clock. This situation is an artifact of our description of the dripping faucet with a single state variable, the drop interval, in discrete time , parameterized by the drop number. This description correctly describes the predictive value of the knowledge of one drop interval to the determination of the next, but neglects the imbedding of the system in continuous time. Within the discrete description the "purely deterministic" and "purely stochastic" examples of Fig. 2a,b are identical in that they store zero information. A complete description of the state of the dripping faucet at an "instant" of time requires (at least) two state variables, the current drop interval T, and a phase variable t, which could be, for example, the time since the preced¬ ing drop. States of knowledge will be represented by two-dimensional distribu¬ tions P(T,t). Computations of the stored information are still performed with the same formula, though, using the higher dimensional distributions. The conditional distributions involving both T and t can, in this case, be computed from distri¬ butions involving only T, the details are relegated to an appendix. The situa¬ tion, though more complicated, is straightforward. So, for clarity of presenta¬ tion, much of the discussion of the next sections will consider only the predic¬ tability of one drop interval from the previous, that is, a one-dimensional sys¬ tem in discrete time. 47 The total stored information of the nearly-periodic example in continuous time can be estimated simply by considering the ratio of the scatter in drop intervals to the average interval length* The clock can be "set" to about this accuracy, and the informational value of this knowledge is about This number is more in line with estimates of the "signal-to-nolse ratio" one makes from, say, the widths of lines in Fig* 10a. The dripping faucet is a very “clean” system, with a high degree of determinism. The total stored infor¬ mation varies considerably and not monotonically, however, with the flow rate. This reflects presumably changing relationships between the few measured vari¬ ables and the background continuum. Studies of these interesting questions remain for the future. 13 . Rate of loss of stored information We have been describing the time evolution of the variables of an auto¬ nomous system by a map which takes distributions over the variables at time t^ to time t^. We have seen that, under a few reasonable assumptions, associated with each such map is an invariant which can be identified physically as the average amount of information stored in the dynamical variables. This number measures the degree to which an observer at t^ can predict events at t^. Predictions farther in the future than t^, without intervening observation, can be studied by reapplying the same map to the distributions at t^» yielding the distributions at time t^. Associated with this compound map is the invari¬ ant Information stored between t^ and t^. Further iteration of the map produces a sequence of numbers 1^, 1^, . describing the ability to predict farther and farther into the future. We expect that often our ability to say anything 48 about the future system state from an observation at will decline as we con¬ sider the more distant future, and that, if the map is ergodic, the sequence I will converge to zero, as all distributions eventually map to a single minimum information distribution P(x). The effect of having made a particular observa¬ tion at is lost. Now we will examine properties of the rate of loss of the stored informa¬ tion, that is, the difference between succeeding terms in the series 1^, I^ y . This quantity h n = 1^ - wi3 - 1 156 seen to correspond to the M entropy” of a deterministic map, although there are difficulties in passing to the deterministic limit. The expected general form of the curve I(t) describing the stored informa¬ tion remaining at time t from a measurement at t * 0 is schematically illus¬ trated in Fig. 29. Fig. 29 - Small squares represent possible measurements Two properties of this curve can be shown from "convexity" properties of loga¬ rithmic measures of information: a) I(t) monotonically decreases, I < I . This property is intuitively Ttri n obvious, our information about a system cannot increase in the absence of observation. b) The curve I(t) is concave, h^^ < h^* This is also intuitively clear, sharper distributions representing greater stored information will spread faster than broad distributions. The maximum entropy, or loss of predictability, will thus occur when the stored information is at a maximum. In the case of a discrete map, taking the variables over an interval ^t, a reasonable definition of the entropy is then the loss of Information between the first and second iterations of the map. Under this definition both the stored information and entropy of a "purely sto¬ chastic" map will be zero. It is tempting to extend this definition to the continuous case by consid¬ ering the slope of the curve I(t) at t ■ 0. But we should remember that an "instant" of time is just as unphysical a concept as a "point" in space. As the intervals between measurements become shorter, the rate of information obtained by observation from the system is increased. In general we can expect that the rate of loss of predictability may depend on the measurement process, the more "severe" the measurement, the greater the perturbation of the original state. Thus the limit process necessary to define a slope cannot be performed. A more complete discussion awaits a quantum mechanical treatment. 50 14 * Example from the data Figure 30 displays the time vs. time map for some water drop data in a "fuzzy hump" regime, together with an approximation to the minimum knowledge distribution P(T) constructed by distributing 4000 drop intervals into 50 histo¬ gram bins. A 50 x 50 transition matrix P(T - ,T ) is likewise assembled. Fig. 31 shows what happens when we place all the "probability fluid" initially in the bin whose location is marked with the arrow in panel (a), and then apply the matrix P(T^ + ^,T^) several times. After the first iteration, the initial "delta function" assumes a width characteristic of the stochastic noise level of the data* We can now compute the stored information associated with the map* Its value for this particular initial bin is 2.3 bits, and the average over all bins, 1.8 bits, approximates the invariant 1^. Under only a few applications of the matrix, nearly all the stored Information is lost, and the equilibrium distribution closely approaches the experimentally observed distribution of Fig* 30b. The entropy of this map is the maximum loss of stored information per Iteration, which occurs between iterations one and two, about *83 bits per drop. To summarize, we collect our experience with the water drop data as a tran¬ sition matrix, which becomes our model for its dynamics* This model of course gives good predictions for additional data, if the system is autonomous. The invariant stored information associated with this matrix and its iterates is then computed, this measures the predictability of the system. As discussed earlier, this computation of the stored information considers only the predictability of the length of one drop Interval given knowledge of the length of the preceding interval, without considering that the actual state space is much larger, as the intervals are imbedded in continuous time* Evolution of t: Average stored is listed in tl 52 When the total stored Information is computed, as described in the appendix, the number obtained is much larger, some 9.7 bits. As Fig. 32 illustrates, the first few bits are lost rapidly, but the remaining phase information persists longer. Fig. 32 - Stored information as a function of continuous time for the data set of Fig. 30a. Computation was performed for multiples of the average period, as indicated by the small squares* In this connection Fig. 33 is instructive. Plotted is the probability of observing a drop fall at some "instant” as a function of time. In the absence of any observation, this distribution is flat, as indicated by the dotted line. But now let us suppose that we know that a drop has fallen at t ■ 0. Our proba¬ bility of a drop falling at time t, in the absence of any further knowledge, is shown, for the data of Figs. 30-32. Because the variation of drop intervals Is only about 5% of the average interval, considerable phase information is preserved. Successive distributions spread roughly only linearly, as would be expected from repeated convolutions of a narrow gausslan distribution, and the stored information declines at a slow logarithmic rate[28]. T n (msec) drops 1-3 220 log P .6 t (see) drops 52—54 log P 10.0 10.2 10.4 10.6 t (sec) Fig* 33 - (top) Time vs. time plot for a data set whose variation is a small fraction of the average period. (center) Relative probability of observing a drop given that one has occurred at t = 0, for drops 1-3. (bottom) Same, for drops 52 - 54. 54 This statistical effect, resulting in the long-term preservation of infor¬ mation in a chaotic system, has been termed '’phase coherence*', for more discus¬ sion, see refs- [10,29]. For comparison, consider the same plot for data at a much higher flow rate, where interval variations are comparable to the average interval* Here the knowledge that there was a drop at t ■ 0 is of predictive value for only a short time into the future. Fig- 34 - (top) Time vs. time map for a data set with variations a large fraction of the average period. (bottom) Relative probability of observing a drop at time t given that one occurred at t ■ 0. 55 15 * The entropy in the limit of “ pure determinism ” In section 9, the entropy of a purely deterministic system was discussed in terms of a limit process involving the partitioning of the domain into infini¬ tesimal elements. In the preceding section, the entropy of a map with a noise component was defined as a maximum rate of loss of stored information. Here it will be shown that the singular noise-free case cannot in general be approached smoothly as the limit of a mapping with a stochastic element, as the stochastic element tends to zero. The property corresponding to the deterministic entropy of a mapping with a very small stochastic element will be made clear. The example we will study is the parabolic map with noise added, x' = rx(l-x) +£ , which can generate a string of numbers resembling faucet data, as illustrated in Fig. 4. Again, ^ is a random variable of some small width, modeling, for example, thermal noise. Figure 35 shows the invariant distribution P(x) of this map at r - 3*7 for * 0, the no-noise case, and for £ « .IX. Fig. 35 - Invariant distribution of the parabolic map with and without added noise. 56 Here the "noise" uncertainty is modeled for convenience by a simple truncation of the accuracy of the computation* (The experience of Crutchfield and Packard and others has been that the outcome of numerical experiments such as this is usually unaffected whether one takes £ to be of gaussian, uniform, or other form[27]. Only the effective width of £ is important.) Note that this is a rather small noise level, much smaller than that illustrated in Fig 4b. The noise would be invisible on the scale of that figure, although finer features of the P^x) distribution do become smeared out. In the deterministic case, ^ “ 0, the information storage capacity of the map diverges, as mentioned earlier. But it becomes finite with the addition of noise, and the series X^, *•* of the Information stored by higher iterates of the map has the concave behavior discussed in section 13. This series is graphed for three small noise levels in Fig. 36. As expected, the initial value of the stored information becomes higher as the noise is reduced. The rate of loss of stored information, h « I - I . , is graphed for the n n n+l three cases in Fig. 37. The rate of loss starts off in each case at the same value, then falls off to a plateau. The length of the plateau becomes longer as the noise level decreases. The value of the rate of loss of Information in the plateau region is near the value of the entropy of the map in the no-noise case, which is indicated by the dotted horizontal line. The fact that the entropy of a map with even a low noise level is initially higher than the corresponding deterministic map is due to partition effects, and to regions of the mapping where intervals are contracted, that is, where the slope is less than one. In a purely deterministic description, a small interval representing an uncertainty can be shrunk and then expanded back to the same level without any further loss of information, as schematically illustrated in Fig. 38a. But if there is noise present in the mapping, no interval can be 59 contracted to below the noise level. An attempt by the deterministic part of the dynamics to contract and then reexpand a small uncertainty interval will result in a larger uncertainty, and loss of information, as indicated in Fig. 38b. As the noise level is decreased, the stored information will increase without bound, but the initial part of the entropy curve will remain the same, even at arbitrarily low noise levels. Thus the entropy of the deterministic approximation to a physical system will be relevant only to observations con¬ ducted at intermediate length scales, that is, coarse enough not to resolve the noise level, but not so coarse as to smear out large-scale structures. Although more careful studies might be needed to clarify the effect of paritloning on the entropy calculation, it seems that the Initial values of the entropy can be estimated from the deterministic map simply by ignoring the contracting parts of the map. In summary, this definition of entropy, in terms of a rate of loss of stored information, does not suffer the difficulties which the standard defini¬ tion in terms of partitions finds in the presence of noise. However, the two definitions are equivalent only in a particular intermediate set of length scales. 60 16 * "Observational noise " In the discussion up to this point, we have assumed that the variables x and x', representing the results of measurements of a system before and after a time interval At, are the "true" variables, that is, knowledge of these vari¬ ables will give the best possible account of the predictability of the system. We have relied on the coordinate invariance of the various measures of informa¬ tion to justify the use of an arbitrary measurement scheme, we hope that a "good enough" measurement of any set of dynamical variables will equivalently charac¬ terize the predictability. More generally, though, we might expect that a measurement procedure often will not reveal the complete state of the system. The measuring Instrument may suffer from a lack of resolution, or there may be "noise" in the measuring instrument independent of any noise in the system itself. Or perhaps the system possesses more "degrees of freedom" than the measurement apparatus, and ambigui¬ ties are Introduced in the projection onto the measured variables. A more gen¬ eral situation is sketched below: Fig. 39 61 The "true” before-and-after variables x and x" and their causal connection P(x'|x) are not observed directly. Instead, the states y and y" of the measur- ing instrument are recorded, and only the causal relation P(y"ly) between these variables is directly accessible. The properties of the measuring instrument are specified by conditional distributions connecting the "true" and observed variables. The vertical data paths of Fig. 39 illustrate two interpretations of the effect of a measurement. The distribution P(x|y) describes a state of knowledge of the "true" variables P(x) given an observed distribution P(y). This is a key step, P(x|y) is a "best guess", which, acting onFCy), yields ?(x). The Inverse of this conditional distribution, P(y'|x^), can be interpreted as the projection onto the observed variables P(y^) of the "true" distribution P(x^). The process whereby the complete system dynamics P(x'|x) generates the observed causal connection P(y^Iy) might be described as follows: An observed distribution P(y) constrains the possible values of the "true" variables x to a distribution P(x). This distribution is acted upon by the "true" dynamics P(x^|x), yielding the distribution P(x"), which can be interpreted as a state of knowledge of the "true" variables after a time At, given an earlier measurement of the accessible variables y. Finally, this state of knowledge is projected onto the observable variables by a second measurement, yielding the distribution P(y'). If the "true" system dynamics P(x^|x) happen to be available, and the pro¬ perties of the measuring instrument are known, the resulting observed dynamics can easily be computed by composition of the conditional distributions: Pa'h)= f[P0i1x-)KxMP(xij)JxJx' J This expresses the equivalence of the two paths between y and y' in Fig. 39. 62 This description of the measurement process might be given by an omniscient observer, possessing a total overview, including the true dynamics P(x'ix). However, the usual case is that we know only the observed dynamics P(y'ly)> and perhaps something about the properties of the measuring instrument. The static image of Fig. 39 cannot describe the mysterious process whereby we generate suc¬ cessively better models of a presumably unique true dynamics from limited sense data. Yet hopefully communication theory can be useful, providing both a frame¬ work for describing imperfect measurements, and a formalism for quantifying the "degree of imperfection”. The communication metaphor suggests a few basic con¬ cepts which may survive the future development of our understanding of the modeling process. a) Measuring instrument as channel: Objective information about the state of a system is obtained via measure¬ ment. The measuring instrument is thus a channel between the system and the observer, and if its properties can be represented by conditional distributions P(x|y) and P(y^lx'), the channel can transmit information at a particular rate , given by I(x|y) and I(y'|x'). This rate places limitations on what can be known about the system. We are assuming that both the system and the measurement process are time- independent, and that the same instrument is used in both the before and after measurements. Thus there will be identical equilibrium distributions P(y) and P(y') defined on the measured variables, and the transmission rate both to and from the primed variables will be the same: I(x|y) = I(y|x) - I(y'|x') = I(x'|y') The domain of the measured and "true” variables may well be different, and the measurement may not be being conducted in an optimum fashion, thus this number 63 is a channel rate, and not a channel capacity* Its units are "bits per observa¬ tion”. If the measuring instrument is "perfect", there will be a one-to-one map¬ ping between the measured and "true" variables, and clearly the distributions P(x'ix) and P(y^ly) will have the same stored information and entropy. Here the rate of transmission through the channel is infinite, if the variables are con¬ sidered continuous. For any physical measurement, though, the rate is finite, which implies that the stored information associated with the observed model is necessarily less than that of the true dynamics Ky'ly) < Kx'lx) The observed mapping P(y'ly) is in this picture a composite of the maps assocl- ated with the "true” dynamic and the measurements, and it can be shown mathemat¬ ically (exercise) that the rate associated with any composite mapping is less than or at best equal to the capacity or rate of the smallest of its parts, equality holding only in the singular case where everything else is one-to-one. This is clear physically, the information flow through a series of devices is restricted by the slowest among them. The measurement Instrument acts as a "filter" between the observer and the system, and both the stored information and the entropy associated with the observed dynamic P(y^ly) will be less than or equal to the stored information and entropy of the "true" dynamic P(x^|x), for predictions any distance into the future. Thus both the absolute value and the slope of the curve I(t) will be less for the observed variables, as sketched below: 64 t -> Fig* 40 - Stored information and entropy reduced by passage through measurement channel. Although in general the true dynamics P(x^|x) cannot be fully known, usu¬ ally we have a greater or lesser confidence in the measurements. For example, the water drop dynamics takes place on a time scale of tenths of seconds, whereas measurements are performed with a crystal clock, with a resolution of a few microseconds. Thus we expect that, if the dynamics is indeed low dimen¬ sional, we will be able to construct a good model of the true dynamics. This intuitive feeling can be numerically justified, for the "fuzzy hump" data of Fig. 30, by comparing the information the system is capable of storing, about 10 bits, with the measurement channel rate. The latter can be estimated by consid¬ ering that the timer can resolve the true drop time to 4 microseconds out of an a priori expectation that is roughly flat over the average drop interval, about .1 second. Thus about 14 bits are available for measurement. The fact that the measurement precision is much greater than the measured system storage capacity provides a good basis for the belief that the system is well characterized by the measuring instrument, but whether this is actually the case or not is an undecidable question. This point will be discussed later. 65 b) Observation rate: A single observation of a physical system will yield a certain amount of information about the state of the system, the amount is as we have seen quan¬ tifiable in bits and limited by the resolution of the measuring instrument. By considering a series of observations, repeated at a given rate, we can define an average ’’observation rate”, the product of the average bits per observation and observations per second. The units of this quantity are thus "bits per second", the same as the entropy. One task of an experimenter studying a new system is to characterize the determinism of the system, which can be described as the information stored through time in the dynamical variables. If the system has a positive entropy, some or all of this causal connection will be lost in the passage of time. If succeeding measurements are too far apart, they will not be able to resolve the determinism, even if they are of high precision, this situation is sketched below: i(t) t —> Fig. 41 - Causally disconnected measurements. Here the observer is only taking random samples out of the minimum information distribution P(x). In order to resolve a given degree of determinism, or stored 66 information, the experimenter must observe at a rate greater than the entropy at that level of stored information. If we assume that the average rate of loss of information is roughly Independent of the amount of information stored, as in the figure, then a given degree of determinism can be resolved with either high precision measurements at a low rate, or coarse measurements at a high rate[30]. c) Complete and Incomplete measurements: An autonomous system can be described as a dynamical rule acting on some . . The dynamical rule which we have been writing P(x^|x) thus will also appear below as space of states {X}, producing a sequence of states X » X . ,,, X .^, n n+l n+2 If a system is completely specified by its "state” X^, then there will be no increase in predictive ability upon learning its history X^ etc. Thus we can define a "complete measurement” as one which allows the maximum pos¬ sible predictability with a single measurement. With a "complete” set of vari¬ ables, the dynamics becomes a first-order Markov process, the evolution of the system so described is independent of its history, that is, P(X n+1 IX n ) - P(X r+ i^ n *^n-lThis is like a "nearest-neighbor interaction" between successive observations X , X . „. n* n+l The "true" variables are by definition "complete", but this property can be lost in an imperfect projection down to the measured variables. Some unmeasured aspect of the system may persevere through time to be recorded by a later obser¬ vation. Then a better prediction can be made by considering the past history of measurements, we have, in the language of communication theory, the more compli¬ cated situation of a "channel with memory". There are at least two general ways in which this situation can arise. First, the "true" dynamic might be capable of storing considerably more 67 information than the measurement channel can transmit* Examples of this case will be presented in the next section. Second, the dimensionality of the "true" dynamic might be higher than that of the measured variables, or that the two sets of variables are not mapped in a one-to-one fashion. This case is exemplified by Fig. 6, even simple functions can become multivalued. Note that in the second case, the measured variables cannot be made complete by just increasing the accuracy of the measurements, an actual change in dimensionality or mapping of the observation channel is required. A complete set of variables can of course be constructed by assembling a larger dimensional variable con¬ sisting of the present measurement plus as many past measurements that increase predictability. The process of reconstructing a higher dimensional state space from lower dimensional observations has been termed "Geometry from a time series" by Packard et al [31]. 17 . Symbolic Dynamics The result of a physical observation is typically recorded in the form of a number of finite length. Thus, from a measurement point of view, an experiment can have only a finite number of outcomes. Each outcome can be labeled by a symbol, and the recorded time behavior of the system then consists of transi¬ tions between these symbols. Measurement, in principle, reduces a continous system to a discrete one. If the underlying geometry of the dynamics is simple, a rather coarse measurement with a small number of symbols may extract a consid¬ erable portion of the determinism of the system. The description of a continu¬ ous system by a small number of symbols is often referred to in the mathematics literature as "symbolic dynamics” [32]. Repeated coarse measurements of a chaotic dynamical system typically will produce a partially correlated string of numbers. We will begin by discussing the predictability and entropy of such a data stream. Then we will consider the 68 case where a measurement can have one of only two outcomes. This provides a particularly simple arena to examine, in a preliminary way, a few rather general questions. Given a data stream, how much can we infer about the system that produced it? What sort of measurements should we make to optimize the predicta¬ bility of a system? Do we use the same criteria to best determine the past states of a system? This section builds on the recent work of Crutchfield and Packard [33]. First a bit of notation, a string of symbols generated by repeated measure¬ ment of some system will be written s^, s^, s^, ... and a history of n such sym¬ bols s 1> s^, ••• , will be denoted S n . Now because the measurement process has produced a discrete set of symbols, we can directly apply the definition of entropy, for a single symbol we have Hfs) =-S fcMyKO where the sum is taken over the possible values each symbol s can take on. The entropy of a string of length n is given by HCS") = -5 P^)I 7 P«V r where the sum is taken over the much larger set of all possible patterns of symbols of length n. Already we can see that practical problems will arise in computing directly the entropy of long strings of symbols. If successive symbols are totally uncorrelated, as are for example the outcomes of a repeated coin toss, then the entropy of a string of n events is simply n times the entropy or "surprise" of the single event: H(S n ) « nH(s) But if the events are correlated, the "surprise" of later events is reduced by 69 expectations build up by the earlier history. The classic example, due to Shannon, is written language. The longer a string of text, the easier it is to guess the next lette. The expected general form of the total entropy as a function of the number of events or symbols is illustrated in Fig. 42 below. The total entropy Increases monotonically (we never get ’’unsurprised"), but the rate slows as expectations are built up, leading to the convex curve shown. (number of symbols) Fig. 42 - Total entropy of a string of symbols as a function of its length* The asymptotic slope of this curve is the entropy per symbol, and the y intercept of this asymptote, indicated by the dotted line, is the stored information. In a real chaotic system with noise present, initial data has predictive value for only a limited time, events which are too far apart are causally disconnected, as argued in Fig. 41. Correlations extend over only a finite length of the string of symbols, old experience ceases to have predictive value. The effect of this on the graph of Fig. 42 is that the curve approaches a constant slope, that is H(S n+1 ) - H(S n ) - H(s|S n ) - const. -EE h 70 for large enough n. Shannon refers to the asymptotic value h as the "entropy per symbol" of a discrete signal source. Now how does one compute the stored information associated with this string of symbols? We know that this number will be finite, if correlations extend only over finite lengths of the string. We also know that, if the curve of Fig. 42 is initially convex, the measurements which produced the symbol string were "incomplete", in the sense discussed in the preceding section. Predictability of the next symbol is increased, up to a point, by considering longer and longer past symbol histories. A reasonable way to approach this computation is to consider the total predictability of the future symbol string given the past symbol history, l(s ,_,s ,„,•••|s ,s or I(S n |S m ) if we indicate past and future strings n-f-l tirrZ n n-1 by S m and S n . We can now argue that this number will be independent of m and n, if both are larger than the correlation length. A few manipulations of the definitions of entropies and stored information given earlier yield: I(S n |S m ) - H(S m ) + H(S n ) - IKS® 411 ) This can be seen to be the _y Intercept of the tangent to the flat part of the curve of Fig. 42, if m and n are large enough* This nice geometric picture tells us, for example, that I - H(S n ) - nh for large n, where h is the asymptotic entropy HCS^ 1 ) - H(S n ). It is clear as well that the y intercept of the tangent to any portion of the curve gives a lower bound to the stored information, and that a criterion for the maximum correlation length is that the y intercept no longer Increase with increasing string length n. 71 If the resolution of the measurement process which produces the symbol stream is Increased, the number of values each symbol s can assume increases, and the number of possible patterns S n Increases exponentially. As the resolu¬ tion approaches the noise level, the asymptotic slope of Fig. 42 will increase without bound. But the y intercept indicated will remain bounded, presumably approaching the stored information of the underlying dynamical system. Please note here the distinction between the entropy of a particular number stream obtained by measurement, and the entropy of a map or system , defined in section 13 to be the rate of decrease of stored information or predictability as future times are considered. The latter "entropy" remains bounded as resolution is increased. Now we will take a look at the effect of coarse measurements on the familiar parabolic one-dimensional map. This example, the writer hopes, will provide a useful and not overly technical illustration of the ideas suggested above. The single dynamical variable in a one-dimensional map is a value along the real line between (usually) zero and one. A simple way to model a coarse meas¬ urement of this variable is to divide the interval in two, and to report as the measured variable only whether the value of the "true” dynamical variable falls on the left or right half of the interval. This measurement scheme is illus¬ trated for the map X “3.7X (1-X ) in Fig. 43a, where the dividing line is Hrri n n placed at X ■ .5. Fig. 43b shows the minimum information distribution P(X) for this map. 72 Fig. 43 - Graph and Invariant distribution for the map X .. **3.7X(1-X). n-t-i n n The results of a sequence of repeated measurements of the system are no longer a string of high-precision real numbers between zero and one, but rather a string of the two symbols "0" and "1**, labeling the left and right side of the interval. If we denote the string of symbols so generated as s^, s^* s^, then the transition matrix of measured quantities P(y'ly) can equivalently be written P(s ,, |s ). n+1 n' The information stored in the continuous variable over an iteration of the map is arbitrarily large, as discussed in section 11. But under the dynamic induced by the single symbol measurement, clearly no more than one bit can be stored over a single iteration, as only one bit is being observed. We will now, in Fig. 44, trace the information stored around the schematic data paths of Fig. 39, where P(x'|x) is the "purely deterministic" dynamics defined by the one-dimensional map, and P(y'ly) is the dynamic induced by the coarse measure¬ ment. 74 As mentioned above, I(x^|x) is infinite, as are also H(x) and H(x'), but I(x|y) and I(y"|x') can be no more than one bit, the maximum capacity of the measurement channel. I(x|y) is in fact somewhat less, as the partition at X = .5 does not divide P(x) evenly, the iterates of the map fall on the right about 78% of the time. Thus the symbol "1" occurs more often, this constitutes prior knowledge which reduce the average new information obtained in each single measurement to .766 bit. A "best guess" of the state of knowledge of the "true" variable x given that the symbol "1" has been observed is illustrated in the top-left panel of Fig. 44, it is simply the right side of the minimum distribu¬ tion P(x). Given that the channel is noise-free, the knowledge we gain about the system is simply equal to the entropy or "surprise" of the measurement, that is, I(x|y) - H(y) This is the informational value of the inference we make about the "true" system given the limited measurement. This state of knowledge P(x) is now iterated under the "true" dynamics P(x^|x), yielding the distribution P(x^) illustrated in the top-right panel of Fig. 44. This distribution represents our knowledge of the variable after an iteration of the map, given that we knew only that the variable was to the right of .5 in the interval before the iteration. The information stored to this point (averaged over both possible initial measured symbols "0" and "1") has been reduced to .254 bit. The difference between I(x|y) and I(x v |y) is, for this partition position, H(x'Ix), the entropy of the deterministic map, as computed for example in Ref. [22]. Finally, the distribu¬ tion P(x') is projected down onto the measured variables "0" and "1", as shown in the lower-right panel, further reducing the stored information to .094 bit. Now we have completed our trip around the diagram, and this last number is the stored information associated with the induced dynamic between symbols, P(y^|y). 75 Thus by transforming to and from the "true*' variables, we can compute the transition matrix between the symbols "0" and "1", and the stored information associated with this matrix. However, in writing the induced dynamic P(y'ly) as a two-by-two matrix p ( s n +]J s n ) we assume that the observationally accessible state of the system is completely specified by a single measurement of the primed variables. That is, the single symbol "0" or "1" describes all we can know about the system. But in this case we can make a better prediction as to the next measured symbol than is possible with just the transition matrix by considering the history of observed symbols, not just the single prior symbol. A single symbol measurement is thus not "complete”, that is. P( Vl , VVl*- ) ,tp(s rri-l |s n ) - The stored information associated with the single-symbol matrix P(y'ly), I(y' |y) ** .094 bit can be improved upon by constructing higher-order transition matrices, considering the previous two, three, or more measured symbols to predict the next. The more of the past which is considered, the better the prediction of the next symbol; the stored information as a function of the number of past symbols considered is graphed below in Fig. 45 for the map of Fig. 43. ‘n+ll 0 (bits) I(x'|y) Fig. 45 - Predictability of the next single symbol given a history of length n. 76 This curve is simply related to the slope of the total entropy curve through the relation: I(s|S n ) - H(s) - H(slS n ) Note the extremely slow convergence of this curve, even after we take a history of twenty symbols, there is more predictability to be gained by looking back to the twenty-first. This is because our model is noise-free, with corre¬ lations extending over indefinite lengths. The tangent to the asymptotic curve becomes, at least, difficult to compute. As we will see in the next section, real data behaves more reasonably. An upper limit to the predictability of s^^ obtainable with any amount of knowledge of the past is supplied by the value I(y'|x), or I(s ,, lx ). This is n*T*i n the predictability of the next single symbol s^^ given knowledge of the com¬ plete continuous variable X n , clearly the knowledge resulting from a set of past measured symbols can only approach this value. Knowledge of X^ reduces the next symbol s^^ to a certainty, that is, H(y^|x) * 0, and I(y^|x) * I(y^lx^) because I(x"|x) is infinite, and P(y^lx^) is a projection. However, the entropy of the deterministic map Itself remains finite, here H(x^[x) ** .512 bit. A second and stricter limit on the predictability of a series of coarse measurements is obtained by considering the value I(x"|y), or I(X ,, Is ), indi- nrrl n cated by the dotted line in Fig. 45. This is the amount of information an observer with full ability to observe the complete continuous variable gains when he learns which side of the interval the variable originated from, corresponding to the upper-right of Fig. 44. This quantity limits the increase in our ability to predict the future symbols s^_^, ••• given the single symbol s^, that is, the average increase in our ability to predict the future we obtain with each new measurement. 77 Note that although I(x^|y) and I(y"|x) describe the information stored along symmetric paths in Fig. 39, they are not equal. The measurement process has introduced a time asymmetry into an autonomous system. In this case, a sin¬ gle measurement gives us more information about the past system state than the future. Because the system is presumed to be autonomous, any number stream derived from it by measurement must be equally predictable forward and backward in time, that is, I(y^ly) must equal I(yly^), and likewise for higher order com¬ binations of symbols. But once we assume a model P(x~|x), in this case expli¬ citly nopinvertable, a single measurement may allow us to infer more about the state of the model in one time direction. For the case of the noise-free one-dimensional map of Fig. 43a, the com¬ plete state of the continuous variable can be determined by considering the sequence of the coarsely measured variable s^, e rt ^i* etc * it: generates. This however is a very special case, and is only possible because i) the system is "perfectly deterministic”, with an infinite storage capacity, thus the present state of the variable X determines all future states of the system, and ii) the partition dividing the interval is in exactly the right place to allow a one- to-one mapping between points on the continuous variable X and symbol histories. We will now study the effect of moving the partition from x - .5, changing the projection of the system onto the observed variables y and y'. Figure 46 demonstrates that the effect depends on which informational measure is being considered. H(s), the simple one-symbol entropy ignoring symbol to symbol correlations, is a maximum at the partition position which makes the two possi¬ ble symbol values equiprobable. H(s|s n ), the "surprise” of the next symbol given a symbol history, is a maximum at x« .5, the peak of the map. Its value, in a noise-free model, is limited by the deterministic entropy of the underlying map. I(s|S n ), the predictability of the next symbol given a symbol history, has 78 position Figo 46 -The top left panel shows the attracting part of the graph for the map x" = 3.7x(l - x). The identity point of the map is the point in x where the identity line crosses the graph. In the other panels appear the single symbol entropy, the entropy of a symbol given its history, and the predictability of a symbol given its history, all as a function of partition position. A history of 15 symbols was used in these calculations. 79 a sharp maximum at the identity point of the map, x * x', here simple alternating sequences like 010101 are most likely* The lesson of Fig* 46 seems to be that, if we have to perform a measurement with constraints, the "best" measurement of a system will depend on the use to which the system is put. If we want to use the system as a recording device, and determine the past states, we may want to optimize the entropy H(y"|y) coming from the system. If we want to send signals into the future, and exer¬ cise control over future measurements, we will maximize I(y'ly). An extreme example of the difference between these two measures is given by the full two- onto-one parabolic map x^ » 4x(l-x), or by the binary shift map of Fig. 21, with the measurement partition at x = .5 in both cases. Here the entropy generated by the map, H(x'|x), becomes equal to one bit per iteration, which is the max¬ imum measurement channel rate I(y^fx^). Thus no predictability of the future is possible, yet an initial condition in the continuous variable x can be read off to arbitrary accuracy if we wait long enough. The considerations guiding optimal partitions in one dimension seem straightforward, but even in two dimensions the situation is unclear [34]. Also, the problem of fuzzy partition boundaries, or noise in the measurement channel, seems open. The whole question of the matching of a measurement channel to a particular system seems largely unexamined in a physics context. "Here be dragons", and a lot of interesting geometry. 18 * Example from the data The effect of a coarse two-symbol measurement on the stored information and entropy will now be illustrated, for the "canonical" data of Fig. 30. As must be clear by now, this data set was chosen as an example due to its similarity to the one-dimensional map with noise added of Fig. 4. 80 The stored information associated with the "true” dynamic PCT^^lT^) is 1.80 bits, as was shown in section 14. The maximum capacity of the observation channel is only one bit, leaving us to suspect that the coarse measurement is not ''complete", that is, the quantity I(y^ly) can be improved on by considering the past history of the primed variables. Taking a cue from Fig. 43a, we define the two symbols of the coarse meas¬ urement by placing the partition at the peak of the distribution. Fig. 47 below shows that there is indeed a slight convexity in the total entropy curve as the number of past symbols is increased. Here, though, the curve reaches an asymp¬ totic slope after only a few past symbols are taken into account. This is understandable in view of Fig. 31, the stored information associated with the "true" dynamic P(x'|x) falls to near zero after only 4 iterations. There is no point in looking farther back into the past than this, the system state at that point is causally disconnected from the present state. 5 H(S n ) (bits) 0 Fig. 47 - Entropy as a function of number of symbols of the "fuzzy hump" data set, for a partition taken through the peak of the hump. The asymptotic slope is maximal for this partition, but not the y intercept. 81 position Fig. 48 - Figure corresponding to Fig. 46, for real data. Note the general correspondence In form of the various informational measures. A history of 7 symbols was used, longer than the correlation length of about 4 symbols. 82 The effect of varying the partition position is shown in Fig. 48. The predictability and surprise of the next symbol given a past symbol history vary pretty much as in the deterministic parabolic map of Fig. 46. Apparently, some of the geometrical considerations governing the selection of optimum partitions carry over from the deterministic to the noisy case. The reduction in predictability resulting from the coarse measurement is quantified by comparing the stored information of our best idea of the "true" variables, I(T n+ ^iT^), with the total predictability of the future symbol string given the past symbol history, ICs^^s^^* • • • ^n^n-l* * # Because correlations extend only over a few measurements, a calculation of the latter quantity is feasible, and is shown as a function of partition position in Fig. 49. The optimum position for best total predictability is, as for the best prediction of the single next symbol, somewhere near the identity. This best value is about .7 bits, thus the reduction due to the coarse resolution of the measurement is at least 1.8 - .7 = 1.1 bits. Fig. 49 - Total stored information of the data string as a function of partition position. This figure corresponds to the movement of the y intercept in Fig. 47 as the partition is varied. 83 We have demonstrated, at perhaps excessive length, methods of characteriz¬ ing data originating from a chaotic system, with a stochastic noise element, passing through a limited measurement channel* The next section mentions some of the difficulties in applying the methods presented here* 19 . Limitations of this modeling procedure The treatment of experimental data discussed in this paper can be summar¬ ized as follows* First, an appropriate set of variables x are selected, and their values discretized into a finite number of possibilities* Next, the tran¬ sition matrix P(x^|x) is constructed by taking a large quantity of data and accumulating the individual matrix elements as histograms. Finally, numerical measures such as the stored information are computed, to characterize the pred¬ ictability which this empirically constructed model enables* An immediate practical difficulty with this procedure is: how do we know that we have selected an "appropriate" set of variables to describe the state of the system? For example, the number of observed variables might be less than the actual "dimensionality" of the system. Measurements will then be "incom¬ plete" in the sense discussed in section 16, and a reconstruction process will be required to generate a more complete characterization of the system state. Even the question of a practical definition of the "dimension" of a dynami¬ cal system is a research problem, this question is briefly discussed in appendix 2. Low-dimensional model systems, as well as experimental data presented here, can have complicated geometrical structure, with what corresponds to an intui¬ tive notion of "dimension" varying locally. The selection of variables in this situation clearly requires the judgement of the observer, implying the prior existence of a model of a more general sort in his mind. 84 The computation of the stored information associated with a model P(x^|x) amounts to a simple counting procedure, all the dimensionality and topology of the system is projected down to a single number. Much more than this will be required to describe even the gross qualitative geometric features of a system dynamic. In fact, the predictability of the immediate future may vary radically from point to point on the system attractor, this property would not be reflected in the average quantities we have discussed here [35]. Nevertheless, the stored information will be very useful as it does provide an objective meas¬ ure of the improvement of an observation; as the model or the measuring instru¬ ment is improved, the numerical value of the stored information will increase. Another serious practical problem is the amount of data required to con¬ struct a good model. The dynamic transition probabilities P(x"|x) are a physi¬ cal property of the autonomous system which can only be approximated by a data string of finite length. If there is not enough data, statistical fluctuations in the histogram approximation to P(x^|x) will add false structure to the model, resulting in an artificially high value of the stored information. The approach to the data taken in this paper was to use histogram bins wide enough to get good statistics, but narrow enough to be below the noise level, and thus capture all the deterministic dynamics. The data records taken so far, typically of lengths of four to eight thousand drop intervals, are in fact only marginally able to satisfy both these conditions simultaneously, even for the simple attractor used as the main example. Also, the noise level below which no causal structure exists is to an extent an assu m ption , a judgement on the part of the experimenter. The accumulation of histograms is actually often a poor way to measure a physical probability. Statistical convergence is slow, proceeding as the square root of the number of samples, and the sheer size of the transition matrix for a 85 high-dimensional system renders the histogram method completely infeasible. Yet there must exist more efficient methods for constructing models, or representing expectations, as demonstrated by the simple fact of our ability to function suc¬ cessfully in a multidimensional world. Bayesian methods may aid in treating some of the problems of insufficient statistics. Here typically the physical probability distribution is estimated on the basis of a limited amount of data, using certain assumptions about some properties of the distribution to be measured. The consistency of the observed data with these assumptions is checked, and the assumed prior properties of the distribution can then be modified to provide a better fit with succeeding data. The problem of estimating the size of a space of N states from limited data pro¬ vides a simple example of this approach. If a random process selects out the states at a rate of one per time unit, we would have to wait a time on the order of N log N to see all the states. But duplications in observed states will start to occur at times on the order of the square root of N, and this can be the basis of an estimate of the size of N. This situation is analogous to the "birthday problem": given a room full of people, how many have the same birth¬ day? See work by Erber[36], Ma[37], and others. To improve our ability to con¬ struct models with limited data, further research is indicated into describing the expected properties of the probability distributions of various sorts of systems, and the effects of insufficient statistics on entropy-like measures. Some studies of this sort are underway[38]. 86 20• More general models , hov " good " Is £ model ? There is a large conceptual difference between the simple physical model described in section 5 and the histograms accumulated from the data. The latter is just a representation of the data, every transition is explicitly described (although as mentioned above some thought enters into the very act of setting up the measurement channel). The data histograms in a sense can't be wrong, but despite their exhaustive description of the data they in themselves have no explanative value. Physical models such as the mass-spring example are not constructed directly from the data, but rather by analogy. They consist of algorithms to predict transition matrix entries, in the absence of the data. There are two considerable advantages of this sort of model over the complete transition matrix. First it is much more concise, only desired matrix elements need be actually generated. Second because it was generated by an analogy, it often suggests generalizations beyond the particular data, allowing for example pred¬ ictions about the effect of varying system parameters. The raw histograms, despite their limitations, have the nice feature that there exists a measure of quality, the stored information, which is independent of external auxiliary measures such as "utility functions". This measure is a property of any string of numbers derived from a time-invariant process. But once a model of the second type has been constructed by the observer, there are two streams of numbers produced, one by the model and one by the reality, and a measure of the "goodness" of the model will be some sort of "distance" between the two number strings (Fig. 50). The value of a model in the real world depends upon its use, and thus in general rests on an external evaluation function. Shannon, in the context of 88 communication theory, introduces a "fidelity criterion” of fairly general form to measure a distance between an input and output number string. This measure, dependent on the value to the user of a particular degree of accuracy in channel transmission, determines an effective rate of transmission through a noisy chan¬ nel. In the biological world, the effective "evaluation function” judging the utility of a model is often severely binary. A fox and a hare carry with them a rather detailed model of each other^s behavior, yet in their interaction all the subtlety is collapsed to two outcomes, the hare is eaten or the fox goes hungry. If one is willing to simply count distinguishable states and transitions between them, without judging the value of particular items of information, and if a large amount of data is available, a "distance" between model and observed data strings can be defined, independent of an evaluation function. This can be done by computing the predictability of one string given the other. Typically a physically observed initial condition is supplied to the model, and the subse¬ quent physical and model data streams compared. A certain number of digits will correspond between the two number streams, and a certain number will differ, see Fig. 51. The difference between the two number streams is, to the observer, an unpredictable element , and thus represents a source of information . For a noisy system (or model), the amount of information so generated will diverge for meas¬ urements of arbitrary precision, as discussed earlier. However, the average number of matching digits, and the average rate of loss of matching digits as predictions farther into the future are attempted, can be recorded. It is unclear to the writer whether such measures are nicely coordinate independent, but they seem operationally well-defined in a given setting, and capable of measuring improvement of a model. An improved model will result in a lower rate of divergence between the two number strings, less "surprise" per unit time to an observer making continuous measurements at fixed resolution. 89 “model" •0110101 .1011001 .1101110 .0010010 .0101011 .1000110 .1110101 These ideas have already appeared in practical use. In a recent talk[40], Cecil Leith discussed a large computer model used for weather prediction. An observed weather pattern is given to the model as an initial condition, and the evolution of the model and the actual weather system compared. The exponential rate of divergence of the two paths through the space of variables is computed, and used to indicate improvement in the model. It should be noted that predictive errors of fixed constants, or fixed drift rates do not reduce the "goodness" of a model under this sort of measure. Eclipse predictions which are always late by exactly twenty-four hours are of equal value as correct predictions, once the fixed delay has been recognized. In principle, a "table of corrections" can be prepared over a short time to com¬ pensate for fixed errors, and linear and even polynomial tendencies for the two strings to drift apart. Only exponential rates of separation will give finite "observed data" .0110011 .1011010 .1101001 .0010100 .0100111 .1001010 .1111000 time Fig. 51 V contributions to the logarithmic measures of information. 90 A second sort of measure of the "goodness” of a model arises from con¬ siderations of the conciseness of a model. To actually generate a prediction from a model requires a computation, and one can give a length in bits for a computer program to implement the calculation. Or one can measure the length of the message which communicates the model to another person. If the same "dis¬ tance” under some measure between model and reality can be achieved with a shorter model, we usually feel that the shorter model is "better". These ideas have been developed in the field known as "algorithmic information theory", by Kolmogorov, Chaitin, and others. A very readable introduction is Chaitin's Scientific American article[41], as well as Joe Ford^s article, "How Random is a Coin Toss?"[42]. Chaitin and Kolmogorov define the "randomness" of a number string to be the length of the computer program required to generate it. A "completely random" number can only be specified by explicitly listing each digit, thus the corresponding computer program must be of at least equal length. The asymptotic ratio of program length to output length will be one "bit per bit" for a completely random number, and zero for a completely determined number string, e.g. 1010101010... . A highly concise model is useless if says nothing about the real data string it is supposed to predict. Thus a reasonable measure of the "goodness" of a model should compare the length of the model to the predictability it enables. For example, one might measure the total size of the body of informa¬ tion comprising the model, including the length of the algorithm, the size of any "table of corrections", and the amount of initial data used, and compare it to the total predictability it enables, that is, the total number of nonredun- dant "matching digits" integrated over the future. The chief limitation of the modeling schemes described in this paper, and indeed of all of information theory as it presently exists, is the limitation to 91 autonomous systems. In any hydrodynamic system of reasonable size, as well as In dally life, the space of possibilities is so huge that all that can be observed are unique events * Only a tiny fraction of the possible distinguish¬ able states are visited in a reasonable time, and the accumulation of average transition rates between individual states is out of the question. But we know that our brains can organize experience, and represent expectations about real¬ ity in a way that allows considerable prediction. Despite our constant use of this facility, we have virtually no exterior understanding of it, as demon¬ strated by our inability to incorporate even a rudimentary form of this process in any man-made machine. A "good model" in physics participates in this unknown interior organizing process, and is said to be explanative . At this point a "good explanation" can only be defined in terms of the subjective feeling of satisfaction it produces, but before subjective sensation is deemed out of place in a discussion of the construction of models, it should be remembered that such feelings or instincts guide all creative work, scientific or otherwise. A model can have "explanative value" even when it has no predictive power, the outstanding example being the theory of evolution. 21 . The undecldabllity of optimum modeling As a model is improved, the discrepancy between its predictions and data from the system under study becomes less and less. Under an older view of things, this process could be continued indefinitely; successive improvements to the model could bring the two number strings into coincidence to as great an accuracy as desired. However, if the given system is chaotic, a minimum rate of divergence between the two number strings is guaranteed. Even if the model were an exact physical duplicate of the original system, and was set to the same ini¬ tial condition, the two would diverge at a rate given by the entropy of the 92 system. In this situation it becomes reasonable to ask, how are we to know when we have reached this intrinsic limit to predictability? How are we to know when we have constructed the optimum model? A similar question arises in an observational context. It is a common experience for those of us who conduct physical experiments that data improves over time, often for no apparent reason. For example, the faucet which produced the data of Fig. 52a below might, several weeks later, yield the data of Fig. 52b at the same flow rate. The improvement in the data is quantifiable as a measured increase in the information storage capacity of the system, computed as described in this paper. How are we to know that further improvements in equip¬ ment or technique will not result in even better data, and increased predicta¬ bility of the system? Fig. 52 The purpose of this section is to make the fairly obvious point that we don't. Furthermore, it will be argued, the question of whether a model is optimum is undecidable In the sense of Godel. 93 From a classical perspective, It Is clear that an experimenter's claim that he Is "In the noise", and that there Is no point In attempting to Increase reso¬ lution, Is an assumption. If the observation rate is less than the entropy of the system at some resolution level, the system will appear stochastic, even If It were "completely deterministic". Boltzmann recognized the logical necessity for this sort of assumption. It appears as an "assumption of molecular chaos" In his treatment of the kinetic theory of gases. In the more modem view this assumption becomes labeled a fact of Nature, and the experimenter will point to "thermal kT noise", or even "basic quantum mechanical indeterminacy" as justifi¬ cation for not expending more effort on resolution. Still, the question of ultimate resolution remains a practical one for chaotic systems with measured noise levels far above the kT level, as Is probably the case for the water drop system, and certainly for weather systems. The concept of "undecidability" has been developed in the context of discrete state logical systems and computing machines, and has not, to my knowledge, been generalized to continuous systems or systems with a noise ele¬ ment. Nevertheless let us usurp some vocabulary and results for use In dynami¬ cal systems theory, noting that a large discrete state system can accurately mimic the behavior of a continuous system, and that the addition of noise to a system is unlikely to decrease any "undecidability" it might have. The discrepancy between the predictions of a model and observations of a physical system is, to the observer, a random element. If the model embodies the observer's complete knowledge of a system (including perhaps the addition of a noise element), he will be unable to distinguish the two resulting number streams, should he get them mixed up. A "proof" that the model was the best possible would involve a proof that the random element, or discrepancy, was intrinsic to the system, or "truly random", not susceptable to any further 94 deterministic description. However, as Chaitin and others argue in the context of the theory of algorithms, one cannot ‘prove” a number to be random, to pos¬ sess such a proof would imply a logical contradiction. Thus, given a set of seemingly random numbers, there is no way in general to show that there does not exist an algorithm shorter than the list itself to generate the numbers, and thus reduce the randomness. This result makes a certain amount of intuitive sense. A "proof" is a manipulation of the fixed axioms of a closed logical system, whereas "random numbers" are exactly those entities which come from outside the logical system, that is, cannot be generated within the system. To "prove" the number random, the system would have to recognize the number as coming from outside. As there is no representation of the number inside the system, the comparison required by a recognition process cannot occur. The process of divining a deterministic aspect to a string of data, thus reducing its randomness, might be likened to "breaking a code". Man-made codes, where the apparent randomness is deliberately maximized, illustrate difficulties which in principle can arise in attempts to model natural phenomena. For example, a type of cipher called an "open key" code has been studied recently[43]. Here, even when the algorithm for generating encrypted text is known, the text cannot be decoded by a simple algorithm (it is believed). Only an exhaustive search of impractical length can find the determinism and reduce the randomness of the text string. So we can at least Imagine a situation where even if we had guessed a correct model for a physical process, we could not demonstrate that the observed data came from that process. It is on this basis that one might argue that a description of theoretical physics in terms of particles and interactions between them can only be partial. Even if one finds the elemental rules, determining their qualitative 95 consequences on a macroscopic scale might require a computation of indefinite length, with a non-unique outcome. Oh, well. Maybe Godel's theorem has nothing to do with chaos, and there is only a desert where these two questions meet. Then again, study here might address one of the larger Issues; what is the limit to man's ability to predict and thus control his own evolution? 22 . Conclusion , " ideas " as models This paper has presented, by means of a physical example, preliminary work on the problem of predictability in the presence of noise. Computation of the stored information and rate of loss of stored information for data generated by the water drop system provided examples of the usefulness of these statistics to describe a degree of predictability, and to distinguish a difference between unpredictability arising from deterministic and stochastic sources. A more gen¬ eral discussion of the problem of modeling was attempted. The limitations of the procedures illustrated above point out the nearly total lack of understanding of the neural modeling system heired to each of us as our evolutionary birthright. Even an ant possesses capabilities which we are unable to incorporate into our machines. Any general theory of modeling must explain our ability to move easily through a high dimensional world, and, above all, to generate new ideas. Incremental learning, which occurs for example in acquiring a skill, might be described by some sort of conditional distribution which is continually updated in a Bayesian fashion, but an idea, as Drake notes[44], seems a more abrupt occasion. An "idea" can be described as the sudden construction of a model for sense data. Some kind of connection is real¬ ized, often increasing the predictability of the perceived world. The occasion of "having an idea" seems a curious combination of the active and the passive. One can actively accumulate knowledge, and actively exclude 96 from the local environment influences which deaden creative thought, but one cannot actively make an idea. Instead, one has an idea, a passive process which often involves some amount of waiting around. The delay is probably due to a process more akin to a random search than a computation, as the brain"s ordinary recall is usually rapid, taking only a few seconds. This suggests that there are parts of the brain optimized for making rapid searches through a very large space of possibilities, together with a signal to the "operator" when a connection has been made. This signal takes the form of a pleasant subjective sensation, often described in the psychological literature as the "aha" experience. It can even happen that one experiences the sensation of an idea about to occur, before the logical content of the idea becomes explicit. To the extent that we are materialists, we must believe that underlying such events is a physical process . How Is it that we can accumulate, and even, to use Jim Crutchfield"s phrase, anticipate knowledge? What are the operating principles of the dizzying jewel from which the natural world turns back to gaze upon itself? Perhaps more can be learned from a study of this mechanism from an interior, as well as an exterior perspective. Perhaps there is experimental physics still to be done without even getting out of bed. Ludwig Boltzmann wrote the entry entitled “model” for the 1902 edition of the Encyclopaedia Britannica. Here he states: “...Long ago philosophy perceived the essence of our process of thought to lie in the fact that we attach to the various real objects around us par¬ ticular physical attributes - our concepts - and by means of these try to represent the objects to our minds... On this view our thoughts stand to things in the same relation as models to the objects they represent. The essence of the process is the attachment of one concept having a definite content to each thing, but without implying complete similarity between thing and thought; for naturally we can know but little of the resemblance of our thoughts to the things to which we attach them. What resemblance there is lies principally in the nature of the connexion.♦ 97 Boltzmann lived too early to learn the physical and topological "nature of the connexion". We may be more lucky. I hope the reader can forgive the loose tone of some of this discussion, and that if it is not science, it is at least entertainment. Perhaps, though, contemplation of a dripping faucet can supply a few rungs in the ladder of meta¬ phor which must be built to understand modeling in a larger sense. When the falling drop is perceived, it falls in the brain as well, outlined in sparkling surreal intensity. The world swarms with life forms, many bearing an individual neural net. Each thin web is capable of directing a complex entity, and in each is mirrored the passion of the world. Yet this instrument sprang from the slime! Finally, I hope this work suggests that one does not need a particle accelerator to step to the frontiers of physics. The tapestry of reality flutters with moving causal structures, both preserving and generating informa¬ tion at all length scales. This brew of freedom and constraint generates a complexity that renders a simple faucet an enigma, let alone consciousness and evolution. Thus even if the particle physicists were to succeed, and all the laws governing the fundamental building blocks of matter were known, we would still walk down the street without knowing how or why. The mystery remains, as always, as close as our forehead. Acknowledgement s This work was made possible by the support and understanding of the Research Committee and others at the University of California, Santa Cruz. Special thanks are due to Jim Warner for hardware and software advice. 98 Appendix 1_ - Continuous time calculation Section 12 describes the computation of the predictability of a time interval T obtained from a measured time interval T , given the conditional n+l n distribution p ( T n+ ^l T n ) an< * the equilibrium distribution of intervals P(T). In fact, although the water drop system produces data in the form of discrete time intervals, these intervals are imbedded in continuous time. A complete descrip¬ tion of the measureable state of the system would include the time elapsed since the preceding drop. The additional phase degree of freedom increases the storage capacity of the system* This larger amount of stored information can be computed from the same data set used to compute the interval-to-interval stored information. + + t t The figure above schematically represents the situation at a particular instant of time, indicated by the dotted line, during a sequence of drop inter¬ vals. The time elapsed since the preceding drop is , the time until the next is t 2 * Knowledge of the state of the system requires knowledge of the length of some drop interval, and its phase. For an observer at the dotted line to be informed of the system state, he must be told the numbers T^ and t^. In the absence of prior observation, he will have to wait until the end of the interval T^ before a system state is determined. His a priori expectation of observing a pair of numbers t^ and T^ is described by a distribution P(t 2 ,T^) which, by time symmetry, is the same as the distribution of prior values 99 The stored information is defined in this paper to be the value of the knowledge of the immediate past state in determining the immediate future state. Thus, we must compute the value of the knowledge of the numbers T^ and t^ in predicting t^ and T^• These four numbers are connected by a conditional distri¬ bution P( t 2 » T 31 t i t an< * full stored information is given by the expres¬ sion pfta) pttaitj) JtJzJtJj, Now in this case, we do not need to compute the full four-dimensional distribu¬ tion, because ^VVW " I(t 2 ,t l* T l ) This can be shown algebraically, using P(T 3 1tj, t 2 » T 1 ) ■ PCTjIt^tp, or appreci- ated by noting that the predictability of T^ is completely determined by T^ ** t^ + t^* Thus there is no additional stored information associated with higher order distributions involving T^. So the full stored information simpli¬ fies to PCtlt,,T)fy p(t,l t,T.) Pit,) Jit JUT, The problem then is to express P^,^), P(t 2 > and P(t 2 lt 1> T 1 ) in terms of the measured distributions P(T) and P (T |T ). The subscript has been added to o n+i n P(T |T ) to distinguish this particular measured distribution in what follows, n+i n First, we will determine the equilibrium distribution PCt^T^). Now p(t, t.) = Pam pcx) 100 P(tjTjP.(TJT)JT, f P,(T,1I) ' l T > This last step is made clear by examining the figure below U T t> T If we divide T at random, all subintervals from zero up to the full length of T are equiprobable• By a similar argument. oQ r The remaining step is to observe that P(t 2 | ti > T i) - P 0 (t 1 +t 2 |T 1 ) Thus all parts of the full continuous information integral can be reduced to sums over the measured distributions, as claimed* This integral is approximated by a discrete sum via a binning procedure as before. 101 In the case where the variation in drop Intervals is only a small fraction of the average drop interval, as is the case in the "fuzzy hump" data set, the computation of the stored information simplifies further. P(T) T As the above figures suggest, P(t) is well approximated by a constant. Also, P(t,ro ~ pm again because of the small relative variation in T. The two approximations were used in the full stored information calculation illustrated in Fig. 32. 102 Appendix 2_ - The dimension question The central idea behind the Lorenz, Ruelle-Takens picture of turbulence is that the random element of a turbulent flow can be described, at least in some limited situations, by a model with only a few "dimensions", or "degrees of freedom”. It is clear that a large amount of information is required to describe the state of a fluid system in "fully developed" turbulence, but the experimental evidence indicates that near the transition to aperiodic behavior, a "low-dimensional" model often adequately describes the behavior. But how does one define the "dimensionality", "number of modes", or "degrees of freedom" of a physical system? Although the concept of "dimension" is intuitive, operational definitions are less clear. The dimension of a mathematical model is defined in its construction, but in a physical system, the "dimension" becomes a quantity to be determined by observation. One approach, employed by a number of workers, is to examine the dimensionality of an attractor in a state space reconstructed from the time series data of a physical experiment. For example, if some system is undergoing a stable oscillation, its motion can be represented by a limit cycle in an appropriate state space, a one-dimensional closed curve. There has been considerable recent work in an effort to extract statistical measures from experimental data, and relate them to formally defined measures of dimension. A review cannot be attempted here, suggested references are [45], especially the article of Farmer, Ott, and Yorke. Here will be given only a beginner's dimension calculation for some water drop data, a short general dis¬ cussion, and a few criticisms and suggestions for future experiments. The writer has benefited from discussions with some of the principles, including Doyne Farmer, John Guckenheimer, and Jim Yorke. 103 The dimension of an object should describe how its volume scales with increasing linear size. Proceeding informally, let's think of an attractor as a compact object with a "volume" of N equiprobable distinguishable states, local¬ ized in state space within a region of radius R. If the "radius” of a single distinguishable state is r, then the dimension of the region containing N states is given by: N - (R/r) d . Taking the log of this relation, we have: log(N) - d log(R/r). The left-hand side is simply the stored information, or number of bits resolv¬ able on the attractor, while log(R/r) Is the resolution obtainable from a meas¬ urement along a single linear attractor coordinate. Thus the dimension of the attractor, roughly speaking, is the minimum number of independent measurements required to localize a single system state. If our view of the attractor is limited by the measurement, that is, the stored information intrinsic to the attractor is much larger than the measurement channel rate, or in the notation of this paper: I(x'|x) » I(x|y) then we can expect an improved total log(N) as measurements are improved. The dimension d could then be independent of resolution over some range, and its value would indicate the ratio of increased predictability to increased resolu¬ tion (as pointed out by Farmer et al). 104 The number of states "nearby” a particular state should be a function of the dimension of an attractor, and this provides the basis for most of the dimension calculations. Typically, the number of states within some length L is determined as a function of L, N(L) - L d where d is the dimension. This approach will now be used to attempt a "dimension" measurement for three sets of water drop data. The algorithm below was chosen on the basis of simplicity, it is I believe closest to the "correlation exponent” of Grassberger and Procaccia, and the "pointwise dimension" of Farmer et al, and Guckenheimer[46]. First, a point is picked at random out of the data and its "distance” from all other data points along the one-dimensional drop interval coordinate deter¬ mined. The resulting list of distances is then sorted in order of distance to nearest neighbor, distance to next-nearest neighbor, etc. This process is repeated for a number of other randomly selected data points, then a mean dis¬ tance Co nearest neighbor, next-nearest neighbor, etc. is computed. This dis¬ tribution of distances is inverted, and plotted as log(N) vs. log(L), where N is the number of data points within a distance L [47]. Next the data is taken in pairs of successive points, and distances to all successive pairs in the resulting two-dimensional state space computed using, for example, a standard euclidian metric. Mean distances to nearest neighbor, next-nearest neighbor and so forth are computed as before, and log(N) vs. log(L) plotted. Successive curves are generated by taking the data in successive triplets, 4^s, 5's t and 6's. Each step corresponds to increasing the dimension of the state space by one. In principle the observed dimensionality of the attractor generated by the data should also increase, until the dimensionality of the imbedding space is larger than the actual dimensionality of the attractor. Once there is room enough in the imbedding space for the full attractor, the observed 105 attractor dimension should arrive at is "true" value. Thus, in theory, all one has to do is measure the slope of successive curves, when it stops changing as a function of imbedding dimension, one has determined the dimension of the physi¬ cal system. In practice, things are not so simple. Figures A1 - A3 below displays the results of applying the above program to three data sets. For the first data set, the dimension calculation seems rather successful. The attractor of this water drop regime seems to be a gauzy assemblage of two-dimensional ribbons and sheets, if the data is viewed as a stereo plot. The calculated curves become more or less parallel after an imbedding dimension of two is reached, and a fit to the slope of the later curves near their center yields a value surprisingly (to the writer) close to two. However, the second data set demonstrates the problems which quickly arise if the probability distribution on the attractor is not particularly uniform. This set is generated by a somewhat noisy period-two drop regime. The purely stochastic part of the data generates dimension curves whose slopes continuously increase with imbedding dimension, as expected. But there is a radical break in the curves at length scales corresponding to the distance between the two periodic points. One might learn to usefully interpret a set of curves such as these, but the more usual situation seems to be exemplified by the third data set. To the eye, the attractor seems simple enough, it's a bent one-dimensional string with a little noise on it. The calculated dimension curves, however, seem rather uninformative, there is no clear place to fit a slope, and no apparent way to see that these curves were generated by a quasi one-dimensional object. 108 The good news is that sets of dimension curves, even if not particularly useful, are apparently coordinate-independent. The writer tried three different metrics in computing "distances" in the embedding spaces: euclidian, sup norm, and the "manhattan" metric (sum of absolute values of coordinate differences). The dimension curves were translated horizontally in the figures by varying amounts depending on metric, but their general form even in detail was well- preserved. Armed with this limited experience, we will now present a few criticisms of the dimension industry, as it currently operates. Implicit in most of the recent work on dimension is the assumption that there in fact exists a single number which characterizes the dimension of a chaotic attractor. Discussions often begin by considering Cantor sets, and other self-similar "fractal" objects, where the "dimension” describes scaling relationships as finer and finer length scales are considered. In a purely deterministic treatment, chaotic attractors possess such a Cantor set structure; as the information storage capacity is Infinite, the complete past and future history of the dynam¬ ics is determined by the infinitesimal structure to be found at any part of the attractor. Thus a dimension calculation using any part of the attractor would yield the same number. But the base length scale which all physical systems, including digital computers, possess limits the degree of structure the underlying attractor can have, and likewise limits the history it can embody to a finite and computable length of time. So It is to be expected that the "dimension" will usually vary across the attractor, if by "dimension" we mean the number of coordinates required to maximally localize a state. As a simple example, consider a system which in its Poincar€ section periodically changes from a fixed point to a limit cycle and back again: 109 Each time the limit cycle shrinks to a “point” beneath the noise length scale, its phase is lost, and each time it reemerges, its phase is a random variable, determined by the noise. Thus two coordinates are required to achieve maximum predictabilty in the lower part of the figure, and only one in the upper. Objects such as this whose actual structure is supported by noise form an interesting new class, see recent work by Farmer[35] and Deissler[48]. The usefulness of a “dimension" number as a scaling exponent depends on the number of orders of magnitude between the smallest and largest length scales. This range is in fact severely limited for most physical data. The scaling structures which have received so much recent theoretical attention are notably absent in most of the fluid systems so far examined, due to the limited dynami¬ cal range, and high dissipation. The experience in this laboratory is that the more low-dimensional structure there is in the reconstructed attractor, the less well-defined is a single "dimension" number. This is not to say that the set of dimension curves may not have a use as indicating an overall character of a data set, much as a power spectra curve does. But there seems little reason at this point to prefer one dimension algo¬ rithm over another as "more fundamental". 110 The most serious limitation of existing "dimension” measurements of physi¬ cal systems is their total lack of any consideration of the effect of spatial variation in the system. In the work described here, and in all other work of which the writer is aware, data is supplied by only a single probe of the fluid, and an attractor supposedly characterizing the entire system is obtained by the time delay reconstruction method. Even in the best of circumstances, such a reconstruction necessarily takes place over an interval of time, and for a sys¬ tem with a positive entropy, the causal connection between successive measure¬ ments is reduced. In this case, two measurements at different points in the fluid will always yield greater predictability than one. This is intuitively obvious, and corresponds to an increase in the "rate of measurement”. Any discussion of the "dimension” or "degrees of freedom” of a fluid flow must address the flow of causality in a spatial sense. Fluid motion is con¬ strained on the smallest scales by viscosity, and at the largest, by boundary conditions. A fluid moving nonperiodically has managed to gain a degree of independence from these constraints. Experiments have indicated that this can happen in several different ways. If boundaries are close, compared to the typ¬ ical roll or wave length scale, the fluid motion tends to be collective. The same attractor is obtained by reconstruction of measurements made at any point in the fluid, subject to the resolution limitations discussed above. If the boundaries are far apart, dislocations or defects in regular flow patterns tend to occur (Donnelly's "turbators"), and the transition to turbulence is more akin to the melting of a crystal. At the onset of nonperiodicity in the latter case, the space of possible fluid flow patterns is apparently impossibly large, with a complicated connectivity, and the path of the fluid system through this state space tortuous, even if the entropy, or rate of divergence from any one path is small[49]. Ill It should be possible to distinguish collective coherent nonperiodic fluid motion from non-coherent motion using the same stored information statistic dis- cussed earlier in this paper* Two measurements taken simultaneously at points as close as possible in space will give the same stored information, have the same degree of causal connection, as two measurements close in time taken from a single probe . Thus the loss of information as it flows through the fluid can be measured, at least in an average sense, by mapping the information common to both probes as a function of distance* For a collective fluid motion, we would expect a rather slow drop-off of the mutual information as two probes are separated, with significant correlations remaining throughout the fluid* A quicker drop-off might well be associated with certain fluid flow features pass¬ ing between the two probes, and might indicate that information transmission through such features is reduced- It may also be possible to quantitatively characterize more fully developed turbulence, measuring the size of typical coherent volumes, as well as the *entropy per unit volume”, or "degrees of free¬ dom per unit volume”, or "number of positive characteristic exponents per unit volume”[50]. Lack of funding has precluded the early experimental test of these ideas* References 1) S. Hawking, "Is the End in Sight for Theoretical Physics?” Cambridge University Press, 1980 2) R. Abraham and C.D. Shaw, "Dynamics: The Geometry of Behavior", Vol 1: Periodic Behavior, Vol. 2: Chaotic Behavior Aerial Press, Box 1360, Santa Cruz, Cal. 95061 Annals of NY Academy of Sciences, Vol. 316 (1978) and Vol. 357 (1980) Physica D, Vol. 7 (1983) A.J. Lichtenberg and M.A. Lieberman, "Regular and Stochastic Motion", Springer-Verlag, 1983* J. Guckenheimer and P. Holmes, "Nonlinear Oscillations, Dynamical Systems, and Bifurcation of Vector Fields", Springer-Verlag, 1983. "Synergetics, A Workshop”, ed. H. Haken, Springer-Verlag 1977. See also Refs. 23 and 30. 3) 0. Rflssler, in "Synergetics: A Workshop", H. Haken, Ed. Springer-Verlag 1977. 4) (gallons per fortnite) 5) S. Hartland and R.W. Hartley, "Axisymmetric Fluid-Liquid Interfaces", Elsevier, 1976. M. Hozawa et al, J. Chem. Eng. Japan 14 , 358 (1981) 6) P. Martien, S. Pope, P. Scott, and R. Shaw, in preparation, to appear, and to be published, 1984. P. Martien, Senior Thesis, Santa Cruz, 1982 S. Pope, Senior Thesis, Santa Cruz, 1984 7) J.P. Crutchfield and B.A. Huberman, Phys. Lett. 77A , 407 (1980) G. Mayer-Kress and H. Haken, J. Stat. Phys. 26 , 149 (1981) 8) S. Grossman and S. Thomae, Z. Naturforsch. 32a , 1353 (1977) 9) I. Shimada and T. Nagashima, Prog. Theo. Phys. 6^, 1605 (1979) 10) E.N. Lorenz, Ann. N.Y. Acad. Sci. 357 > 282 (1980) 11) J.P. Crutchfield et al, Phys. Lett. 76A , 1 (1980) 12) R.M. May, Nature 261 , 459 (1976) 13) M. HSnon, Comm. Math. Phys. _50, 69 (1976) 14) "Strange Attractors and their Physical Significance", J.P. Crutchfield and R. Shaw, videotape, 1978. 15) 0. Rflssler, Ann. N.Y. Acad. Sci. 316, 376 (1978) 16) E.N. Lorenz, J. Atmos. Sci. 210, 130 (1963) 17) D. Ruelle, F. Takens, and T.T.Tris, Comm. Math. Phys _20, 167 (1971) D. Ruelle, Ann. N.Y. Acad. Sci. 316 , 408 (1978) 18) J.P. Gollub and H.L. Swinney, Phys. Rev. Lett. 35, 927 (1975) 19) "Strange Attractors and their Possible Relation to Turbulence", NSF Grant Proposal (Material Science Division) 1980,1981,1982,1983 "Machine for Simulation of 2-d Partial Differential Equations", NSF Grant Proposal (Computer Science Division) 1981 (copies available on request) 20) L. Boltzmann, personal communication 21) C.E. Shannon and W. Weaver, "The Mathematical Theory of Communication”, University of Illinois Press, 1962 22) R.S. Shaw, "Strange Attractors, Chaotic Behavior, and Information Flow” Santa Cruz, 1978, published in Z. Naturforsch. 36a , 80 (1981) 23) E.T. Jaynes, in "Delaware Seminar in the Foundations of Physics", Vol. 1, Springer-Verlag, 1967. 24) L* Boltzmann, 1872. See also Shannon, Ref. 21. 25) A.N. Kolmogorov, Dokl. Akad. Nauk. SSSR 124 , 754 (1959) Ya. Sinai, Dokl. Akad. Nauk. SSSR 124 , 768 (1959) A.N. Kolmogorov, Problems of Information Transmission J^, 3 (1965) and 5^, 3 (1969). 26) Y. Oono and M. Osikawa, Prog. Theo. Phys. 64, 54 (1980) 27) J.P. Crutchfield and N.H. Packard, Int. J. of Theo. Phys. 21, 433 (1982) 28) J.D. Farmer, PhD thesis, Santa Cruz (1981) 29) J.D. Farmer, Ann. N.Y. Acad. Sci. 357, 453 (1980) 30) R. Shaw, "Modeling Chaotic Systems”, in "Chaos and Order in Nature”, ed. H. Haken, Springer-Verlag 1981. 31) N.H. Packard et al, Phys. Rev. Lett. 45, 712 (1980) 32) R. Bowen, "Equilibrium States and the Ergodic Theory of Anosov Diffeomorphisms", Lect. Notes in Math. 470 , Springer-Verlag 1975. J. Guckenheimer, Inv. Math. _39^ 165 (1977) N.H. Packard, PhD Thesis, Santa Cruz (1982) 33) J.P. Crutchfield and N.H. Packard, "Symbolic Dynamics of Noisy Chaos”, in "Order and Chaos" proceedings, North-Holland Amsterdam 1983. 34) J. Curry, J. Stat. Phys. 26_, 683 (1981) 35) J.D. Farmer, "Sensitive Dependence to Noise Without Sensitive Dependence to Initial Conditions", J. Unpub. Results 1^, 1 (1983) 36) T. Erber et al, J. Comp. Phys. 49, 394 (1983) 37) S. Ma, J. Stat. Phys. 16, 221 (1981) 38) E. Jen, J.D. Farmer, unpublished 39) L. Wittgenstein, "On Certainty", Harper & Row 1972. 40) C. Leith, "Order in Chaos" Conference, Los Alamos, 1982 41) G.J. Chaitin, Scientific American, May 1975 42) J. Ford, Physics Today, April 1983 43) J. Smith, Byte Magazine, January 1983 44) A.W. Drake, "Fundamentals of Applied Probability Theory", McGraw-Hill, 1967. 45) J.D. Farmer, E. Ott, and J.A. Yorke, Physica 7D> 153 (1983) J.D. Farmer, Z. Naturforsch. 37a , 1304 (1982) P. Grassberger, Phys. Lett. 97A , 227 (1983) 46) P. Grassberger and I. Procaccia, Phys. Rev. Lett. 50, 346 (1983) J. Guckenheimer and G. Buzyna, Phys. Rev. Lett. 5^, 1438 (1983) A. Brandstlter et al, Phys. Rev. Lett. 51 , 1442 (1983) 47) If the logarithms are taken before the mean, the measure is closest to the "pointwise dimension", otherwise, the measure Is the "correlation exponent" of Grassberger and Procaccia. This change made no difference insofar as the qualitative features described here, although numerical values were changed by amounts on the order of ten or twenty percent. 48) R.J. Deissler, Phys. Lett. 100A, 451 (1984) 49) R.J. Donnelly et al, Phys. Rev. Lett. 44, 987 (1980) G. Ahlers and R.W. Walden, Phys. Rev. Lett. 44, 445 (1980) P. Berg€, in "Chaos and Order in Nature", ed. H. Haken, Springer-Verlag 1981. 50) D. Ruelle, Phys. Lett. 72A, 81 (1979)