This appendix gives brief descriptions of the algorithms for script and m-script operations which are the basis of language processing and learning in the theory. You may wish to run through these algorithms `by hand' with a pencil and paper for some simple examples, to get a feel for what the key operations do - particularly for m-intersection, which is the core operation of the learning theory.
All these algorithms have been implemented in a Prolog program which does language processing and learning for a small subset of English, and which can check the implementation of the operations by checking script algebra relations. The program is described in Appendix D.
There are three key operations: script inclusion (the inverse of subsumption), unification and intersection (generalisation). They are closely equivalent to the corresponding operations for feature structures, as studied in computational linguistics (Scheiber 1986). They are all used as building blocks for the m-script operations described in section A.2. There is a fourth operation, which I call script subtraction, which is required as a sub-operation for m-unification.
For each of the three main operations, we need to match two tree-like script structures together node by node. This is done from the root downwards (as scripts are drawn with the root node at the top). Since there are choices as to how the two scripts are matched up, all three operations involve an element of search (over possible matchings) to find a satisfactory matching (for inclusion) or the best matching (for unification and intersection). In the Prolog implementation, this search is done using the Prolog failure-driven backtracking mechanism.
For the descriptions which follow, label the different nodes of a script by indices i,j etc.; then node i of script A is denoted by A(i), and the subtree of script A which is rooted at node i is also a script, which will be denoted by A[i].
Variable slot values are called `aces' and are denoted by capital letters preceded by a question mark, as in `?A', `?B', etc.
Script Inclusion: To check that script A includes script B (i.e. to check that A >s B) do the following steps:
In practice, the testing of time-order arrows is made easier if, before comparing A and B, you take the transitive closure of all time-order arrows in A, and separately for B. Then in a pairing set {(j,y), (k,z),...}, if there is a time-order arrow from B(y) to B(z), there must be one directly from A(j) to A(k). For the other script operations of unification and intersection, assume the transitive closure of time-order arrows is also done in advance.
Script unification: To form the unification C = A Us B of two scripts A and B, do the following steps:
Therefore script unification may fail. If it does, this means that the scope sets s(A) and s(B) have no overlap.
Script unification can always be checked from its definition in terms of script inclusion. The result C must obey C >s A and C >s B; and it should not be possible to take any information away from C (to remove any nodes or slots), keeping both these relations true.
Script intersection: To form the intersection D = A ^s B of two scripts A and B, do the following steps:
Script intersection always gives a result.
The result of a script intersection can be checked from its definition in terms of script inclusion. The result D must obey D <s A and D <s B; and it should not be possible to add any information to D (to add any nodes, slots or time order arrows), keeping both these relations true.
The script subtraction E = B- A is defined as the script E with smallest information content which obeys E Us A = B. Thus E only exists if B >s A. An algorithm to find E is:
The result of a script subtraction can be checked from its definition E Us A = B.
An m-script M represents a set of scripts, which is called its scope, sigma(M). M-scripts look like scripts, in which some nodes are trump nodes (denoted by labels !A, !B etc. on the node), and there may be a trump link between any pair of trump nodes.
In the descriptions which follow, the script which is derived from an m-script M by removing all its trump nodes and trump links will be denoted by Ms. This is always within the scope of M, Ms e sigma(M). (Here, the letter e denotes set membership; it is an attempt at epsilon)
We shall consider only m-scripts where each trump node has at most one trump link attached to it; so every trump node is either a lone trump node, or one of a pair attached by a trump link. These m-scripts are apparently sufficient for language. For m-scripts not obeying this constraint, the algorithms are more complex to state.
To define precisely the scope of any m-script, we need to define truncated subtrees of the m-script. For every node of a type which may be a trump node, there can be certain slots on nodes of that type. We divide these slots into `up' slots (which are considered to be above the trump node, and this unaffected by the presence of that trump node and its links) and `down' slots, which are below the trump node. Most slots are `down' slots, but it is useful for language processing to make one or two slots `up' slots.
Then for an m-script with N trump links, there are (N+1) truncated subtrees. The first of these, M[1] is rooted at the root node (conventionally node 1) of M, and is truncated at any trump nodes; it contains only the `up' slots on those nodes, and no child nodes below them. For every trump node i, there is a truncated subtree M[i] whose root is the trump node, and which is truncated at any trump node below i. It contains only the `down' slots of trump node i, and only the `up' slots of lower trump nodes where M[i] is truncated.
Thus the information in M is partitioned between its truncated subtrees M[i]; every slot from M appears on precisely one of the M[i]. Although any trump node appears on two of the M[i] (it is the root node of one of them, and a leaf of another) its slots are partitioned between the two.
An m-script M is well-formed if, wherever it has a trump link from node i to node j, M[j] >s M[i]. Assume the input m-scripts in all of the algorithms below have been checked to be well-formed.
The definition of the scope of an m-script is implicitly given in the algorithm to determine whether some script, S, is in the scope of M:
Scope test: To check that script S is in the scope of M, or that S e sigma(M), do the following steps:
N m-includes M, or N >m M, if any script S in the scope of N is also in the scope of M.
M-inclusion test: To check that N >m M, do the following steps:
If any of these tests fail, then N >m M is not true.
(In what follows, I am using the notation Ps for the script part of P - i.e P without any trump nodes or links)
M-unification: P = M Um N, is the P with largest possible scope obeying P >m M and P >m N; therefore sigma(P) < sigma(M) and sigma(P) < sigma(N). P may not exist, if sigma(M) and sigma(N) do not overlap. To find P, do the following steps:
The resulting m-script P can be checked from the definition of m-unification. P should obey both P >m M and P >m N; but it should not be possible to extend the scope of P (remove nodes or slots; add trump nodes; or remove trump links) without making this untrue.
M-unification is the key operation for language generation and understanding; for both of these, there is one m-unification per word m-script. The initial unification in step (1) tests the applicability of the word (syntactic and semantic constraints). In step (2), the m-unification requires one unification for each trump link in the word m-script:
M-intersection: Q = M ^m N, is the Q with smallest possible scope obeying M >m Q and N >m Q; it always exists. To find Q, do the following steps:
(step 3 must be replaced by a rather more complex procedure in some cases with complex trump configurations)
M-intersection is just script intersection, with extra steps (2) and (3) to discover trump nodes and links. Step (3) discovers a trump link wherever both the input m-scripts M and N obey the trump link relation.
M-intersection is the key operation used to discover word m-scripts in language learning. However, for this purpose there is a key difference from the definition given above, in the discovery of trump links.
When forming `learning example' scripts by partially understanding sentences and observing the meaning they refer to, the child may observe not just the true meaning script X, but also some extra information; she infers a script Y which includes X, Y >s X. Therefore the trump link relation M[j] = M[i] Us Q[j] becomes just M[j] >s M[i] because of extra information on the `downstream' end M[j] of the trump link. (As the `upstream' end M[i] comes from previously learnt words, it has no extra information.)
So if we did m-intersections of these example scripts as above, we would not discover the trump links. We need in stead to use `broad m-intersection', which differs from plain m-intersection, replacing (3) by (3') :
3' For any two trump nodes i and j on Q, add a trump link from i to j if both M[j] >s M[i] and N[j] >s N[i], and if neither M nor N has any trump nodes without the same trump link on node i or j.
In practice a word is learnt by m-intersection of several learning examples, which because of the commutative and associative property of m-intersection (an m-script algebra identity) can be done in any order.
This appendix contains a proof that, within a certain mathematical model of learning, a Bayesian learning procedure gives the best possible fitness; so that natural selection will tend (other things being equal) to produce animals with Bayesian learning in preference to any other.
The mathematical model of learning uses some abstractions and approximations (such as discrete time steps) but is nevertheless quite a good approximation to the problem of learning social causal regularities in a primate group. Primates are under intense selection pressure to compete socially, and need to learn these regularities as fast as possible. If Bayesian learning is optimal, this implies that primate selection by social competition leads to near-Bayesian learning of social regularities. In this theory, social learning is the forerunner of language learning, which will therefore inherit its Bayesian character.
We characterise animal learning as follows:
To make this abstract framework a little more concrete in the context of primate social learning: There may be three distinct causal regularities in some primate social group: R1 : `When Brutus bites Cassius, Cassius becomes frightened', R2 : `If you threaten Cassius when he is frightened, he will give you food, and R3 : `If you threaten Cassius when he is not frightened, he will bite you'. Suppose R2 and R3 are well known, and the task of learning is to discover, by observation, whether R1 holds.
The prior probability that R1 holds is quite small, p(R1) = 0.1; and even if it does hold, there is no a priori knowledge about its cause-effect probability s1(R1); so r(s1) = 0.1 in the range 0<s<1.
Suppose a monkey observes 49 events E(1) .. E(49); in 12 of these, Brutus bites Cassius, and on 10/12 occasions, Cassius is then frightened. Normally, Cassius is only frightened 20% of the time. Now, in time step 50, Brutus again bites Cassius. Should our monkey go and threaten Cassius ? If the rule does hold (and has high rule probability s) then Cassius will be frightened and our monkey will get some food (fitness increment +0.01 on some scale); but if the rule does not hold (those 10/12 occasions were just flukes; Cassius was frightened for some other reason) then our monkey will get a bite (fitness increment -0.1).
The proof that Bayesian learning gives the optimum decision rule is now quite simple:
The way an animal chooses an action Ak, given sense data Dm , is summarised in a decision function Fk(Dm ). Fk gives a value for each action index k and sense data Dm . The animal acts as if, when having sense data Dm , it calculates all the decision functions Fk for each possible k, and chooses the action Ak with the largest Fk .
Decision functions can be used to describe any relation between sense data and actions ; that is, to describe the input-output relation of any cognitive system. So they give an external, functional specification of any possible brain. Our problem is: what is the best possible set of decision functions Fk ?
Figure 2 shows the relations between these concepts.
Figure 2: Flow of causation for the mathematical model of animal cognition. Time ordering is from left to right, and arrows flow from causes to effects. The probabilities P of states of affairs S, of sense data D and of outcomes O are discussed in the text. The horizontal dashed line denotes the interface between the animal and its environment.
The requirement equation defines the best possible decision functions Fk(Dm) - those which give best average outcome V, and which therefore give greatest possible fitness.
The expected value from taking action Ak in the state of affairs Si is the average value of all possible outcomes, weighted by their probabilities :
Using the definition:
we can calculate the expected value from one state of affairs Si , which gives sense data Dm , if the animal uses decision functions Fk to choose an action:
The theta function picks out one term from the sum over actions Ak - the term which has the largest Fk. So it embodies the decision rule to pick out one action by calculating the Fk and picking the largest. U is the average value resulting from that choice.
The average value for a single encounter is got by summing over all states of affairs and sense data sets, weighted by their probabilities:
where :
Gk depends only on things outside the animal's control Ñ on probabilities of states, probabilities of sense data, probabilities of outcomes and values of outcomes. Gk(Dm) is effectively the average value of doing action Ak when having sense data Dm .
In the design of a cognitive system, Gk cannot be varied, but the decision functions Fk can.
(2.4) is a sum over all Dm, where (because of the q function) for each Dm only one Gk enters the sum. The maximum Vavg is got by setting
since this choice means that for every Dm , the q function picks out the largest of the Gk and adds it to the sum. Any other Fk gives a worse average outcome, by sometimes picking a smaller Gk .
Therefore the optimum decision rule is given by
Thus equation (2.7) is the requirement for animal cognition Ñ the optimum form of cognition in all circumstances. I shall call it the Requirement Equation.
We can multiply all the Fk by any positive common factor, and it will not alter the choice of action. This enables us to rewrite (2.7) in two parts :
(2.8) is recognisable as Bayes' theorem, applied to find the most likely state of affairs Si in the light of the sense data Dm; then (2.9) chooses the best possible action, in the light of the likely states Si. For many problems, finding the state by (2.8) is the hard part, and then choosing an action by (2.9) is comparatively easy.