Appendix A: Script and M-Script Algorithms

This appendix gives brief descriptions of the algorithms for script and m-script operations which are the basis of language processing and learning in the theory. You may wish to run through these algorithms `by hand' with a pencil and paper for some simple examples, to get a feel for what the key operations do - particularly for m-intersection, which is the core operation of the learning theory.

All these algorithms have been implemented in a Prolog program which does language processing and learning for a small subset of English, and which can check the implementation of the operations by checking script algebra relations. The program is described in Appendix D.

A.1 Script Operations

There are three key operations: script inclusion (the inverse of subsumption), unification and intersection (generalisation). They are closely equivalent to the corresponding operations for feature structures, as studied in computational linguistics (Scheiber 1986). They are all used as building blocks for the m-script operations described in section A.2. There is a fourth operation, which I call script subtraction, which is required as a sub-operation for m-unification.

For each of the three main operations, we need to match two tree-like script structures together node by node. This is done from the root downwards (as scripts are drawn with the root node at the top). Since there are choices as to how the two scripts are matched up, all three operations involve an element of search (over possible matchings) to find a satisfactory matching (for inclusion) or the best matching (for unification and intersection). In the Prolog implementation, this search is done using the Prolog failure-driven backtracking mechanism.

For the descriptions which follow, label the different nodes of a script by indices i,j etc.; then node i of script A is denoted by A(i), and the subtree of script A which is rooted at node i is also a script, which will be denoted by A[i].

Variable slot values are called `aces' and are denoted by capital letters preceded by a question mark, as in `?A', `?B', etc.

Script Inclusion: To check that script A includes script B (i.e. to check that A >s B) do the following steps:

  1. Match the root node of A with the root node of B.
  2. For the matched pair of nodes A(i) and B(x), if B(x) has any slot s with a defined value v, check that A(i) also has the same slot s with the value v or a more specific value w. If v is a plain constant value, w must be the same value; if v is a class-valued constant, w must be the same class or a subclass of it; if v is an ace value, w may be an ace value or a constant.
  3. If there is any slot on B(x) not matched by a slot in A(i) as above, fail and try another node matching, if there are any left.
  4. If two nodes A(i) and B(x) are successfully matched as in (2) above, try out all possible pairing sets of their child nodes. A pairing set is a set {(j,y), (k,z) ...}such that every node B(y) amongst the child nodes of B(x) is paired with some A(j) from the child nodes of A(i). If (j,y) are paired, then nodes j and y must be of the same node type (node types are holder, script, scene, entity and property, abbreviated as ho, sr, se, en, pr).
  5. Pairing sets must respect time-order constraints: if there is a time-order arrow from node B(y) to B(z), then for a pairing set {(j,y), (k,z),...} there must be a time-order arrow from A(j) to A(k), either directly of via some intermediate A(l), A(m),....
  6. If there is no such pairing set, fail and try a different matching at the parent level, if there are any left.
  7. For each pairing (j,y) of the pairing set, do step (2) to check that the slot values in A(j) include all slot values in B(k).
  8. Having traversed the tree structures, all nodes and slot values in B must be matched by some node and slot values in A. If some ace value ?V occurs on more than one slot in B, then it must everywhere be matched by the same value w (either the same constant, or the same ace value ?W) in A. Otherwise, the inclusion test fails.

In practice, the testing of time-order arrows is made easier if, before comparing A and B, you take the transitive closure of all time-order arrows in A, and separately for B. Then in a pairing set {(j,y), (k,z),...}, if there is a time-order arrow from B(y) to B(z), there must be one directly from A(j) to A(k). For the other script operations of unification and intersection, assume the transitive closure of time-order arrows is also done in advance.

Script unification: To form the unification C = A Us B of two scripts A and B, do the following steps:

  1. Match the root node of A with the root node of B.
  2. For a matched pair of nodes A(i) and B(j), form a node C(k) of the result. For each slot s: if neither node has a value, put nothing on C(k); if only one of the nodes A(i) and B(j) has a value v, then put the slot-value s(v) on node C(k); if both nodes have the same value v, put s(v) on C(k); if A(i) has a class value v and B(j) has a class value w, where w is a subclass of v (or vice versa) put the subclass value s(w) on C(k); if one node has a constant value, and the other has an ace value, put the constant value on C(k); if both nodes have an ace value, put an ace value on C(k) ; if the two nodes have inconsistent values v and w, fail and try another node pairing.
  3. If two nodes A(i) and B(x) are successfully matched as in (2) above, try out all possible pairing sets of their child nodes. A pairing set is a set {(j,y), (k,z) ...}such that at least one node B(y) amongst the child nodes of B(x) is paired with some A(j) from the child nodes of A(i). If (j,y) are paired, then nodes j and y must be of the same node type.
  4. For each node pair (j,y), do step (2) to check that the nodes A(j) and B(y) are consistent, and to form the result node C(l).
  5. If a node from A(i) or B(x) is not paired with some node from the other input script, then that node and all its descendants goes into the result C unchanged. Thus every node in A or B must appear in C. Attempt to maximise the number of paired nodes, to minimise the total information content of the result C.
  6. Pairing sets must respect time-order constraints: if there is a time-order arrow from node B(y) to B(z), then for a pairing set {(j,y), (k,z),...} there must not be a `reverse' time-order arrow from A(k) to A(j). (Time-order arrows on both A and B have been transitively closed). The time-order arrows on the result C include all time-order arrows from A and B.
  7. Variable slot values must be unified consistently. If a variable value ?V appears more than once in A, and is unified with constant values v, w, .. from B, then these constant values must be consistent (eg v = w, or v is a subclass of w). If ?A is anywhere unified with ?B, then they are then treated as equal; they must nowhere be unified with mutually inconsistent constants. If any inconsistency occurs, fail and try other node pairings, if there are any.

Therefore script unification may fail. If it does, this means that the scope sets s(A) and s(B) have no overlap.

Script unification can always be checked from its definition in terms of script inclusion. The result C must obey C >s A and C >s B; and it should not be possible to take any information away from C (to remove any nodes or slots), keeping both these relations true.

Script intersection: To form the intersection D = A ^s B of two scripts A and B, do the following steps:

  1. Match the root node of A with the root node of B.
  2. For a matched pair of nodes A(i) and B(j), form a node D(k) of the result. For each slot s: if neither node has a value, put nothing on D(k); if only one of the nodes A(i) and B(j) has a value, put nothing on node D(k); if both nodes have the same value v, put s(v) on D(k); if A(i) has a class value v and B(j) has a class value w, where w is a subclass of v (or vice versa) put the superclass value s(v) on D(k); if one node has a constant value, and the other has an ace value, put the ace value on D(k); if both nodes have an ace value, put an ace value on D(k) ; if the two nodes have inconsistent values v and w, put an ace value on D(k).
  3. If two nodes A(i) and B(x) are matched as in (2) above, try out all possible pairing sets of their child nodes. A pairing set is a set {(j,y), (k,z) ...}of nodes A(j) and B(x); it may be empty. If (j,y) are paired, then nodes j and y must be of the same node type.
  4. For each node pair (j,y), do step (2) to form the result node D(l).
  5. If a node from A(i) or B(x) is not paired with some node from the other input script, then that node and all its descendants do not appear in the result D. Attempt to maximise the number of paired nodes, to maximise the total information content of the result D.
  6. For a pairing set {(j,y), (k,z),...}, where A(j) and B(y) match to give node D(l), and A(k) and B(z) match to give node D(m), there is a time-order arrow from D(l) to D(m) only if there is one from A(j) to A(k), and one from B(y) to B(z).
  7. If a slot s(v) from A matches with a slot s(w) from B to give a slot s(?X) on D, then wherever else a slot s(v) from A matches with a slot s(w) from B (with the same values v and w, which may be either constants or aces), it gives a slot s(?X) with the same ace value ?X on D.
  8. Any slot s(?X) on D, whose ace value ?X is not repeated elsewhere on D, gives no information and can be removed.

Script intersection always gives a result.

The result of a script intersection can be checked from its definition in terms of script inclusion. The result D must obey D <s A and D <s B; and it should not be possible to add any information to D (to add any nodes, slots or time order arrows), keeping both these relations true.

The script subtraction E = B- A is defined as the script E with smallest information content which obeys E Us A = B. Thus E only exists if B >s A. An algorithm to find E is:

  1. Check that B >s A; if not, fail. Otherwise set E = B.
  2. Start at a leaf node A(i) of A, and find the node B(x) of B which matched it in the inclusion test. For any slot s(v) on A(i), if the same slot s(v) occurs on B(x), remove this slot and value from E(x).
  3. If all slots have been removed from a leaf node E(x), remove the node.
  4. When all children of a node A(j) have been treated in this manner, do (2) for the parent node A(j) and its matching B(y).
  5. If all slots have been removed from a non-leaf node E(y), and all its child nodes have been removed, remove the node E(y).

The result of a script subtraction can be checked from its definition E Us A = B.

A.2 M-Script Operations

An m-script M represents a set of scripts, which is called its scope, sigma(M). M-scripts look like scripts, in which some nodes are trump nodes (denoted by labels !A, !B etc. on the node), and there may be a trump link between any pair of trump nodes.

In the descriptions which follow, the script which is derived from an m-script M by removing all its trump nodes and trump links will be denoted by Ms. This is always within the scope of M, Ms e sigma(M). (Here, the letter e denotes set membership; it is an attempt at epsilon)

We shall consider only m-scripts where each trump node has at most one trump link attached to it; so every trump node is either a lone trump node, or one of a pair attached by a trump link. These m-scripts are apparently sufficient for language. For m-scripts not obeying this constraint, the algorithms are more complex to state.

To define precisely the scope of any m-script, we need to define truncated subtrees of the m-script. For every node of a type which may be a trump node, there can be certain slots on nodes of that type. We divide these slots into `up' slots (which are considered to be above the trump node, and this unaffected by the presence of that trump node and its links) and `down' slots, which are below the trump node. Most slots are `down' slots, but it is useful for language processing to make one or two slots `up' slots.

Then for an m-script with N trump links, there are (N+1) truncated subtrees. The first of these, M[1] is rooted at the root node (conventionally node 1) of M, and is truncated at any trump nodes; it contains only the `up' slots on those nodes, and no child nodes below them. For every trump node i, there is a truncated subtree M[i] whose root is the trump node, and which is truncated at any trump node below i. It contains only the `down' slots of trump node i, and only the `up' slots of lower trump nodes where M[i] is truncated.

Thus the information in M is partitioned between its truncated subtrees M[i]; every slot from M appears on precisely one of the M[i]. Although any trump node appears on two of the M[i] (it is the root node of one of them, and a leaf of another) its slots are partitioned between the two.

An m-script M is well-formed if, wherever it has a trump link from node i to node j, M[j] >s M[i]. Assume the input m-scripts in all of the algorithms below have been checked to be well-formed.

The definition of the scope of an m-script is implicitly given in the algorithm to determine whether some script, S, is in the scope of M:

Scope test: To check that script S is in the scope of M, or that S e sigma(M), do the following steps:

  1. Check that S >s M . If not, fail. If S >s M, label the nodes of S so that M[i] is matched by S[i] in the script inclusion test.
  2. Form the truncated subtrees M[i] of M, and the corresponding truncated subtrees S[i] of S.
  3. Above the top trump nodes of M, S and M must be equal; so check that S[1] = M[1]. If not, fail.
  4. If there is a trump link from node i to node j of M, the subtrees of S must obey S[j] = S[i] Us M[j].

N m-includes M, or N >m M, if any script S in the scope of N is also in the scope of M.

M-inclusion test: To check that N >m M, do the following steps:

  1. Check that N >s M; if not , fail. If so, label the nodes of N so that M[i] is matched by N[i] in the script inclusion test.
  2. Above the top trump nodes of M, N and M must be equal; so check that N[1] = M[1].
  3. To ensure that Ns e sigma(M), if M has a trump link from node i to node j, then the truncated subtrees of N must obey N[j] = N[i] Us M[j].
  4. To ensure that sigma(N) < sigma(M), wherever N has a trump node i, M must also have a trump node on node i, or its parent node, or its grandparent node, etc.
  5. To ensure that sigma(M) sigma(N), wherever M has a trump link from node i to node j, either N must have a trump link from i to j, or N must have no trump nodes on or above i or j.

If any of these tests fail, then N >m M is not true.

(In what follows, I am using the notation Ps for the script part of P - i.e P without any trump nodes or links)

M-unification: P = M Um N, is the P with largest possible scope obeying P >m M and P >m N; therefore sigma(P) < sigma(M) and sigma(P) < sigma(N). P may not exist, if sigma(M) and sigma(N) do not overlap. To find P, do the following steps:

  1. Calculate R = Ms Us Ns. If this unification fails, the m-unification fails. If P exists, then Ps >s R. Label the nodes of M, N, and R so that matching nodes have the same node labels.
  2. If P exists, then Ps e sigma(M) and Ps e sigma(M). So if M has a directed trump link from node i to node j, P must obey Ps[j] = Ps[i] Us M[j]. R might not obey this relation, but it may be possible to `round out' R by adding extra information to make it do so. This is done by two calculations: Ps[j] = R[i] Us M[j] (forward rounding), followed by Ps[i] = R[i] Us (Ps[j] - M[j]) (backward rounding) in succession.
  3. Do (2) for all trump links on N or M. If any of the unifications fail, then the m-unification fails. (When the trump links of M and N overlap in complex ways, this step may involve quite a complex iteration to satisfy all the trump link equations of M and N together; but this rarely happens). We now have a `minimal' Ps which is in the scope of both M and N, but it has no trumps; we need to expand its scope as far as possible, by adding trumps to it.
  4. To enlarge the scope sigma(P) as much as possible, if both N and M have a trump node i, put a trump on node i of P. If N has a trump on node i, and M has a trump on the parent or grandparent (etc.) of i, put a trump on node i of P.
  5. To ensure sigma(P) does not extend beyond either sigma(N) or sigma(M), if either N or M has a trump link for node i to j, put a trump link from node i to j of P. (again, there are subtleties here from overlapping trump links of M and N, but they are rarely important)

The resulting m-script P can be checked from the definition of m-unification. P should obey both P >m M and P >m N; but it should not be possible to extend the scope of P (remove nodes or slots; add trump nodes; or remove trump links) without making this untrue.

M-unification is the key operation for language generation and understanding; for both of these, there is one m-unification per word m-script. The initial unification in step (1) tests the applicability of the word (syntactic and semantic constraints). In step (2), the m-unification requires one unification for each trump link in the word m-script:

M-intersection: Q = M ^m N, is the Q with smallest possible scope obeying M >m Q and N >m Q; it always exists. To find Q, do the following steps:

  1. Calculate Qs = Ms ^s Ns. Label the nodes of M, N, and Q so that matching nodes have the same node labels.
  2. Put a trump on any node i of Q where (a) Q(i) differs from M(i) or N(i) in any slots or child nodes, or (b) either M or N has a trump on node i, or (c) where some ancestor node of Q(i) has a trump because of (a) or (b). The scope of this Q includes the scopes of both M and N, but it can possibly be narrowed by adding in trump links.
  3. For any two trump nodes i and j on Q, add a trump link from i to j if both M[j] = M[i] Us Q[j] and N[j] = N[i] Us Q[j], and if neither M nor N has any trump nodes without the same trump link on node i or j.

(step 3 must be replaced by a rather more complex procedure in some cases with complex trump configurations)

M-intersection is just script intersection, with extra steps (2) and (3) to discover trump nodes and links. Step (3) discovers a trump link wherever both the input m-scripts M and N obey the trump link relation.

M-intersection is the key operation used to discover word m-scripts in language learning. However, for this purpose there is a key difference from the definition given above, in the discovery of trump links.

When forming `learning example' scripts by partially understanding sentences and observing the meaning they refer to, the child may observe not just the true meaning script X, but also some extra information; she infers a script Y which includes X, Y >s X. Therefore the trump link relation M[j] = M[i] Us Q[j] becomes just M[j] >s M[i] because of extra information on the `downstream' end M[j] of the trump link. (As the `upstream' end M[i] comes from previously learnt words, it has no extra information.)

So if we did m-intersections of these example scripts as above, we would not discover the trump links. We need in stead to use `broad m-intersection', which differs from plain m-intersection, replacing (3) by (3') :

3' For any two trump nodes i and j on Q, add a trump link from i to j if both M[j] >s M[i] and N[j] >s N[i], and if neither M nor N has any trump nodes without the same trump link on node i or j.

In practice a word is learnt by m-intersection of several learning examples, which because of the commutative and associative property of m-intersection (an m-script algebra identity) can be done in any order.

Appendix B : Proof of Bayesian Optimality

This appendix contains a proof that, within a certain mathematical model of learning, a Bayesian learning procedure gives the best possible fitness; so that natural selection will tend (other things being equal) to produce animals with Bayesian learning in preference to any other.

The mathematical model of learning uses some abstractions and approximations (such as discrete time steps) but is nevertheless quite a good approximation to the problem of learning social causal regularities in a primate group. Primates are under intense selection pressure to compete socially, and need to learn these regularities as fast as possible. If Bayesian learning is optimal, this implies that primate selection by social competition leads to near-Bayesian learning of social regularities. In this theory, social learning is the forerunner of language learning, which will therefore inherit its Bayesian character.

We characterise animal learning as follows:

To make this abstract framework a little more concrete in the context of primate social learning: There may be three distinct causal regularities in some primate social group: R1 : `When Brutus bites Cassius, Cassius becomes frightened', R2 : `If you threaten Cassius when he is frightened, he will give you food, and R3 : `If you threaten Cassius when he is not frightened, he will bite you'. Suppose R2 and R3 are well known, and the task of learning is to discover, by observation, whether R1 holds.

The prior probability that R1 holds is quite small, p(R1) = 0.1; and even if it does hold, there is no a priori knowledge about its cause-effect probability s1(R1); so r(s1) = 0.1 in the range 0<s<1.

Suppose a monkey observes 49 events E(1) .. E(49); in 12 of these, Brutus bites Cassius, and on 10/12 occasions, Cassius is then frightened. Normally, Cassius is only frightened 20% of the time. Now, in time step 50, Brutus again bites Cassius. Should our monkey go and threaten Cassius ? If the rule does hold (and has high rule probability s) then Cassius will be frightened and our monkey will get some food (fitness increment +0.01 on some scale); but if the rule does not hold (those 10/12 occasions were just flukes; Cassius was frightened for some other reason) then our monkey will get a bite (fitness increment -0.1).

The proof that Bayesian learning gives the optimum decision rule is now quite simple:

The way an animal chooses an action Ak, given sense data Dm , is summarised in a decision function Fk(Dm ). Fk gives a value for each action index k and sense data Dm . The animal acts as if, when having sense data Dm , it calculates all the decision functions Fk for each possible k, and chooses the action Ak with the largest Fk .

Decision functions can be used to describe any relation between sense data and actions ; that is, to describe the input-output relation of any cognitive system. So they give an external, functional specification of any possible brain. Our problem is: what is the best possible set of decision functions Fk ?

Figure 2 shows the relations between these concepts.

Figure 2: Flow of causation for the mathematical model of animal cognition. Time ordering is from left to right, and arrows flow from causes to effects. The probabilities P of states of affairs S, of sense data D and of outcomes O are discussed in the text. The horizontal dashed line denotes the interface between the animal and its environment.

The requirement equation defines the best possible decision functions Fk(Dm) - those which give best average outcome V, and which therefore give greatest possible fitness.

The expected value from taking action Ak in the state of affairs Si is the average value of all possible outcomes, weighted by their probabilities :

Using the definition:

we can calculate the expected value from one state of affairs Si , which gives sense data Dm , if the animal uses decision functions Fk to choose an action:

The theta function picks out one term from the sum over actions Ak - the term which has the largest Fk. So it embodies the decision rule to pick out one action by calculating the Fk and picking the largest. U is the average value resulting from that choice.

The average value for a single encounter is got by summing over all states of affairs and sense data sets, weighted by their probabilities:

where :

Gk depends only on things outside the animal's control Ñ on probabilities of states, probabilities of sense data, probabilities of outcomes and values of outcomes. Gk(Dm) is effectively the average value of doing action Ak when having sense data Dm .

In the design of a cognitive system, Gk cannot be varied, but the decision functions Fk can.

(2.4) is a sum over all Dm, where (because of the q function) for each Dm only one Gk enters the sum. The maximum Vavg is got by setting

since this choice means that for every Dm , the q function picks out the largest of the Gk and adds it to the sum. Any other Fk gives a worse average outcome, by sometimes picking a smaller Gk .

Therefore the optimum decision rule is given by

Thus equation (2.7) is the requirement for animal cognition Ñ the optimum form of cognition in all circumstances. I shall call it the Requirement Equation.

We can multiply all the Fk by any positive common factor, and it will not alter the choice of action. This enables us to rewrite (2.7) in two parts :

(2.8) is recognisable as Bayes' theorem, applied to find the most likely state of affairs Si in the light of the sense data Dm; then (2.9) chooses the best possible action, in the light of the likely states Si. For many problems, finding the state by (2.8) is the hard part, and then choosing an action by (2.9) is comparatively easy.