3.1 Bayesian Learning gives Optimal Fitness
3.2 Testing for Rule Scripts
3.3 Discovering New Rule Scripts
3.4 Speed and Robustness of Learning
3.5 Top-Down and Bottom-Up Learning
3.6 Learning Exceptions from Negative Evidence
3.7 Learning Noun Meanings
3.8 M-script Learning
3.9 The Primary Learning Algorithm
3.10 Gathering Negative Evidence
3.11 The Secondary Learning Algorithm
3.12 The Fundamental Theorem of Language Learning
3.13 Summary of the Learning Theory
I have sketched how the m-script model of language, being equivalent to unification-based grammars, can describe the production and understanding of typical adult language - including complex sentence meanings and diverse syntactic constructs in a wide range of languages. It does this in a way which depends principally on m-scripts and m-unification.
If we assume that the ability to m-unify is an innate capability of the human brain, then to learn a language is to learn the m-scripts of its words. If, therefore, we can find out how a child can learn word m-scripts by attending to adult language, we have a working theory of language learning to compare with the evidence. This section describes the proposed method for learning any word m-script.
I propose a Bayesian probabilistic theory of language learning. The reasons for using Bayesian learning, rather than any other learning framework, are:
Appendix B gives a general proof that Bayesian learning, with prior probabilities which match actual probabilities in the environment, gives best possible fitness. Here I shall give an intuitive argument that it does so.
Consider an animal which learns rule scripts in order to predict outcomes of social situations and to choose social actions. For instance, a monkey may learn the rule script `If I bite X and X has higher rank than me, then X will bite me back'; he may then use this rule script to know when he should, or should not, bite other monkeys. Knowing the rule script helps him to maximise the benefits of biting, and minimise the penalties.
He can learn such a rule script by observing some number of occasions when the prediction of the rule is confirmed. Suppose L occasions must be observed in order to learn the rule. How large should L be? (i.e., how many examples are needed to learn a rule?) Two things may go wrong:
Each possible rule script R has a prior probability PA(R) of being true.(A is a subscript; see formula below) Since there are very many possible rule scripts, this prior probability is rather small for a typical rule. To learn a rule R, we apply Bayes' theorem to the two alternatives: R is true, or R is not true:
Here D stands for the available sense data, which consists of a sequence of scripts (describing actual events) which the learner observes [1]. Initially, when no events have been observed, P(D|R) = 1 and so P(R|D) = PA(R) = small; the learner should not believe the rule R.
Suppose that rule R makes a striking (and otherwise unlikely) prediction. For every event in D which confirms this prediction, P(D| not R) is multiplied by a small number (the small probability of the prediction, if the rule were not true), whereas P(D|R) is not diminished (as the rule predicts the event). So after even a few such events confirming the rule, the `handicap' of small PA(R) in equation (3.1) is overcome; PA(R) P(D|R) becomes much larger than P(D| not R), so P(R|D) grows very close to 1. The learner should then believe the rule and act as if it were true.
The Bayesian learning criterion is to choose L (the number of positive examples required to learn the rule) just so that P(R|D) in (3.1) is close to 1 - above a threshold [2] (1-d). This minimises the aggregate penalty of (1) and (2) above:
The penalty (1) for slow learning rises linearly with L, and the penalty (2) for spurious learning falls off exponentially with L. As is proved in Appendix B, the Bayesian criterion chooses L to precisely balance these two penalties, minimising their sum and achieving best overall fitness. No other learning criterion does this.
For instance, if a rule script R has a very small prior probability PA(R), then by (3.1) it will take longer to learn this rule; more examples L are required to overcome the handicap of small PA(R). But because PA(R) is so small, this rule very rarely holds; so the fitness penalty of learning it slowly is on average a small penalty. Bayesian learning is tuned to learn important rules fast, other rules more slowly.
The Bayesian criterion means that most rules are learnable from 6 - 12 examples (which is very fast learning, compared for instance to many neural net models, which require thousands of examples). The performance of vervet monkeys in learning social regularities (Cheney & Seyfarth 1990) shows they can learn complex symbolic rules, like scripts, from a few training examples, as the Bayesian model would predict.
The optimality of Bayesian learning reflects itself not only in very fast learning - as fast as is possible without rampant errors - but also in very robust learning, which is highly immune to random errors and noise in the data. From the more detailed Bayesian treatment of script learning below, we shall see how this robustness emerges. It is a very important property for biological fitness, and for a child learning a language from noisy, error-prone data.
We define a model for Bayesian learning of rule scripts as follows:
We treat the time dimension as discrete; the learner observes a sequence of factual scripts F(t) describing events at times t = 1...N. The average amount of information in each event script F is Iavg (e.g. for definiteness, take Iavg = 30 bits).
A rule script R describes a causal relation between the events at times t and (t+1). R is a script with two scenes - a `cause' or `condition' scene C(R) and an `effect' scene E(R). If at time t, the conditions of the rule are satisfied, then with probability s(R) the effects of the rule will follow at time (t+1). This is computed as follows:
The rule conditions are tested by script inclusion (step 1). For instance, the condition C(R) = `I bite ?X' is satisfied by a factual script F(t) = `I bite Caesar'. The consequences of the rule are calculated by unification (step 2). A variable identity in the rule such as `?X' may be fixed by this unification, and the same fixed value will appear in the effect E(G). This is how a rule such as `if I bite ?X, ?X will bite me' can predict `Caesar will bite me'.
If the rule fires, the factual script for the next time step, F(t+1), will reflect the consequences of the rule E(G), but may also have other information unrelated to the rule (step 3). The actual outcome may be `Caesar bites me and takes my banana' - which includes the rule effect E.
The rule probability s(R) (the probability that the rule fires, if its conditions are satisfied) is typically large. It is not necessarily 1; the cause does not always lead to the effect. s(R) should not be confused with the prior probability PA(R) that the rule holds at all. For most rules, this prior probability is very small.
Suppose we know what rule script R we are testing for. Given a sequence of factual scripts F(1) ...F(N), how do we test whether R holds ? We apply Bayes' theorem, equation (3.1), to the two possibilities `Rule R holds' and `Rule R does not hold'. Here D stands for the sequence of factual scripts F(t). To calculate the terms in (3.1):
To calculate P(D|not R) : If R does not hold, then the probability P(D|not R) is simply 2**(-N Iavg) - because the probability of some event script F(t), with information content Iavg , occurring at random is just 2**(-Iavg) , and it happens N times over.
To calculate P(D|R) : Group the factual scenes F(t) into pairs for successive time intervals, forming two-scene scripts H(t) =[F(t),F(t+1)]. If a rule R fires between time steps t and (t+1), then H(t) s R.
Suppose there are Nc occasions where F(t) s C(R) - i.e. where the conditions of the rule were satisfied; and that on Nce of these occasions, H(t) s R - i.e. the conditions were satisfied, and the rule fired. The best estimate of the rule probability [3] s(R) is then s = (Nce / Nc) .
For each of the Nce `successful' occasions, the contribution to P(D|R) is a factor s*2**(Ie - Iavg). For each of the (Nc - Nce) `failed' occasions, the contribution to P(D|R) is a factor (1-s)*2**(-Iavg). For each of the (N - Nc) occasions when the rule conditions were not satisfied, there is a factor 2**(- Iavg). These are just the probabilities of the observed outcomes on each occasion [4].
To calculate PA(R) : We need a simple model of the prior probabilities PA(R) which somehow penalises more complex rule scripts (because there are so many more of them) and ensures that the expected number of true rules is finite. To do this, we can use the information content I(R) of the rule scripts, choosing
Using l 2-3 penalises complex rule scripts sufficiently to ensure convergence; with finite C, the sum of P(R) for all rules (the expected average number of rules which hold) is finite.
The total information content of a rule script can be written as a sum of the information contents of its cause and effect scenes; I(R) = I(C) + I(E). Typically these two parts are roughly equal.
We can then multiply these factors together, calculating both terms PA(R) P(D|R) and P(D| not R) of the Bayes formula, and compare them. If PA(R)P(D|R) is the larger of the two, then there is sufficient evidence to believe the rule holds; otherwise, any apparent evidence is more likely to be a fluke.
This comparison can be envisaged in a simple graphical form, which I call a crossover diagram, shown in figure 9. This plots the log (to base 2) of the two terms in the denominator of (3.1), calculated as described above, against the number of occasions Nc where the conditions of the rule are satisfied. By inspection of this diagram, comparing the slopes of the lines, you can see which term dominates the denominator, and therefore which hypothesis (Rule or no Rule) is to be believed, as a function of Nc. As soon as the `Rule' line is well above the `No Rule' line, the rule should be learnt.
Note that the two lines in this diagram both apply to the case where the rule actually holds; but each line applies to a different hypothesis, of `Rule' and `No Rule', entertained by the learner.
Figure 9: Crossover diagram for two hypotheses, `Rule' and `No rule' to calculate how much evidence must accumulate before the rule is believed.
At first, the `rule' line has the handicap of the small prior probability of the rule, which produces the initial gap between the two lines. As more cases Nc accumulate (where the rule conditions are satisfied), so evidence for the rule accumulates, until the two lines cross over; after that, the rule should be believed, and therefore learnt.
The `no rule' line has a slope -Iavg because every observed script has a probability 2**[-Iavg]. The `rule' line is a weighted average of two cases - where the rule fires (i.e. its effect ensues) or does not fire. These two cases contribute respectively s[log s + I(C) - Iavg ] and (1-s)[log(1-s) - Iavg ] , so the difference of the slopes of the `rule' and `non-rule' lines is sI(C) + [s log s + (1-s) log(1-s)]. Of these terms, if I(C) is moderately large (10 bits or more) s I(C) will dominate, except for very small s.
The crossover between the two lines (the point beyond which the rule can be believed) will occur when
or equivalently where
Nce is the number of positive examples, where the conditions of the rule are satisfied and the rule fires. (3.3) shows the important result that the number of positive examples required to learn a rule is independent of the rule probability s(R) - therefore it is independent of the number of negative examples, where the rule does not fire. Highly `unreliable' rules, with small s(R), can be learnt.
So in a typical case where I(C) + I(E) 2I(E) , lambda =3, around six positive examples are required to learn any rule - regardless of how simple or complex it is, and regardless of the number of negative examples in between. For any rule, six positive examples are enough to provide definite evidence for the rule - overcoming its small prior probability [5].
`Six positive examples' is a useful rule of thumb for remembering the speed of Bayesian learning. However, it depends crucially on l, the coefficient in the simple model (3.2) of prior probabilities of rules. In practice, we expect the form of the prior probabilities to be more complex than that, so `six' is just a ball-park; the number may be anywhere in the range 2 - 10.
The `6 positive examples' rule of thumb is approximately independent of s; but it does not work down to very small s. In particular, for the case s = 2**[-Ie], the precise form of the crossover formula shows that the rule can never be learned. This is because, at this value of s, the probability of the effect occurring is identical, whether or not the rule holds; so the rule is vacuous.
Chemical Analogy: The vertical axis of the crossover diagram is -(Chemical energy). You learn a new rule/atom when the energy cost of introducing it is more than offset by its energy benefits across the molecules you have observed - giving a lower energy (= more likely) interpretation of the data as a whole.
This completes the Bayesian criterion of sufficient evidence for a rule; but it does not yet tell us how to discover likely candidate rules. There is a simple algorithm to do this.
If , for a rule R, on several occasions t its conditions C(R) are satisfied and the effect of the rule ensues, then, as described above, H(t) s R. To discover a rule script from a sequence F(1)...F(N) of factual event scripts, construct all the event pair scripts H(t) = [F(t),F(t+1)] and form all their pairwise intersections, Mtu = H(t) ^s H(u) = [F(t) s F(u), F(t+1) s F(u+1)].
Usually, when you intersect together two random event scripts F(t) and F(u), most of the information in the two does not coincide, so the result F(t) ^s F(u) has a small information content, in the region of Iavg/4. Whatever it is, this `background' level Iback can be determined empirically, and random fluctuations above it will have exponentially decaying probability P(I) 2**(Iback - I). Random fluctuations more than around 6 bits above Iback will be rare.
If, for a pair of times t and u, the rule R was in action at both times, then H(t) ^s R and H(u) ^s R. From this it follows (via script algebra identities) that their intersection Mtu also obeys Mtu ^s R. Therefore M must have at least the information content I(R), as well as an information content (in both scenes) of [Iavg-I(C)]/4 from random coincidences.
So if we look at the information content of all the pairwise intersections, there will be a clear gap of 3/4 I(R) between the haves and the have-nots - those which have the same underlying rule script R, and those which do not. The probability of a `have-not' looking like a `have' by chance - of it having an information content I(R) larger than it should - is of order 2**[-3/4 I(R)], which for information-rich rule scripts (say I(R) > 20 bits) can be safely ignored.
If Mtu has large information content, and Muv also has large information content, this implies that scripts H(t), H(u) and H(v) all reflect the action of the same underlying rule R. In this way we can easily discover groups of scripts in which the same rule R was in action, and form the intersection X = H(t) ^s H(u) ^s H(v) ^s H(w). By the script algebra, this must still satisfy X ^s R, and if we are following the `six positive examples' heuristic above, it is likely that X will be a very good approximation to R; the chances of some superfluous piece of information (eg the value of some slot) appearing identically on the 6 scripts H(t), H(u),... are very small indeed.
Therefore the procedure of taking pairwise intersections Mtu and then grouping them together is a reliable way of finding the rule scripts which are operating.
While pure script intersection is very robust against extra information in any of the scripts, it is not so robust against missing information in any of them; if one script happens to have a node or slot missing, it will be missing in the resulting rule script. So in practice we should move a little towards `majority voting' for features on the rule script - requiring, for instance, that at least two scripts in the set lack a feature before we remove it from the result. This makes a trade-off between the two kinds of robustness - robustness against extra information, and robustness against missing information.
It is then simple find how many F(t) obey F(t) >s C(R), that is, to find how many times the condition of the rule was satisfied (but its effects did not necessarily follow). In this way we can estimate the rule probability s(R).
There may be cases where two rules R1 and R2 have the same `condition' part C(R), but distinct `effect' parts E(R1) and E(R2). Such a pair of rules in effect says `If C holds, then either E(R1) or E(R2) will follow, with probabilities ...'. In language learning, this is relevant to homonyms, where the same word (cause part) has different meanings (effect part).
If there is such a pair of rules, then some of the intersection pairs Mtu will have large information content I(C) + [Iagv-I(C)]/4 in their condition part, but random information content Iavg/4 in their effect part - because, for instance, R1 was acting at time t but R2 was acting at time u. This is easily recognisable in the pairs Mtu ; in this case, use only Mtu with large information content in both cause and effect parts, to define the groups of scripts to intersect together. Then both rules will be discovered.
At first sight, this algorithm seems to require time of order N**2 to form all the pairwise intersections Mtu ; therefore, the amount of processing to be done at each new time step (generalising H(t) with all previous H(t')) grows linearly with time, which would be unacceptable. However, this problem is easily solved. If all the previous H(t) are stored in some kind of associative memory, then it is only necessary to retrieve the few most likely matches, rather than the whole set. In this respect, a subsumption graph of H(t) would be suitable. For learning word scripts, associative retrieval based on the sound of the word is also possible.
So the algorithm can learn rule scripts from experience with constant computational cost per unit time, requiring about six positive examples per rule.
Chemical Analogy: When the same structure is seen in six or more supposedly random molecules, that structure is likely to be an elementary constituent.
This learning algorithm is robust in several ways:
For language learning, this means that the child only has to hear a word correctly and infer its correct referent from non-linguistic evidence on a minority of occasions; the word's meaning script can still be inferred reliably.
The major issue not yet tackled is how the algorithm behaves when there are two similar rules - in particular, a general rule and an exception rule. This important case is addressed in the next subsection.
The Bayesian derivation makes it clear that six positive examples is about the minimum needed for reliable learning. Anything using less would tend to be unreliable, for at least two reasons:
So one-shot learning would require very special constraints to make it robust and reliable.
There is an important question concerning the order of learning: suppose there is a general rule R, and a number of special-case rules R1 , R2 etc. They are special cases, in that their conditions are stronger than the conditions of R; that is, C(Ri) >s C(R) for i = 1,2. The subcase rules are exceptions to the general rule; either their effect parts may be different E(Ri) not = E(R), or they may give rise to the same effects with different probabilities s(Ri) not = s(R).
The question is: which is learnt first, the general rule or the special cases ? If the general rule is learnt first, we call this top-down learning; but if the special cases are learnt first, we shall call it bottom-up learning.
The Bayesian learning theory gives an answer to this question, but it is not the same answer in all cases; it depends on the form of the rules.
The question of top-down versus bottom-up learning is a very important one for language learning, related to the issues of learning general rules of syntax, over-generalisation, and so on. It turns out that for language learning, the theory predicts that bottom-up learning predominates; but there is also some top-down learning in a few cases.
We first sketch the mechanisms for top-down and bottom-up learning, in the Bayesian framework:
Why should one of these happen, rather than the other ?
Since the conditions of the general rule are less restrictive than those of the special-case rules, its conditions are satisfied more frequently than they are (every time some C(Ri) is satisfied, then automatically C(R) is satisfied). So the learning examples accumulate faster for the general rule than for any special case. On these grounds, we expect the general rule to be learnt first, and the top-down mechanism to dominate.
However, it may not be possible to recognise instances of the general rule in the data. Recall that the learning algorithm takes pairwise intersections Mtu of learning examples H(t) and H(u) - and that when the rule R is operating on both occasions t and u, the intersection Mtu is readily distinguishable from a `background' Mtu , by its information content. But this only works if the rule R has large information content (say, greater than 10 bits) in both its cause and effect parts.
Rules with small information content in their cause or effect parts cannot be recognised directly in the data. The only way to find such rules is by the bottom-up mechanism described above. In this mechanism, each special rule Ri has already been `cleaned up' by script intersection when it was learned; do it does not have a lot of random noise in it. The small information content of the general rule R can be found in the intersection of the Ri .
We conclude that the top-down mechanism dominates for rules with large information content; but rules with small information content can only be found by the bottom-up mechanism.
For language learning, the m-scripts for open-class morphemes have large information content on both sides (the `sound' side and the `meaning' side), so they can be learnt direct from the data. M-scripts for broad syntactic rules, or for closed-class morphemes, all have small information content. So they can only be learnt by the bottom-up, secondary mechanism, after several narrower rules have been learnt (as described in section 3.9). So the bottom-up mechanism dominates for language.
Nevertheless, there are occasions where top-down learning of exception rules is used in language learning, to correct over-regularisation of irregular forms. So I shall briefly describe how that works in the Bayesian theory.
Consider a general rule R and an exception Re, where C(Re) >s C(R).
If the `effect' parts of the two rules E(R1) and E(R2) are easily distinguishable, then the examples of each rule are easily distinguishable; each rule can be learnt individually as described above, using of the order of six positive examples.
However, consider the case where the `effect' parts of R and Re are identical, E(Re) = E(R), but this effect occurs with different probabilities in the two cases. The probability of firing is s for R ,and t for Re. For instance, s might be near 1, and t near zero; this means that in the more general case, E is very likely to happen, but in the special case, it will probably not happen. Suppose that R is learnt first (top-down learning).
If s = t, the exception rule predicts exactly the same as the general rule; in effect, there is no exception to learn. When s t, the exception rule makes a different prediction from the general rule, but this could not be detected from any one example.
To learn the exception rule, when t < s, one needs to observe several examples where the conditions of the exception rule Re are satisfied (and so the conditions of R are also satisfied), and to realise that the effect sometimes does not occur when predicted by the general rule R. In other words, one has to learn from implicit negative evidence.
The Bayesian analysis tells us how many examples must be observed to do this. As before, we can calculate the probabilities P(D|S) for the two different hypotheses `R only' (no exception) and `R and Re ' (exception). These are input to Bayes' theorem (3.1) and the crossover analysis done as before. Again, the two hypotheses give straight lines on the crossover graph.
Whereas for the simple learning case (section 3.2) the difference between the slope of the `rule' and `no rule' lines was typically large (or order 10 or more), for the `rule and exception' case the difference turns out to be small - typically 1 or less. In other words, evidence in favour of the exception rule can only accumulate slowly.
Where the two lines cross (the number of examples needed to learn the exception) depends on their zero intercepts - which depend on the Bayesian prior probabilities built into the learning mechanism. With the simple model (3.2) of independent priors for each rule, this difference is lambda*I(Re). For a typical rule with I(Re)= 20 bits and lambda=3, lambda*I(Re) =60. With a slope difference less than 1 in the crossover diagram, this would require more than 60 examples to learn the exception.
However, rules often have exceptions. That means that the simple model of prior probabilities in (3.2) is not realistic for exception rules. A more realistic model might be P(R and Re| R) = 2**lambda [I(Re)-I(R)]; this allows frequent exception rules, and by penalising the exception less in Bayes' theorem, allows faster learning - in, say, 5 - 10 examples.
Monkeys can learn exception rules from implicit negative evidence. Cheney and Seyfarth (1990) have shown how monkeys habituate to a particular peer's alarm call (but not to all peers' alarm calls) if that peer's call proves, on a number of occasions, to be false alarm. The general rule might be `If any individual ?X gives an alarm call, there is an eagle about with probability 0.8' and vervets can learn an exception `If Brutus gives an alarm call, there is an eagle about with probability 0.1'. The monkey has to learn this by observing occasions where Brutus gave a call and the eagle (expected by the general rule) failed to materialise; nobody explicitly gives him the information `that was a false alarm'.
Cheney and Seyfarth's results show that vervet monkeys can learn just these exception rules from implicit negative evidence, and can learn them from a fairly small number of examples - as the model above predicts [6]. The same process for learning an exception from implicit negative evidence is available for language learners.
Since the m-scripts for nouns have no trump links, they are identical in form to simple scripts; so they can be learnt by the script learning mechanism which we have described. This gives a computational theory of how children learn the meanings of nouns.
Figure 6 illustrates a typical noun script which is to be learned. To apply the script learning theory to this problem, consider the left branch of a noun m-script (the branch containing just the sound of the word) as the `cause' part of a rule script; and treat the right branch (which contains the relevant meaning script) as the `effect' part.
In this model, the child hears people speaking, in circumstances where she can see what is happening and construct one or more representations of it, in script form. On each such occasion (labelled by t) she forms `sound/meaning scripts' H(t) whose left branch represents just some sequence of contiguous phonemes (sequences rather longer than words give efficient learning) and whose right branch is some piece of script meaning inferred from the scene - specifically, the script description of some entity involved. There may be several H(t) formed for one sentence/occasion. Then the H(t) are input to the learning algorithm as described above. A typical H(t) is shown in Figure 11.
Figure 11: Script H(t) formed by a child observing a parent say `The cat is in disgrace', and input to the noun learning algorithm.
Pairwise script intersections Mtu are formed, keeping only those which have significant information over the random noise level Iback in both the left branch (i.e. which have the same sequence of phonemes occurred on both occasions) and the right branch (some significant overlap in meaning).
On the occasion shown in figure 11, the child hears her parent say the phonemes `th', `e', `ca', `t' and so on, in that sequence, as represented by the left-hand branch of the script (the precise way in which the child divides the sound stream into sections such as phonemes is not important, as long as it is consistent). She also observes a black quadruped, which happens to be under the table and near a bowl - as represented by the right-hand branch of figure 11.
On taking pairwise intersections Mtu, the only other H(t) which give above-noise information content on both sides of Mtu will be those in which the phonemes constituting `cat' are present in the left branch, and the black quadruped is present in the right branch. For instance, the sentence `Look at my shoes' will give high information on the left branch (mimicking `cat') but no high information content on the right branch.
Figure 12 shows the H(u) resulting from a second occasion; and Figure 13 shows the script Mtu which is the script intersection of figures 11 and 12.
Figure 12: Script H(u) formed by the child hearing someone say `My cat can catch mice' and observing the nearby scene.
Figure 13: Script Mtu formed by intersection of the two scripts from figures 11 and 12.
Note that this Mtu has high information content in both branches, but still contains some information not germane to the concept `cat' - information about a colour `black' and some relationship to a table.
Those Mtu with high information content on both branches define groups of occasions t, u, v, ... to be intersected together. Combining about six examples in this way will give a clean form of the meaning of the noun `cat'. In the left branch, the result will only contain the sound `cat' because all other phonemes (before or after) are lost in the intersection; and in the right branch, any meaning not central to the concept `cat' (such as the information about colours and tables in figure 13) will also be rejected by the intersection process.
Script intersection prunes a number of noisy examples, to associate the core sound with the core meaning. As long as both the left branch and the right branch of the underlying word m-script have adequate information content (say, 10 bits or more), the useful pairings Mtu (where both H(T) and H(u) contain both the core sound and the core meaning) will be clearly separated from other pairings (where either the sound or the meaning is missing) by the information-theoretic criterion, that the information content of Mtu be well above background noise in both branches.
This procedure can reliably learn noun meaning scripts of any complexity; it is not restricted to simple meaning structures like that in the right branch of figure 13.
Homonyms will give some pairs Mtu whose left branch has significant structure (the sound of the homonym) but whose right branch is noise (because two different senses were intersected together). However, as described above, only those Mtu where both branches have high information content are used to define the groupings for intersection; in this way the different senses of a homonym can all be learnt independently.
For a homonym such a `bank' or `palm', each distinct sense is learnt with a rule probability s(R) which depends linearly on its frequency of occurrence. A rare word sense will have s(R) near zero, and a common word sense will have s(R) near 1. When choosing between senses of a homonym, we use Bayesian inference to find the maximum likelihood interpretation; the rule probability of each word sense enters into this likelihood calculation, leading us to favour common word senses. (this model is a little over-simplified; we should also allow for context-dependence of the probability s(R), distinguishing s(bank|finance) from s(bank|fishing) )
This form of learning is memory-intensive, requiring the child to store hundreds or even thousands of scripts H(t). However, I do not regard this as a drawback. As adults, we have the capacity to remember tens of thousands of words, and a similar number of places we have visited. Even rats can neurally encode memories of hundreds of distinct places (O'Keefe 1987). Why should not a similar memory capacity be used in childhood for such an important purpose ?
As discussed previously, the H(t) should be stored in some kind of associative memory (or a subsumption graph memory) so that not all the pairings H(t) s H(u) need be literally formed; only promising candidates (eg those where several phonemes in sequence match) need be made, to keep the computational costs acceptable.
In this way the script intersection mechanism solves two hard problems together: segmenting the sound stream into words, and factoring out core word meanings from random extra meaning present on particular occasions.
The robustness of the rule learner carries over to this application. Even if the child frequently misinterprets the scene, or mishears the word, still a few clean examples, (interspersed amongst any number of spoiled ones) are sufficient for learning. Even if on every occasion, the child adds much irrelevant occasion-specific detail to the meaning, still the intersection mechanism will reliably project out the common core meaning.
Robustness follows basically from statistics and the independence of different learning occasions. The chances of some spurious phoneme appearing five times over in the left branch, or of some spurious element of meaning appearing five times over in the right-hand branch, are very small indeed. Both would have to happen in the same five examples to affect the result. Therefore taking a script intersection of five or more occasions eliminates spurious noise with high reliability (while retaining the core sound and meaning). And on the very few occasions when it does not, a couple more examples will do so.
This mechanism can learn not just the meanings of nouns, but of several other classes of word as well. For instance, it can learn the bare meaning of a verb (without, however, learning how this meaning is linked to the meanings of the verb's arguments); similarly for an adjective or adverb. Such early learning of bare verb meanings is not, in this theory, a necessary prerequisite to learning the full verb m-script.
It cannot, however, learn anything about closed-class morphemes - because their meanings have an information content (typically 1 - 5 bits) which is too small to operate the learning mechanism. Recall that it requires rule scripts to have large information content (say, 10 bits or more) in both their `cause' and `effect' branches in order to clearly distinguish operation of the rule script from random noise; closed-class morphemes fail on one or both counts.
Learning bare meanings of isolated words - nouns or otherwise - is only a start to learning a language. It tells the child nothing about the syntax of the language, how sentences hang together, or how complex meanings are constructed out of simple ones. To know these things, the child needs to learn the unbounded script functions by which language meanings are constructed. She must learn the full m-scripts for words.
While functionally, an m-script is much more powerful than a script (it is an unbounded function from scripts to scripts), structurally, an m-script is very much like a script; it is just a script with a few trump links. If you can learn a script, and then learn its trump links, you have learnt an m-script. The only thing you need to learn about a trump link is where to put its two ends. It turns out that this is a rather simple learning problem, compared to the problems we have solved already.
Just as rule scripts are discovered by script intersection, so m-scripts are discovered m-intersection, described in section 2 and Appendix A. To m-intersect two scripts (or m-scripts) together, first you do a script intersection, and then do some further simple tests to discover any trump links.
There is a general procedure to learn a word m-script from a few sentence examples where (a) you know the meanings of all other words in the sentence, and (b) you can infer the meaning of the sentence from non-linguistic clues. It is a fairly direct extension of the noun-learning procedure we have already described. The procedure is:
M-intersection projects out the word meaning structures and phoneme sequences as before, but it also discovers the trump links which give m-scripts their power as script functions. It learns the full word m-script for the unknown word.
We shall show how this works out by an example, learning an intransitive verb. Suppose the child has learnt the noun m-scripts for a few common and proper nouns, by the script intersection mechanism above; she then knows m-scripts for Mummy, Fred, girl and a few other nouns. Suppose she then hears the sentence Fred shouts in a context where she can hear who is shouting.
She forms an initial SMS from the sounds she hears. She then applies the m-script for Fred to add an entity node representing Fred, and adds the full meaning which she has inferred non-linguistically as a right branch. The resulting SMS is shown in figure 14.
Figure 14: SMS H(t) constructed by the child on hearing the sentence `Fred shouts' and using her knowledge of the word `Fred' to partially understand it.
A second example Mummy shouts at Fred gives rise to the SMS H(u) in figure 15.
Figure 15: Another example, when the child hears the words `Mummy shouts at Fred', using her knowledge of the words `Mummy' and `Fred' to partially understand it.
The result of m-intersecting these two SMS together is shown in Figure 16.
Figure 16: The result of m-intersecting the two scripts of figures 14 and 15, which is the m-script for the word `shouts'.
M-intersection, like script intersection, projects out the common structure from two or more scripts, and rejects all the rest. Thus it discovers the shared meaning in the right branch (the core meaning of shouts), and the shared sound sequence `shouts' in the left branch - rejecting non-shared sounds such as `Mummy'. It also discovers shared variable identities [7] such as `?A' in figure 16. However, most importantly, it discovers the trump link (the curved arrow) which enables the argument of shouts (e.g. Fred or Mummy) to be incorporated in its meaning script.
If an m-script W has a trump link from node a to node b, then all the scripts H within its scope (the learning examples) should obey the trump link relation H[b] = H[a] Us W[b]. However, the child observes not just the speaker's intended meaning, but may observe other irrelevant aspects of the scene - so she may observe a meaning script N which includes the intended meaning M, so that N s M. This means the learning examples do not necessarily obey H[b] = H[a] Us W[b], but must obey a weaker relation H[b] >s H[a].
Therefore to learn a word m-script, we use a variant of m-intersection which discovers a trump link whenever the relation H[b] >s H[a] is obeyed by all the learning examples.
This looser form of m-intersection can discover trump links even in the presence of noise in the meaning scripts. Can it do so reliably ? Suppose it erroneously put a trump link between two nodes c and d. It would only do this if the relation H[d] >s H[c] were obeyed in all the (6 or so) SMS used for the m-intersection. For two random script subtrees H[c] and H[d], the probability of having H[d] >s H[c] by chance is of order 2**[-I(H[c])]; for even one input script this probability is usually quite small, and the chances of it happening for all six SMS H(t) are very small indeed. So the m-intersection algorithm has only a small chance of discovering a spurious trump link.
M-intersection reliably discovers trump links, and thus reliably discovers the ways in which word m-scripts link their meanings to the meanings of their arguments. It discovers full word m-scripts; figure 16 is the full word m-script for `shouts', containing all its syntax and semantics. Once the child has learnt this m-script, she can reliably use the word `shouts' both for understanding and language generation.
As before, each possible word m-script W has a prior probability 2**[-lambda*I(W)], and a crossover analysis is used to find how much evidence is needed to overcome this small prior probability, and hence to `believe' that the m-script exists. As before, the result of this analysis is that about six clean learning examples are needed - which is also the number required to determine reliably what nodes, slots and values are part of the m-script, and to determine its trump links.
The same procedure works to find more complex m-scripts, such as that for the word `gives' in figure 8. Even though `gives' has three separate trump links (one each for the donor, the recipient and the gift) it does not take more than about six good examples to define all the trump links and all the information in the m-script.
One further elaboration is required for the algorithm of the previous section, to make it capable of learning any open-class word (eg noun, verb, adjective or adverb) in a language. The algorithm of 3.7 is good for finding any verb which may be the main verb of a sentence, because that is the last m-script to be applied in constructing the meaning, when understanding the sentence; the SMS H(t) are made by partially understanding all the verb arguments (to make their left branches) and by inferring the full meaning from non-linguistic clues (to make their right branches).
Consider the problem of learning the m-script for red from example sentences such as Peter burst the red balloon and Madge bought a red lollipop. Suppose all other words are known, but red is unknown. Applying the m-script for red is not the last step in understanding the sentence, so we cannot just do all the other understanding steps and then pair the result with the full sentence meaning. We need a way to isolate the unknown meaning portions red balloon and red lollipop as input to the learning algorithm.
The way to do this is to use both partial sentence understanding and partial sentence generation in constructing each SMS. Given a sentence Peter burst the red balloon, given a non-linguistic understanding of the full sentence meaning F, and given word m-scripts for all words except `red', the algorithm is as follows:
Figure 17: First stage of generation from the meaning script describing `Peter burst the red balloon'. The m-script for `burst' is applied right-left, to create the left branch, passing meanings back along the trump link arrows. The scene marked `X' is then used to help learn the word `red'.
In these examples, the script meaning representations are simplified, and the word sounds are represented as single scenes, rather than as one scene per phoneme, for simplicity. Also, some part of the SMS which obviously differ between examples, so will be eliminated by the m-intersectoin, have been stripped off from the figures. The learning algorithm is not affected by these simplifications.
Figure 18: A learning script H(t) obtained by pairing the meaning scene X from figure 17 (right branch) with the partially understood phrase `red balloon' (left branch)
Figure 19: M-script R for the word `red', found by m-intersecting several learning scripts H, including that shown in figure 18.
Since the learning script H(t) in figure 18, and all the others H(u) ... like it, obey the relation H[b] s H[a], the m-intersection R = H(t) ^m H(u) ^m H(v) ... discovers the trump link from node a to node b. This trump link enables the script function for red to convert the meaning script for balloon into that for red balloon; the meaning script for wooden sailing boat to that for red wooden sailing boat and so on.
We can now see that the previous learning procedures - the first one for nouns, and the second one for verbs - are special cases of this primary learning procedure. For nouns, there is no partial understanding stage necessary; for main verbs, there is no partial generation stage necessary; but in general, one can use both partial understanding and partial generation to narrow down to the unknown word, followed by m-intersection of several examples to find its m-script.
In normal language use, speaker and listener form the same SMS in their heads. The SMS contains the word m-scripts for all words in the sentence. If you know the m-scripts for all the words, and know the meaning script, this SMS is defined redundantly; every scene within it is constrained in two independent ways. So even if you do not know the m-script for one of the words, you can still reconstruct the same SMS. Children do this, and m-intersect the SMS to find the m-script for the unknown word.
To see why each SMS is defined redundantly, regard an SMS as a chain of the form:
[word sounds]<->[word m-scripts]<->[sentence meaning]
Every scene along this chain is redundantly specified; each scene is
either: (a) A heard sound, and in the left branch of a word m-script
or: (b) The right branch of a word m-script, and part of the left branch of another word m-script
or: (c) The right branch of a word m-script, and the full meaning script.
So in each of (a), (b) and (c), a scene of the SMS is defined in two independent ways. Because the SMS is redundantly specified, even if any one word m-script is missing, the SMS can still be constructed. The scenes of the missing word m-script will each be defined just once (in one of (a), (b) or (c)) rather than twice; but it will be the same SMS. The examples above illustrate this.
Chemical Analogy: Normally, speaker and listener synthesise the same SMS molecule from opposite ends; the speaker from the `meaning script' end and the listener from the `word sound' end.
If a child does not know the m-script for one word, but can infer the meaning script by other means, she can still synthesise the same SMS molecule, working inwards from both ends.
Collecting a set of these SMS molecules with the same unknown word, and analysing them by m-intersection, she can project out the missing word atom.
This learning procedure uses linguistic knowledge powerfully to narrow down the range of things the speaker might be referring to. From the example Peter burst the red balloon, knowledge of the meaning and syntax of burst is used to deduce that red can refer only to the thing which Peter burst; then knowledge of the word balloon is used to further constrain the possibilities - red must be some attribute of a balloon. Also, the only phonemes left unaccounted for are those in the word red, which are bounded on either side by known words.
To use an analogy: at first, the child must pick pieces of meaning out of her environment with clumsy fingers, and try to link them with word sounds; but later, her growing knowledge of language is like a pair of tweezers which enables her to pick out much more finely the meanings that adults are referring to.
So as the child's vocabulary grows, she can use this stronger learning procedure - which uses both partial understanding and partial generation to narrow the range of possibilities - to learn each new word from a few clean learning examples, rather than from many noisy, contaminated ones. This linguistic `cleaning' of the signal allows the procedure to learn some quite `small' words, which are phonetically not very salient and have small information content in their meanings. For instance, prepositions such as in, above etc. may now be learnable by this procedure. However, it is not the only way available to learn closed-class morphemes; for these, another procedure is also available.
Intuitively, the communicative fidelity of language depends on generation and understanding being (in some sense) the reverse of one another, so that hearers recover the same meanings as speakers start from. In this theory, the reversibility arises because generation and understanding use the same process of m-unification, run either `forward' or `backward'. In the secondary learning process, learners use both together to isolate learning examples for new words.
This full primary learning process has another essential use for the learner, in that it creates occasions for negative evidence. It is known that explicit negative evidence is rare and unreliable (Marcus 1993), and implicit negative evidence is a hard concept to define. For instance, if adults never say goed for six months, how much negative evidence does that constitute ? How many times might an adult have said goed in that period? On what occasions does the child notice the lack of goed? On this basis, it seems that any rare word or construct risks being unlearned by implicit negative evidence.
In this theory, the child can receive definite, usable negative evidence on a countable number of occasions, as part of the primary learning process. Suppose she believes that goed is a word, with the obvious meaning. On some occasions, she will hear an adult speaking, infer non-linguistically what is being referred to, and start to generate a sentence describing it (step (2) above). This `silently generated' sentence contains goed; the child may then record that the adult did not say goed but in stead said went. So this one occasion can serve both as positive evidence for went and negative evidence for goed.
By accumulating enough negative learning examples against the goed m-script (probably more than six examples, as discussed in section 3.6) the child may learn that goed is not a word.
I shall call this the primary unlearning process. It is an explicit computational realisation of one of the operational principles in Slobin's (1985) model of language acquisition: OP (REVIEW): MONITORING. Compare utterances you hear with forms that you would produce in the same situation. Store mismatches and attempt to accommodate your grammar to unassimilated input forms....
Theories of language learning have gone through many contortions, because it has been believed that children receive no negative evidence. The primary learning mechanism enables the child to capture pieces of negative evidence almost as easily as she captures positive evidence.
There is a second learning procedure which may be used for learning `bound' closed-class morphemes, such as the regular morphology of nouns and verbs - m-scripts which may have rather small information content, so cannot be discovered by the primary learning process.
Suppose the child has learnt several forms of the same verb - the m-script gives of figure 8, together with other forms such as give and giving. For each one, we shall suppose the sound scenes are phoneme-encoded, with a sequence of contiguous sound scenes for g, iv s and so on. Each verb form describes the same basic giving action, but particular forms have extra information - for instance the form gives specifies that the time is the present, and that the donor is third person singular (this information can be found in the m-script of figure 8).
Now if we m-intersect together the different m-scripts for of give, gives, giving , two things happen:
Suppose the child has also learnt m-scripts for the third person singular form of several verbs - gives, cuts, eats etc. All of these specify (time:present), and specify that the actor is third-person singular; but the actions in the verbs are diverse. If we m-intersect all these m-scripts together, then two things happen:
By taking these m-intersections, therefore, we have factorised verb m-scripts into `stem' and `inflection' parts. We can also show that if we m-unify the stem m-script for some verb with an inflection m-script, we recover the full m-script for the inflected verb. Therefore full inflected m-scripts need not be stored; they can be reconstituted by m-unification.
Denote full inflected verb m-scripts by V, verb root m-scripts by R, and inflection m-scripts by I. The root and inflection m-scripts are derived by m-intersections such as:
R(give-) = V(gives) ^m V(give) ^m V(giving)....
= V(gives) ^m X
I(-s) = V(gives) ^m V(hopes) ^m V(lets) ...
= V(gives) ^m Y
Then an identity of the m-script algebra shows that the reconstituted inflected verb script [R(give-) m I(-s)] cannot contradict the original inflected verb m-script:
R(give-) Um I(-s) = (V(gives) ^m X) Um (V(gives) ^m Y)
<m V(gives)
Therefore the reconstituted script cannot contain more information than the original, and cannot contradict it; but it still might contain less information. However, we know that [R(give-) Um I(-s)] must contain all the root verb information in R(give-), and it must contain all the third-person present information in I(-s); so we can show by enumeration that it contains all the information we need in the original inflected form. To all intents and purposes it is equivalent. It is usually identical.
This secondary learning process enables the child to learn reliably the m-scripts for closed-class morphemes such as -s, -ing, -ed and so on - even though the sound scene in their left-hand branch is not very salient, and the amount of meaning in their right-hand branch may be very small (too small for the primary learning process to work). It can do this because the input for secondary learning is not `noisy' observations of everyday scenes, but consists of word m-scripts which have already been cleaned up by an m-intersection process. Secondary learning starts with clean, low-noise data, to extract a small signal.
The secondary learning process has several distinct and important uses in the theory:
The secondary learning of broad pseudo-words gives us two ways of handling word ambiguities, which are mutually consistent. If we thought the missing verb might be cooking or bringing , we could try both verbs and take the script intersection of the resulting sentence meanings; or if we have no clue, we just use a `vanilla transitive verb', got by secondary learning, whose meaning is a script intersection of many transitive verb meanings. In either case we get a temporary meaning which is a script intersection of the true meaning with something else - so is guaranteed to be a subset of the true meaning.
The primary and secondary learning procedures are part of a continuous spectrum. There are intermediate points along this spectrum, which can lead to learning the same word. This is illustrated in Figure 20.
Figure 20: (a) Primary learning, (b) secondary learning and (c) hybrid primary/secondary learning
Figure 20(a) shows the primary learning process, in which a number of learning example SMS (circles) are m-intersected together to form a word m-script (square).
Figure 20(b) shows a pure secondary learning process; several primary learning processes deliver m-scripts, and they in turn are input to a further m-intersection.
Figure 20(c) shows a hybrid learning process. To m-scripts from primary learning, and a number of new SMS, are m-intersected together.
Because m-intersection is commutative and associative, it is hard to distinguish between these processes; they all give the same result.
Chemical Analogy: Very light word atoms cannot be reliably detected in noisy large sentence molecules. In stead, the sentence molecules are first analysed down to heavier `pseudo-atoms' which are actually small molecules. These (purified) small molecules are later analysed further, to reveal the light atoms of closed-class morphemes. The analysis can be done in different orders and still gives the same result.
We can prove an important self-consistency property of this set of learning procedures. It is the equivalent (for language) of the stability of DNA replication as the basis of life.
This fundamental theorem of language learning is:
If speakers compose sentences by m-unification of word m-scripts, and people learn words by m-intersection of sentences they hear, then a set of word m-scripts (a language) will propagate stably through a population.
The sketch of the proof is as follows: consider a word whose m-script is W, which is used by speakers in a population. They compose sentence-meaning structures S by m-unifying W with m-scripts for other words U, V. etc.: S = U Um V Um W Um... = W Um X. (Here S is an SMS like that in figure 7, which speakers use for generation, and learners reconstruct as learning examples). Somebody learns his own version W' of W by m-intersecting these SMS together, W' = S ^m S' ^m S''... = (W Um X) ^m (W Um Y) ^m...
A simple result of the m-script algebra then implies that W' >mW. Therefore W' is consistent with W; it contains all the information (nodes, slots, trump links) that W contains, and may contain more if there are random coincidences between the learning examples S, S' etc. However, if enough learning examples are used, these random coincidences will be vanishingly rare; and if they accumulated they would tend to make a word too narrow in scope to be useful, and speakers would tend to broaden it again. Therefore the learning process faithfully transmits the m-script W through a population of speakers.
Since this applies to any individual word m-script W, it also works for an arbitrary collection of word m-scripts W, X, Y...... It works, for instance, in the presence of homonymy and polysemy - when the same word sound has several distinct meaning scripts (i.e. distinct m-scripts). So it guarantees that any language will be faithfully transmitted by the learning process.
Chemical Analogy: Each language has its own distinctive stock of word atoms. Speakers synthesise these together by m-unification to form sentence molecules. Learners re-synthesise the same molecules by attending to inferred meanings, and analyse them down to word atoms. They can only recover the same word atoms which the molecules were originally made from.
This shows that the m-intersection learning mechanism propagates word m-scripts stably from one generation to the next, just as DNA replication propagates genotypes stably. As for DNA replication, there must be mutation-like processes which lead to language change; but these happen against a background of stable, faithful replication.
Every word (more precisely, every word sense) in a language is described by an m-script with two branches. This m-script is a script function from its left branch to its right branch. The left branch describes the word sound and other meaning scenes which are arguments of the function; it defines the syntax of the word (e.g. it defines order and agreement constraints between the arguments). The right branch describes the full meaning of the word including the meanings of its arguments. Trump links from left to right branches convey meaning scripts of unbounded complexity from the arguments into the full meaning.
Each word script function is applied left-to-right, to build up complex meaning scripts, in language understanding. It is applied right-to-left, to break down complex meaning scripts into their component parts, in language generation. This applies to all parts of speech.
In both generation and understanding, word application is non-destructuve - adding to the sentence-meaning structure rather than changing. Speaker and listener build up the same SMS in their minds, which underpins the faithful communication of language.
While the `function' view of word m-scripts is most appropriate for language generation and understanding, the m-script algebra view is more appropriate for learning. If a word with m-script W is used to construct several SMS Z, then they all obey Z >m W.
The strategy for learning a word m-script is to collect a few example SMS (X, Y, Z...), then to m-intersect them together forming W' = X ^m Y ^m Z.. M-intersection of the SMS constructs the m-script W' with the smallest possible scope which contains all of them, which must (by an m-script algebra identity) obey W' >m W ; given six or so examples, coincidental similarities between X, Y, Z.. are rapidly weeded out, and W' is a very good approximation to the word m-script W.
Different strategies for collecting SMS are used at different stages of learning:
The strategies (1) and (2) can be regarded as special cases of (3), which constitutes the most general method for primary word learning.
The secondary learning procedure also uses m-intersection; but as input it uses word m-scripts rather than raw sentences and meaning scripts. This allows it to extract the m-scripts for closed-class morphemes by a `factorisation' process.
These learning mechanisms together give a child the means to learn robustly the m-script for any word in the language, and thus to learn a full adult language. They make many predictions about the nature and course of the learning process; these are compared with observations in section 5.
[1] In deriving the form of Bayes' rule above, we have used the fact that PA(not R) = 1-PA(R) 1.
[2] The value of d depends on the relative penalties of (a) believing the rule if it is not true and (b) not believing the rule if it is true.
[3] Using this best estimate is a heuristic to get to the answer; a more rigorous appoach is to integrate over all possible s(R) between 0 and 1, using probability densities. It gives the same answer.
[4] We assume that the outcome F(t+1) always has an average information content Iavg bits ; if the rule has operated (with probability s), so that F(t+1) s G, then Ie(R) of these bits are accounted for by the rule, and the remaining bits occur with probability 2**(Ie - Iavg). Otherwise, if the rule does not fire (with probability (1-s) the Iavg bits occur with probability 2**(-Iavg).
[5] For larger l, the number would be larger, for simple or complex rules. Therefore (perhaps surprisingly) a large value for l does not bias a learner towards simpler rules; it just makes all rule learning a bit slower.
[6] The design of the experiment ensured that the monkeys knew the general rule before learning the exception; it was top-down learning.
[7] The same variable `?A' appears in both branches of the m-script. This means that the person mentioned before the sound `shout' (in the left branch) is the same as the person doing the shouting (in the right branch meaning structure). The variable `?A' can convey a bare identity between the branches, but the trump link can convey much more - a whole script subtree.