Appendix C: A Fundamental Theorem of Language Learning

This fundamental theorem of language learning, stated in section 3.11, is:

If speakers compose sentences by m-unification of word m-scripts, and people learn words by m-intersection of sentences they hear, then any set of word m-scripts (a language) will propagate stably through generations of speakers.

This theorem is important because without it, there could not be so many diverse yet stable languages in the world; just as without DNA replication, there could not be diverse and stable species. Its proof was sketched in section 3.11; this appendix gives a more detailed (though still not rigorous) derivation. The derivation gives insight into how the whole theory of language generation, understanding and learning hangs together.

We assume (a) that every m-script has a trump on its top script node, and (b) that each word m-script has, in its left branch, a sequence of `sound scenes' one per phoneme, rather than one sound scene for the whole word sound.

I shall first derive the result under simplifying assumptions, then show how they can be relaxed. The simplifying assumptions are:

To understand a simple unambiguous sentence with no `gap' constructs, the processing is as follows:

  1. Form a phoneme sequence script P, with a top holder node and one sound scene per phoneme heard, with time order arrows between the sound scenes.
  2. Select word m-scripts W1, W2 , W3 etc. whose sequences of phonemes match the sequences of phonemes heard, and m-unify them with P, forming successively (P Um Wi ), ((P Um Wi ) Um Wj), etc.
  3. Since two m-scripts will, in general, not m-unify, the order of m-unification is highly constrained. Each word m-script must be m-unified with all scenes of its left branch matched - both the sound scenes, and the meaning scenes (which must match meanings produced by previously m-unified words). All words except nouns require one or more meaning scenes to match their left branch; so the first word to be m-unified must be a noun.
  4. Each word m-unified adds the meaning scene of its right branch to the growing structure. Continue until all phonemes have been consumed by some word, and all meaning scenes except one have been consumed by a word. If you cannot, the sentence is ungrammatical.
  5. The one remaining meaning scene is the meaning of the sentence.

The script algebra implies that m-unification is exactly commutative, A Um B = B Um A, and partially associative (A Um (B Um C)) approx = ((A Um B) Um C). It is `partially' associative in two senses:

This means that if a product (((A Um B) Um C) Um D)... can be formed at all, in any order, then it will be the same irrespective of which order it is formed in, and can be written without all the brackets as (A Um B Um C Um D..), with the constituent m-scripts in any order.

(Note this order of m-unification has nothing to do with the order of words in the sentence. That is defined by time-order constraints between scenes of the SMS, regardless of how the SMS was formed).

For a grammatical sentence, with phoneme sequence script P and words W1 , W2 , ... W3 , the m-unification

Z = (((P Um Wi) Um Wj) Um Wk)... Um Wn)

exists for some order of m-unification, and obeys

Z >mWi

for every word Wi in the sentence. Furthermore, since every phoneme in P has been consumed by some word W, P itself adds no information to Z which is not in the product of the W; so Z can be rewritten without P:

Z = ((Wi Um Wj) Um Wk)... Um Wn)

= (Wi Um Wj Um Wk... Um Wn)

The meaning M of the sentence can be found within Z as the right-hand meaning scene of the last-unified word Wn; it can be identified as the only scene which has no trump links going out of it to other scenes (it has trump links coming into it).

Since this full meaning scene is a part of Z, it follows that

Z >m M

so

Z = Z Um M

= (Wi Um Wj Um Wk Um ...Wn Um M)

Suppose some speaker starts with exactly the meaning M and wishes to express it. M was the result of m-unifying the last-applied word Wn with the penultimate structure ((Wi Um Wj) Um Wk)...); this guarantees that we can form (M Um Wn), which is what the speaker does, as the first step in generation. Subsequent steps in generation are to m-unify the m-scripts for all the words in the sentence, forming

Z' = ((((M Um Wn) Um Wk) Um ....Wi)

= (Wi Um Wj Um Wk m ...Wn Um M)

= Z

Here we have used the `partial associative' property of m-unification to show that the speaker and the listener form the same m-script structure Z in their heads.

The same result can be shown in a less abstract manner, by considering the action of individual word m-scripts in generation and understanding. Suppose a word W has a sound scene and N (typically N= 0, 1, or 2) meaning scenes in its left branch, and one meaning scene in its right branch. There is a trump link from each left-branch meaning scene to the right-branch meaning scene.

In either case, the structure Z so created has N `left branch' meaning scenes S1...SN each connected by a trump link to the `right branch' meaning scene M, and each obeying a trump link relation:

M[i] = W[i] Us Si

These trump link relations define 1:1 relations between their ends; if all the Si are fixed, then M is fixed, and conversely, if M is fixed, then all the Si are fixed.

The full m-script structure Z above is a network of meaning scenes connected by trump links. Most scenes have trump links coming in to them, and one trump link going out. The only scenes with no trump links coming in are noun meanings; the only scene with no trump link going out is the full meaning M.

Because each trump link defines a 1:1 relation between its ends, the structure is over-determined. If you start at the full meaning scene M and work right-to-left (backwards over trump links), you must get the same result as by starting at the noun meaning and working left-to-right (forward over trump links) - because the trump link equations, relating the scripts, are the same in both cases. That is why `Z formed left-right' in understanding must be the same as `Z formed right-left' in generation.

Now consider a child hearing a sentence in which she does not know just one word, but knows all the other word m-scripts; and where she can infer the meaning M from non-linguistic clues. The child can construct the same m-script structure Z by working from both ends towards the middle. If the missing word is a noun, its meaning script can be constructed back from M along the trump links of other words. Otherwise, all meaning scenes in Z can be found by working along trump links from either end.

The only part of Z which cannot be built in this way is the trump links of the unknown word W; Z obeys the trump link equations, but does not have the relevant trump nodes or a trump link. So the child actually constructs a near-copy Z' of Z which, because it does not have the trump nodes, obeys Z' >m Z. Since Z >m W, this implies that Z' >m W.

Suppose the child hears a number of sentences containing this unknown word W. For each sentence Si she constructs an m-script structure Z'i obeying Z'i >m W.

If you m-intersect together a number of structures Z'i, all of which obey Z'i >m W, then the result W' = (Z'1 ^m Z'2 ...Z'n) must also obey W' >m W. As explained in section 3.8, as the number of examples Z' increases, the amount of extraneous information (which survives in W', but is not a part of W) decreases exponentially; and the m-intersection automatically discovers the trump links in W. After about six examples, we can expect all extraneous information to have been eliminated, leaving W' = W; the unknown word will have been correctly learned.

This faithful learning applies to any word W; so if adults use any arbitrary set of word m-scripts W1 , W2 ,... for communication by m-unification as above, and if the child learns by m-intersection, she can learn all the Wi by starting from the nouns and working upwards. This proves the fundamental theorem in the case of no ambiguities, homonyms of gap constructions.

The extension of the theorem to take account of these other complexities is not difficult. Principally, we rely on the robustness of the learning mechanism in rejecting `bad' learning examples Z ,and in being able to pick out a small signal from a large amount of noise.

Homonyms: If the child ever tries to combine examples Z1 and Z2 which refer to different senses of a homonym, the result of this m-intersection will have very small information content in its right-hand meaning branch, so will be rejected by the learning mechanism.

Other Ambiguities: If the child somehow selects the wrong sense of an ambiguous sentence, this too will be rejected by the learning mechanism for too small information content. Generally, it is safest to avoid including ambiguous sentences in the learning set, for slower but safe learning.

Gap Constructs: Here, the m-script structure Z contains scenes which are not determined by the meaning of some noun, and have no incoming trump links; the only meaning they have (in the first stage of understanding) comes from the argument restrictions of the verb. However, if the child has inferred the correct full meaning M, the identity of the gap filler can be got by tracing information backwards along trump links from M to the gap entity, in the `generation' direction.

Appendix D: The Language Learning Program

This appendix describes a program which can understand, generate and learn a fragment of English using the mechanisms of this theory.

The program is written in Prolog, and uses some specific features of LPA Prolog as implemented on the Mac. Since it was developed in 1993/4 and has not been kept up to date, it now seems to work only on obsolete hardware (Mac SE30). Source code is available on request.

A graphical editor can be used to build and display scripts and m-scripts, and was used to generate some of the diagrams in this paper.

The program is based on an implementation of the script algebra and m-script algebra operations, using algorithms similar to (but more complete than) those described in Appendix A. The implementation can be checked by testing that the results of the operations obey the relations of the script algebra and the m-script algebra. The m-script operations can handle the trump link structures needed for word m-scripts, but cannot correctly handle trump link structures of arbitrary complexity.

The program is implemented using m-scripts which differ from the ones shown in this paper in one respect. In stead of having one script node at the top with several scene nodes hanging from it (as in this paper), a word m-script has one 'script holder' node at the top with two script nodes below it - one for the left branch and one for the right. The scenes then hang from these script nodes.With that form, the understanding process is slightly more complex than I now think it needs to be; but that is the way I did it, and I haven't yet had time to modify the program. I used a chart parser mechanism to keep track of time order constraints which I now believe can be done automatically as part of m-unification; but apart from that, there are no real changes in how understanding, generation or learning are done.

In word m-scripts, the sound of one word is represented as one `sound scene' containing the whole sound of the word, rather than as a sequence of `phoneme sound scenes'. Semantic representations of word meanings broadly use the principles of lexical and conceptual semantics as explored by Jackendoff, Levin , Pinker and others, but are rather crude in places; it is not often that the precise details of semantic representation affect the broad principles of language processing and learning (although of course they affect its realisation in many ways).

The Language Understanding Component takes as input typed sentences, typically containing only words whose m-scripts have been defined to the program. It proceeds by m-unification and other operations until a single meaning structure has been found, or some form of failure (eg from an ungrammatical sentence) has occurred. It digests sentences at a roughly linear time cost of 10 - 100 seconds per word on the Mac SE30.

It uses a chart parsing mechanism to control the selection of sound scenes, and meaning scenes from previous word m-unifications, to m-unify with the left branch of the next word to be processed. Word sounds are initially arranged as arcs between neighbouring nodes i and i+1 in a chart of nodes from 0 to N (for an N-word sentence). Nouns are processed first, each giving an entity-describing scene. This scene is inserted as an arc spanning the same two nodes as the noun sound. For homonyms, two or more entity scene arcs span the same pair of nodes.

To m-unify any word, its left-branch scenes (including the word sound scenes, and other meaning scenes) must be matched with contiguous arcs of the chart, and all the constraints of m-unification must be satisfied by the scenes on selected arcs. Then the right-hand branch of the result (the meaning of the word, with its arguments filled in) is put in the chart as a new arc spanning from the start to the end of the contiguous arcs which the word matched.

Verb m-scripts can, when necessary, be m-unified with one or more of their noun argument scenes missing (so consuming a gap). Higher priority is given to verb m-unifications which do not require gaps.

In this way, meaning scenes are built up which span longer and longer stretches of the chart. The sentence is fully understood only when a unique meaning scene spans the whole chart.

The m-unification of words is interspersed, in quasi-parallel fashion, with various other operations:

With this rather ragbag set of techniques, the program is able to handle a wide range of the constructs of adult English, generally getting a sensible meaning structure out at the end - even when two or more complex features combine in the same meaning. It certainly cannot handle all the difficult and intricate cases studied by linguists; but there is no reason to suppose that it could not be progressively refined to do so.

The set of words known to the program is currently about 400, which includes about 80 non-verbs and 10 different m-scripts per verb for each of 30 verbs; and multiple senses for many words, particularly prepositions. Typical of the sentences it can successfully understand are:

You can give the biscuit to me. Fred told Lucy the lamp was broken. Charlie wanted to go home. The boy who Lucy thought ate her cake went home. John will not have hurt himself. The big boy should not have eaten lucy's biscuit. Who did Lucy think ate her biscuit ? Lucy's biscuit was eaten by Fred. Fred cooked and ate the fish. Who went to London ? John. Lucy told Anna to go home when Charlie left. Has every bishop been seen with an actress ?

For the Generation Component of the program, the same word m-script are arranged in a subsumption graph according to their right-hand meaning branches; searching down this graph from the root can give successively better and better approximations to the required meaning at any stage of generation.

The generation component starts with a meaning structure, which may be one got by the understanding component, or may be constructed by hand (and therefore have meanings not fully expressible with the program's limited vocabulary). A search is made down the subsumption graph for the word whose meaning W best captures the meaning M to be expressed, in that:

Sometimes, when no appropriate word is to hand, the program has to choose some very bland word such as `do' or `thing'; but it carries on anyway. The m-script for W is m-unified with M, which creates the word sound scene of W (to be said later) and some new meaning scenes M'. Meaning in M but not in W has been carried back along the trump links into these new meaning scenes. The process is then repeated on each of the new meaning scenes M' , selecting other words to match them, until all meaning has been consumed or lost. At this stage you have a set of word sound scenes, with time order constraints between them (from time-order arrows in the word m-scripts). The words can then be said (or rather, typed out by the program). Various choices have to made using appropriate heuristics and methods:

This generation component can satisfactorily generate the same set of sentences as the understanding component can understand. It behaves gracefully when it cannot fully express the meaning - coming up with something grammatical, useful and not misleading.

The generation and understanding components are both used by the learning component, which we can now describe.

In a typical run of the learning program, one or more words are excised from its full vocabulary, and it is presented with a few sentences, each of which contains an unknown word. It is also given a meaning structure, which is the actual meaning of the sentence, but with some amount of random extra meaning slots and nodes tacked on. These simulate things the child may notice or infer about the scene, which are not part of the speaker's meaning. For each such sentence-meaning pair, the learning procedure does these steps:

  1. From the provided meaning scene, generate as far as possible, using only m-scripts for known words which were heard in the sentence. This consumes words at either end of the sentence, leaving a stretch of words in the middle which are not used up (the unknown word is within the stretch of unused words); and one un-consumed meaning scene which corresponds to those words. This meaning scene forms the right branch of a learning example script.
  2. From the stretch of unused words, understand as far as possible using the m-scripts of the known words. This yields a number of meaning scenes, and a word sound scene for the unknown word, which form the left branch of the same learning example script.

The program constructs one of these learning example scripts for each of the given sentence-meaning pairs, and m-intersects them together. It uses the `broad' m-intersection criterion, inferring a trump link between nodes i and j whenever S[j] s S[i] for all learning examples.

(Note that this differs from the procedure used in the proof of the fundamental theorem of language learning in Appendix C, in that a heavily pruned structure is constructed from each sentence-meaning pair, rather than the full sentence-meaning m-script Z containing all other word m-scripts, as in the proof of the theorem; but this pre-pruning just makes the m-intersection operation easier, without changing the result. Without pruning, the m-intersection operation will still remove all the extra example-specific structure.)

The result of running the program is that this learning algorithm reliably discovers the correct m-script for words of any part of speech. M-intersection finds just the structures in the left and right branches (given about 4 or more learning examples) with occasional extra slots, and then finds all the trump links - never proposing spurious ones.

The program has also been run in `bootstrap' mode starting from zero vocabulary, like an infant. When fed carefully-chosen example sentence-meaning pairs, starting with nouns, it has bootstrapped itself to a vocabulary of 50 words, and could easily go further.

This learning simulation has not simulated the Bayesian significance criterion in any detail, but has relied only on the `six examples' rule of thumb which emerges from the Bayesian calculation. Nor has it yet simulated the selection of appropriate examples from the many inappropriate examples a child hears during the day, and the information-theoretic criterion for rejecting inappropriate examples (eg homonyms). However, I believe that if these components of the program were to be built, they would work. The Bayesian statistics which show how they work are not complex and do not make any implausible assumptions.

The program shows that the tight interconnection between language generation, understanding and learning via the m-script algebra works not just in theory, but also in a running computer implementation. Something like this implementation might also be found in the human brain.