Language acquisition has been high on the agenda of cognitive science for forty years, shaping theoretical linguistics and informing many child language studies. There are many theoretical ideas about parts of the acquisition process. Yet there are very few fully-articulated, workable models of first language learning and even fewer have been compared with a wide range of data. The notable exception is Pinker's (1984, 1989) theory of language acquisition.
This paper describes a new broad-scope theory of language learning, with the following features:
The last point is the most important. I have compared the theory with over 100 key facts about language acquisition, finding good, unforced agreement in the vast majority of cases. In many cases the theory can give a clear,crisp account of facts which are puzzling in most current language learning theories. These comparisons are summarised in Table 5.1 on page 47. I believe no other theory of language learning can claim such broad agreement with the facts.
There is a shorter version of this paper (of about 40 pages) available from the author, or visible at http://public.logica.com/~wordenr, which you may wish to read first - to get an overview of the theory, and to see some of the key comparisons with data. This full version gives a more complete description of the theory, and contains all 101 comparisons with data.
The theory of language learning is part of a larger theory of language evolution, learning and performance. I need to describe other aspects of this theory for three reasons:
These descriptions of the non-learning aspects are kept as short as possible, to keep the focus on the learning theory (section 3), and the comparisons with data (section 5).
Theories of language learning have been polarised between two camps:
This theory does not fit neatly in either camp. It does not posit de novo language-specific structures or learning mechanisms in the brain, nor does it rely on broad ill-defined learning mechanisms; it proposes a Bayesian learning mechanism, evolved specifically for primate social intelligence and extended for language, with a precise mathematical structure. This structure underpins the robustness, expressiveness and diversity of languages.
The mathematics of the theory are in three linked parts: (1) Script algebra, which is the discrete mathematics of feature structures, and will be familiar to many computational linguists; (2) M-script algebra, which extends this to functions on feature structures (as used, for instance, in categorial grammars) ; and (3) Bayesian learning theory, which is a simple application of probability theory. None of them are complex, or require anything beyond school maths; but the power and self-consistency of the learning theory hinges on them. This mathematical/computational basis is established in sections 2 and 3.
While the maths is elementary, it may be unfamiliar and inaccessible to some. Fortunately, many of its important consequences can be understood by a simple analogy to chemistry, which I shall develop alongside the maths, in highlighted paragraphs.
Section 2 describes the theory of language understanding and generation. It is a unification-based theory, where sentence meanings are feature structures, built up by successive unifications of meaning elements. Many syntactic constraints are constraints on the unifiability of feature structures. The theory is comparable with other unification-based grammars such as LFG and GPSG, showing that it has similar power - and can handle complex features of many adult languages. Like them, it has a reversible model of language understanding and generation.
The theory is fully lexicalised; every word (or word sense) is represented in the brain by a structure called an m-script, which embodies all the syntax and semantics of the word. There are no separate phrase structure rules, transformations or parameters. Therefore if we can learn the m-scripts for words, we can learn a language.
Since our knowledge of child language acquisition comes only from studies of language production and comprehension, we need a theory of production and comprehension in order to compare the learning theory with data. Effects which have been attributed to learning limitations can often be understood as production effects - arising from children's strategies for speaking with limited vocabularies. The model of language production is a key part of the theory.
Section 3 describes the process for learning the m-script for each word, and how this leads to `bootstrap' learning of a language. Many of the background assumptions are as in Pinker's theory - for instance, that the child learns by hearing sentences in contexts where he can infer their meaning non-linguistically. But the statistical and mathematical basis of learning is different.
It is a Bayesian learning theory, which can be shown to give optimum learning performance. The learning procedure projects out common structure from examples (rejecting random extra noise), and has a Bayesian criterion of sufficient evidence. This means it can learn the m-script for any word from a few noisy examples. It can gather implicit negative evidence and learn from it.
Section 4 discusses the evolutionary origins of language, and the processes of historic language change; there are parallels between the two. I propose that the capacity to use scripts (which underlie language meanings) evolved to support primate social intelligence; so they have a 20 million year evolutionary history and require a fast, robust learning mechanism. M-scripts arose more recently, in part to support a primate theory of mind.
Language learning allows word m-scripts to reproduce and propagate through a population of speakers, and so to evolve (as a form of Dawkins' `memes'). They evolve to maximise the speed and efficiency of communication, and evolve much faster than the brains which use them. This accounts for many prominent features of language (such as approximate regularity, grammatical subjects, and the Greenberg-Hawkins universals) as the results of language change (m-script evolution) rather than innate features of the human brain.
Section 5 compares the predictions of the learning theory with observations. I first discuss some general properties of language acquisition (such as its speed, robustness, and approximate order of acquisition). I then discuss particular observations, in the order: acquisition of the lexicon, phrase structure, morphology, complement-taking verbs, auxiliaries, alternating verb arguments, pronouns and movement, and finally bilingual language acquisition.
For the majority of these 101 comparisons, the m-script theory is in good unforced agreement with the data, not requiring extra assumptions. Where extra assumptions are required, they do not strain credibility. I have found no major conflicts between the theory and the data. However, I have not been able, in the time and space, to examine the data as thoroughly as, for instance, Pinker (1984) does in his comparisons; much work remains to be done for a full evaluation of the theory. Nevertheless, the initial indications from these comparisons are positive.
Section 6 compares this theory with other theories of language learning, I discuss Pinker's (1984,1989) theories (which have much in common with the m-script theory, and from which I have borrowed the treatments of some phenomena), then discuss Principles & Parameters theories, Connectionist theories, Slobin's Operating Principles, and Siskind's computational model of lexical acquisition.
Section 7 concludes, summarising the main results of this work.
There are four appendices: (A) describing algorithms for the m-script operations which underpin the theory; (B) showing that Bayesian learning gives optimal performance; (C) deriving a fundamental theorem of language learning; and (D) describing the computer program which implements the theory.
2.1 Representing Language Meanings as Scripts
2.2 Scripts Functions to Build Language Meanings
2.3 Word M-scripts and Language Processing
2.4 Processes of Understanding
2.5 The Process of Generation
2.6 Procedural Skills of Language
2.7 Segmenting the Sound Stream
2.8 Retrieval of Words from Memory
This section steps back from the problem of language learning, to describe a theory of language understanding and generation. This defines what has to be learned, to learn a language. The theory is unification-based, with similarities to LFG (Kaplan & Bresnan 1981), categorial grammars (Oehrle et al 1988), and GPSG (Gazdar et al 1985), but with distinctive features of its own. It has a simple mathematical structure.
Using the mathematics, the essence of the theory of language, and of the learning theory, can be stated quite simply; so I ask the reader's patience in setting up this elementary mathematics (or, for the non-mathematical, the chemical analogy).
It may be worth reading the shorter version of the paper to get an overview of the language processing mechanisms before reading this more systematic description.
In this theory of language, the meaning of every sentence is described by a symbolic structure called a script. A script is a kind of feature structure, as used in much computational linguists (e.g Shieber 1986; Pollard & Sag 1987) , and in lexical and conceptual semantics (e.g. Levin & Pinker 1991; Jackendoff 1991). For instance, scripts are directly comparable to the uncommitted f-structures of LFG.
I assume that when understanding a sentence, we form the script representation in our heads, before forming other representations such as mental images and mental models (Johnson-Laird 1983). Similarly to speak a sentence, we first form its script meaning structure in our heads [1]
It is proposed that the first script representation capacity evolved some 20 million years ago to support primate social intelligence (see section 4).
A script is a tree-like structure made of a few different types of nodes; each node contains information encoded in the values of various slots. They are pure trees, with no structure-sharing. However, slot values can be variables, and different slots on one script may have the same variable value - a form of value-sharing.
The basic form of the script representation derives from the work of Schank (1972,1977) and Nelson (1985). The choices of tree structure, slots and values used in this representation are broadly derived from the lexical semantics of Jackendoff (1991), Pinker (1989), Talmy (1985) and others (e.g. in Levin & Pinker 1991). There are many similarities to the structures used in cognitive semantics (Lakoff 1987; Langacker 1991).
Warning: The script structures used in this paper are intended to be illustrative, not definitive. I have not worked hard to find the very best all-round script representation, or to incorporate all the insights from Jackendoff, Pinker, and others. As this is an evolving theory, the choices of nodes and slots may even change a little between examples. I believe that the script structures used in the brain are actually rather more complex that those illustrated here. However, in most cases, the form of the learning theory, and the tests of it, do not turn on fine details of the script representation; where they do, it will be noted.
An example of a script encoding a simple meaning, `Charlie broke the doll' is shown in figure 1 below. This example is taken from a program, described in Appendix D, which implements the language theory described here.
Figure 1: A typical sentence meaning structure. Heavy lines denote the script tree structure. The light lines connect each script node to a box showing the information at that node, as slot:value pairs.
Node types are marked inside the node circles, as follows - sr:script; se:scene; en:entity; pr:property. There are only these node types. Slots and their values are shown in the boxes attached to nodes. For most slots, the value is one of a small set of allowed values.
In this simple script, most of the action is defined in one scene, the top 'se' node, which designates an event [des:event]. This is a bounded transitive act [act: act2] which certainly happened [pol:certain] Reading off the nodes below that from left to right, they represent :
(1) Charlie - the agent in this scene, and also the topic; human, masculine, and third person.
(2) The doll - a three-dimensional object with the property of being lifelike. It is the patient of this scene. It is given a variable identity ?C in order to refer to it elsewhere.
(4) The scene which results from Charlie's action; in which the same doll (identity ?C) goes into a state of being broken. Breaking is a causative verb.
(5) The time at which this scene happens, in the past.
Because script trees can be indefinitely broad and deep, they can express an unbounded set of meanings - as is required for language.
Some aspects of meaning cannot be expressed directly in scripts, because there are no innate slots in the script representation capable of expressing them. For instance, the full description of an artefact such as a bicycle or a hammer cannot be expressed in a script. We assume that some script slots may contain links to other meaning representations about hammers - such as mental images, or the procedural skills of using them - which can be accessed after language understanding, or before production, but do not form a central aspect of the language faculty itself.
An important property of scripts (not illustrated in this example) is that a script may represent a sequence of events in a partial or total ordering. Then each event is represented by one `scene' node below a `script' node, with time-order constraints (represented by arrows) between the scene nodes defining the allowed time ordering. This time-ordering relates closely to constituent-order rules in grammars.
We adopt a simple model-theoretic semantics for these scripts, which does not address (for instance) some subtle distinctions of meaning needed for embedded attitude reports, nested quantifiers and so on, as are addressed by Montague semantics (Dowty et al 1981), situation semantics (Barwise & Perry 1983) and other formalisms. These distinctions are not important for many tests of the learning theory, but will need to be addressed at some future time.
In this simple model-theoretic semantics, each script represents a set of possible social situations. A simple script with one scene describes a social situation in which the events of that scene are going on, and in which other things, not mentioned in the script, may also be going on. Similarly, a script with two scenes linked by a time-order arrow describes an extended situation in which those two scenes occur in that time sequence, and other scenes may occur as well.
In this sense any script denotes an infinite set of possible social situations. This set is called the scope of the script; the scope of script S is written as sigma(S).
Note: the three key script operations are denoted by symbols which should be like the set-theory operations of set inclusion, union and intersection, followed by subscripts s for script operations, m for m-script operations. For this hypertext version, I use >s for script inclusion, ^s for script intersection, and Us for script unification. Similarly I shall use >m , ^m and Um for the corresponding m-script operations. I shall also use > ^ and U (without 'subscripts') for set operations.
There are three key operations on scripts (inclusion, unification, and intersection), which are defined in terms of the model-theoretic semantics, and can be done by symbolic manipulations of the tree structures. Their definitions are:
Script A includes script B (written as A >s B) if A contains all the information in B, and possibly more. Script inclusion is defined in terms of scope sets:
A >s B iff sigma(A) < sigma(B) (2.1)
That is, A includes B if any situation described by A is also described by B. Intuitively, A must contain all the information in B, and may contain more; but it does not contradict any information in B. Therefore script inclusion is the inverse of subsumption for feature structures. With this inversion, the words `subsumption' and `inclusion' can be freely interchanged [2].
Although script inclusion is defined in terms of the infinite scope sets, it can be tested by looking at the tree structures of scripts A and B, to check that all the information in B is also in A. This algorithm is described in Appendix A.
The language learning theory is a Bayesian theory, depending on probabilities of events and prior probabilities of rules. These probabilities figure in the definitions of the two remaining script operations.
For any script S, there is a probability P(S) that the events described in the script take place within some time interval; this is a rough measure of the 'size' of the scope set sigma(S). We can also define an information content I(S), which can be calculated by adding up the information content of all slots on S. A simple slot `sex', with two equally likely values `male' and `female', adds 1 bit to I(S); similarly a time-order arrow adds 1 bit to I(S). The probability and the information content are approximately related by
P(S) = 2**-I(S) (2.2)
The unification of scripts A and B, written as A Us B, is the simplest script which combines all the information in both of them. It is defined in terms of script inclusion:
C = A Us B is the script with smallest information content I(C) satisfying both C >s A and C >s B. (2.3)
So the scope set sigma(C) is a subset of sigma(A), and of sigma(B); it is the largest possible subset expressible as a script. If sigma(A) and sigma(B) do not overlap, C is not defined.
If it exists, what does C = (A Us B) look like ? C has all the structure - nodes and slots - of both A and B, but they are matched together to get maximum overlap of nodes and slots - making C as small as possible. Both A and B can be seen within C.
Script unification is very like unification in Prolog (Clocksin & Mellish 1979), or unification of feature structures (Pollard & Sag 1987), and involves some trial-and-error matching of nodes to get the best possible overlap between A and B. There is a fairly simple algorithm for script unification, described in Appendix A.
When two scripts A and B are unified, the information in the result is a set union of the information in the two inputs; any defined slot in either A or B must appear in the result.
Chemical Analogy : The unification of two scripts is like a chemical compound got by combining two atoms or molecules. The shared nodes and slots are like the shared valence electrons.
Chemical energy = - log(probability) = Information content. The two scripts (atoms or molecules) combine in the minimum-energy, minimum information configuration.
Minimum energy = maximum likelihood.
The intersection of scripts A and B, written A ^s B, projects out all the information which A and B have in common and loses all other information. It, too, is defined from script inclusion:
D = A ^s B is the script with largest information content I(D) satisfying both A >s D and B >s D. (2.4)
This means that the scope set sigma(D) contains sigma(A), and contains sigma(B); it is the smallest scope expressible in a script which does this. It always exists, because a script with no information has the largest possible scope set, which contains all others.
Script intersection is like feature structure generalisation, or generalisation in Prolog [3]. It is done by a simple algorithm, described in appendix A.
Script intersection is the central operation of the language learning theory, so it is worth illustrating by an example, shown in figure 2.
Figure 2: Illustration of script intersection. Information on nodes is denoted as slot:value pairs.
This shows the intersection of two simple scripts. It is done by matching them together node by node, retaining only information which is common to both, and trying out different node pairings to maximise the amount of retained information. Points to note are:
Script intersection is used in language learning to project out the common meaning of a word from several instances of its use.
Chemical Analogy : As unification is like chemical synthesis, script intersection is like analysis.
In intersection, two script/molecules are compared with each other, and the output molecule is their largest shared substructure.
The terms `include', `unify' and `intersect' have a simple link to set theory, as applied to the information in scripts. Because more information implies a smaller scope set, and vice versa, the relations to scope sets show an inversion. For inclusion, the inversion is evident in equation (A.1): script A includes B if the scope of B includes the scope of A. For intersection and unification, the corresponding relations are:
sigma(A Us B) < sigma(A) ^ sigma(B) (2.5)
sigma(A ^s B) > sigma(A) U sigma(B) (2.6)
The inversion between and is evident in the equations.
These simple set-theoretic relations lead to important relations between results of script operations - similar to the relations of elementary set theory. Typical relations of this script algebra are :
A ^s B = B ^s A (2.7)
A Us (A ^s B)= A (2.8)
A Us (B Us C) = approx (A Us B) Us C (2.9)
A ^s (B Us C) = approx (A ^s B) Us (A ^s C) (2.10)
Some of these relations are only approximate; but in practical examples, the approximate relations turn out to be nearly precise, typically to within one or two slots or nodes difference. For most purposes we shall assume they are exact.
These three operations, of inclusion, unification and intersection, are at the core of the theory of language. They are familiar from the study of feature structures and unification grammars (Shieber 1986; Pollard & Sag 1987), but their algebraic structure, in equations like (2.7) - (2.10), is perhaps not so well known. This algebraic structure is crucial for the self-consistency of the learning theory. It can be understood simply in the chemical analogy:
Chemical Analogy : The relations of the script algebra have analogues which are common-sense `theorems' of chemistry:
An important type of script is a rule script, used to describe some causal regularity. A rule script has at least a cause scene and an effect scene, with a time-order arrow between them; the rule states that if the cause scene takes place, then the effect scene is likely to follow with a certain probability P, defined as part of the rule. Variable slot values are important in rule scripts, to state regularities involving `the same individual' in both cause and effect scenes. If a particular script S is an example of a rule script R in action, then S >s R; all the variable identities in R are replaced by fixed identities in S.
Rule scripts are proposed to be an important element of primate social intelligence - the way primates represent important regularities of social life, and predict social outcomes. Examples of these rule scripts are discussed in (Worden 1996b). However, they are not powerful enough to express language rules, and it is worth describing why they cannot.
A rule script may be regarded as a function from scripts to scripts - a function which, given a script containing just the cause scene C, delivers a script containing just the effect scene E. If the rule script has some variable slot values (in both its cause and effect scenes), and the cause scene fixes these variables, they must then have the same values in the effect scene; thus the effect E depends on the cause C. This function from scripts to scripts may be written as E = f(C). However, it is not a very powerful function, since the only things that can change in the result E are the values of slots, each of which has only a small finite set of possible values. So there can only be a small finite set of values for the result E. As the function f can only deliver a finite set of values, whatever its argument, it is called a bounded function.
A key property of language is its unboundedness - its ability to express an unbounded set of meanings from a finite set of tokens. So if language meanings are delivered by some form of function application, it is clear that the required functions cannot be bounded functions. That is why rule scripts are not powerful enough, as functions, to perform the operations of language.
In this theory, each word of a language is denoted by a function between scripts - a function which, given some scripts as arguments, delivers another script as result. The result script contains both the meaning of the word itself, and other meanings from the arguments. Each word script function combines these meanings to build up more complex meaning scripts.
Sentence meanings are built up by successive applications of word functions, which convert word sounds to meanings. For instance, to understand `Angry Fred shouts', first a word function f for the word `Fred' uses the sound `Fred' as argument, and delivers the noun meaning script F, denoting Fred: f(`Fred') = F. An adjective word function a for `angry' then acts on this noun meaning, a(`angry', F) = M, delivering a script M describing Fred in a state of anger; then finally, the function h for an intransitive verb `shouts' acts on M, h(M, 'shouts') = S, to deliver the meaning script S for the sentence `Angry Fred shouts'.
The succession of states of the sentence is then:
`angry' `Fred' `shouts' (state 1)
`Angry' F `shouts' (state 2)
M `shouts' (state 3)
S (state 4)
where a word script function converts from each state to the next - leading finally to the full meaning.
This functional view of language has been extensively developed in Categorial Grammars (Oehrle et al 1988).
There is always some `last' word function application, which delivers the full meaning of a whole sentence (typically the last word applied is the main verb). The set of possible sentence meanings is unbounded, so we need unbounded script functions to build up full sentence meanings.
The required script functions are called m-scripts. An m-script looks very much like a script, with one notational extension; but their semantics are rather different and their three main operations, while analogous to the three script operations of unification, inclusion and intersection, are more powerful.
Possible evolutionary origins for m-scripts are described in section 4 . Here we shall just describe the formal and mathematical properties of m-scripts, as required to describe their role in language.
There is a close parallel between scripts and m-scripts. The key to this parallel lies in defining a model-theoretic semantics for m-scripts, analogous to that for scripts.
Where a script represents a set of situations, an m-script represents a set of scripts. The set of scripts denoted by an m-script M is called its scope, sigma(M).
An m-script looks much like a script, but may also have trump nodes, and trump links. Trump nodes are nodes labelled !A, !B etc., and a trump link is a curved arrow from one trump node to another.
To define what set of scripts is represented by an m-script M (i.e. to define its scope set), denote the nodes in its tree structure by i = 1,2,...m, and denote the subtree of M with its root at node i by M[i]. Similarly denote subtrees of scripts.
Consider an m-script M with one trump node , at node i. Script S is in sigma(M) (i.e. S is in the set of scripts represented by M) if above the trump node i, S = M (i.e. they have exactly the same nodes, and the same information on each node), and the subtrees below the trump node obey
S[i] >s M[i] (2.11)
That is the definition of a trump node. So below the trump node, S must have at least as much information as M, but may have extra nodes and slots - any amount of them.
So the scope of an m-script with a trump node is an infinite set of scripts; the m-script denotes that infinite set. An example of an m-script with one trump node, and some of the scripts in its scope set, is shown in figure 3.
Figure 3: a basic m-script with one trump node, and a few of the scripts which it denotes.
This m-script has one trump on node 3. Therefore, for a script to be within its scope, the script must have no extra structure above that node, or below other nodes (such as 2) which are not below node 3. But it may have any amount of extra structure below node 3, as long as that structure includes the structure already below node 3 in M. All these scripts obey S[3] >s M[3], and therefore also obey S >s M.
In practice, m-scripts with a lone trump node are not of much interest for language, and this example was only used to introduce the idea of trump nodes. To be useful, they must be linked together.
For an m-script M with two trump nodes i and j, and a trump link from node j to node i, then for a script S to be in the scope of M, above the trump nodes S and M must, as before, be equal. Below the trump nodes, the two subtrees of S must obey S[i] >s M[i], S[j] >s M[j] as before. They must also obey
S[i] = S[j] Us M[i] (2.12)
that is the constraint implied by the trump link between nodes i and j. By this equation, each trump link requires a unification - which (in this theory) is what introduces the unification into unification-based grammars.
Trump links are central to the theory of language, so they will be illustrated by a simple example, shown in figure 3.
Figure 3: Example of an m-script and its scope set - some of the scripts which it represents.
This example shows a simple m-script, M, and a few of the scripts S1 , S2 , S3 ,... in its scope set sigma(M) - the scripts that it denotes. The nodes of M have been numbered to make the scope definition clear. M has a trump link from node 2 to node 3, so any script S in its scope must obey S[3] = S[2] Us M[3], where S[3] is the subtree of S below node 3, etc. You can check that the three examples given obey this constraint.
This definition extends straightforwardly to an m-script with several trump links; see Appendix A for details.
So while S can contain extra information (beyond that in M) below both trump nodes, with a trump link the extensions below nodes i and j are not independent. The subtrees S[i]and S[j] must be related by a unification. This means that if the subtree below one trump node is fixed, then the subtree below the other end is also fixed by the trump link. This works in either direction; S[j] is a script function of S[i], or conversely S[i] is a function of S[j]. That is how m-scripts act as script functions.
There is one constraint on well-formed m-scripts. This is that an m-script must be within its own scope. Therefore the subtrees of any m-script must obey its own trump link equations.
The definition of trumps and trump links defines the scope sigma(M) of any m-script M. Then, just as for scripts, we can define three key m-script operations in terms of the scopes:
An m-script G m-includes m-script H (written as G >m H) if any script in the scope of G is also in the scope of H; that is, if G contains more information about allowed scripts in its scope than H:
G >m H iff sigma(G) < sigma(H) (2.13)
The relation G >m H can only hold if G has all the structure (nodes and slots) of H, possibly more. The structure of H can be seen within G. But G cannot have any extra trump nodes, unless they are on or beneath trump nodes of H; and G may have extra trump links, not in H.
The m-unification of m-scripts M and N, written as M Um N, is the simplest m-script which combines all the information in both of them. It is defined from m-inclusion:
P = M Um N is the m-script with smallest information content I(P) which satisfies both P >m M and P >m N. (2.14)
The information content of an m-script is its usual script information content plus a term for trump structure. An m-script with bigger information content has smaller scope, and vice versa. If two m-scripts have identical script trees, the one with fewer and lower trump nodes, and more trump links, has a smaller scope set and so has greater information content. A 'top trump' on the root node has less information content than any other trump configuration, because it allows the greatest freedom the expand the structure.
Two m-scripts cannot necessarily m-unify, if their scope sets do not overlap. In particular, if plain unification (ignoring trump nodes) does not work for them, then m-unification will not work either.
If it exists, what does P = M Um N look like ? Just as for ordinary script unification, P has all the structure (nodes and slots) of both M and N; M and N can both be found within the structure of P. As for ordinary script unification, N and M are matched so as to maximise their overlap, making P as simple as possible. Any trump node of P must be on or below a trump node of M, and below a trump node of N. Finally, all the trump links of both M and N must appear in P. This means that P may have extra nodes and slots (not required in ordinary unification) just to make it satisfy all its own trump link relations.
The m-intersection of m-scripts M and N, written as M ^m N, projects out all the information which M and N have in common and loses all other information. The m-intersection always exists. It is defined from m-inclusion:
Q = M ^m N is the m-script with largest information content I(Q) which satisfies both M >m Q and N >m Q. (2.15)
Again, the result of an m-intersection looks much like the result of a script intersection. Q = M ^m N only has those nodes and slots which appear in both M and N, but M and N are matched so as to maximise this overlap, making Q as big as possible. If either N or M has a trump node, then Q must have one in the same place; and Q can only inherit a trump link from M if N has one in the same place. However, crucially, m-intersection can `discover' trump nodes and trump links, creating them where they were not before:
This creation of trump links by m-intersection is a vital part of language learning.
Note the close analogy between these definitions and the definitions of the script operations. Because the m-script operations are defined set-theoretically, in the same way as the script operations, they obey algebraic relations precisely analogous to those of the script algebra; these form the m-script algebra.
Chemical Analogy: This goes over from script to m- script operations unchanged; m-unification is like chemical synthesis, and m-intersection is like analysis. Now we can bring the analogy closer to its use in language:
While the m-script operations are defined set-theoretically in terms of their scopes, they can be calculated by algorithms operating on the m-script tree structures. These algorithms are more complex than those for script operations, but always start by doing the script operations. They are described in Appendix A.
Strictly, the m-script operations act only between m-scripts, not between scripts and m-scripts. However, by treating a script as a special m-script, we can m-unify scripts and m-scripts together.
In the theory of language, each word is defined by an m-script (call it W), which has two branches - a left branch and a right branch - hanging from a top script node. The left branch consists of one or more scene nodes (and all the structure below them) and the right branch is just the rightmost scene node with the structure below it. These two branches will be denoted by L(W) and R(W). There may be zero, one or more trump links, each one directed from a node in the left branch to a node in the right branch. An example with a trump link is shown in figure 6 below.
For a script in the scope of this m-script, if the left branch is known, then the right branch is fixed by the trump links; and vice versa. Therefore the m-script acts as a function from scripts to scripts. It is:
Therefore m-scripts are functions from scripts to scripts, which can deliver, as results, the arbitrarily complex meanings of language. They are applied left-to-right in language understanding, and right-to-left for generation.
To apply an m-script M as a left-to-right function f on a script S, proceed as follows:
To apply the same m-script M right-to-left to a script T, do the same, but in step (1) check that T >s R(M); then the m-unification P = M Um T(top) adds a left branch L(P) = f **-1 (T), calculating the inverse script function.
We can regard an m-script as a function with several arguments, X = f(A, B, ...). Here A, B, etc. are all scenes below a top script node of a script S, and the m-script is applied to S as before. This is how a verb m-script works with one argument each for the agent, patient, etc. When these m-scripts are applied right-to-left, they calculate all the `argument' scripts A, B, etc. from the one `result' script X.
I shall illustrate by an example, how (a) an m-script for each word embodies all the syntactic and semantic constraints on the use of that word, as well as its meaning, and (b) language understanding and generation are done by the repeated application of these word m-script functions, to build up or dissect meaning scripts.
First consider the simplest m-scripts in any language, the m-scripts for nouns and proper nouns. A typical one of these is shown in figure 5.
Figure 5: M-script F defining the meaning of the name 'Fred'
This has a left branch script and a right branch script, so it is a function from scripts to scripts - albeit a very trivial function. As it has no trump links, it is a bounded function, which can only deliver one result scene - the scene describing the individual Fred (its right-hand branch) As its argument (the left-hand branch) it has just a single scene containing the sound `Fred'.
This script can be used to convert from sounds to meanings in a simple, one-word, manner. When hearing the sound `Fred', applying the script function left-to-right delivers the meaning script F which denotes Fred: f(`Fred') = F. Alternatively, given the same meaning script F, the function can be applied right-to-left, to deliver the sound `Fred': f -1(F) = `Fred'.
Next consider the m-script function for the adjective `hungry'. This is shown in figure 6.
Figure 6: M-script G for the sound "Hungry"
This m-script, like all other word m-scripts, can be regarded as a function h from its left-hand branch to its right-hand branch. This time, however, as it has a trump link, it is a non-trivial function.
It is a partial function, and can only be applied to scripts which include its left branch. To do this, the argument (left-hand branch) must have a scene containing the sound hungry, followed by a scene containing an animate entity. There may be any amount of extra information below the animate entity node.
When the function is applied, the trump link requires a unification, between nodes a and b (see figure 6):
S[b] = S[a] Us M[b] (2.16)
The result S[b] combines what we knew before about the entity (in the argument script S[a]) with new information in M[b] (the property node for being hungry - see figure 6).
So the m-script for hungry can add the property node for hunger to a meaning script S (of arbitrary complexity) describing some animate entity.
We can now illustrate the whole process of understanding the phrase `hungry Fred', shown in figure 7.
Figure 7: the script Z created in understanding "hungry Fred ". Z is a sentence-meaning structure (SMS).
We start off having heard the sounds, with just two scenes, describing the sounds hungry and Fred. Understanding then proceeds by successive m-unifications (equivalently: successive function applications) to build up more complex meaning structures.
Denote this whole script by Z, and its four subtrees, below the four scene nodes, by Z[1] ... Z[4]. The full script structure Z is called a Sentence-Meaning Structure (SMS) since it contains both the word sounds of the sentence, and the meaning script.
Initially we have just the two heard sounds, Z[1] = Fred and Z[2] = hungry. Then we apply the m-script for the word `Fred', calculating Z' = Z Um F and so adding Z[3] = f(Z[1]) to the structure; and finally we apply the m-script for `hungry', calculating Z'' = Z' Um G and so adding Z[4] = h(Z[2], Z[3]). Both m-script applications simply add structure to Z, and both m-scripts can be seen in the final structure of Z (figure 7). Z[4] is the meaning script resulting from the understanding process.
For language generation, we start with the meaning script Z[4], and apply the same word m-scripts right-to-left. First we apply the m-script for hungry, calculating Z' = Z Um G and so adding Z[2] and Z[3]; then we apply the m-script for `Fred', calculating Z'' = Z' Um F and so adding Z[1]. This gives us scripts for two word sounds (Z[1] and Z[3]) which can then be said.
Note that in both language understanding and generation, the same SMS (Z in figure 7) is built by m-unification. In understanding, we start with just the sounds Z[1] and Z[3], and calculate the meaning Z[4]; in generation, we go the other way round, from the meaning Z[4] to the sounds Z[3] and Z[1]. The end result is the same in both cases.
Chemical Analogy, the SMS is a molecule that we synthesise by m-unification of word atoms. Each word atom can be seen within the structure of the molecule. Speaker and listener both build up the same molecule in their heads. You can synthesise the SMS molecule in various different orders (as the speaker and listener do) and still get the same molecule; communication is faithful.
This example illustrates several features which remain true for understanding and generation of much more complex sentences in adult language:
It will not yet be obvious that this function application model can cope with the full complexities of adult language. To provide some evidence that it can, I shall do two things:
The m-script formalism can be compared to any unification-based grammar formalism, but for definiteness I shall use Lexical Functional Grammar (LFG) (Kaplan & Bresnan 1982). In terms of LFG, each word m-script embodies a phrase structure rule and a lexical entry, including all its functional equations. To see how this happens, consider the m-script for a more complex word, `gives', shown in figure 8.
Figure 8: M-script for the word "gives"
Again, this m-script is a function from its left-hand side to its right-hand side - to be applied left-to-right for language understanding (to build meaning scripts) or the inverse direction for generation (to break them down). On its left-hand side, as well as the script for the sound `gives', it requires as arguments three meaning scripts - one each for the donor, the recipient and the gift.
The right hand branch expresses the meaning of `A gives B C' - roughly that A acts on B, with the result that B possesses C. The unifications required by the trump links mean that, whatever script structures appear in the arguments representing A, B and C, these same structures will appear in the full meaning script.
This shows how the m-script formalism is roughly equivalent in power to LFG, and can be expected to handle the same range of language features. That supports the key claim of this paper, that if you can learn m-scripts, you can learn a language.
Each word m-script has one or more scenes in its left branch, and only one meaning scene in its right branch. These scenes can all be classified into three main types:
In terms of this classification, the description of the gives m-script above is:
[entity] gives [entity] [entity] <->[event]
We can discern some standard shapes of the different parts of speech in a language, as follows:
Noun, pronoun: ball <->[entity]
Adjective: green [entity]<-> [entity]
Article: the [entity]<-> [entity]
Preposition: [entity] on [entity]<-> [entity]
Preposition: [event] to [entity] <->[event]
Adverb: boldly [event]<-> [event]
Simple transitive verb: [entity] hits [entity] <->[event]
Complement-taking verb:
[entity] tells [entity] [event] <->[event]
Auxiliary: [entity] can [event]<-> [event]
These are summaries of the number and type of scene nodes in left and right branches of the word m-scripts. The orders of scenes in the left-hand branch is not significant, but has been given in English-like form.
Chemical Analogy: When combining word atoms to form SMS sentence molecules, an event scene on the left branch of one word must match with an event scene from the right branch of another; similarly for entity scenes. This means that the atoms and molecules have a kind of two-valued, directed valency which must be respected in synthesis.
Like chemical valency, it is not always precisely satisfied. In particular, verb m-scripts are often m-unified (in language understanding) when some of their required entity scenes are not present, and these `gap' entities must be determined later.
You may wish to sketch for yourself the process whereby a sentence such as `Fred gives the boy a banana' is understood or generated (ignoring for a moment the articles). In understanding, all the nouns are processed first using simple m-scripts like that in figure 5; this assembles the three arguments required for the `gives' m-script above, which then m-unifies with them to build the full meaning structure. This looks like the right-hand branch of figure 8, with the three unknown entities filled in by unifications. Because the `gives' m-script has time order arrows on its left-hand side, the noun phrases for the three entities involved must appear in that time order in the sentence. For generation, the process goes in reverse, leading to the same SMS.
If, in stead of `banana', we had `the lollipop he found in the park', the full meaning structure for this noun phrase would be assembled as the `gift' entity before applying the `gives' m-script. Being an unbounded script function, `gives' can pass over an arbitrarily complex entity script, as a description of the gift.
For a language with weaker word-order constraints, such as Latin, some or all of the time-order arrows on the left-hand side would be missing, allowing several word orders; but there might correspondingly be stronger agreement constraints (eg requiring that the donor be marked as an agent).
The m-script formalism differs from many unification-based grammar formalisms, such as LFG, in three respects:
Apart from these differences, which will be discussed later, it is equivalent in power to most unification-based grammars and can handle the same range of language constructs. How it does so is illustrated in the next subsection.
The syntax and semantics of any part of speech can be expressed in m-scripts, as illustrated above. Sentences of any complexity can be understood by the same mechanism of repeated m-unifications, one for each word. Each m-unification adds to the SMS and builds a part of the meaning script by unifying elements of meaning together, and this process can build meaning scripts of arbitrary depth and complexity.
To handle typical sentences of adult language, m-unification is the core process, but is supplemented in various ways:
These methods have all been tested in a program which can use a vocabulary of about 400 words to understand or generate a wide variety of English sentences, containing many of the complex constructs of adult language. Understanding time is roughly linear in sentence length, and there seem to be no hard limits on the vocabulary size or sentence complexity it could handle. More details of the program are given in Appendix D.
The synchronous processing of semantic and syntactic information, with no branching at ambiguities, has advantages of psycholinguistic plausibility. It enables us to form meaning structures in real time mid-way through hearing a sentence, in spite of structural ambiguities - as introspectively we know we do.
Since language is so often used in conditions of haste, high background noise and interference, the handling of ambiguity - without incurring a combinatorial explosion of analyses, and arriving reliably at the most likely interpretation - is a very important practical aspect of the theory. All types of ambiguity (structural, word sense, missing or unclear words) are handled by the same two-stage mechanism:
A maximum likelihood choice of interpretations can be shown to give best average performance (Worden 1995, and Appendix B) and, as we will see below, has a very close link to the learning theory, which is also Bayesian.
Chemical Analogy: In cases of ambiguity, the listener may combine the same set of word atoms in two or more different ways to get different sentence molecules. Or for homonyms, he may have a choice of two word m-script atoms for the same word sound.
In these cases, the listener may continually analyse the candidate molecules as they are forming, to find the shared common substructure of all candidates; and then later use other knowledge to find the preferred minimum energy configuration (the maximum likelihood interpretation)
Language generation is done by the same basic mechanism as language understanding (m-unification, to build up an SMS, but in approximately reverse order). In outline, the process is as follows:
Start from a meaning script, which is the meaning you intend to convey. Choose a word to m-unify with it, matching the meaning to the right branch of the word m-script. This adds new scenes corresponding to the left branch of the word m-script (including its word sound scene, and other meaning scenes). Elements of meaning are passed back into the new meaning scenes by the trump links. In turn, these new meaning scenes are consumed by m-unifying them with other word m-scripts.
In this way, a full SMS is built, consuming all the original meaning and creating word sound scenes for all the words to be said. The SMS contains the necessary time-order constraints between words, limiting the orders in which they can be said.
In one sense, generation is simpler than understanding, because one starts from a known meaning structure, and never has to cope with structural ambiguities or missing information. However, it poses two new problems:
The first problem is particularly acute for children with small vocabularies; how they handle it determines what they say, and so influences much of the data we have on child language learning. It is therefore important to understand the generation mechanism, and its impact on child language learning data.
The challenge facing the child is to say `the truth, the whole truth and nothing but the truth' (although `truth' may be intended meaning, rather than literal truth on some occasions!). We can characterise these three criteria in script terms:
(A) The truth : At each stage, the meaning script of the selected word should not contradict the intended meaning script - e.g. they should not have different values of the same slot.
(B) The whole truth: The meaning script should not omit any nodes or slots in the intended meaning, unless those nodes and slots are below a trump node, and so will be `passed across' on trump links to be expressed by other words later in the generation process.
(C) Nothing but the truth: The meaning script of the intended word should not add any nodes and slots which are not in the intended meaning.
With a finite vocabulary, these three criteria cannot be precisely satisfied at once. At any stage in generation, for any candidate word, we can make a measure of the mismatches between the word and the intended meaning of types (A), (B) and (C), in information-theoretic terms (i.e the number of bits of mismatch of each type), and choose the word which minimises a weighted sum of these mismatch scores.
Thus at each stage, the speaker chooses the best-matching word according to these criteria, and applies it by m-unification. There is a further criterion in each word choice; each step of m-unification strips out as much meaning as possible from the meaning passed across on trump links, and which remains to be said; but it is also possible to use words which re-introduce some of this stripped-out meaning to say some things redundantly.
Thus language generation is a process of successive approximation, with word choices to be made at each stage. I have described a simple `discrepancy minimisation' model of how these choices are made, which seems to work well in the computer implementation of the theory (it generates acceptable versions of all the sentences which the program can understand, and behaves gracefully when its vocabulary is inadequate; see Appendix D).
However, in reality the choice of words will be guided by many pragmatic factors, above all by those elements of meaning which the speaker most needs to convey accurately. For children with small vocabularies, these factors may have important effects on production-based measures of language learning.
Because words do not match the intended meaning exactly, language generation may modify the original meaning script - adding to it in places, and leaving out some of its meaning. Nevertheless, the SMS which is formed by generation contains all the m-scripts for the words spoken, and is the same as the SMS formed by the hearer.
The main thrust of this paper is that every language consists of a set of word m-scripts; therefore, if we can somehow learn word m-scripts, we can learn a language.
For this to be strictly true, we would require two conditions to hold:
I believe that (1) holds to a good approximation, and I have illustrated it by simple examples above. (2) is also a moderately good approximation, because:
However, (2) is certainly not 100% accurate. There is also a `periphery' of procedural skills in language use, which we may not assume are innate or universal, and which may therefore need to be learnt. Some of these are:
So there is quite a diverse list of these procedural skills. While some of them may be subject to innate constraints, we may also be fairly sure that some or all of them are learnt, and improve with practice. The mechanisms for learning them are expected to be rather different from the mechanism (described in section 3 below) for learning the declarative structure of word m-scripts.
The theory does not yet address the learning of these procedural skills; in a few of the comparisons with language learning data (section 5), where additional assumptions need to be made, these assumptions are often about the procedural skills and their acquisition.
It might seem, from the description so far, that this model of language requires some separate solution to the problem of word segmentation, so that word m-scripts with a single `sound scene' describing the sound of the word, like those in figures 5,6, and 8, may be used for language understanding and generation. However, this is not the case. Suppose that we have some way of segmenting the sound stream into smaller units (such as syllables, phonemes, diphones or whatever) which in turn can be put together to make words. Then a word m-script, in stead of having just one `sound scene' describing the sound of the word, may have a sequence of contiguous sound scenes describing its sequence of phonemes.
The models of generation and understanding will work just as well with this modified sequence-of-phonemes form of m-script; and crucially, the sequence-of-phonemes form solves the problem of sound segmentation for language understanding. The child need only observe a sequence of phonemes, and word m-unification automatically segments then into words. As we shall see below, the sequence-of-phonemes representation also solves the problem of sound segmentation in learning.
Therefore the sequence-of-phonemes form of m-scripts is probably the form used in the brain. However, for notational simplicity, I shall continue to use the `single word sound' notation for most word m-scripts.
For both generation and understanding, it is necessary to retrieve rapidly and efficiently just a few relevant word m-scripts from the many thousands stored in the brain.
For language understanding, this is not a problem; some kind of associative retrieval, based on the sound of words, will do the job.
For production, this will not do, as the word sound is not initially known. However, if word m-scripts are stored in a subsumption graph (a tree of scripts, where the script stored at any node subsumes any scripts stored on its descendants - i.e. is included in them) then at any stage of production, the search for good candidate words can be done by starting at the root of the graph, and descending it to find successively better approximations to the intended meaning.
This graph search takes time of order log(N), where N is the size of the vocabulary, and so is fast even for large vocabularies. This enables us to choose words to express what we mean (minimising the mismatch from the intended meaning, as discussed above) in real time.
[1] To be precise, the process of language generation can sometimes alter the script meaning structure; see below.
[2] I have used the term inclusion (rather than subsumption) by analogy with set inclusion, to emphasise the close link between the script operations and set theory.
[3] I have used the term `intersection' rather than `generalisation' to maintain the clear link to set theory. When two scripts A and B are intersected together, the information in the result is like a set intersection of the information in the inputs. Any slot and value must appear in both A and B to appear in the result.