This section contains brief remarks on other theories of language acquisition, comparing them with the m-script theory.
Pinker's (1984, 1989) theory of language acquisition is, to my knowledge, the only other broad-range theory which has been defined in some computational precision and systematically compared with a wide range of acquisition data. The 1984 theory is a broad treatment of many aspects of acquisition, and the 1989 theory works within this framework to address the puzzle of alternating verb argument structures. The relevant data are always examined critically and thoroughly, especially where they seem to challenge the theory - much more thoroughly than I have been able to examine the data in this paper.
Pinker's theory has many features in common with this theory, but there are also important differences. To list first the main points of resemblance:
So the m-script theory owes a large debt to Pinker's work. The main differences are:
I will claim - inviting readers to judge for themselves - that in each of the differences (1) - (5) this theory is an advance over Pinker's.
(1) is a difference of economy of hypothesis. One broad unified theory is to be preferred over a collection of sub-theories, if it can fit the data. The comparisons of section 5 show that this theory does fit the data, with many confirmations and very few counter-examples. Whether it can continue to do so, after the kind of in-depth comparisons with the data that Pinker has made, remains to be seen.
Again in (2), a fully lexicalised theory seems to score over `separate syntax' theories in economy of hypothesis. We know that children have to learn very many word meanings; if they can learn the syntax of their language by the same mechanism, rather than needing separate mechanisms, this is more economical both for the theory and for them. But again, the real question is: does a fully lexicalised theory fit the data ? In general, a non-lexicalised theory tends to predict some sharp `watershed' events on the day a child acquires some key rule of syntax, and there is no evidence for such watersheds (C5). The fully-lexicalised nature of this theory makes other significant, confirmed predictions in (C6), (C7), (C9), (F1), (I3) and (I4).
In difference (3), one-memory theories of learning have, I believe, usually been adopted as a theoretical convenience, and not for any cognitively or biologically-motivated reasons. The one-memory restriction makes computational theories of learning simpler to think about, but has no other rationale. There is evidence that many animal species can make abstractions and generalisations from experience. This usually requires the comparison of several learning examples; why not allow it for language learning ?
Human memory capacity is very large, and there is no reason why children should not store hundreds or thousands of learning examples (scripts for sentence-meaning pairs) in their minds - and selectively retrieve just the few which are relevant to one word, when they are ready to learn it.
The advantage of this multi-example learning is seen most clearly seen in difference (4). In its categorical, all-or-nothing learning procedures, Pinker's theory is haunted by a `one false move' prediction. If the acquisition of a strategic piece of syntax - such as a major phrase structure rule - can be triggered by a single example, what is to stop a single mis-analysed or misheard example from leading to a false move which sends the child off down a blind alley ? Learning examples need to be `vetted' very carefully; too rigorous vetting would inhibit all learning, while too loose vetting leads to many false moves. The most effective kind of vetting (waiting for several examples to accumulate) is ruled out by the one-memory assumption.
In contrast, the Bayesian criterion built into the m-script theory allows the child to wait until several examples show that a rule is statistically significant, in the presence of noise, before adopting it; and even then, there are no strategic, language-wide learning decisions to take, because syntax is fully lexicalised. This makes language learning very robust.
There is evidence that Bayesian-like learning criteria are used in many aspects of human and animal cognition (Anderson 1990; Gallistel 1990) - as one would expect, because Bayesian learning is provably optimal (see Appendix B). There seems to be no good reason to exclude this robust Bayesian weighing of the evidence from language learning.
The `one false move' problem is exacerbated by the assumption (5) that the child uses no negative evidence. This forces Pinker to postulate a system of marking some parts of rules (but not others) as provisional, able to be retracted later on the basis of other evidence. The retraction rules often seem complex and contorted. A false retraction can be just as harmful as a false move, so the vetting criteria for retractions need to be framed with great care. Can retractions themselves be provisional ? If so, Pinker is coming close to weighing the evidence of multiple examples by probabilistic criteria.
It is now well established that children do not rely on explicit negative evidence such as parental corrections, and many theories, like Pinker's, have struggled with the supposed lack of negative evidence. In this theory, the primary learning process requires the child to `silently generate' sentences describing the inferred meaning of an adult's sentence, comparing them with what she hears. This not only picks out the sounds and meanings of new words from those of known words, but also provides negative evidence on supposedly-known words; the child may often observe `where I would have said X, an adult said Y'. Enough of these negative learning examples can change or retract an m-script.
The negative evidence mechanism allows the m-script theory to account straightforwardly for many disparate pieces of evidence, such as (B10), (B15), (D8), (G4), (G5), (H2), and (H3), where other theories (including Pinker's) struggle. These ,and other examples where we easily learn the jagged edges of our language, constitute a strong case to allow negative evidence (A).
In summary, while Pinker's assessments of the language data are unmatched, and have inspired several features of this theory, the m-script theory has a more soundly-based, robust and unified learning mechanism, which allows it to account for much of the evidence more easily than Pinker's theory.
The Principles and Parameters (P&P) approach to language acquisition is based on a core assumption which this theory does not share - that the language learning problem is so hard (and the child's learning data so poor) that it must be reduced to a problem of setting a few parameters.
The m-script theory shows (in a working computer model) that there are robust and powerful learning mechanisms, which can rapidly learn the complex structures of language in the presence of noise. Around six examples per word are sufficient, and it seems that children can gather these examples easily at the required rate. Therefore the m-script learning mechanism is a working counterexample to the core assumption of P & P theories.
If we do not need to accept the P&P approach because of the supposed poverty of the stimulus, how do P&P theories fare in comparison with the child language data ? I believe that the data have not, over the years, given any striking confirmation of any of the core predictions of P&P models; rather, they have posed a series of problems which have necessitated successive dilutions of the original P&P idea - by maturation hypotheses, lexicalisation of parameters, etc. Without having counted the score, I very much doubt whether a systematic comparison of P&P with a broad range of child language data (as in Table 5.1) would yield anything like the measure of agreement shown by the m-script theory.
However, there is reason to hope that much work in the broader generative grammar framework (as opposed to P&P learning theory) is complementary to the m-script theory. Generative grammar has addressed issues, such as gaps and movement, where the m-script theory has (so far) little to say; and it has gained many useful insights into these phenomena. These insights have previously been expressed in terms of several levels of structure (such as D-structure and S-structure) which are hard to reconcile with the one-level m-script formalism. However, in Chomsky's (1992) minimalist programme, the underlying structures are now simpler and more compatible with the m-script approach.
If you believe that the m-scripts have merit as a theory of basic language learning, then maybe it will be possible to translate the insights of generative grammar on gaps, movement and other complex phenomena into m-script terms, to the benefit of both theories.
Ultimately, it seems likely that our theories of language learning will be connectionist, since that is how most computation is done in the brain. However, the question at issue is whether (a) neural nets will just provide components in the implementation of a language engine in the brain, or whether (b) language learning itself uses something like today's connectionist learning schemes. Pinker and Prince (1988) have characterised (a) as `Implementational Connectionism' and (b) as `Eliminative Connectionism'; it is the eliminative version that I will discuss first.
Rumelhart and Maclelland's (1985) neural net learning model for verb past-tense morphology raised hopes that this form of raw connectionism might encroach the symbolic-AI heartland of language. This model stirred up a lively controversy, leading ultimately (I believe) to the result that simple neural net learning does not adequately fit the facts of verb over-regularisation.
My aim here is not to comment on that specific issue, but on the wider prospects for eliminative connectionist models of language. There has, to my knowledge, been no published demonstration that neural nets can make any inroad on the core problem of human language - which is the richness and productivity of its meaning structures - and there are two reasons to expect that none will be made.
(1) Productivity: Today's generation of neural nets encode knowledge in a finite (typically small) number of connection weights. While these may easily encode small bounded symbolic structures (such as the relation between verb stems and their past inflections), nobody has yet found a good way to make these connection weights encode an unbounded symbolic structure such as a script tree. Connectionist nets do not have the unbounded, productive representational power of language (Fodor & Pylyshin 1988).
All demonstrations of neural nets in language have confined themselves to low-dimensionality sub-problems such as verb morphology or the simpler facets of syntax. They have never ventured (in print) out onto the high-dimensionality ocean of productive, meaningful language.
(1) Learning Speed: Bayesian learning has optimal performance (learning the required rules from a few examples) when the inbuilt prior probabilities match the actual probabilities in the environment. The reason why neural nets can learn many different patterns, (but typically learn them very slowly) is that neural nets do not have strong prior probability biases built into their structure; so they are open to many patterns, and fast learners of none (Denker et al. 1987). It is at present hard to see how the required prior probabilities could be built into a neural net.
When neural net learning results are reported, the learning times are measured in epochs, or thousands of trials. This is far too slow for a child, making it highly unlikely that these learning mechanisms have anything to do with human language learning.
While the prospects for eliminative connectionism are bleak, an exploration of implementational connectionism may be much more fruitful:
So a connectionist implementation of the m-script theory might be constructed out of rather a few network types, and might give good performance - in contrast to eliminative connectionist models.
Slobin (1973, 1985) has put forward a model of language acquisition consisting of about 40 Operating Principles (OPs) abstracted from extensive cross-linguistic study of acquisition in over a dozen languages (Slobin et al 1985). A typical OP is:
OP (POSITION): FIXED WORD ORDER. If you have determined that word order expresses basic semantic relations in your language, keep the order of morphemes in a clause constant.
Slobin's operating principles are, as he emphasises, derived bottom-up as an attempt to understand actual language acquisition data, rather than top-down from any overriding theoretical expectations. They are an excellent summary of much that we know about language acquisition.
However, beyond that level, they are (on their own) hard to intepret theoretically. Being stated verbally, they leave room for interpretation, especially when two or more OPs make different predictions. If, for instance, we took the OPs as specifications of language learning sub-modules in the brain, and tried to build computational models of those sub-modules, there would be many questions to answer (for instance, about how competition between OPs is resolved) and we would probably end up with a highly complex theory. Much extra detail would need to be added to realise each individual OP computationally (Pinker 1986), as well as `glue' to bind the OPs together as a working theory (Bowerman1985).
It is perhaps more useful not to go from OPs to a theory, but from a theory to OPs. For any computational theory of language learning, we may ask - does each OP emerge from the workings of the theory ? If not, how does the theory account for the data which are summarised by that OP ? Or does the theory actually contradict an OP ? In this way, Slobin's OPs can serve as a very valuable link between computational theories and data, highlighting gaps in any theory, and potential conflicts between theory and data.
I have not yet done this systematically for the m-script learning theory, but it may be very worthwhile to do so. From a preliminary review which I have made, many of the OPs seem to have a clear and direct interpretation in the m-script theory; others require more careful analysis (e.g. to see what the Bayesian probability theory predicts in specific cases); and a few clearly go beyond the scope of the m-script theory as currently formulated.
If precisely-stated algorithms of language acquisition are rare, then working computational models are even rarer. One of these is Siskind's (1996) model of lexical acquisition, which has interesting similarities and differences from the lexical acquisition part of the m-script theory.
As in the m-script theory, Siskind assumes that word meanings are represented by some kind of feature structure (~ script), which can be inferred from learning examples. In each example, the word is heard along with some inferred sentence meaning structure which contains (~ includes) the word meaning. Siskind presents an algorithm which, like that of section 3, can infer a word meaning structure from a small number of learning examples. He presents results from running the algorithm on artificial meaning examples.
The major differences between Siskind's algorithm and the m-intersection algorithm of this theory are :
The logic by which Siskind's algorithm converges on the correct word meaning, using only a few learning examples, is quite similar to the logic of m-intersection. When two meaning scripts are intersected together, the only slots which survive are slots which appear on both input scripts; thus the slots on the result are something like a set intersection of the slots on the inputs. Random coincidences between learning examples are rapidly eliminated by more examples. The first stage of Siskind's algorithm uses essentially this set intersection of slots in the inputs - and therefore rapidly converges on the set of slots in the word meaning, as the number of examples increases.
However, script intersection converges on the true meaning from above, and stops when a Bayesian criterion of sufficient evidence is satisfied (typically after 6 or so examples). Siskind's algorithm converges on the true meaning set both from below and from above, and stops when the two versions coincide.
By separating the two steps, Siskind's algorithm does not use structural information in the first step. If one example has a slot in position A on a script tree, and a second example has the same slot in position B, then script intersection will (correctly) eliminate the slot from the word meaning - while Siskind's algorithm does not do so until later. This difference in speed between the two algorithms is probably not very significant, and would be hard to discern in child learning data.
While Siskind's algorithm has similarities to the m-script algorithm, and probably has comparable performance, I would claim some advantages for the m-intersection algorithm:
The theory described in this paper has, I believe, a transparency, simplicity and power which commend it theoretically:
Transparency: Each word in a language is represented by an m-script structure which can be easily drawn or envisaged, and understood. The core operations of m-unification and m-intersection can be done with pencil and paper, and can be understood by an analogy to chemical reactions.
Simplicity: The structure of every language is completely embodied in the m-scripts of its words, with no other parameters or rules. One operation of m-unification is used for both language understanding and production. All the syntax and semantics of a language are learnt by the m-intersection mechanism. Language regularities arise from m-script evolution.
Power: the m-script formalism can express meanings of unbounded complexity, and the syntax of adult languages. M-intersection learning is fast, robust, and can learn the m-script for any word. Negative evidence can be gathered to correct any error.
However, these would be of little consequence if the theory did not agree with the data. The most important result in this paper is the comparison with empirical data on language learning, summarised in table 5.1. When compared with many diverse facts of language learning, the theory gives a natural and unforced agreement, without requiring extra ad hoc assumptions, in 84 out of 101 cases. Extra assumptions are required to fit the other 17 comparisons. I have not yet encountered any major, theory-threatening difficulty.
Even allowing for my possible selectivity in the choice of comparisons, and blindness to the faults of my own theory, this is an encouraging result. There is no other theory of language learning (with the possible exception of Pinker's (1984) theory) which claims this level of success in accounting for such a broad range of facts. Why does this theory do well in comparing with child language data?
The single most important answer is that the Bayesian learning mechanism is powerful enough to do the job. Other theories have generally used much less powerful learning mechanisms, and therefore have problems comparing with children's consummate ability to learn. The power of the Bayesian m-script mechanism spans four distinct dimensions:
It is an obvious fact that children's language learning exhibits all of (1) - (4). Pure symbolic and P&P theories generally fail on (3) and (4), while connectionist models fail on (1) and (2).
We can confirm that the Bayesian learning theory delivers all of (1) - (4) in two ways: either by doing the mathematical analysis to show that it does, or by building it into a computer program and observing its performance. I have done both, and am confident that others can reproduce the result.
The coherence of the learning theory rests heavily on the claim that to learn a language is just to learn a set of word m-scripts. In other words, a language is just a set of word m-scripts. This claim is supported by the computer program which handles a non-trivial sample of English in this way, and by the correspondence of m-scripts with unification-based grammars such as LFG.
The view of language underlying this theory has much in common with cognitive linguistics (Lakoff 1987; Langacker 1990) - in its grounding of language in semantics rather than syntax, in the biological origin of scripts in social cognition, and in the strong links between scripts and other cognitive structures. However, cognitive linguistics has adopted a connectionist model of language learning (Langacker 1990) which, I argue in this paper, cannot fit the facts. Perhaps this theory can be the learning theory for cognitive linguistics.
A key departure of this theory from the mainstream is the proposal that language regularities do not reflect regularities in the mind, but result from a process of m-script evolution for efficient communication. Thus the deep structure of language does not tell us as much as we thought about the deep structure of the mind; it tells us how language evolves on a neutral substrate of mind. For those who want to use language to learn about the mind, this may may be a disappointment.
However, in telling us less about the mind, it may actually be telling us more. In science, less is more; a theory with fewer assumptions and simpler structures is better than a more complex theory. So if a theory such as this one, without elaborate language-specific structures in the brain (but with general social learning mechanisms in stead) can fit the data, then it is to be preferred; it gives us a simpler, more transparent theory of the mind.
If this sounds dangerously empiricist after thirty years of Chomskyan nativism, so be it. The test of a theory is not how well it caters for our empiricist or nativist leanings; but how well it fits the data. On the test of language acquisition data (which was the primary motivation for Chomskyan nativism) this theory does better than any existing nativist theory.
The all-round agreement with the data leads me to claim that maybe this is the way we learn our first language. If others wish to prove this claim wrong - by an in-depth examination of the data on one language, or by working out the theory in greater depth, or by carrying out new experiments - they have my support; because if this theory is at all on the right track (as the data seem to show it is) then proving it wrong will yield new insights and progress.
Acknowledgements: I gratefully acknowledge my debt to the many researchers whose empirical results on language learning make child language such a vital and stimulating field. Some are cited here, many who should be are not. This theory would be nothing without their results.