R. P. Worden
Charteris Ltd
rpw@charteris.com
http://dspace.dial.pipex.com/jcollie/
This paper describes a precise sense in which language change is a form of evolution - not of the language itself,
but of individual words which constitute the language. Many prominent features of language can be understood as
the result of this evolution.
In a unification-based theory of language, each word sense is represented in the brain by a re-entrant feature structure, which embodies the syntax, semantics and phonology of the word. When understanding or generating a sentence, we unify together the feature structures of the words in the sentence to form a derivation structure.
Unification is like chemical synthesis of the word feature structures, and there is a complementary operation - generalisation - which is like chemical analysis. Generalisation is used for language learning, where the child projects out the feature structure for one word from several derivations containing the word. Because chemical analysis can only recover the elements originally used for synthesis, by this mechanism the child can learn only the feature structures of words used by adults.
In this way, the feature structure for a word replicates itself faithfully from the brains of adults to the brain of a child. Each word is an independent package of information which replicates and propagates through generations of people, just as DNA replicates faithfully and propagates through generations. Words are memes.
The faithful mechanism of DNA replication (with rare mutations) underlies the evolution of the structure and diversity of life. Similarly the faithful reproduction of word memes underlies the structure and diversity of language. As word feature structures propagate from generation to generation, they undergo slow changes from selection pressures which cause many types of language regularities.
In this analogy, each language is an ecology and each word is one species in the ecology. In language as in nature, different word species exert selection pressures on one another. The selection pressures on a word are those factors which cause people to use it more or less often, and to learn it more or less easily - useful meaning, distinctive sound, lack of ambiguities, productivity, learnability, economy of expression, and social acceptance. I illustrate by examples how these selection pressures act on words.
The approximate regularity of languages is a result of these selection pressures - requiring case markings of nouns to line up with the common verbs, so that case-marked languages are most commonly nominative/accusative or ergative/absolutive; while mixed-ergative languages are still possible. Similarly the three main patterns of meanings of verbs of motion, observed by Talmy (1985), arise by selection of verb feature structures. The theory predicts 'domains of regularity' in languages, with irregular borders.
Some of the strongest selection pressures on words arise from the need to avoid hard ambiguities, which cannot be handled by real-time language processing functions. If some complex construct is ambiguous, it may still be tractable if the generalisation of the two possible meaning structures has large information content. This requirement places strong mutual selection pressures on the syntax of words, giving rise to many of the Greenberg-Hawkins universals.
There are two competing explanations of language structure - that it reflects genetic evolution of the human brain, or that it arises from the evolution of word memes. Evolution of word memes goes much faster than evolution of brains, and so removes the selection pressures which could lead to genetic evolution of language structures in brains. For many language structures, there is no need to suppose that they reflect innate structure of the brain.
The idea that language change is somehow analogous to evolution has a long and chequered history. Pre-Darwinian evolution, Lamarckian evolution and teleology have all been invoked in dubious explanations which gave the idea of language evolution a bad name - but nevertheless, some kind of Darwinian evolution of languages is now regarded as a valid tool for thinking about language change. (McMahon 1994)
One form of this idea goes as follows: each word is represented in the brain by a package of information that embodies the sound, syntax and meaning of the word. When people speak to one another, they combine the word packages to make sentence packages. A child, observing the sentence packages passing between adults, can somehow extract the component word packages and thus learn the words. Thus the word packages propagate from generation to generation.
This propagation process is illustrated in figure 1.
Figure 1: propagation of a word package from one generation to the next.
Here, different colours of word package denote different parts of speech. Over several generations, as word packages reproduce via the speaking/learning mechanism, some words are more successful than others - more commonly used or more easy to learn - and there is competition between the different 'species' of words. This competition leads to changes in the balance of word species in a language, and so leads to language change. Such a change is illustrated in figure 2.
Figure 2: selection of word packages by differential reproduction over several generations.
Here each circle denotes a person, and the single boxes in the circles denote words that they know. The joined boxes between the circles denote sentences made up of words. The 'red' species of word becomes more populous over two generations. (Now red denotes a specific word, rather than a part of speech)
In this model, words are a form of Dawkins' (1976) 'memes' - culturally transmitted replicators that propagate through a population. Such a picture is quite appealing, and could be used to model aspects of language change. However, as it stands it is unsatisfactory because of its looseness. Just what are these packages of information - what is in a word package and what is not? What is the mechanism of reproduction and what information can it propagate? Such a theory is like the theory of Darwinian evolution before the discovery of DNA replication - it is quite plausible, but fundamental questions remain about how it really works. Until we find the answers to these questions, the idea remains an appealing story rather than a predictive theory.
When the structure and replication mechanism of DNA was discovered, Darwinian evolution could be put on a much firmer footing. Core questions, such as the relation between Mendelian discrete inheritance and continuously variable traits, could begin to be answered. The essence of the DNA chemical replication mechanism is shown in figure 3.
Figure 3: DNA replication mechanism
The DNA molecule is a double helix of complementary base pairs, which replicates by splitting as shown in the diagram.
Each resulting strand gathers matching bases (represented by short strips) from the surrounding cell, and because
the minimum energy configuration has precisely complementary base pairs, two exact copies of the original molecule
are made.
DNA replication is done by a combination of chemical analysis (splitting the helix in two) and synthesis (accumulating new bases from the surrounding cell) - precisely matched to preserve the information in the order of the base pairs. The DNA molecule may be many millions of base pairs long, but can still replicate precisely with very few transcription errors. Any sequence of legal base pairs can be replicated; in this sense the replication is completely transparent to any sequence, and can pass through any genetic information.
It is this extreme precision and transparency of DNA replication which underlies the huge diversity of life. Because each DNA molecule can carry a large amount of information, and can propagate it faithfully from generation to generation, this information can be used for the design of living things. Crossing and mutation create diversity - but without the precise DNA replication, that diversity would never be preserved for long enough for selection to act on it.
We now see the challenge faced by a 'words as memes' theory of language evolution. If the word packages do not carry enough information, or cannot be reproduced faithfully enough, they cannot serve as the DNA of language; changes introduced by selection of word memes might be wiped out by any imprecision of the replication mechanism. From what we have said so far, word replication (by a child observing the words used by her parents) might well be 'sloppy' enough to wipe out subtle changes.
Is there a mechanism of word replication which is precise enough to serve as a basis for evolution of word memes?
There is such a mechanism, within the framework of unification-based grammars such as Categorial Grammars (Oehrle et al 1988), HPSG (Pollard & Sag 1993) or LFG (Kaplan & Bresnan 1982). In this framework, I have developed a working theory of language learning (i.e. of word replication). There is not space here to describe the theory fully; it is described in a separate paper available from the web site http://dspace.dial.pipex.com/jcollie/. I shall give enough detail here to show how word replication, like DNA replication, works by a process of analysis and synthesis - and how it propagates information precisely and transparently from one generation to the next.
In Categorial Unification Grammars (Karttunen 1986; Uszkoreit 1986; Zeevat et al 1988; Bouma and Van Noord 1993), each word is described by a feature structure - a tree-like information structure in the brain, with information stored on the nodes of the tree. We do not know what neural encoding is used in the brain for such structures, but the details of the encoding are not important to the theory - what is important is the computational properties of the feature structures, however they are realised in neurons.
The feature structure for each word embodies the sound of the word, the meaning of the word, and the syntax associated with it. In this respect, Categorial Grammars are fully lexicalised - all the syntax of a language is embodied in the feature structures of its words, rather than being embodied, say, in any separate phrase structure rules or parameters. If you can learn the feature structures of the words, then you can learn the whole language. Two simple feature structures, for the words 'Fred' and 'sleeps' are shown in figure 4. The tree-like form, with information encoded in slot values such as 'cla:human' (meaning - the class of this entity is human) on the nodes, is evident. The curved lines are re-entrant links, which imply that any sub-trees beneath the two ends of the link must be shared (always identical).
Figure 4: Word Feature structures and their unification in a Categorial Grammar
In a paper presented at a previous conference in this series (Worden 1998) I argued that language evolved from primate social intelligence, and uses a tree-like internal representation of meanings which I called 'scripts', after Schank & Abelson (1977). The script theory of primate social intelligence has been used (Worden 1996) to analyse aspects of primate social learning and behaviour such as those observed by Cheney & Seyfarth (1990); it was then extended to a theory of language and language learning, arguing that language and primate social intelligence use the same computational mechanisms, both for learning and performance. I have more recently re-formulated the language learning theory in terms of Categorial Grammars, in order to link it explicitly to a mainstream theory of language; but the essentials of the theory are unchanged. The feature structures of Categorial Unification Grammars are identical to the scripts used in (Worden 1998), and it is still proposed that these feature structures arose in primate social intelligence. The learning mechanism is the same in both cases.
The key operation on feature structures is called unification (Shieber 1986), and has been extensively studied mathematically (Siekmann 1989; Carpenter 1992). It is like a form of chemical synthesis - to unify two feature structures, you overlay them where they have structure in common, and include all the structure of each in the result. The unification of two feature structures A and B is written (A u B). An example is shown in figure 4 above, where the feature structures for 'Fred' and 'sleeps' are unified together to find the meaning of the sentence 'Fred sleeps', which is the right-hand branch of the result.
In Categorial Unification Grammars and other unification-based grammars such as HPSG and PFG, unification is the central operation for both sentence understanding and generation. This gives a very powerful model of language performance, which can embody intricate syntactic and semantic features and has been tested against many languages. Mature adult languages can be well described by unification-based grammars (e.g. Pollard & Sag 1993).
In the radically lexical version of categorial grammar which I shall use here, if a sentence consists of words w, x, y… with feature structures W, X, Y …., then the sentence is understood by unifying together the feature structures of all its words. The result of this unification is a feature structure called the derivation D = W u X u Y.. , and always contains the meaning of the sentence in one of its branches. D also contains the feature structure for every word in the sentence amongst its sub-structures, because it is made by unifying them together. Within these feature structures are the sounds of all the words of the sentence - which was the starting point for sentence understanding.
How do these word feature structures replicate, as a child learns a language? There is another operation on feature structures, called generalisation (Shieber 1986). This is the complementary operation to unification - and as unification is like chemical synthesis, so generalisation is like chemical analysis. The generalisation of two feature structures A and B is written as (A ^ B), and is formed by placing the two structures one on top of the other and only retaining the parts they both have in common.
Suppose there are several derivations D1 , D2 , D3 … for sentences all of which contain a word W. Since D1 ,
D2 , D3 … all contain the feature structure for W within them, their generalisation (D1 ^ D2 ^ D3…) will also contain
W as a substructure - and if there are several distinct derivations, their generalisation contains little else,
so that to a very good approximation (D1 ^ D2 ^ D3…) = W. By generalising the derivations, we can recover the original
feature structure W.
In this sense, unification and generalisation play complementary roles in word replication, like chemical synthesis
and analysis. Unification combines W with other feature structures, and generalisation recovers it from the results.
Suppose a child learning a language does not know the word W, but knows all the other words in the sentence. Hearing sentences containing W and the other words, she can often infer the intended meaning from the context. In this way, she can reconstruct the derivations D1 , D2 , D3 …, even though she does not know the feature structure for W. By generalising the Ds together, she can recover the feature structure and thus learn a new word W. That is how children learn their native language.
Just as chemical analysis cannot recover what you did not put in by synthesis, so generalisation cannot recover what you did not put in by unification. This means that apart from small perturbations, children generally cannot learn words unless they hear people using them.
The learning process is illustrated in figure 5, where a child observes two distinct uses of the word 'sleeps', in the sentences 'Fred sleeps' and 'cat sleeps'. She already knows the feature structures for 'Fred' and 'cat', and in both cases she can infer the intended meaning of the sentence (the right-hand branches of D1 and D2). You may check that she is then able to reconstruct both derivations D1 and D2 from what she knows (the word sounds, the feature structures for 'Fred' and 'cat' and the inferred meanings). Then by forming the generalisation D1 ^ D2 she can recover the feature structure for the unknown word 'sleeps', and thus learn a new word - syntax, semantics and phonology all together.
Figure 5: The word 'sleeps' is incorporated in two derivations by unification, and is recovered from them by
generalisation.
(detailed note: the derivations D1 and D2 as reconstructed by the child do not have the re-entrant link - the curved line - which is an important part of the feature structure for 'sleeps'. But this link is recovered as part of the generalisation operation)
This basic learning mechanism is embedded in a Bayesian learning theory, which shows that typically a child requires about six good examples of the use of any word (where she knows the other words in a sentence, and can infer the intended meaning of the sentence) in order to learn the word feature structure. The learning mechanism is highly robust against poorly-understood or misheard examples, and is an explicit counterexample to Chomsky's (1988) 'poverty of the stimulus' argument. Children get plenty of stimulus to learn any word of their native language, and thus to learn its syntax.
If the feature structures for several words have structure in common, there is a secondary learning process - generalising all the similar word feature structures together - which projects out their shared structure to learn a lexical or morphological rule relating the different words (e.g. the -ed rule for the English past tense). This saves learning all the regular forms individually. Exceptions to these regular rules (e.g. that 'goed' is not a word) can also be learned without explicit negative evidence or parental input, by gathering implicit negative evidence from the child's learning data.
This theory of language learning is described in more detail in (Worden 1997, 1998b), where it is compared with many empirical facts of language learning, showing good agreement.
The learning process of unification and generalisation is the language equivalent of DNA replication. It works for any word feature structure - no matter how complex, or what part of speech - and given a few examples of each word, it works reliably to reconstruct the original feature structure. That is the faithful replication of word feature structures on which the evolution of words is built. I have built a program which replicates word feature structures by just this process, and it is able to replicate faithfully all parts of speech in a representative fragment of English.
The resulting model of language evolution, by replication of word feature structures, is shown in figure 6.
Figure 6: Replication of word feature structures by unification and generalisation.
For words, as for DNA, faithful replication depends on a precise sequence of chemical synthesis and analysis. For replication of words, the robustness of the process depends on the Bayesian learning theory; for DNA, robustness of replication depends on the statistical mechanics of the relevant molecules and enzymes at room temperature.
Learning of word feature strcutures by unification and generalisation is sufficiently robust, faithful and transparent to allow word feature structures to replicate over many generations, and thus to evolve. As words evolve, the language changes. In this theory, 'evolution of words' is not just a rough analogy with biological evolution - it is a precise evolutionary theory of language change, and can be used to analyse directly the observed forms of language change. Nevertheless, the analogy with biological evolution is still useful for understanding how the theory works.
In this analogy, each word is like a separate species - not mixing its 'genotype' (feature structure) with that of other words - and so a language is like an entire ecology. Every word has a 'niche' which is a part of meaning space; different word species may compete with one another to occupy useful niches in the space of meanings which people want to express. By combining productively with other words, a word may effectively expand its meaning niche and so prosper; this is the selection pressure which has led to the unbounded productivity of language, and to all of syntax. As the whole of a language is defined by its word feature structures, evolution of the words constitutes the whole of language change. The full Oxford English Dictionary is a museum of mainly extinct word species.
Each word species is a parasitic life form, no less so than a virus. A virus needs a living cell to host it, and a word needs a human brain. As I stand here, I have 50,000 separate species of word parasite in my head, and I am breathing them out all over you. You in turn are absorbing them, because you have virtually the same 50,000 parasites already in your head. Some of these species are thousands of years old, and some are much younger; for instance, the meme 'meme' is only twenty years old, but is already quite robust.
As we shall see, this model of language change can account for many language-specific features, as well as for cross-language universals. A language universal is not necessarily a clue to some underlying constraint of the human brain; it may simply arise from the evolution of words.
In order to see how word feature structures evolve over many generations, we need to understand the selection pressures which shape their evolution. 'Fitness' of a word species depends on how frequently it is used by speakers, and on how easy it is for children to learn the word when they hear it. We can identify six main factors which determine the fitness of a word feature structure:
· Useful Meaning: A word will tend to be used frequently if it expresses a meaning which people find useful, and need to express often.
· Productivity: The use of a word depends not only on its own intrinsic meaning, but also on the ease with which it combines with other words to express useful compound meanings - on the productivity of the constructs in which it figures. Evolution of word species is essentially symbiotic; the fitness of a word species depends on how well it co-operates with other species.
· Economy: If a meaning can be expressed in two different ways, and one of them is quicker and more economical than the other (e.g. uses fewer syllables), then the quicker construct will tend to be used more. An economical construct may invade the meaning niche of a more long-winded construct, and supplant it.
· Lack of Ambiguity: As language grows in productivity and complexity of sentences, the scope for ambiguity multiplies. The human mind is remarkably good at resolving language ambiguities on the fly; but any word feature structure which adds unnecessarily to the level of ambiguity will tend to be avoided, and be selected against.
· Ease of learning: The learning mechanism is unconscious and automatic; to learn a word feature structure, a child needs to collect about six examples of the use of the word, in unambiguous sentences where she knows all the other words in the sentence and can infer the intended meaning non-linguistically. Ease of learning therefore equates to frequency of use in unambiguous constructs, in circumstances where the intended meaning can be inferred. Because of the secondary learning mechanism, regular constructs can reduce the amount of learning required, so making learning easier.
· Social Identification: We judge people by their language, and know that we are judged by our own language. In any circumstances where people wish to identify themselves with some social group, or to be seen to be affiliated to that group, they will tend to adopt the language of the group. Word evolution is continually shaped by peoples' social aspirations .
These six selection pressures act together in different ways at different times to shape the words of a language.
Very many features of languages can be regarded as the result of word evolution - in fact arguably every feature of a language can be regarded in this way. I shall use six examples to illustrate how word evolution creates language structure:
· Semantic role selection: how languages distinguish between agents, patients, actors and themes
· Verbs of motion: what extra meaning is packaged with the verb , and what needs to be expressed in other ways
· Language-wide phonetic shifts: such as the great vowel shift in English
· Language Universals: such as the Greenberg-Hawkins universals, leading to broad generalisations such as the Head Parameter
· Features of Creole languages: such as the typical Tense/Mood/Aspect order of particles before the verb
· Regularity and irregularity: the mixture of the two which we find in most languages.
For each example, I shall describe which of the six main selection pressures have given rise to the language features we see. The main selection pressures involved are shown in the table below, and are listed at the head of each sub-section.
| Meaning | Productivity | Economy | Ambiguity | Learning | Social | |
| Semantic role selection |
* |
* |
* |
|||
| Verbs of motion |
* |
* |
* |
|||
| Language-wide phonetic shifts |
* |
* |
||||
| Language Universals |
* |
* |
||||
| Features of Creole languages |
* |
* |
||||
| Regularity and irregularity |
* |
* |
* |
In a sentence such as 'Fred punched Tony' it is very important to know who ended up with a bloody nose - who took the agent role, and who was the patient, in the punching event. Languages use a variety of devices to signal this information (Andrews 1985).
There are three main roles to distinguish - agent and patient of a transitive verb, and actor (or theme) of an intransitive verb. However, since there are at most two main semantic roles for any verb, languages only need to convey a binary distinction. They do this in one of three main ways - by word order, by a nominative/accusative case marking on nouns, or by an ergative/absolutive case marking on nouns. Every language uses one or more of these devices -which shows the strength of the selection pressure to avoid such an important ambiguity.
In languages which use word ordering to define roles (e.g. English SVO ordering), this ordering constraint is built into the feature structure for each verb. In such languages, case marking of the nouns tends to be redundant, and so is not used - a selection pressure for economy acting on noun feature structures.
If a few common verbs have role determination by word order, this reduces the need for case marking on all nouns. This in turn implies that all verbs need to use word order to determine semantic roles. Thus a symbiotic selection pressure acts back and forth between nouns and verbs to create a regularity in the language.
In a more weakly ordered language such as Latin or Japanese, where the verb feature structures do not determine semantic role by position, the nouns need to have case markings to distinguish agents and patients. There are two dominant systems of case marking - nominative/accusative and ergative/absolutive. In both systems, the more frequently occurring case (the one which occurs in two out of the three main roles) tends to be the unmarked case with fewer phonemes - an example of the selection pressure for economy in action. This is shown in figure 7 below.
Fig 7: Most common case marking systems for nouns
The case markings are usually at least partially regular, which means that cases for many nouns do not need to be learnt individually, reducing the learning load - an example of selection for ease of learning.
When verb feature structures have no time-order constraints of the SVO variety on their role-fillers, the matching of entities to roles is achieved entirely by case markings on the noun feature structures. Therefore the same verb feature structures can work unchanged with either form of noun case marking - nominative/accusative or ergative/absolutive. So there is no mutual verb-noun selection pressure to line up all nouns along the nominative/accusative axis or the ergative/absolutive axis, and mixed-ergative languages (which have both systems of case marking) are known, but rare (Andrews 1985; Comrie 1989).
What then is the selection pressure which disfavours such mixtures, and drives most languages to one axis or
the other?
While nouns and verbs can work together perfectly well in a mixed-ergative language, we must also consider adjectives.
In weakly ordered languages, adjective-noun agreement is very important in matching each adjective with the appropriate
noun, which may be separated from it in a sentence. However, if nouns have both nominative/accusative and ergative/absolutive
case marking, the required markings for adjectives will be more complex and hard to learn. So the selection pressure
for lack of ambiguity in matching adjectives with nouns will tend towards pure nominative/accusative or pure ergative/absolutive
languages. We would expect this to be a weaker selection pressure than the pressure to assign semantic roles correctly,
and so to act more slowly. Mixed-ergative languages may be long-lived remnants of a collision between languages
with opposite case-marking schemes, where the adjective-matching selection pressure has not yet extinguished either
form of case marking in favour of the other.
The main elements of meaning which can be directly encoded in verbs of motion, rather than being expressed by other
particles or phrases nearby, are the following:
· The motion itself
· The path relative to the ground of the motion (into, under, over,…)
· The manner of motion (rolling, staggering, …)
· The form of the moving thing (round, long, …)
Talmy (1985) has noted that every language encodes only two of these - the fact of motion and just one of the last three properties. Languages such as English directly encode the motion and the manner of motion; Spanish encodes the motion and its relation to the ground; and some American languages such as Atsugewi encode just the motion and the form of the moving thing. This is illustrated in figure 8 below.
Figure 8: direct encoding of information in verbs of motion
Why does no language encode more or less than two of the attributes of the motion, and why is the choice of
which two regular across any language ? We can understand why, in terms of the selection pressures on the verb
feature structures.
The basic need to encode some of these aspects of the motion arises from the selection pressure to express useful
meanings. The choice then is between expressing those meanings intrinsically in the verb itself, and expressing
them in nearby particles and phrases. This choice is determined by a tradeoff between economy (for which you might
tend to express everything in the verb itself) and ease of learning.
Consider a language where the verbs encode not two, but three of the possible elements of meaning. Suppose each optional element of meaning (relation to ground, manner, form of figure) has ten distinct values. To encode motion plus one optional element, you require ten verbs of motion (e.g. the horizontal bars on figure 9 below, for verbs encoding path). But to encode motion plus two optional elements, you require a hundred verbs (like the red square in figure 9, encoding path and manner) - which requires much more learning to master the system. Not only that, but each such verb will occur ten times less frequently in the child's learning data, and so will take ten times longer to learn. Going from one optional element to two will impose a massive extra learning load.
Figure 9: possible sub-spaces for meanings of verbs of motion
Why then is the choice of which optional element to encode a language-wide choice? One answer comes from niche sizes. Consider a language as shown in figure 9 which already has several verbs encoding path of motion (the blue horizontal bars). Now if a new verb arises which encodes just motion and manner of motion (the green vertical bar), about half of its meaning niche is already taken by the path-encoding verbs - so it will not be used so frequently, and will therefore not reproduce strongly. By contrast, there are completely empty niches for the remaining 'path' verbs. So a majority of path verbs will tend to drive out manner verbs, and to reinforce itself. The same arguments apply in three dimensions, adding the dimension 'form of moving object'. In every language, one dimension will come to dominate.
Labov (1981) has shown how phonetic changes tend to be of two quite distinct kinds - discrete phonetic changes which affect specific lexical items, and then may diffuse across the lexicon ('lexical diffusion' changes), and graded sound changes which affect the whole lexicon together ('neo-grammarian' changes). It is the second type of phonetic change which I shall discuss here, exemplified by the great vowel shift in mediaeval English (McMahon 1994). This complex change had several components, one of which is illustrated in figure 10.
Figure 10: Some of the sound changes in the great vowel shift
This global change across the language can be understood in terms of two selective forces on word feature structures - social identification and avoidance of ambiguity.
We are all familiar with the association between accent and social groupings, which affects all words in the vocabulary. Group-specific accents may be initiated by any random event - perhaps even from the accent of one individual whom others wish to emulate - and can then reinforce themselves by mutual imitation. They affect the whole vocabulary, because people use them as much as possible, to 'label' themselves as part of the desirable social group. Assume that one such vowel change was at the origin of the great vowel shift, where one vowel began to sound more and more like some other vowel.
Once this vowel moved so far as to start to become indistinguishable from some other vowel (the next one in the chain), that second vowel too would have to change in order to minimise ambiguity. That is, the word species driven along this path by social selection pressure would start to invade the phonetic niches of other words, and to drive them out. Once started, of course, the second vowel change could also be reinforced by the pressures of social identification. The second change might only be necessary to remove ambiguities in certain words, but would be adopted in all words for easy social identification. This would create new ambiguities, leading to a third vowel change, and so on. In this way, the vowel shift could propagate round a whole chain of vowels, until it came round to a vowel sound which had been left 'vacant' by the original change. At this point the cycle could stop.
Greenberg (1966), Hawkins (1994) and others have discovered a number of universals of language structure, some of which hold with high reliability over all known languages, and which have been interpreted as evidence of language-specific structures in the human brain. Typical of these is Greenberg's (1966) Universal Number 2, which states that:
"In languages with prepositions, the genitive almost always follows the governing noun, while in languages with postpositions it almost always precedes."
Many of these universals can be understood not as a reflection of any structure in the brain, but more simply as the result of evolution of word feature structures. The key selective forces responsible for universal no. 2, and others like it, are the forces of reducing ambiguity while retaining productivity.
English is a language with prepositions rather than postpositions ('fiddler on the roof' refers to a fiddler, not a roof) and the genitive follows the governing noun ('man of action' is a man, not an action). In languages such as Japanese both go the other way round; universal no. 2 says that in essentially all languages the genitives and adpositions are similarly linked. Why are the English feature structures for 'on' and 'of' linked in this way?
One answer lies in the handling of structural ambiguities. Consider a compound phrase like 'the lid of the box on the table' which can be read in two ways:
((the lid of the box) on the table)
(the lid of (the box on the table))
Because English genitives and prepositions branch the same way, both of these readings refer to some kind of lid. In Japanese, both readings would refer to a kind of table. However, in a language which did not obey Greenberg's universal No. 2, the two readings might be radically different (e.g one reading would refer to a kind of lid, the other reading to a kind of table). This is shown in the table below.
If the two readings have nothing in common (as in the last column, where one is a lid and the other is a table),
then the structural ambiguity is a 'hard' ambiguity which might severely affect any later processing of the sentence.
Therefore it needs to be resolved immediately before further processing. If the two readings are rather similar
(both being a kind of lid) then this is a 'soft' ambiguity and we can use their shared meaning to carry on processing
the sentence, coming back to resolve the ambiguity later when we have more information. So it is much easier to
handle the soft structural ambiguities in a language which obeys Greenberg's universal number 2.
Language has developed to use large complex sentences which have many of these structural ambiguities. Therefore it is very important to be able to handle them 'on the fly', getting the gist of a sentence without resolving all ambiguities immediately. This has imposed a selection pressure on the feature structures for genitives and adpositions, forcing them to line up with universal no. 2 in any language.
Similar accounts apply to many of the structural universals discovered by Greenberg and others. For instance, the structural ambiguity in 'I saw the man near the steps' would be hard to handle in a language which had VO order and postpositions; so VO order is generally linked with prepositions, OV order with postpositions. VO/OV order must also be consistent with genitive order to avoid other similar hard ambiguities; the three features of VO/OV order, pre/postpositions, and genitive binding direction all exert selection pressures on each other to be mutually consistent and so to avoid hard ambiguities. This is shown in figure 11 below.
Figure 11: Mutual selection pressures leading to a head-first or head-last language
The other important 'headenness' parameter - whether relative clauses are before or after the noun - is also linked to some of these parameters (e.g postpositions and relative clauses after the nouns will lead to hard ambiguities) but I have not yet checked these fully.
The only way to obey these constraints simultaneously is to be clearly a head-first or head-last language - to have a well-defined Head Parameter. Feature structures for verbs, prepositions and genitives are mutually selected to line up in this way.
Even in spite of the selection pressure, languages often depart from this ideal for extended periods; when they do so, they use other devices, such as more elaborate case marking, to control the resulting ambiguities (as in Latin, which is essentially a prepositional, OV language). These devices have their costs, and still entail a selection pressure towards simple head-first or head-last configurations. Languages descended from Latin, such as French, have moved to a more pure head-first configuration.
Hawkins (1994) accounts for the same data using his 'Early Immediate Constituent' (EIC) principle, which acts to minimise a short-term memory load in sentence understanding. Such considerations probably do provide a selection pressure on words, and so could be argued to drive language change by word evolution. However, Hawkins' account requires some quite subtle counting of EIC metrics to make it work, and I would argue that the effects of hard structural ambiguities on processing efficiency are much more marked (e.g having to hold two quite distinct meanings in short-term memory while processing much of a sentence) and are more clear-cut; they provide a stronger selection pressure on words to conform to the head-first or head-last configuration.
Therefore the Head Parameter, which has been taken as evidence for innate language-specific structure in the brain, and is central to the 'Principles and Parameters' model of language acquisition (Chomsky 1988), is not evidence for any innate structure in the brain; it can be just as well explained by the evolution of word feature-structures in historic time.
Bickerton and others have noted how Creole languages have striking features in common which apparently cannot be explained as arising from their antecedent languages. Bickerton (1984) has proposed that these shared features are evidence for the 'bioprogram', the core language endowment in the human mind which distinguishes modern human language from an earlier proto-language.
For Bickerton's account to hold, these 'bioprogram' constraints on language must be strong enough to have got true language started in the human species, but weak enough to be overridden by modern languages.
An alternative account is that the universal features of Creoles arise not from some innate (and overrideable) structure in the human mind, but are the first fruits of rapid word evolution before slower changes give rise to the different structures of mature languages. In other words, when a new language is created out of Pidgin words by first-generation speakers of that language, the selection pressures on word feature structures are very different from the selection pressures in a mature language; the distinctive features of Creoles arise from these distinctive selection pressures.
I suggest that the strongest selection pressure on word feature structures in the early stages of a Creole is ease of learning; if a feature structure can be learnt fast, it will spread fast through a population, even if it is less economical than others. Later, more economical constructs will gain ground.
I have not worked out this alternative account in detail. However, to illustrate how it might work I shall use two examples:
(1) Creole languages often use separate discrete particles to express elements of meaning which in other languages are expressed in other ways; for instance, tense is expressed by a particle, rather than by verb morphology. These particles occur frequently, so tend to be easy to learn; whereas verb morphology is learnt by a more complex two-stage process.
(2) In Creoles the particles expressing tense, mood and aspect always occur in the order tense-mood-aspect (TMA) before the verb (Bickerton 1984), whereas for mature languages a more typical order is MTA. So a Creole sentence takes the form 'Fred (T) (M) (A) swim' compared with English 'Fred might have been swimming' which is 'Fred (M) (T) (A) swim'. Why is the mood marker closer to the verb in Creoles ?
For a Creole, mood consists of just of a binary realis/irrealis distinction, whereas for more mature languages the irrealis mood is split into a number of different modalities (might, could, should) which are as much about the subject's attitude to the action as about the action itself. For ease of learning, a marker should be as close as possible to the thing it modifies (so learners can easily make the connection, in spite of possible mis-hearing other parts of the sentence). Therefore we may suppose that in Creoles the realis/irrealis mood marker is drawn towards the verb that it modifies, whereas in mature languages the modality markers are drawn towards the subject.
These accounts are clearly not yet fully persuasive; however, they suggest that an explanation of Creole structure in terms of selection pressures on words, rather than the structure of the human mind, may be worth investigating further.
The examples given so far have hinted that the word evolution picture may give a valid account of one of the most
puzzling features of language - the mixture of regularity and irregularity which we observe in every language.
Linguists have carefully catalogued the regularities, and then built theories of language on them - after which
the irregularities became something of an embarrassment, a set of facts without explanation, often to be relegated
to a dim 'periphery' of language. Given the pervasiveness of irregularities, they deserve a better explanation
than that.
The theory of categorial grammar and learning outlined in section 2 can support languages of arbitrary irregularity. The theory is fully lexicalised, so the syntax associated with every word is packaged in the feature structure for that word; a language could exist in which every word had different syntax packaged with it. However, because of the selection pressures on words, such a language would not stay that way for long. The three main selection pressures driving towards regularity are:
· Productivity: If many word feature structures have shared common 'shapes', they are easily interchangeable like Lego bricks in common patterns, giving a rapid combinatorial explosion of possible meanings. However, if each word had its own idiosyncratic shape, finding patterns which fit properly together would be a new creative exercise each time, sharply limiting the productivity of the language.
· Ambiguity: The examples of semantic roles (4.1) and language universals (4.4) illustrate how the need to avoid ambiguities can often force different words of a language into a common mould of regularity.
· Ease of learning: Regular syntax and morphology can be learnt by a secondary learning process (not described here) which reduces the learning required - so that for instance the full morphology of every new verb or noun need not be learnt.
On this basis, therefore, we might expect every language to continually converge to a state of greater and greater regularity. However, I suggest that two main forces prevent this.
The first can be understood from the analogy of a ferromagnetic crystal, in which neighboring atoms, through their magnetic moments, tend to line each other up along a common axis of magnetism. The mutual selective forces exerted on one another by words are of this form - tending to line up the verbs of motion along a certain 'axis', or to line up many parts of speech along a head-first or head-last axis, and so on.
However, in ferromagnetic solids, all atoms do not take the same alignment. Once the atoms in a certain region have become lined up in one direction, they stabilise each other in that direction - so it then becomes more difficult for any influence from neighbouring regions to realign them. The result is that the solid splits up into a number of domains, each of which has a regular alignment, but which have irregular borders. This is illustrated in figure 12 below.
Figure 12: Domains of regularity in a language, as in a ferromagnetic crystal
I suggest that words in a language may show similar behaviour - words of similar meanings may form 'domains of
regularity', stabilising each other in those patterns and resisting change from other domains. Each new word (such
as W in the diagram) is drawn into one of the domains A, B or C - but the domains have irregular and unpredictable
boundaries.
One example of this is the mixed ergative languages such as Yidijn. In such languages, the nominative/accusative nouns are usually at the animate end of the scale, and the ergative/absolutive nouns tend to be at the inanimate end - they form distinct domains, rather than overlapping at random (Anderson 1985). Each domain is self-stabilising (e.g. through interaction with the adjectives which tend to apply to animate and inanimate nouns respectively - see section 4.1) but they have different alignment.
The second major force leading to irregularity is, of course, language mixing - typically caused by conquest or invasion. Here the conquering group brings its own language, which the conquered emulate and absorb (i.e the word feature structures propagate like viruses from conquerors to conquered), producing an irregular mixture from two (possibly more regular) antecedent languages. This social /political force is the main initial agent of irregularity. The resulting irregularity may take a variety of long-lasting knock-on forms as the many new small domains of regularity - created and intermingled by the language collision - jostle to re-form their boundaries.
So the mixture of regularity and irregularity which we find in all languages - and which is often an embarrassment for theories based on regularity - emerges naturally from the theory of word evolution. Languages, like species, are always travelling, never arriving; their words journey together in semi-regular bands which are frequently colliding and reforming.
There are two competing explanations of the structure of language - that it reflects language-specific structures in the human brain (i.e that it arises from biological evolution), or that it reflects just the functional requirements for language. The theory of this paper is of the latter kind, because it is the functional requirements for language which create selection pressures for the evolution of words.
If a particular feature of universal grammar (say, the Head parameter) can arise from two distinct mechanisms, how do we choose which mechanism to believe? One relevant piece of information is the relative speed of the two mechanisms. The speed of any evolutionary process depends on three main factors (Worden 1995):
· The inter-generation time for replication
· The maximum number of offspring from one successful replicator in one generation
· The strength of the selection pressures (difference in fitness between least and most fit)
From each of these three factors, words are expected to evolve much faster than people. While children learn some words form their parents, with a generation time of 20 years, they can learn many words from their peers with a generation time of one year or less. While a reproductively successful person may have up to 10 children, a successful word can spawn hundreds of copies of itself. And finally, while intelligence and loquacity certainly contribute to the fitness of people, other non-cognitive selection pressures are equally important; and the selection pressure for particular parts of grammar is probably very weak. By contrast, if a word feature structure is a misfit, or is supplanted by another word, its outlook for replication is very bleak.
Because these drivers of selection are all so much stronger for words than for peoples' language device in the brain, we would expect words to evolve much faster - by a factor of 1000 or more. This expectation is borne out by the rapid evolution of languages over historic time, compared to the very small changes in human intellect for at least the last 50,000 years.
If two competing explanations of some change both seem to fit the facts, you should believe the faster one - the faster mechanism will get there first and make the change, even if the slower one might have done so in time. In fact, the faster mechanism will probably remove any selection pressure which could have driven the slower mechanism. If the words of our languages naturally line themselves up to be head-first or head-last, over hundreds of years, then our brains are under no selection pressure to evolve a head parameter (i.e to force the words into line) over millions of years.
The idea that languages evolve has always seemed like a nice idea, but has been hard to cash out into a predictive theory. This is because the basic mechanism of language replication - the DNA of language - has been unknown; and without it, the evolutionary story lacks crucial detail. Constraints on language replication may prevent or divert evolutionary changes. However, there is now a precise working theory of language learning, formulated in the framework of categorial grammars, which enables us to understand how words replicate, and so how they evolve.
This theory is in many ways the direct opposite of the Chomskyan 'Principles and Parameters' picture of language acquisition and language change (Chomsky 1981; Lightfoot 1991). In the P&P picture, language acquisition is such a hard problem that it has to be narrowed down to the problem of setting a few parameters on the basis of language data - and the restriction to a few parameters must be innately specified in the wiring of the human brain.
In the theory of this paper, however, language acquisition is not such a highly restrictive process. The unification/generalisation mechanism for word replication can faithfully replicate any word feature structure, embodying any of the wide range of grammatical constructs seen in the world's languages. This general mechanism for learning and using feature structures is directly descended from primate social intelligence, in a biologically motivated way, and uses the same brain areas (Worden 1998). It does not place arbitrary limits on the evolution of word feature structures - rather, it transmits information transparently, letting words evolve freely wherever the dominant selection pressures take them.
I have shown how this picture of word evolution can account for some of the most prominent features of languages, such as:
· The diversity of language syntax
· The domains of syntactic regularity and irregularity seen in languages
· The speed of language change
· Many language universals
· Special features of Creole languages
So much of the structure of language can be accounted for in this picture, that we may make a new working hypothesis as follows: The only constraints on language from the structure of the human brain are any innate limits on the form of word feature structures themselves (for instance, limits on the features allowed on a certain type of node). Apart from these, all language structure arises from the evolution of word feature structures under the selection pressures of use and learning. This working hypothesis is, as noted above, essentially the opposite of the Chomskyan working hypothesis; but it currently seems to be in good agreement with the data.
You may be concerned that the theory of word selection has so many mechanisms available to it - the six types of selection pressure described in section 3, and used in section 4 to analyse features of languages and language change - that it is under-constrained. Theories of evolution have always faced this danger of being under-constrained; but in this case, as the changes can often be observed as they occur, there is at least the prospect of directly verifying that the proposed selection pressures are acting on words. While theories of language are generally not proved false by single decisive tests, this theory is open to disproof by weight of evidence - if enough evidence accumulated for language changes which cannot be attributed to observable selection pressures on words.
The key result of this paper is that language structure tells us less than we thought about the structure of the human mind. Language universals do not reveal the underlying structure of our minds, but arise simply from selection pressures acting on word feature structures as they reproduce over many generations. To those who want to learn more about the mind, this result may seem a disappointment; a possible source of detailed information about the mind is revealed not to be one.
However, on reflection, they need not be disappointed - because in scientific theories, less is more. This theory need not assume that the mind has a whole range of language parameter switches, to be set on hearing 'trigger data' (or to be set to intermediate positions for irregular languages). It assumes just that the human mind has a few general and powerful mechanisms for learning and using feature structures, which have evolved from our primate social intelligence. These mechanisms place few constraints on the word feature structures, which then evolve freely as we use them. As long as this model fits the data, a simpler theory of the human mind is a better one.
Anderson, S. R. (1985) Inflectional Morphology, in Shopen, T. (ed) Language Typology and Syntactic Description, Vols. I-III, Cambridge University Press, Cambridge UK
Andrews, A. (1985) The major functions of the noun phrase, in Shopen, T. (ed) Language Typology and Syntactic Description, Vols. I-III, Cambridge University Press, Cambridge UK
Bickerton, D. (1984) The language bioprogram hypothesis. Behavioral and Brain Sciences 7 173-188.
Carpenter, B. (1992) The logic of typed feature structures, Cambridge University Press
Cheney, D.L. and R.M.Seyfarth (1990) How monkeys see the world, University of Chicago Press
Chomsky, N. (1981) Lectures on Government and Binding, Dordrecht: Foris
Chomsky, N. (1988) Language and problems of knowledge: the Managua lectures, MIT press, Cambridge, Mass.
Comrie, B. (1989) Language Universals and Linguistic Typology, Blackwell, Oxford
Dawkins, R. (1976) The selfish gene, Oxford University Press.
Greenberg, J. H. (1966) Some universals of grammar with particular reference to the order of meaningful elements, in Greenberg, J. H. (ed) Universals of language, 2nd edition, MIT press.
Hawkins, J. A. (1994) A Performance Theory of Order and Constituency, Cambridge University Press
Kaplan, R. M. and J. Bresnan (1981) Lexical Functional Grammar: a Formal System for Grammatical Representation
Karttunen, L. (1986) Radical Lexicalism, CLSI report CSLI-86-68, Stanford University
Labov, W. (1981) Resolving the neogrammarian controversy, Language 57: 267 - 308
Lightfoot, D. (1991) How to set Parameters: Arguments from Language Change, MIT press, Cambridge, Mass.
McMahon, A. M. S. (1994) Understanding Language Change, Cambridge University Press
Oehrle, R. T., E.Bach and D. Wheeler (eds) (1988) Categorial grammars and natural language structures, Reidel, Dordrecht
Pollard, C. and I. Sag (1993) Head-driven Phrase Structure Grammar, University of Chicago Press
Schank, R.C. and R.P.Abelson (1977) Scripts, Plans, Goals and Understanding: an Inquiry into Human Knowledge Structures, Lawrence Erlbaum Associates, Hillside, New Jersey
Shieber, S. (1986) An introduction to unification-based approaches to grammar, CSLI, Stanford, CA.
Siekmann, J. H. (1989) Unification theory, J. Symbolic Computation 7, 207-274
Talmy, L. (1985) Lexicalisation patterns: semantic structure in lexical forms, in Shopen, T. (ed) Language Typology and Syntactic Description, Vols. I-III, Cambridge University Press, Cambridge
Uszkoreit, H. (1986) Categorial Unification Grammars, Proceeedings of COLING 1986, Bonn
Worden, R.P. (1995) A Speed Limit for Evolution, Journal of Theoretical Biology 176, 137 - 152 .
Worden, R. P. (1996) Primate Social Intelligence, Cognitive Science, 20(4): 579-616
Worden, R. P. (1997) A Theory of Language Learning, draft paper at http://dspace.dial.pipex.com/jcollie/
Worden, R. P. (1998a) The Evolution of Language from Primate Social Intelligence, to be published in The Evolution of Phonology and Syntax, Hurford, Studdert-Kennedy & Knight (eds) Cambridge University Press
Worden, R. P. (1998b) A Theory of Learning for Categorial Grammars, draft paper at http://dspace.dial.pipex.com/jcollie/
Zeevat, H., E. Klein and J. Calder (1988) Unification Categorial Grammar, in N. Haddock, E. Klein and G. Morrill (eds.) Categorial Grammar, Unification Grammar and Parsing, Edinburgh Working Papersi n Cognitive Science Volume 1, Centre for Cognitive Science, University of Edinburgh