Thursday, February 26, 2009

"Oldest English Words" Identified

I stumbled across this piece in the BBC website that discusses a computer-enabled mathematical model that analyses the rate of change of words in English and related languages. It appears to be me that this is a new application of glottochronology , a technique that has been around for quite a while.

The researchers at Reading University in the UK, claim "I", "we", "two" and "three" are among the most ancient English words. They also say that they can use their model to predict when words will drop out of the language, and list "squeeze", "guts", "stick" and "bad" as probable early casualties.

One intriguing feature of this model is an algorithm that allows you to build a phrasebook of common words between two periods of time:

"You type in a date in the past or in the future and it will give you a list of words that would have changed going back in time or will change going into the future," Professor Pagel told BBC News.

"From that list you can derive a phrasebook of words you could use if you tried to show up and talk to, for example, William the Conqueror."

That brought me up short. Doesn't this guy know William the Conqueror spoke French?


Peculiar said...

Whenever I read about glottochronology and Indo-European linguistics, it blows my mind that it can possibly qualify as science. The number of what-ifs would seem to add up to a very shaky structure indeed. But it has come through with some pretty striking instances of predictive power, and the whole thing really does seem to work, even though so many component lines of reasoning are questionable.

It sometimes seems to me that linguistics is in a state analogous to geology in the 19th Century. The theory is there, and we can learn a lot by applying it to the facts available. But unlike geology, where our data-gathering increased by orders of magnitude in a century, there's not much hope that we'll ever find significantly more ancient language data than we have now. Sure, maybe a linguistically interesting name of some Ostrogoth king might pop up, or a Central Asian oasis could yet yield a new form of Tocharian, but even that would remain a drop in the bucket of human speech. How much testing of hypotheses can you really do without access to new data? It's like doing geology in only one country: you can learn a lot in Scotland, but you're never going to get to plate techtonics.

Don't get me wrong, I do find this stuff worthwhile and fascinating. It's pretty damn cool that we can give plausible arguments that, say, Finno-Ugrian borrowed Indo-European words a couple millenia before anyone was writing in that neighborhood. But it's inevitable that there's a lot we'll never know, and therefore a lot we'll misunderstand. That's observational science for you, I guess, but it seems like you bump against the limits a lot faster in linguistics.

Reid Farmer said...

Very well said, Mr. P.

One of the other things that really bothered my about this article is the fact that this Pagel guy isn't even a linquist - he's a biologist. Maybe contributing to his Duke William gaffe.

therese said...

Not suprising Pagle is an evolutioary biologist- the techniques are similar between treebuilding for phylogenetic work and building language trees. One night when I was playing around online looking at language trees I came across a presentation on it where the first slide was "Why do we make tree" Answer: "If we don't the phylogeneticists will for us"

Peculiar: The application of predictive models using limited available data is basicly what systematics is. Be it asking questions about relatedness of languages or species you take a very limited amount of available data (say all the extant species in a group, assuming your lucky and can actually get all the species), feed it all into ModelTest which picks the best model (out of 54 i think it is) then you stick it in Mr. Bayes, wait anywhere from an hr to 30 days (model based analyses take forever even with supercomputers) until the results appear. From there you can predict not only relationships between taxa but also likely ancestral states of characters, which characters are homologous vs convergent, etc. The real science is being able to figure out what characters actually belong in the matrix i.e. are not likely to have evolved multiple unrelated times, are coded corectly, and have some sort of phylogenetic signal. The rest of the science comes later while trying to interpret the analyses. Typically when its all done youve got 10 different analyses all with slightly differnt clades. Figuring out what to do there is where the fun begins...