Monday, September 11, 2006

Word Play

Chris Anderson, author of "The Long Tail", in his blog introduces a fascinating (atleast to me) idea of the interplay of words, their frequency of occurance as opposed to their length, their contextual meaning and juxtaposition.

To quickly summarize his thesis, Chris cites Zipf's Law which states that the frequency distribution of words as a function of their length has an exponential distribution and the rank of these words shares a power law relationship with their frequency of occurance/usage. This was based on analyzing James Joyce's opus Ulysses and was stated, at the time, to hold true for the english language. Chris then cites Wentian Li who proves that this is the case for all languages including imaginary ones that constitute words banged out on keyboards by monkeys. Chris then concludes that Zipf's Law is more of a mathematical law in total rather than one governing and explaining linguistic features of a particular, or for that matter any, language.

Just thinking about this, one can already see that certain words (like "I") are more likely than certain others with the same length (I can't think of any right now but you are free to fill in the blanks). And this, I am sure, is a feature of the english language. There might be different relationships in different languages but there is enough evidence to suggest that all languages (real and inventented) do not follow the power law blindly. Rules of linguistic construction are the anamolies and these rules, in my opinion, were first formalized by years of talking/speaking before they were codified into a written language system. I am no linguist and certainly no expert on the evolution of languages and am shooting from the hip here but Li, in my opinion, has over generalized Zipf's Law.

Edit: A fun way to learn about the Long Tail phenomenon

6 comments:

Anonymous said...

Deadly Dilipa...oow(out of words).

Some south asian languages actually were codified at the same time they were spoken. But the Generalization could be explained that humans have had a common pool of words to use. Any isolated/resource limited country(ex: Japan) should be studied for Zipf's law.

Having said this, some languages also have the peculiarity that they "borrow" words and ideas. This might explain some of the similiarities.

Countries that have hardly had morte than 3 to 10 languages and limited migration are rare and these languages should serve as our baseline

Anonymous said...

Hmm...and along we were taught about semantics and grammar! So much for creative writing. We are Zipf's puppets after all..sigh!

Anonymous said...

oops..that was 'all along'. See? I have no control over my words now!

EmperorFrost said...

anonymous:
do mention your name next time. I have a couple of comments - Why do you expect a language that has developed in relative isolation to follow a power law? Even if this language were developed in isolation just by the way its grammar has evolved, there ought to be truck loads of violations of any power law. And I would expect it to get worse for the power law when it comes to languages that were developed by borrowing and adapting from others. If this were to be the development path for a language, then it surely must be oral before formalizations of grammar are made. Take for examples the various words that have crept into verbal english long before they were formalized into the language itself. My thesis is random languages (like 'monkey') follow Zipf's law. Human languages are quite likely to disobey.

NaN:
you really have an interesting point there, maybe if I choose my words based on Zipf's Law I can write (oh the horrors!) creatively!

More interestingly, typos (akin to monkey, as they are totally random and do not follow any grammar) ought to be commanded by Zipf's Law. I think I am going to cook up an experiment to verify this . Dont expect it very soon though.

Anonymous said...

Dilipaaa..

Every rule/grammar,semantic) has a exception..every language has its dialect and deviations that make it 'unique' and branding them under any law would make a rule out of it!(terrible!). However they are classified together ( Hindi/Bhojpuri or Thai for ex) are borrowed and have similar issues that they being the same have a different usage of words and their usage. I may chose to speak the words or construct my own thoughts without regard to the rule and still have someone understand.

Everyone migrated/adapted from Africa and so we should have some rules in common with African languages(Zulu/Swahili) but they are the ones with the least rules and are bastract but ones with Maximal isolation(Japan till 13thcent or Inuit) have structured languages that describe (inuit for one has 26 discriptions for color Red!). These are baseless laws or pure mathematical fiction.


PS:
I wish to remain out of reach as well... guess ?.

EmperorFrost said...

anonymous,
a couple of comments -
"I may chose to speak the words or construct my own thoughts without regard to the rule and still have someone understand."
If there are only 2 people (or groups of people) who understand each other's 'rules' then I would most likely term it code and not language. As soon as you bring some scale into the equation, there have to be some generalizable "rules" for everyone to follow. A language thrives on being spoken and understood by many. As you scale up further, these "rules" become softer resulting in localized rules, each being followed by communities of people. These localized rules constitutes the dialects of the language in question. These communites are mostly held together by some thread of commonality besides dialect - something religious, geographical etc.

w.r.t the rest of your comment, I do not see anything that contradicts what I have been saying: languages will defy any power laws. You have argued for the same. So we are in agreement here. But where I disagree with you is in your comparison of languages such as Zulu/Swahili and Japanese (your example of a language developed in isolation). It is your opinion (garnered from both your comments) that languages such as Japanese are better examples for studies such as the Zipf's law primarily because of their rigid codification (on account of their relative isolation). While I tend to sort of agree that Japanese might be a better candidate than Zulu, I still maintain that even Japanese will fail any power law tests. And for Japanese being a better candidate, it is my opinion that isolation is not the reason but the onset of civilization, science and progress. Zulu was more of a spoken tongue than Japanese. Japanese was codified into a ridig language long before Zulu ever was. Not discounting that Zulu/Swahili were primarily languages just used for communication and not codifying science etc there was little need to go beyond speaking thereby accounting for very large deviations from any mathematical laws. Japanese on the other hand quickly evolved from just being a pure spoken tongue to one that was used to write and document science, literature etc. Though I agree with you in principle, the reason according to me was the development of the language into a script and beyond.