14

Is there a natural language which is known not to follow Zipf's law? I'm interested to see if it's really universal.

This is what Zipf's law states:

Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.

Edit - To clarify a bit - imagine that you could snoop on every conversation held by native speakers everywhere, in every context, over the course of one year, and record every word, and then count the words occurring, and calculate their frequencies, would they deviate from Zipf's law? Now, we can't do that in reality, but for languages with a substantial written corpus, we can have a broad enough sample.

sashoalm
  • 510
  • 1
  • 5
  • 15
  • 5
    Surely this is missing the point of Zipf's law? It's not meant to be a linguistic universal in the same way that, say, the binding conditions might be. It doesn't apply to languages, but to corpora, and no matter how big your corpus, you can never equate that corpus with the language it's written in. Asking, e.g. "Does Portuguese obey Zipf's law?" is meaningless. – P Elliott Jan 12 '14 at 15:44
  • Does that mean word frequencies differ between the spoken and written language? – sashoalm Jan 12 '14 at 16:33
  • Sasholm, it is true that written and spoken languages have different word frequencies, but that is not what P Elliot meant. He meant that a corpus is a large but finite subset of utterances of a language. – Mitch Jan 13 '14 at 02:37
  • Sure, Zipf's law is an inexact observation and pretty much all natural languages follow it to some extent. Some might be close or further away, but none follow it exactly, it's not gravity. There might be some man-made languages used by machines that are 'flat' or all words have the same frequency (say in cryptography). – Mitch Jan 13 '14 at 02:48
  • @Mitch Isn't that how statistics works? For example, with voting, you don't need to ask everyone in America how they'll vote, you just need to ask (for example) 100,000 people. So long as you pick them at random (they're not all from the same city - that'd called a biased sample), the average you get should be pretty close to the real average. That works even for potentially infinite data sets - a random sample, if statistically significant, should still be close to the averages. – sashoalm Jan 13 '14 at 04:45
  • 1
    Sasholm, I don't agree with P Elliot about his intent of corpus vs language. A sample vs the entirety/rule that produces them is a well-known statistical concept. The Zipf curve for Finnegan's Wake is well known to be noticeably skew from the Zipf curve (as well as the opposite direction from, say, low complexity text like the Twilight series). All human languages follow Zipf's law more or less, none exactly. – Mitch Jan 13 '14 at 13:41
  • Part of Zipf's law comes from just plain reality. 50% of car accidents occur with 1/2 mile of home? You spend 50% of your driving within 1/2 mile of your home. – Mitch Jan 13 '14 at 13:45
  • @Mitch "A sample vs the entirety/rule that produces them is a well-known statistical concept" - that's fine as long as you could in principle collect all of the data. When taking a sample of a population, for example, the population is finite, so you can in principle calculate how large your sample has to be to be repsentative. The set of all possible sentences (assuming that's how we define a language in its totality) is infinite. – P Elliott Jan 14 '14 at 13:46
  • 2
    @PElliott Samples (which are always a finite set) can come from a discrete or continuous space; the continuous space is necessarily infinite but the discrete space can also. Anyway, with respect to language, how do you come up with a -grammatical- rule like 'articles come before a noun', which is about an infinite number of sentences, from only a finite sampling of heard utterances? – Mitch Jan 14 '14 at 14:26
  • @Mitch Exactly! To put it another way, if you have a loaded dice, you can measure their probability to give you 1,2,3,4,5 or 6 from, say, a 1000 throws - and it would be a pretty good approximation, even though you could potentially make an infinite number of throws! That's how statistics works, as I understand it. – sashoalm Jan 14 '14 at 14:37
  • I updated my question to say "natural languages", to remove any ambiguity. – sashoalm Jan 14 '14 at 14:45
  • @Mitch Crucially, we can't formulate rules like 'articles come before a noun' just by using a corpus. We have to collect negative evidence, i.e. we construct examples where the article comes after the noun, and then see whether speakers find it acceptable. There are plenty of odd constructions, like parasitic gaps, which have been extremely important in linguistic theory but are vanishingly rare in corpora. – P Elliott Jan 14 '14 at 14:49
  • @Mitch In other words, nobody formulates linguistic universals based purely on corpus data. – P Elliott Jan 14 '14 at 14:50
  • 1
    @PElliott sure, but that's not the point. Both the linguist (explicitly) and the language learner (implicitly) are creating universal rules from a finite set of data. So the distinction between language and corpora doesn't help. A Zipf curve is calculated from a corpus, but Zipf's law is about the language. – Mitch Jan 14 '14 at 19:36
  • @Mitch The data set used by the linguistic/learner may be finite, but it includes rules about what isn't allowed. It's not possible to formulate a linguistic universal purely on the basis of a finite data-set in the absence of negative evidence, i.e. a corpus. – P Elliott Jan 15 '14 at 11:02
  • As a though experiment, imagine a book written in a language where all words are equally distributed. That means every word in it occur the same amount of times. That means once you use a word, you can't use it again until you've used every other word in the book at least once. This includes function words such as "the", "that", "in" or "if", and common things like "man", "more", and "up"; for every "the" you need an "internet", a "tomorrow", a "glove" etc. If the distribution is linear, it barely improves; you have to use e.g. 1 "for" every 10 "the"s, 1 "internet" for every 10 "hard"s, etc. – melissa_boiko Jan 30 '18 at 02:08

4 Answers4

10

As with all natural laws, Zipf's law is an approximation. If you take a large corpus, and compute the Zipf curve, it will more or less follow a Zipf distribution (with coefficients thrown in to account for slack).

This doesn't mean that for every language it follows the exact rule of 'the second most common lexical item is 1/2 as frequent as the most common'. It's just a lax observation. One can do a regression analysis to discover exactly the coefficients for a particular language.

Even within a language there are divergences. It isn't hard to find works whose Zipf curves diverge from the general language's. Joyce's Finnegan's Wake uses so many rare and made-up words that its tail is thick and long. But children's literature attempts to be easily understood and so has few rare words and drops off much more sharply.

Zipf's law doesn't just approximate word frequencies but also letter frequencies, city sizes, income ranks, and many other rank vs. frequency graphs. It is taken into account in decrypting substitution ciphers, and also in creating codes (artificial languages) that do not follow Zipf's law.

Adam Bittlingmayer
  • 7,664
  • 25
  • 40
Mitch
  • 4,455
  • 24
  • 44
  • 2
    True, except that the harmonic pattern of city sizes was not discovered by Zipf but much earlier by Auerbach. Credit where credit is due. – fdb Jan 14 '14 at 23:52
  • I know very little statistics. Is there a standard measure of how closely does a given corpus follow an ideal Zipf curve (e.g. a number that would be high for a typical text, and low for Finnegan's Wake)? – melissa_boiko Sep 02 '16 at 14:07
  • This answer is recapping what the Zipf law is, and saying that not Zipf law are the same. Of course they are not, as there's a free parameter in the Zipf law. But it does not really answer the question, which I find very interesting. – famargar Mar 02 '17 at 09:31
  • @famargar Good point that I did not exactly outright answer the question. The question is essentially an empirical or statistical one - I don't know of a test of all current languages as to how close they fit to the Zipf law. So I don't know the answer. I just highly suspect that all natural languages are roughly Zipf-like because a human language would have to be pretty strange to not follow it. – Mitch Mar 02 '17 at 14:53
  • 1
    Also the distribution of income was already captured in the Lorentz curve, when Zipf was three years old. (@fdb) – Keelan Sep 07 '17 at 05:51
8

Zipf’s law, as I understand it, is not really about languages, but about statistics and probability. It is just one of several formulations of the fact that many non-arbitrary sequences of numbers (frequency of words in a given corpus; population size of cites in relation to their rank; annual turnover of ranked companies; etc., etc.) are not evenly distributed along a decimal scale, but are more or less evenly distributed along a logarithmic scale. As such, it ought to work with all language corpora, including texts that avoid the use of certain letters.

fdb
  • 24,134
  • 1
  • 35
  • 70
1

Spanish doesn't follow it. Not even remotely.

https://en.wikipedia.org/wiki/Most_common_words_in_Spanish

-1

A sample of any text written constraitively and/or according to grammars of avoidance speech styles (including honorific speech styles for languages with honorific lexemes but without neutrally polite speech styles) would contradict the Zipf's law.

This can be best shown by the phonetical and lipogrammatical example of A Void by Perec. This novel has been written without E, the most frequently used French letter, and hence has none of the frequently used words in French vocabulary (e.g. it contains no de and des, which could have the highest freequencies in French texts elsewhere).

In short, the conversation would present a large diversity in lexical occurrence due to the very existance of ideolects.

Manjusri
  • 2,781
  • 1
  • 19
  • 27
  • 1
    I'm interested about the second part of you answer. Does that mean that French deviate from Zipf's law? That is, does the entire known written corpus of French also deviate statistically in the word-frequencies? – sashoalm Jan 11 '14 at 16:17
  • This means Zipf's law is not universal, because it is not applicable to constraitive writing. Elaborating the first part of my answer further, I would suggest that Russian could be that kind of a language (e.g. one of the most frequent lexemes in Russian for 2012 was сэр, a russified form of sir, but I doubt this is a highly frequent lexeme in most Russian texts). – Manjusri Jan 11 '14 at 20:57
  • 1
    I've edited my question to clarify what I mean. In particular, the novel "A Void", as I understand it, deviates from Zipf's law, but it also deviates from the language-wide statistical distribution for French. – sashoalm Jan 11 '14 at 21:21
  • Any text in a language deviates from a statistical distribution for a language. – Manjusri Jan 11 '14 at 22:28
  • 3
    Maybe someone should actually do a word count on Perec’s “La disparition”. I would not be surprised if it did conform with Zipf’s law. – fdb Jan 11 '14 at 22:59
  • 2
    The question was about particular languages in their totality, not about specific texts in those languages. I can write a three volume text consisting of a single recurring word, kind of "Buffalo buffalo buffalo ...", but it won't have anything to do with the general tendencies of English. – Yellow Sky Jan 11 '14 at 23:35
  • 1
    This is a mere proof of the fact that you (or another downvoter) hasn't caught the meanining of my answer, as usual. Not a big surprise. – Manjusri Jan 12 '14 at 06:06
  • 1
    @Manjusri Don't worry about it. Seems someone has changed their upvote to a downvote on my question, too, probably the same guy that downvoted you. But let's not obsess about points, the discussion is what is important :) – sashoalm Jan 12 '14 at 11:32
  • 2
    @YellowSky it doesn't make sense to ask about Zipf's law wrt to a particular language in its totality, since the totality of a language is an infinite set of sentence, and an infinite set of sentences by definition will fail to meet Zipf's law. – P Elliott Jan 13 '14 at 00:56
  • If you remove one letter (like in "A void") the remaining letters should still obey Zipf's law, just in a different order compared to the order that includes "e". – kaleissin Jan 14 '14 at 18:30
  • Removing one letter we remove the words. E.g. 'de'/'des' is one of the most frequently used prepositions in French, which results in change of grammar. This could result in twenty words down the list, etc. – Manjusri Jan 14 '14 at 20:47