9

Sorry if my question is a little more mathematical in nature, but my question is:

Suppose I took a document of some length whether it be news article, book, or something of that sort. What sort of relationship would I expect between the document's length and the number of unique words contained in it.

There are two constraints a graph would contain: the line y = x meaning every word I read is unique and the line y = |english| where we recognize that English has a finite number of words, or some sort of upper bound although I don't want to get into a discussion on how many words are in the English language.

Perhaps more practically, how long a document on average would I need to retrieve to get a list of 100 unique English words, 1000 unique English words , or some other value? This of course would differ between languages, but I am interested in English.

Otavio Macedo
  • 8,208
  • 5
  • 45
  • 110
demongolem
  • 491
  • 4
  • 11
  • There is no relationship (beyond the fact that the first page or so is limited by having so few words). A document is built of words in the writer's vocabulary, which is finite. A document generally has a purpose (or topic), and it's content will be limited to words relating to that purpose. You could have your printer print 3000 copies of a page with the same word repeated on it, and get 1 unique word across 3000 pages, or you could pick up a dictionary and have thousands of unique words across hundreds of pages. –  Dec 10 '12 at 21:24
  • 7
    I think you're looking for Heaps' law. It basically says that as the corpus size increases, the number of unique words increases sublinearly. I'm sure people who know computational linguistics have some hard numbers. –  Dec 10 '12 at 21:34
  • @Jim I am talking about real documents such as a CNN writer might put together, not toy documents. As a document increases in length, the same words will not be repeated over and over again. Rather, a longer document would expound further on the given topic. Agreed some words would be more common, but there is a relationship. – demongolem Dec 10 '12 at 21:37
  • @SáT - If this were on-topic, I would upvote you. Thanks, that looks good. – demongolem Dec 10 '12 at 21:38
  • @demongolem I suggest you use real documents, not CNN, because a fifth-grade vocabulary is quite limiting. Isn’t that what they are told to use? Unless it has gone down since. Yes, I’m serious, not ragging. – tchrist Dec 10 '12 at 22:40

1 Answers1

10

Thanks for the comments on Heaps’ Law. I followed one of the links to a related law, Herdan’s Law.

I’ll quote Wikipedia here:

The rule is as follows: if V is the number of different words in the text, and n is the length of the text, then V will be proportional to n to the power β:

        V ∝ nᵝ

where β ranges from 0.5 to 1 depending on the text.

(The equation looks better in one of the stackexchange families that understands equation formatting. My apologies.)

If one takes the conservative β of 0.5, look for a corpus of 10000 words to get you a V proportional to 100 unique words. At β of 0.6, your corpus would be proportional to 2200 words.

=== EDIT: Added real numbers for "Green Eggs and Ham" and a couple of CNN articles.

I tested "Green Eggs and Ham" and a pair of CNN articles to find the number of unique words.

  1. Green Eggs and Ham : 791 words, 55 unique.
  2. Angry with Obama, GOP threatens political war next year : 1553 words, 573 unique.
  3. New voter rules announced ahead of Egypt referendum : 408 words, 235 unique.

The β for these three are 0.600, 0.864, and 0.908. For the second one, I used V = nᵝ and took the log of both sides to solve for β. So β = log(V) / log(n).

Testing was done in Java. Sam-I-Am counts as a single word, as does decision-making.

=== EDIT: Added graphic for "Green Eggs and Ham"

The dependent, x-axis shows the word in the corpus. The independent, y-axis shows the unique count.

Green Eggs word count vs unique

rajah9
  • 584
  • 2
  • 11
  • If the law is named after Gustav Herdan, better check to see if it's been widely verified. – jlawler Dec 11 '12 at 05:29
  • Your mathematics is wrong (10000^0.6 is about 251 not 2200) and in any case there is the key word "proportional" which means the statement is not about the number of different words in a given text but about how it changes when the text length changes. – Henry Dec 11 '12 at 07:37
  • It is odd that the Wikipedia article for Herdan's Law says it is also known as Heaps' Law but the article for Heaps' Law gives different values for β (specifically, the page for Heaps' Law states 0.4-0.6 plus instead of being a proportionality, they include a linear factor K with a value from 10-100. Can anyone shed light on this inconsistency? – acattle Dec 11 '12 at 10:23
  • This may work as some sort of very rough rule of thumb for your average prose text, but it is easy enough to come up with counter examples, like a poem being a repetition of two words over five lines. – Cerberus Dec 11 '12 at 13:34
  • @Cerberus I certainly agree with this. It would seem that at the low end and the high end of the x axis there is great error and it is somewhere in the middle of the curve where one would get the better performance. – demongolem Dec 11 '12 at 15:35
  • @Henry Agreed, the key word is "proportional." The math I used was 10000 ^ .5 = 100 and 2200 ^ .6 = 101.26. The OP was looking for corpus size to get 100 unique words. – rajah9 Dec 11 '12 at 17:17
  • @demongolem: Yes. However, one's definition of a "document" also matters. The curve will probably jump up if you go from "half a chapter v. whole chapter" to "one chapter v. 1.5 chapters" in a novel, if the second chapter is about a different character/situation/etc. It's all very complicated, and the regularity depends on the genre too. – Cerberus Dec 11 '12 at 18:44
  • @Cerberus: Yes, you're right, of course. Just like Gadsby would be an exception to letter counts, Green Eggs and Ham would be an exception to unique word counts. (There are probably others, but that was the first one that sprang to my mind.) –  Dec 11 '12 at 18:46
  • @J.R.: Yes, exactly! The word count will probably not go up by much any more after the first two pages of Green Eggs and Ham. In almost any genre, the curve will be also be steep for the first sentence or so, then flatten out later on. – Cerberus Dec 11 '12 at 19:05
  • @Cerberus, Added a graphic that addresses your supposition. – rajah9 Dec 11 '12 at 23:34
  • @rajah9: Oh, that's very interesting! I have two more questions; I'm a bit confused as to how you went about calculating those figures. Last time I did mathematics was in high school, so please bear with me. 1. Did you use separate data points for each additional word of GE&H? Or per every x words? 2. Since V and n^β are "proportional", that means the formula is V=kn^β, where k is a constant. What constant have you found? Or is it 1? And how have you determined the values k and β? 3. You found values for β greater than 1 for two texts; does this mean you found constants below 1 for those? – Cerberus Dec 12 '12 at 02:04
  • @Cerberus, A1. The first words are "Sam I am. I am Sam. That" The first seven data points are (1,1), (2,2), (3,3), (4,3), (5,3), (6,3), (7,4). I used separate data points for each word. A2. I used k=1 because it was such a small set. A3. I had transposed the V and n columns for the CNN articles and have since recalculated their β values (now less than 1, which makes more sense!) Thanks for the gentle nudge. I have edited the answer and added the math steps. – rajah9 Dec 12 '12 at 14:42
  • It might be worth posting a question on maths.SE about the similarities / differences in Heap's Law / Herdan's Law and the ambiguities in their Wikipedia entries. – hippietrail Dec 17 '12 at 02:22