What is the entropy per word of random yet grammatical text?

Question

It has been suggested that to construct an uncrackable yet memorable passphrase with 256-bits of entropy, the passphrase should be manifested as a poem.

The answerer made an estimation of the entropy per word based upon the frequency of common words but deferred to linguistic experts to determine the entropy lost by grammar.

What is the expected per word entropy of random yet grammatical text?

I don't think that information entropy is on topic for this website. — curiousdannii, Jun 29 '14 at 22:11
@curiousdannii Linguists do not study the randomness or lack thereof of language? — , Jun 29 '14 at 22:12
Not to my knowledge. The whole point of language is that it's not random. Randomness can't communicate. — curiousdannii, Jun 29 '14 at 22:13
@curiousdannii: please keep in mind that computational linguistics is on-topic here. And your second comment does not seem to mean anything. — prash, Jun 30 '14 at 00:32
@prash I know that, I was just guessing that information entropy in and of itself was slightly tangential. Though I see you have given a good answer :) — curiousdannii, Jun 30 '14 at 04:17
@curiousdannii, I know why you downvote my post, the reason is alike: you have defined a boundary between what is linguistical and non-linguistical. But It is God's job, not work of mankind. — XL _At_Here_There, Nov 10 '14 at 08:24
@XL_at_China I haven't downvoted any of your posts, and it's very rude to accuse me of having done so. — curiousdannii, Nov 10 '14 at 08:50
@curiousdannii, because some persons are unfair, bullying has to match with power like knowledge, ability,etc. — XL _At_Here_There, Nov 10 '14 at 08:56
See also: Can the entropy per word be caculated precisely?And relation among information theory, semantics, and pragmatics — hippietrail, Nov 15 '14 at 13:43

prash · Accepted Answer · 2014-06-30T12:14:56.440

12

The figure for entropy of any language will depend on the model we use for computing it. This is quite like how someone who speaks English well would see lesser entropy in English than someone who barely speaks the language.

The model that Shannon used gave him a figure of 11 bits per word. Grignetti (1963) reported 9.83 bits per word. Some of the relatively modern techniques described in Chapter 6 of the classic Manning and Schütze textbook show entropy values of about 7.9 bits per word, when tested on Jane Austen's Persuasion. This example, being from a textbook that's over a decade old, is likely to have been superseded by better models.

Given all this, if I had to guess, I'd estimate about 5 bits per word.

Update: The paper by Montemurro & Zanette (2011) answers your question a somewhat more directly. For English, they report an average entropy of 9.1 for shuffled text, and 5.7 for the original text. So, that shows you how much you gain by taking "grammar" (read "word order") into account.

Now, I think if someone were to devise software that takes real grammar into account, including syntax, semantics, pragmatics, etc. we could conceivably achieve about 4 bits/word.

As an aside, in all this, Finnish comes out an interesting language, with an entropy of 7.1 bits/word.

edited Jun 30 '14 at 12:14

answered Jun 30 '14 at 01:19

prash

3,649
3
25
33

Late, but I can't thank you enough for such a practical analysis! – Jun 30 '14 at 07:51
@Gracchus: I have added a newer reference. – prash Jun 30 '14 at 12:16
Doesn't Finnish have higher entropy just because it is agglutinative and therefore has longer words? – Moss Jul 29 '14 at 04:45
@Moss I don't know if it has longer words, but with richer morphology it will have a larger number of words for expressing the same concept. "Car" may have 15 different forms (this is how I estimated) based on the parts of sentence it can play a role in. Making up long words is a feature of German, though. – prash Jul 29 '14 at 20:26
"Now, I think if someone were to devise software that takes real grammar into account, including syntax, semantics, pragmatics, etc. we could conceivably achieve about 4 bits/word.", This is not true. Entropy can not count for semantics and pragmatics, I am astonish at what you say about entropy and semantics and pragmatics, please see Shannon's classical article for reference. – XL _At_Here_There Nov 10 '14 at 06:30
@XL_at_China No need to be astonished. Have you read papers on the significance of latent variable methods in parsers? These annotations have taken on some of the role of semantic annotation in parsers. And secondly, there are corpora that include semantic annotations already. – prash Nov 10 '14 at 12:13
3

@XL_at_China Can you please stay focused on the topic and not go into non sequiturs? If you have something concrete to say, please include it in your answer below. – prash Nov 10 '14 at 13:07
@prash, please read some book of computability,model theory,and information theory, then you can know I focus on the topic, and give a conclusion about your answer. – XL _At_Here_There Nov 11 '14 at 01:00
@XL_at_China You really should take your own advice. – prash Nov 11 '14 at 01:07
@prash, LOL, so researcher has to know the definition of semantics and related theorems, the definition of pragmatics,and to have read Shannon's classic article. Otherwise, classical information would applied to misunderstood "semantics" and "pragmatics". – XL _At_Here_There Nov 11 '14 at 02:05
1

Let's end the argumentative discuss, it is useless, and make no advantage for all. – XL _At_Here_There Nov 11 '14 at 02:16

score 4 · Answer 2 · answered Nov 05 '15 at 02:52

I am the developer of the Readable Passphrase plugin for KeePass, which is all about creating random yet grammatical text. So here is a brief empirical analysis of what it produces.

Based on the phrases produced, passphrases from a ~14k word dictionary (version 0.15 of the plugin) contain between 7.4 and 9.3 bits of entropy, depending on what grammatical forms of a word are allowed. Given that this is when trying very hard to be random, I'd interpret those numbers as an upper bound of the entropy present in normal prose.

I took great pains when developing the plugin to try to count the number of combination different phrases might produce, but this analysis is entirely dependent on my counts being correct. (Entropy is derived based on the part of speech and allowed grammatical forms for each word in the pattern. It is also reported as a range, as different parts of speech affect what is grammatically allowed).

Method

Create 1000 passphrases which follow a fixed grammatical pattern.
Determine the theoretical entropy of such a phrase based on numbers of words in the dictionary.
Find average number of words per phrase = total words generated / 100
Find average entropy per word = theoretical entropy / average words per phrase.
Rinse and repeat for a different pattern.

Basic Pattern

<noun> <verb> <noun> - aka strength NormalRequired.

Note that nouns may be common, proper or derived from an adjective, and may have a definite or indefinite article or personal pronoun. The first noun may also be substituted for a number from 0-999 (digits). The verb uses present, past and future tenses. The entire phrase may be in the interrogative.

Samples:

my trite one examines the supply
should Waldo knit the sophist
how does their bifocal thing coil the daydream
should a secret thing enqueue the decade
the 1 risk whams a whaler

Average words per phrase: 5.21
Theoretical entropy per phrase (bits, min / avg / max): 39.5 / 44.8 / 46.4
Entropy per word (min / avg / max): 7.58 / 8.59 / 8.90

Long Pattern

<noun> <adjective> <adverb> <verb> <adverb> <preposition> <adjective> <noun> <conjunction> <noun> - aka strength InsaneRequiredAnd

In addition to previous pattern, this includes plural nouns and demonstratives. And continuous present, continuous past, perfect and subjunctive verb tenses, and intransitive verbs. Note that intransitive verbs dramatically shorten these phrases (mostly because the plugin doesn't handle them very well). The conjunction is either and or or.

Samples:

should their streaky one variably elevate plus these hoarse liars because of the overdone oddity
should these mellow ones feasibly repose apart from this torrid poacher but not this depleted hewer
when does this disfigured one decisively replace except for the deadly intruders or the real sawdust
the 22 armful of logicians earlier smooched amidst that gay turret and even a homemade skywriter
their convex thing profited evermore

Average words per phrase: 13.50
Theoretical entropy per phrase (bits, min / avg / max): 119.49 / 123.03 / 124.17
Entropy per word (min / avg / max): 8.85 / 9.12 / 9.20

Comment

Adding extra parts of speech adds, at best, 1.5 bits of entropy per word. While also introducing considerably more complexity (making it much harder to remember the phrase).

In order to get to 9 bits per word, the length and complexity of the phrase are quite out of hand. It would take non-trivial but reasonable effort to memorise, but once done, your very close to the magic 128 bits.

The shorter phrases are pretty easy to pick up (having memorised several of them during development of the plugin!).

And yes, for those interested in the plugin, it does have phrases in-between these.

What is the entropy per word of random yet grammatical text?

2 Answers2

Linked