10

I am wondering what the most succinct written language is.

I would call one language more succinct than another if that language could communicate the same idea as another with fewer characters. I am not a linguist, so please forgive me if there is anything about this post that is misguided and feel free to make any corrections needed.

hippietrail
  • 14,687
  • 7
  • 61
  • 146
user93189
  • 209
  • 2
  • 3

3 Answers3

11

Edit to clarify that I used the full declaration and not only the article 1

The way to answer your question is to have a sample of the same long text translated in many languages. The only example I know is the universal declaration of human right, which has 380 reasonably complete translations hosted at http://www.unicode.org/udhr/. Then one just has to count the characters and have the answer.

Caveats of the method

However, the method has several caveats.

1. Variety of the Translations

According to the website, different orthographies of the same language correspond to different translations, which is probably what explains the difference between the two Chinese versions (3737 characters vs 4677). The amplitude of the difference in the two Chinese versions is quite big (20%) for two orthographies which have one character/syllable and encode the same language. It may reflect the stylistic difference of the two translators, and I suppose that this difference can be used as a warning to take all the number below with a big grain of salt. The error bars are likely to be bigger than 20%.

Some of the translation do not contain the preamble, while most of the texts contain it. So it’s not exactly the same text each time. Other differences might exist, since I guess no one can check all the 380 languages.

I’ve come across two identical texts for different languages (Ashéninka Perené and Cashinahua). I guess it’s a technical error, and other errors are probably in the text.

2. Specific nature of the text.

It is a legal text. Poetry, love letters, or historic tales have no reason to have the same succinctness. For example, the translation of article 1 in “Ashéninka, Pichis” seems to turn 2 sentences into 5, with a quote. I can only make sense of this if this article needs to define terms which don’t have exact translation.

3. Definition of a character

I have use the simplest definition of a character : it is a Unicode character in the file, as encoded in the file (below the copyright boilerplate). However, this definition has several problems :

  1. The files are full of white spaces
  2. They are not normalized, neither as NFC (é is one character) nor NFD (é would be e+acute accente : 2 character). The upside of this is the presumed use of the more usual form for a language (NFC for Korean, an unnormalized version for Vietnamese)
  3. The definition kept by Unicode has sometimes more to do with the history of digital encoding of the scripts than with the real linguistic status of an entity as a character.

    • For example Korean is written in Hangul, an alphabet where the 2 or 3 letters (jamos) describing a syllable are placed in a square of the size of a Chinese character. Currently, there are two canonically equivalent ways of encoding a Korean syllable in Unicode : one (NFC) with one character, and one (NFC) with a character for each jamo, that is two or three characters/syllable. All that should of course be invisible to the reader.

    • An other set of example is the way abugidas are encoded. An abugida is an alphabet where the vowel is essentially a diacritic added to the consonant. So the graphical unit is a syllable, but one can distinguish both the consonant and the vowel in this graphical unit. Some of them, like the Ethiopic Ge'ez, used for Amharic, are encoded with one character/syllable while other, like most (all?) Indic scripts are encoded as a complex script with one character/consonnant and vowel, that is typically two character/syllable. While one could argue that this factor 2 corresponds to a more succinct digital representation of the script in the Unicode standard, it does not to corresponds to anything in term of the script itself.

Results

The most succinct language is Mandarin Chinese, in traditional characters, followed by the same language in simplified characters. With little surprise, in the 10 most succinct languages, one has languages with one syllable per character (Chinese, Yi, Amharic, Korean) and Japanese (which is a very special system). However, a few languages (Waama, Even, Cashinahua ,Beti) written in an alphabetical system (latin and cyrillic) make it into the top 10, while some 1 character/syllable don’t make it (Tigrina, Cree, Vai, etc.) and no abjad (alphabet without the vowels) makes it.

Article 1 in various languages, as an illustration.

Below is the article 1 of this declaration in a few languages sorted by the total number of character used in the declaration. This number given after the language name and counts the number of characters in the full declaration (i.e. the 30 articles, and the preamble when present).

I first give 10 the most succinct languages, then English for reference, and the least succinct language.

Chinese, Mandarin (Traditional) : 3737

第一條

人人生而自由,在尊嚴和權利上一律平等。他們賦有理性和良心,並應以兄弟關係的精神相對待。

Chinese, Mandarin (Simplified) : 4677

第一条

人人生而自由,在尊严和权利上一律平等。他们赋有理性和良心,并应以兄弟关系的精神相对待。

Yi, Sichuan : 4910

ꋍꏢꏡꌠ

ꊿꂷꃅꄿꐨꐥ,ꌅꅍꀂꏽꐯꒈꃅꐥꌐ。ꊿꊇꉪꍆꌋꆀꁨꉌꑌꐥ,ꄷꀋꁨꂛꊨꅫꃀꃅꐥꄡꑟ。

Waama : 5121

Pɔpɔɔma (1)

Yiriba na bà sikindo dare bà mɛɛri, da seena yirimma mii bà ta da i nɛki bà tɔɔba.

Even : 5299 (no preamble)

Статья 1

Бэйил бокэтчур омэн хилкич нян урумкэр балдаритно, теми ноҥардук эгдьэн ҥи‐да ачча. Бэйил бөкэтчур мэн долан акагчимур биннэтын.

Ashéninka Perené or Cashinahua : 5576

(The the files are almost identical ! It's probably an error. I guess that the following is in one of the languages, but I don’t know which.)

Artículo 1

Yudabu dasibi jabiaskadi akin, xinantidubuki. Javen taea jau jaibunamenunbunven.

Japanese : 5865

第1条

すべての人間は、生まれながらにして自由であり、かつ、尊厳と権利とについて平等である。人間は、理性と良心とを授けられており、互いに同胞の精神をもって行動しなければならない。

Amharic : 6393 (No preamble)

አንቀጽ፡፩፤

የሰው፡ልጅ፡ሁሉ፡ሲወለድ፡ነጻና፡በክብርና፡በመብትም፡እኩልነት፡ያለው፡ነው።፡የተፈጥሮ፡ማስተዋልና፡ሕሊና፡ስላለው፡አንዱ፡ሌላውን፡በወንድማማችነት፡መንፈስ፡መመልከት፡ይገባዋል።

Korean : 6400

(The file is in NFC: 1 character per syllable. A file dormalized in NFD, with 1 character/jamo would be two or three times longer)

제 1 조

모든 인간은 태어날 때부터 자유로우며 그 존엄과 권리에 있어 동등하다. 인간은 천부적으로 이성과 양심을 부여받았으며 서로 형제애의 정신으로 행동하여야 한다.

Beti : 7428

Atiñ 1

Abiali bod bese, tege ai sesala, bene etie dzia a mis memvende y'enyiñ, dzom dzia etu fili nkóbó, fili ntsogan, fili mboan. Ve abiali te, mod ose ayem dze ene abe, dze ene mbeñ asu e mod mbog antoa ai mfi na enyiñ ewulu mezen mene sosoo. ...

English : 12322

Article 1

All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.

Ashéninka, Pichis : 28432

IÑAAPINCATHAYEETERI 1.

Maaroni atziripayeeni, ovaquera intzimapaaque, eero ocantzi iñaashitacaavaitaityaari iromperanataityaari. Eejatzi oquemitari iroñaaca te apantyaaro amanitashireteri atziri ancanteri: "Te pirjiperote eeroca, iriima irinta iriitaque ñaaperori". Eejatzi oquemitari te oncameethate intzime aparoni atziri antayetashityaarone caari ishinetaacairi pashine irantero. Tema maaroni ayotziro ampampithashirvaayeta, ayotziro tsicarica otzimayetzi cameethatatsiri anteri o tsicarica otzimi caariperotatsiri, irootaque ocovaperotantari iroñaaca entacotavacaayetya anquemitacaantanaquero arentzitavacaatyeeyaami ocaaquiini.

octosquidopus
  • 349
  • 1
  • 7
  • 1
    Probably a chapter of the Bible, using only modern day language translations (i.e. excluding King James for English) could be appropriate as well? I bet it's been translated to far more languages than such declaration! – Joe Pineda Feb 21 '14 at 00:03
  • if you find me a collection of text files of a few hundred Bible translations, I'll rerun the same analysis – Frédéric Grosshans Feb 21 '14 at 08:59
  • What about Omniglot's collection of translations of "The Babylon's Tower" narrative into several hundred languages? http://www.omniglot.com/babel/index.htm I had thought you'd base your comparison off their collection of translations of the Human Rights Declaration into more or less same number of langs http://www.omniglot.com/udhr/index.htm – Joe Pineda Feb 21 '14 at 16:15
  • @JoePineda : The texts at omniglot.com (both on Babel and the UDHR) have two problems for me : 1. they are much shorter than the full UDHR hosted in unicode.org which I used (including the 30 articles and not only the 1st) ; 2. They are not proposed as a set of different text files, which would need a (admittedly small) extra work to process the files. – Frédéric Grosshans Dec 09 '14 at 17:13
  • 1
    This depends a lot on the translation. You can see that Beti has a lot of problems transcribing dignity (even if I don't speak the language), whereas English does not just use the same word, due to Latin influence, but also tries to mimic the original French doués de with endowed with, where has perhaps doesn't capture the whole meaning (is that related to dues? Do you owe me or whomever to use reason-ability?). You can see that Beti follows the french original, and that everything up to the first "and", if that's analog to the comma in Beti, is packed into just "Abiali bod bese". – vectory Dec 25 '18 at 21:36
  • @vectory: indeed ! This translation dependence is well illustrated by the two Mandarin version, as I stress in the first caveat paragraph above. – Frédéric Grosshans Dec 26 '18 at 11:58
  • How do Traditional and Simplified Mandarin manage to diverge by nearly 900 characters? That seems highly suspect; assuming the translation is the same, the character count should also be the same. There’s no NFC/NFD distinction with Chinese characters, is there? – Janus Bahs Jacquet Jul 23 '20 at 13:48
  • @JanusBahsJacquet : I guess the translation is not the same, but I'm not sure it's enough to explain such a large difference – Frédéric Grosshans Jul 23 '20 at 21:43
7

The number of the symbols a written language uses is inversely proportional to the length of the text in that language, the more symbols a language has, the shorter texts are. This is true for any notation system, not only human languages. For example, our decimal numeral system uses 10 different symbols (0 to 9), and the binary numeral system uses 5 times less symbols, only two (0 and 1), so we have number "9" (one symbol) which in binary system is "1001" (four symbols).

What follows from this is that the language with the biggest number of symbols will have the shortest texts. Chinese, with its 50,000+ symbols of which up to 3,000 is required to read a newspaper and 8,000 needed for an educated person, is the first candidate for the Most Succinct Written Language Award.

UPDATE: There are constructed languages that are designed specially for the purpose to compress so much information into a single word that it would take a whole sentence in any natural language. The most famous among such languages is Ithkuil. It has a very complicated script, an Ithkuil sentence written in that script contains much fewer "characters" than there are words in the English translation of the sentence. So, if you consider constructed languages too, than Ithkuil is also a good candidate.

Yellow Sky
  • 18,268
  • 39
  • 65
  • On the other hand, depending on your criteria, you could count Mayan as the most succinct also - since it lets you build blocks of syllables, you can have whole phrases in a single block. Depends on whether the criteria for succinctness is the physical length of the text or the number of individual letters used. – Sjiveru Dec 16 '13 at 22:39
  • 1
    @Sjiveru - user93189 has a very well-formulated criterion for a language to be succinct, it is to "communicate the same idea as another with fewer characters". Mayan blocks consisted of glyphs, the glyphs being Mayan characters, not the blocks. All in all there are about 300 different Mayan glyphs, which is much less than thousands of characters in Chinese. Besides, if to write a text in any language by laying out atoms in the shape of its symbols, such a text will definitely be immeasurably shorter than any single Chinese character printed in a book. :D – Yellow Sky Dec 16 '13 at 22:54
  • Lol, that's quite true! I guess 'succinct' to me means something quite different. – Sjiveru Dec 17 '13 at 00:02
  • Take into account that Classical Chinese is even more terse and succint than any modern derivative of it! So surely it'd win, hands-down, in the contest among natural languages :) – Joe Pineda Dec 17 '13 at 01:14
  • 1
    We're talking about fictional languages here, really. If one had a writing system with 50 million symbols, one could get the succincity down to a really low level, with appropriate choice of representation. Or, with the entire IPA available, plus tones and various suprasegmental phenomena, one could get well into the thousands. That's basically the premise behind "Speedtalk", from Robert Heinlein's novella "Gulf". But it wouldn't really work, as this blog post makes clear. – jlawler Dec 17 '13 at 01:31
  • @YellowSky, I'm not sure that the concepts of "character" and/or "glyph" are that well defined. E.g. the digraph "fi" is written as 1 glyph in several fonts. Some languages lack consistency in whether to treat certain things as one character or 2, e.g. Spanish "ll", Danish "aa", Dutch "ij" and so on. – dainichi Dec 17 '13 at 04:21
  • How do you define "inversely proportional". So since Chinese orthography has more than 1500 times more symbols than English orthography, texts are 1500 shorter? That is not even true for the binary and decimal systems. The binary system needs base-2-log 10 ~ 3.32 times more symbols than the decimal system, not 5. – dainichi Dec 17 '13 at 04:27
  • @dainichi - When I wrote about the binary system, I meant 10/2=5, that was about the number of symbols in its inventory (0, 1), not about the rate at which binary numbers are longer than decimal. – Yellow Sky Dec 17 '13 at 08:49
  • 1
    Not a linguist myself but a question i am also interested in and one that would make perfect sense from an Information Theory point of view. So, what if you normalise number of symbols by alphabet length? Would that make the evaluation valid? – A_A Dec 17 '13 at 16:39
  • @YellowSky, you wrote "The number of the symbols a written language uses is inversely proportional to the length of the text in that language". AFAIK, that means that if there are 5 times as many symbols, the length of the text is one fifth. Which is wrong. – dainichi Dec 18 '13 at 00:30
  • @dainichi - I'm not a mathematician, I think you have the right formula in your comment above. – Yellow Sky Dec 18 '13 at 00:32
  • @dainichi, I don't think that you can say one way or another whether the number of symbols is inversely proportional to the length of the text in that language, unless you have a more precise relationship in mind. Such a relationship would be one where we could ensure that the other variables could be held constant while varying either the number of symbols or the length of text. Otherwise, all that we can say is that, the number of symbols in a written language seems or does not seem to be inversely proportional to the length of the text. – user93189 Dec 18 '13 at 03:25
  • @user93189, "inversely proportional" has a precise definition which is not satisfied here. Put in "seems" or not, that doesn't make a difference. Maybe what is meant is that they are negatively correlated, but that is a different thing. – dainichi Dec 18 '13 at 04:10
  • @dainichi, what is that definition and why is it not satisfied? – user93189 Dec 18 '13 at 04:15
  • 1
    @dainichi - I explained in my answer what I mean by "inversely proportional": the more symbols a language has, the shorter texts are. Naturally, the ratio of that is different in every particular case, I never said the factor of it is always 1. And that's just theory, IRL things are sometimes different, e. g. texts in Hebrew can be shorter than the same in English, but Hebrew has fewer letters than English, that's because different languages have a different degree of redundancy and it's difficult to calculate. – Yellow Sky Dec 18 '13 at 09:57
6

I do not know which language is the most succint one, but I can tell you that this is actually a question of Information Theory, rather that Linguistics.

Essentially, what you are asking about is which {language, script} combination has the lowest entropy. That is, take a random sentence and cut it off at a random point (including halfway through a word), and ask speakers to make a guess about what comes next. Repeat this a few thousand times and make a statistical estimate of the average uncertainty of your language. Claude Shannon [1] was the first one to apply this method, and he found out that English (written in the standard Roman alphabet), on average, has an information content of about 1 bit/letter. Given that there are 8 bits in 1 byte, the most efficient compressing algorithm might be able to compress a random English text to 1/8th of its original length, but not more.

So, your question can be rephrased as:

  • Given the most efficient compression algorithm A, which {language, script} combination LS is the one that, given a random text T, A(T(LS)) results in the highest reduction in length relative to the original text?

[1] Shannon, Claude. 1951. Prediction and entropy in printed English. Bell Systems Technical Journal 30:50--64.

Koldito
  • 1,134
  • 7
  • 5
  • 2
    I believe O.P. wants the opposite: the language with the "highest" entropy, the one whose writing packs so much information in so few characters that a compression algorithm would reduce it insignificantly or not at all. As indicated, Classical Chinese in classical characters wins this competition hands down: incredibly succint grammar, each char packing about 15 bits of info (though for most common texts you can get off with just 13 bits, 11 for very simple ones). – Joe Pineda Feb 20 '14 at 23:57
  • Written French has a very low entropy: It doesn't have as much consonant and vowel clusters as English, some common groups are fairly common (eau, au, que) and so their appearance is very predictable. I'd bet a French text of some length would compress more than one the same length in English... – Joe Pineda Feb 21 '14 at 00:00
  • 1
    However, there's indeed a linguistic aspect to this question besides the one for information theory: besides the redundancy inherent to all languages and the (i)logicities of their traditional writing systems, some languages have more concise grammars whereas others require much more verbiage to get the same idea around. You can't count really how much bits words, naked of any writing system, do encode... – Joe Pineda Feb 21 '14 at 00:07