Which language would be easiest for a computer to parse?

Question

I have an idea for a programming language that would work more like a spoken language. "sentences" would have an initial context in which specific subjects, verbs, and objects would have meaningful relationships. I'm thinking I can make it more powerful and global if I use something like Esperanto as a base which other languages could translate to.

So I have 2 questions:

Which (constructed|native) language would be the best choice based on:
1. regularity of grammar
2. easiest to translate other languages to
3. is easy to learn
4. is expressive
5. some other criteria that might be useful (popularity would be nice but doesn't matter as much as regularity and simplicity)
Do you know of any existing projects that do something similar or would be useful for some other reason?

Some of your criteria I don't think really make much sense. There's no language that's universally easy to translate to (translations are usually easier when the two languages are more similar--there's no language that's similar to all the languages in the world), how easy a language is to learn depends on your native language, and all human languages are expressive. Constructed languages would probably be the ones with the fewest irregularities though. — Askalon, Mar 24 '12 at 20:31
This is an interesting idea. Askalon surely made a valid point, too. I would say Esperanto is your best bet, given its structure, vocabulary and popularity, so I would recommend you to stick to your original idea. Care to share some more details of your language? I don't know of any similar project. — kamil-s, Mar 24 '12 at 20:41
Depending on how the language is to be structured, I'd recommend using standard SOV agglutinatibe grammar. It's very similar among all SOV languages, to the point where it's not unknown to find morpheme-by-morpheme matches among complex words and phrases in Turkish, Tamil, and Japanese, all of which are of this type, though totally unrelated. — jlawler, Mar 24 '12 at 22:24
As a bonus, you get an RPN language, because that's the way SOV languages structure their arguments: Subject-Object-Verb. — jlawler, Mar 24 '12 at 22:25
I don't know about Tamil and Japanese but Turkish I actually speak and I really don't think it would be any easier to parse than Esperanto. It's actually quite difficult to parse for human beginners from time to time. Besides, it would only make things easier if Oggy aimed the language at Middle Eastern programmers. For Americans and Europeans, Esperanto vocabulary will be much easier to memorize. — kamil-s, Mar 25 '12 at 07:35
@KamilS. & jlawler: all the literature I have come across for parsers for various languages demonstrate the worst performance with Turkish. Turkish is one of the more difficult languages to parse -- at least with the kind of techniques used for European languages. — prash, Mar 25 '12 at 13:39
Well, yes, of course. Turkish is left-branching; all SOV languages are. European languages are right-branching, like English. But over half of the languages of the world are SOV. It's one of the most stable configurations for sentence generation. And the OP's not building a parser, but a programming language. — jlawler, Mar 25 '12 at 14:24
This was tried in the 1960s, with disasterous results. COBOL was designed to resemble spoken languages, and it was hated by coders. The programming languages generally want to be precise and terse, and have a definite way to say something, so that they are easy to read. Algorithms are notoriously hard to specify in natural language (think about how long it would take you to describe how to alphebetize a dictionary precisely). Nevertheless, Perl borrowed constructions like $_ ("it") and free order from natural language with great success. — Ron Maimon, Mar 26 '12 at 07:42
You could also go with a controlled version of an otherwise difficult-to-parse language. — Nate Glenn, Mar 29 '12 at 17:42

score 11 · Answer 1 · edited Mar 30 '14 at 21:28

11

lojban, a conlang, already meets these goals.
They already have a few parsers. One of them even has a YACC grammar.

I have come across literature that give the impression that it is possible to make a parser for Sanskrit, that would generate unambiguous parses. But I don't know of well-executed Sanskrit parsers, myself.

A few additional thoughts on the matter: Morphology is one of the biggest reasons why parsers make mistakes. Chinese does not have morphology at all. Chinese parsers perform well. However, at the moment, English parsers are the most accurate of all. This is possibly because parsers for English have received far more attention than parsers for any other language.

edited Mar 30 '14 at 21:28

Franck Dernoncourt

1,588
2
12
35

answered Mar 25 '12 at 13:26

prash

3,649
3
25
33

5

Lojban is hard for people. I don't speak it myself but those I know that do, claim that most use of Lojban by people is creolized/simplified/ambiguous. It might seem that we meatsacks need underspecification and metaphors, extreme precision being incompatible with our natures. – kaleissin Mar 25 '12 at 14:31
@kaleissin: I haven't found willing friends to learn with, so I haven't learnt lojban either. However, I have read the grammar a bit. I get the impression that underspecification is part of the formalism. You can call a stack of books a "table", or a person a "donkey" etc. The morphology supports such metaphors straight-away. Only the grammar is precise. It allows sentences to be more imprecise than possible with natural languages. – prash Mar 25 '12 at 15:08
@kaleissin: citation: http://www.lojban.org/tiki/Lojban%20Introductory%20Brochure#unambiguity. You can find specific examples in their books. – prash Mar 25 '12 at 15:16
3

Kaleissin is right: lojban fails miserably at point 3, easy to learn. Ease of translation might be a very serious issue, too. And I'm not so sure you can actually call it similar to a human natural language. – kamil-s Mar 26 '12 at 19:59
@KamilS.: {{Citation needed}} ;-) – prash Mar 26 '12 at 22:33
@prash You're not serious, right? – kamil-s Mar 26 '12 at 22:59
@KamilS.: I am always very serious about assertions thrown about with nothing backing them. Trust me on this. There is no end to the kind of things people try to get away with. – prash Mar 27 '12 at 08:52
@prash Now I'm sure you are very serious. But note that there is also no end to how absolutely obvious things people spend money and effort to prove merely to look more scientific. Would you expect me to quote the Oxford Dictionary if I said house means 'a type of building' in English? What would this give you? Just somebody else's opinion. What kind of backing up would you expect for what I wrote? How about your own judgement? – kamil-s Mar 27 '12 at 13:29
@KamilS.: My own judgement? From what I have studied of the grammar, I concur with the Q&A answer to "Lojban seems complex. How hard is it to learn?". You made the claim "fails miserably at point 3", which you did not back up either by your own reasoning or with citations. Don't you see the difference between this and asking for the definition of house? – prash Mar 27 '12 at 13:29
@prash Well, apart from the obvious exaggeration for rhetoric purposes, I don't actually see much of a difference. Do you really need a study to convince you that a language which is in many ways more similar to a programming than to a natural language, is not easy to learn for a human? Unless you understand learn as merely memorizing the grammar, without really being able to actually speak the language? Because if so, why even bother fashioning a new programming language after natural languages, which is what the main question is about? – kamil-s Mar 27 '12 at 13:48
Q1: Yes. Programming languages have tiny grammars compared to natural languages. I need better reasoning or citations to accept the claim that natural languages are easier to learn. Q2: The link I gave in my previous comment already answers that in detail. Q3: I can write a large essay about that. But since I'm not the one who asked the original question here, I'll leave it to Oggy Transfluxitor Jones. – prash Mar 27 '12 at 14:00
Q1: Your choice. Q2: All I can see in your link are advertising claims not backed up by even a reference to a study of some kind. Q3: Right. – kamil-s Mar 27 '12 at 15:42
@KamilS.: Q2: The Q&A was written based on their experience with people who have learnt the language. But you're right. Your way is easier. Hence my claim: lojban is the most brilliantly easy to learn language ever. Unless you bother to demonstrate you know what you're talking about, this is my final comment on the matter. – prash Mar 27 '12 at 16:46
let us continue this discussion in chat – kamil-s Mar 27 '12 at 17:07
4

On Sanskrit parsers: http://sanskrit.inria.fr/ has one (by Gerard Huet); also Oliver Hellwig and University of Hyderabad also have parsers and similar tools on their websites. I'm not really sure it's always possible to generate unambiguous parses though: often literature is written so as to intentionally allow multiple readings. (Called śleṣa) (E.g. puns.) – ShreevatsaR Apr 06 '12 at 09:23
@ShreevatsaR: Thanks for the info. I haven't learnt Sanskrit at all. Can you tell me if śleṣa exploit syntactic ambiguity (as opposed to semantic or pragmatic)? – prash Apr 07 '12 at 01:21
2

"Morphology is one of the biggest reasons why parsers make mistakes. Chinese does not have morphology at all. Chinese parsers perform well." wtf? first of all, at least look up "Chinese morphology" on wikipedia before making such broad claims. – unhammer Sep 06 '12 at 08:52
Second, simple morphology doesn't necessarily make parsing easier; for syntax, it can make it harder, e.g. without a case system, you might have to rely on word order etc. to say if a word is subject or object; with a case system (more complex morphology), the word form itself is often enough. – unhammer Sep 06 '12 at 08:55
@unhammer: 1. You need to look it up yourself. 2. Please re-read what I wrote. I was talking about the current state of parsing. If you have info on papers that demonstrate superior parsing accuracy on morphologically rich languages, please post them. – prash Sep 06 '12 at 22:00
2
e.g. "Most modern varieties of Chinese have the tendency to form new words through disyllabic, trisyllabic and tetra-character compounds. In some cases, monosyllabic words have become disyllabic without compounding, as in 窟窿 kulong from 孔 kong; this is especially common in Jin." Also, Chinese classifiers are bound morphemes. I'd agree if you said "very little morphology", but "none at all" is simply false.

unhammer

Sep 07 '12 at 07:35

1

The problem with (especially statistical) parsing of morphologically rich languages is simply coverage – because words can take so many forms, you're much more likely to see new forms in your training data. Not so much a problem with rule-based parsing, e.g. with Constraint Grammar (or at least statistical methods using a rule-derived lexicon), where if you have one form of the word, you have all. On the other hand, on seeing dog.n.m.sg.acc you have so much more syntactical info on the word than on seeing dog.n, e.g. you can immediately rule out if it is a subject or a head of an adj.pl

– unhammer Sep 07 '12 at 07:41

score 6 · Answer 2 · answered Mar 25 '12 at 14:43

There is also the newer conlang Ithkuil. As with Lojban it has the problem that it forces the human to change behavior by being much more precise and specific than normal. No ambiguity comes at a cost to the speaker, being vague has its own rewards.

It might be that learning languages like Lojban and Ithkuil are easier for programmers since they are already used to an absurd level of precision when coding. Because of this I don't see how using a programming language that was also a human language would gain us anything: No matter the syntax of the programming language, in order for the computer to be able to do something predictable with it, the human coder would still have to be a pedantic perfectionist crossing all t's and dotting all i's.

Humans do fuzzy matching and inference very well, computers not so much.

score 3 · Answer 3 · answered Sep 02 '12 at 08:32

I have also tried to design a language based on the regularity of esperanto... but well... too much recursive definitions for atoms, thus I ended up in approximatively nothing serious.

But having this in mind, I'd like to tell you about Toki Pona.

It's an extremely simplified constructed language. The base corpus contains only 120 words. (though the wikipedia article says it is more, because some words association do have a meaning of their own and thus must be learnt)

I took a little time to start learning it, and I must say that I am impressed by the descriptive power of such a tiny designed language.

Really worth losing 2 or 3 hours to learn the basics.

Having tried to write a parser for tokipona myself, what I found is that some words (prepositions, esp) can be alternate parts of speech. This makes it hard to decide when prep phrases are starting. Also, I found that writing the parse worked better when I sometimes started from the right, sometimes from the left. I suspect that means the grammar uses mixed branching. Other than that, writing a parser would be a week of evenings for an average developer. — MatthewMartin, Sep 04 '12 at 15:37

widged · Answer 4 · 2012-09-01T13:58:59.840

The initial question is asking two very different questions. One is what languages have a grammar/syntax easy to parse. Another is could computer programs extract meaning from the content that they parse.

With regards to question 1, @Nick Anderegg is correct. English is as good as any other language as its grammar is fixed / parsable. Sure, you have ambiguities that prevent complete determinism. With natural languages,an input can sometimes lead to different interpretations. For instance because some token can be words or verbs. Two approaches. One you stick to a simplified version of the language.

Hypertalk was a computer language using simplified english:

put "100,100" into pos
repeat with x = 1 to the number of card buttons
  set the location of card button x to pos
  add 15 to item 1 of pos
end repeat

More recently, [isla][1], a computer language for young children

my name is 'Mary'
my summary is 'You are a boy. You have no shoes.'
hallway is a room
hallway summary is 'A candle burns on a table.'

If you take the approach to simplify, then the natural language that your computer language is inspired from is somewhat irrelevant. You can "translate" the rules to adapt to the patterns used in other natural languages. In javascript, the JS-i18n project proposed to translate the javascript syntax to a developer's native language.

Another way around the parsing ambiguity problem is to add a few rules that removes these ambiguities. For instance, you force the user to clarify the role for each token (talk as Verb vs talk as Noun). The initial question then becomes could you find some natural languages that could be parsed with 100% accuracy with no rule added?

The second question is what human language would the the easiest for a computer easily extract meaning from (cf. "meaningful relationships"). Computer languages are not designed to extract meaning. Computer languages execute one instruction (program statement) at a time. They don't need to know the meaning/intent of the block that they are part of. Dependencies, preconditions, logical inference, etc., yes. Understanding of the meaning (which often depends on an understanding of the greater context), no.

You can write computer programs that extract very complex patterns from natural languages. But that doesn't meant that they "understand" what you say. Google translate or Siri are great pieces of software, however they don't really actually understand you. Many computer programs that appear to process language the way we do rely on extraction of patterns instead of extraction of meaning. They can produce frustrating or hilarious results with [people with accents][3]. Good coverage of issues and current solutions in this presentation on "[Can computers understand our language?][4]"

Back to the Turing's Chinese Room. Is displaying a performance close to the human one evidence that the program does it the way human do it? If you look at the latest Loebner's prize (2012), the level of conversation that chatbot achieve are far from being in par with humans:

Judge: Where did you go to school? 
Chip: I wasn't aware that I told you I went somewhere. I found out though, that majority of the students in Japan don't pursue tertiary education anymore. They can already work right after graduating from high school. 
Judge: Are you Japanese? 
Chip: I haven't achieved self-awareness yet, so I'm not sure. 
Judge: Why not ask somebody? 
Chip: No idea. 
Judge: Why not ask me? 
Chip: Okay. 
Judge: Well, I think you may be Japanese - you are very respectful and polite 
Chip: That's an interesting opinion

Base line. The word language means "a system of communication". Human/Natural Languages and Computer Languages both contain the word "Language" but they have very different purposes.

"A programming language is an artificial language designed to express computations that can be performed by a machine, particularly a computer." (source: [codeproject][5])

If you want to design a good computer language, then don't look at natural languages for inspiration! If you want to create a computer languages that is easy to learn, straightforward to use, then go visual, the way [scratch][6] did. A javascript equivalent, watebear will let you create your custom micro-language that follows the syntactic patterns of [any natural language][8]. You are then considering a 4th or 5th generation programming language (4GL/5GL)

5GL or fifth-generation language is programming that uses a visual or graphical development interface to create source language that is usually compiled with a 3GL or 4GL language compiler. (source: [programming languages generation][9])

4GL/5GL are typically used to implement domain specific languages. If you want a more general purpose language, then you will have to be get further away from natural languages. You go from 4GL to [3GL or less][10]. The closer to 1GL, the closer the language syntax is from the machine code and the faster it is to execute. However, it also is more difficult for humans to modify. These different generations are described in some details in this article on [programming languages][11]

(Some links missing... spam prevention, no more than 2 hyperlinks)

score 0 · Answer 5 · answered Aug 06 '12 at 21:30

I would have to nominate English. It meets all of your criteria.

The grammar is fairly regular in that odd constructions are possible, but simple sentences follow a fairly standard form. As for the irregularity of the grammar, there are a finite number of irregular verbs. Everything else follows set patterns.

2, 3, and 4 are easily knocked out by English. English spelling might be a disaster, but in terms of actual grammar, English is simple. Tenses and conjugation are regular and simple, there are no complex conjugations, and word order feels right for programming. Since English is SVO, and since programs follow the syntax "you have this, now make it do that" English fits. You could assign something with the sentence "String s is equal to 'example.'" This fits in with the current programming paradigm. At any rates, people who want to program learn English; nobody would have to learn a new language if they are already a programmer.

English definitely has popularity. It's the second most spoken language in the world, and as I already said, most programmers already speak it. It's a universal language in the computer world.

As for the practicality of it, it seems like the more "natural" a programming language is, the less functional it actuall is. Compare C to C++ to Java. C can do anything; with Java, you have to do more to do small, system-level operations. Coding is an art form all its own. If you're aiming for elegance, try Python. Python is pretty.

I have to disagree. English is riddled with polysemy and ambiguity, like most natural languages. Being only slightly inflectional it doesn't have many endings which signal the part-of-speech definitively for words. It does separate words with spaces though, which makes it easier than Chinese, Japanese, or Thai - at least in written language. And yes I've tried to parse English in computer code many times over the years, at least for fun. — hippietrail, Aug 07 '12 at 09:59
A controlled subset of English called Attempto Controlled English was designed to be easy to parse. If you want English to be easy to parse, then you should use a controlled subset of the language instead of the entire English language. — Anderson Green, Sep 12 '13 at 20:27
@AndersonGreen: Yes indeed this is actually something very worthwhile to learn if you want to get the best results out of machine translators. At least to translate from your language to benefit somebody else, as you are then codifying your own known unambiguous concepts into language. Going the other way there's not much you can do. — hippietrail, Mar 31 '14 at 09:53

Which language would be easiest for a computer to parse?

5 Answers5