17

I have an idea for a programming language that would work more like a spoken language. "sentences" would have an initial context in which specific subjects, verbs, and objects would have meaningful relationships. I'm thinking I can make it more powerful and global if I use something like Esperanto as a base which other languages could translate to.

So I have 2 questions:

  1. Which (constructed|native) language would be the best choice based on:

    1. regularity of grammar
    2. easiest to translate other languages to
    3. is easy to learn
    4. is expressive
    5. some other criteria that might be useful (popularity would be nice but doesn't matter as much as regularity and simplicity)
  2. Do you know of any existing projects that do something similar or would be useful for some other reason?
Otavio Macedo
  • 8,208
  • 5
  • 45
  • 110
  • 4
    Some of your criteria I don't think really make much sense. There's no language that's universally easy to translate to (translations are usually easier when the two languages are more similar--there's no language that's similar to all the languages in the world), how easy a language is to learn depends on your native language, and all human languages are expressive. Constructed languages would probably be the ones with the fewest irregularities though. – Askalon Mar 24 '12 at 20:31
  • This is an interesting idea. Askalon surely made a valid point, too. I would say Esperanto is your best bet, given its structure, vocabulary and popularity, so I would recommend you to stick to your original idea. Care to share some more details of your language? I don't know of any similar project. – kamil-s Mar 24 '12 at 20:41
  • 1
    Depending on how the language is to be structured, I'd recommend using standard SOV agglutinatibe grammar. It's very similar among all SOV languages, to the point where it's not unknown to find morpheme-by-morpheme matches among complex words and phrases in Turkish, Tamil, and Japanese, all of which are of this type, though totally unrelated. – jlawler Mar 24 '12 at 22:24
  • As a bonus, you get an RPN language, because that's the way SOV languages structure their arguments: Subject-Object-Verb. – jlawler Mar 24 '12 at 22:25
  • I don't know about Tamil and Japanese but Turkish I actually speak and I really don't think it would be any easier to parse than Esperanto. It's actually quite difficult to parse for human beginners from time to time. Besides, it would only make things easier if Oggy aimed the language at Middle Eastern programmers. For Americans and Europeans, Esperanto vocabulary will be much easier to memorize. – kamil-s Mar 25 '12 at 07:35
  • @KamilS. & jlawler: all the literature I have come across for parsers for various languages demonstrate the worst performance with Turkish. Turkish is one of the more difficult languages to parse -- at least with the kind of techniques used for European languages. – prash Mar 25 '12 at 13:39
  • Well, yes, of course. Turkish is left-branching; all SOV languages are. European languages are right-branching, like English. But over half of the languages of the world are SOV. It's one of the most stable configurations for sentence generation. And the OP's not building a parser, but a programming language. – jlawler Mar 25 '12 at 14:24
  • 1
    This was tried in the 1960s, with disasterous results. COBOL was designed to resemble spoken languages, and it was hated by coders. The programming languages generally want to be precise and terse, and have a definite way to say something, so that they are easy to read. Algorithms are notoriously hard to specify in natural language (think about how long it would take you to describe how to alphebetize a dictionary precisely). Nevertheless, Perl borrowed constructions like $_ ("it") and free order from natural language with great success. – Ron Maimon Mar 26 '12 at 07:42
  • 1
    You could also go with a controlled version of an otherwise difficult-to-parse language. – Nate Glenn Mar 29 '12 at 17:42

5 Answers5

11
  1. lojban, a conlang, already meets these goals.
  2. They already have a few parsers. One of them even has a YACC grammar.

I have come across literature that give the impression that it is possible to make a parser for Sanskrit, that would generate unambiguous parses. But I don't know of well-executed Sanskrit parsers, myself.

A few additional thoughts on the matter: Morphology is one of the biggest reasons why parsers make mistakes. Chinese does not have morphology at all. Chinese parsers perform well. However, at the moment, English parsers are the most accurate of all. This is possibly because parsers for English have received far more attention than parsers for any other language.

Franck Dernoncourt
  • 1,588
  • 2
  • 12
  • 35
prash
  • 3,649
  • 3
  • 25
  • 33
  • 5
    Lojban is hard for people. I don't speak it myself but those I know that do, claim that most use of Lojban by people is creolized/simplified/ambiguous. It might seem that we meatsacks need underspecification and metaphors, extreme precision being incompatible with our natures. – kaleissin Mar 25 '12 at 14:31
  • @kaleissin: I haven't found willing friends to learn with, so I haven't learnt lojban either. However, I have read the grammar a bit. I get the impression that underspecification is part of the formalism. You can call a stack of books a "table", or a person a "donkey" etc. The morphology supports such metaphors straight-away. Only the grammar is precise. It allows sentences to be more imprecise than possible with natural languages. – prash Mar 25 '12 at 15:08
  • @kaleissin: citation: http://www.lojban.org/tiki/Lojban%20Introductory%20Brochure#unambiguity. You can find specific examples in their books. – prash Mar 25 '12 at 15:16
  • 3
    Kaleissin is right: lojban fails miserably at point 3, easy to learn. Ease of translation might be a very serious issue, too. And I'm not so sure you can actually call it similar to a human natural language. – kamil-s Mar 26 '12 at 19:59
  • @KamilS.: {{Citation needed}} ;-) – prash Mar 26 '12 at 22:33
  • @prash You're not serious, right? – kamil-s Mar 26 '12 at 22:59
  • @KamilS.: I am always very serious about assertions thrown about with nothing backing them. Trust me on this. There is no end to the kind of things people try to get away with. – prash Mar 27 '12 at 08:52
  • @prash Now I'm sure you are very serious. But note that there is also no end to how absolutely obvious things people spend money and effort to prove merely to look more scientific. Would you expect me to quote the Oxford Dictionary if I said house means 'a type of building' in English? What would this give you? Just somebody else's opinion. What kind of backing up would you expect for what I wrote? How about your own judgement? – kamil-s Mar 27 '12 at 13:29
  • @KamilS.: My own judgement? From what I have studied of the grammar, I concur with the Q&A answer to "Lojban seems complex. How hard is it to learn?". You made the claim "fails miserably at point 3", which you did not back up either by your own reasoning or with citations. Don't you see the difference between this and asking for the definition of house? – prash Mar 27 '12 at 13:29
  • @prash Well, apart from the obvious exaggeration for rhetoric purposes, I don't actually see much of a difference. Do you really need a study to convince you that a language which is in many ways more similar to a programming than to a natural language, is not easy to learn for a human? Unless you understand learn as merely memorizing the grammar, without really being able to actually speak the language? Because if so, why even bother fashioning a new programming language after natural languages, which is what the main question is about? – kamil-s Mar 27 '12 at 13:48
  • Q1: Yes. Programming languages have tiny grammars compared to natural languages. I need better reasoning or citations to accept the claim that natural languages are easier to learn. Q2: The link I gave in my previous comment already answers that in detail. Q3: I can write a large essay about that. But since I'm not the one who asked the original question here, I'll leave it to Oggy Transfluxitor Jones. – prash Mar 27 '12 at 14:00
  • Q1: Your choice. Q2: All I can see in your link are advertising claims not backed up by even a reference to a study of some kind. Q3: Right. – kamil-s Mar 27 '12 at 15:42
  • @KamilS.: Q2: The Q&A was written based on their experience with people who have learnt the language. But you're right. Your way is easier. Hence my claim: lojban is the most brilliantly easy to learn language ever. Unless you bother to demonstrate you know what you're talking about, this is my final comment on the matter. – prash Mar 27 '12 at 16:46
  • 4
    On Sanskrit parsers: http://sanskrit.inria.fr/ has one (by Gerard Huet); also Oliver Hellwig and University of Hyderabad also have parsers and similar tools on their websites. I'm not really sure it's always possible to generate unambiguous parses though: often literature is written so as to intentionally allow multiple readings. (Called śleṣa) (E.g. puns.) – ShreevatsaR Apr 06 '12 at 09:23
  • @ShreevatsaR: Thanks for the info. I haven't learnt Sanskrit at all. Can you tell me if śleṣa exploit syntactic ambiguity (as opposed to semantic or pragmatic)? – prash Apr 07 '12 at 01:21
  • 2
    "Morphology is one of the biggest reasons why parsers make mistakes. Chinese does not have morphology at all. Chinese parsers perform well." wtf? first of all, at least look up "Chinese morphology" on wikipedia before making such broad claims. – unhammer Sep 06 '12 at 08:52
  • Second, simple morphology doesn't necessarily make parsing easier; for syntax, it can make it harder, e.g. without a case system, you might have to rely on word order etc. to say if a word is subject or object; with a case system (more complex morphology), the word form itself is often enough. – unhammer Sep 06 '12 at 08:55
  • @unhammer: 1. You need to look it up yourself. 2. Please re-read what I wrote. I was talking about the current state of parsing. If you have info on papers that demonstrate superior parsing accuracy on morphologically rich languages, please post them. – prash Sep 06 '12 at 22:00
  • 2
  • e.g. "Most modern varieties of Chinese have the tendency to form new words through disyllabic, trisyllabic and tetra-character compounds. In some cases, monosyllabic words have become disyllabic without compounding, as in 窟窿 kulong from 孔 kong; this is especially common in Jin." Also, Chinese classifiers are bound morphemes. I'd agree if you said "very little morphology", but "none at all" is simply false.
  • – unhammer Sep 07 '12 at 07:35
  • 1
  • The problem with (especially statistical) parsing of morphologically rich languages is simply coverage – because words can take so many forms, you're much more likely to see new forms in your training data. Not so much a problem with rule-based parsing, e.g. with Constraint Grammar (or at least statistical methods using a rule-derived lexicon), where if you have one form of the word, you have all. On the other hand, on seeing dog.n.m.sg.acc you have so much more syntactical info on the word than on seeing dog.n, e.g. you can immediately rule out if it is a subject or a head of an adj.pl
  • – unhammer Sep 07 '12 at 07:41
  • @unhammer: 1. I am not a Chinese linguist, but what you pasted looks more like an equivalent of compound words in English. I don't see how this is "morphology". 2. I'm aware of all that, but, like I said, I was talking about the current state of parsing. – prash Sep 07 '12 at 19:26
  • @kamil-s: It would be a very interesting experiment to try to raise some children as Lojban speakers as has been done with Esperanto, Hebrew (when it was not "living" in the usual sense), Klingon, and Sanskrit. – hippietrail Mar 31 '14 at 09:36
  • @prash: Compounding is part of morphology. Also morphology and syntax are often better considered as a single unit in which case the best term is "morphosyntax". Chinese is a good example of a language where it's better to think of morphosyntax than of morphology vs syntax, because there are many ambiguous cases where you could analyse something as either morphological or syntactic, which turns out just to be wasteful hairsplitting based on terminology rather than on the reality of how the language at hand actually works. – hippietrail Mar 31 '14 at 09:42