How could one generate gibberish that mimics a specific language?

Question

If given a list of languages the listener was able to understand or classify, how would you generate textual output using a standard phonetic alphabet, for example IPA, that would sound like a language if read by someone familiar in reading the textual output.

Do you want the text to read like gibberish in a real language (i.e. English gibberish), or like gibberish for a fake language? — Nathan, Oct 16 '11 at 01:26
@Nathan: To keep it simple, the text would be in IPA, it's not meant to be read as a language, it's meant to be heard as a language. Only reason I'm require the output be IPA is that text-to-speech would almost always sound fake, human read text would be much harder to identify as fake just based on the rendering on the sound. Does that answer your question? — blunders, Oct 16 '11 at 01:39
Almost. You just want the person to say "Hey, that sounds like language!", not "Hey, that kinda sounds like English!", right? — Nathan, Oct 16 '11 at 01:41
@Nathan: Correct. I'm able to randomly make up phonetic sounds that sound like no existing language and have them sound like a language, but really just do it - there's not a real process to it. I'm aware of methods of speech disguise , but was hoping for something a little more complex. Meaning it wouldn't be able to be deciphered or proven to be non-language. — blunders, Oct 16 '11 at 02:07
So it should be unintelligible, not actually be real language, yet not be possible to prove that it's not some unknown language. One way would be to use the phoneme inventory and phonology of a well-described language to generate well-formed strings, then filter out those that are attested in the lexicon of that language. Many languages have lots of well-formed strings that aren't actually employed in the lexicon. But if all the strings conform to the rules of the language I'm not sure if that means it isn't a real language, even though it would be gibberish (ie unintelligible)? — Gaston Ümlaut, Oct 16 '11 at 03:52
Another way would be to make up a phoneme inventory and set of phonological rules that could, conceivably, be from a real language and then generate strings according to those rules. I'm not sure what you'd do about intonation as this depends on meaning. But intonation is one of the aspects of language that isn't encoded in writing systems so you could perhaps ignore it. Anyway, isn't this what Mark Okrand did with Klingon, but he went on to add semantics and grammar? (Klingon doesn't sound much like a natural language, so it would be interesting to see if linguists could be fooled!). — Gaston Ümlaut, Oct 16 '11 at 04:01
@GastonÜmlaut Yeah, I was thinking about something along those lines. Only, I would include rules for acceptable syllables, then use those rules to generate the strings. Something like: select syllable structure (eg, CV). For each element of the syllable, select an appropriate sound. You could get really fancy and come up with rules for alternations, but I'm not sure if that would add to the prima facia realness of the language. — Nathan, Oct 16 '11 at 04:12
+1 @Gaston Ümlaut: Just to be clear, the output is IPA, which supports the notation of intonation as far as I know. Meaning if you belief intonation would play a factor, you should account for it. — blunders, Oct 16 '11 at 08:08
Read up on Markov chains - they're used for all kinds of things that come in patterned sequences, including language in audio and orthographic writing, so it would work just as well in IPA transcription. And with the same limitations. — hippietrail, Oct 16 '11 at 10:56
Someone seems to be working on such a project and asked a question in SE Computer-Science: http://cs.stackexchange.com/questions/21896/ . Read the comments to the question. Maybe he can give you more details. — babou, Feb 22 '14 at 00:33

score 12 · Accepted Answer · edited Oct 18 '11 at 14:11

12

This is a real question that has a real answer published by real linguists to answer other real linguistics questions. (It also has applications in amateur linguistics and non-linguistics fields, like generating lorem ipsum text for design layout)

http://crr.ugent.be/programs-data/wuggy This application will generate similar words given a pre-existing list of any language. It does not do a perfect job of generating phonotactically valid words, but it's close.

A better way to generate random words is to work out the phonotactics of the target language-- which patterns of consonant and vowels are permitted, what is permitted as a coda, onset and nucleus.

Ideally you'd choose sounds according to frequency in existing corpus, but a uniform distribution might be okay for a first approximation. Then you start generating words by choosing letters at random, constrained by the phonotactic rules.

Markov chains work, but phonotactic rules are only kind of like markov chains. A (possible) markov chain only pays attention to the most recent letter could generate words that don't follow the coda-nucleus-onset patter and are too long or too short.

To generate words in a language with interesting morphology, you'd need to select at random the relevant prefixes and suffixes and apply the necessary changes to allow those morphemes to be attached to the stem.

edited Oct 18 '11 at 14:11

hippietrail

14,687
7
61
146

answered Oct 17 '11 at 14:34

MatthewMartin

2,914
19
33

1

It also has applications in gaming! Here's a cute little article on Simlish, the gibberish language heard in the Sims games. It seems like they didn't take a very thoughtful approach, just stuck voice actors into the studio and told them to make stuff up, but it does sound convincing. To me it's like babytalk English, angry Italian stereotype and substitute curse words. – mollyocr Oct 18 '11 at 17:20
Also, we English speakers have all sorts of generalized language impressions, like saying that Chinese sounds like "chingchong" or doing a nasal-y French laugh. I love seeing what English sounds like to other cultures. – mollyocr Oct 18 '11 at 17:24
“A (possible) Markov chain only pays attention to the most recent letter” -> That's only if you consider that the states are the letters themselves: it's classical to “enrich” the states so that they remember more information (a state could be a whole syllable or even more). Your length issue is more serious: if I'm not mistaken, a Markov chain producing text will only generate words whose length is in exponential distribution. But I guess you don't have to go that far from the Markovian world to solve this problem... – JPP Oct 31 '11 at 11:51
(cont'd) The naive idea would be to make a coupling with another random process: Imagine that your Markov chain has some states labeled as “safe exit points”, i.e. states generating a slice of word that can reasonably end up at the end of a word. Next to your Markov chain, a Swiss cuckoo clock that strikes at random times (following the wanted distribution — needless to say, true Swiss clocks strike at deterministic times!). When it strikes, the Markov chain is replaced by an “emergency route Markov chain” which is devised to take you safely and quickly (but still randomly)... – JPP Oct 31 '11 at 12:00
(cont'd) from the current state to a “safe exit point” where the word is completed. It seems to me that if the safe exit points are dense enough (and so if you don't pass much time in the emergency route), you will get words of the good (random) length. Obviously, this would take a lot of serious research and fine tuning to really work (I bet people already have worked on that idea) but I cannot see a good reason why such a strategy would fail. – JPP Oct 31 '11 at 12:05
If the algorithm generates a word and another algorithm has to evaluate if first algorithm got it right (i.e. is at an exit point), why not discard the first and just use the second algorithm? The phonotactic constraints are generally known and generating words that match that is algorithmically sort of hard, while markov chains are easy if you have a transition matrix. If refining a markov chain is difficult, I'd expect people to switch to generating words according to the real phonotactic constraints e.g. (C)V(CV(N))* for Japanese – MatthewMartin Oct 31 '11 at 13:41
@MatthewMartin: Just a heads-up that this question/answer pair upvote wise is doing well, and if there's any additional information you're able to add it would likely have a huge impact on pushing the question to the question/answer to the "next-level". If you have any suggestions on improving the questions itself, feel free to edit the question, or comment below the question itself. Thanks! – blunders May 12 '12 at 03:16
For humor value, some of the things I got when I fed an English dictionary to a Markov-chain-based algorithm in the past: goatempts curvaccidatodgy zeppasmo almly zookelpful quinedibly votrepinabitized czarskiing qweryphs zuccon xercococcur forkplay fruckage beepfat qualmly wussyfoo gluejew qwerbage weenie feelworldvic vyingfishpad squettled jarfed clatchfin owlywed xylorying frazziest pornumb bromsdam sunkyard fryingpie menhand xersnigget twirlifted gyroite yowlywed zomboide gagiosex knivie runtnest fobbyhoe dogey glasmanly ptoriery – Justin Olbrantz Oct 22 '12 at 05:03

score 2 · Answer 2 · answered Oct 23 '11 at 00:12

I love the subject of why languages sound like they do. Prosody goes a long way to explaining why, I think.

It would be great to know how others see (or hear!) me speaking my own language. Here's one perspective. This is a wonderful fake English short movie.

How could one generate gibberish that mimics a specific language?

2 Answers2