7

I am a literacy researcher looking to create an add on package in R that offers quantitative methods for discourse analysis. I am creating a function for taking a chunk of text and measuring the syllables per word so

x <- c('dog', 'cat', 'pony', 'cracker', 'shoe', 'Popsicle', 'pronunciation' )

would yield: 1, 1, 2, 2, 1, 3, 5

Currently I have an algorithm that is about 90-95% accurate by following the these rules:

  1. The pattern eeing and eing counts as a 2 vowels (so seeing becomes sVV and being becomes bVV);
  2. An e at the end of a word is just dropped (as in come) which is counted as cVm;
  3. Two vowels next to each other (as in peach) is counts as one vowel sound (it becomes (pVch);
  4. Any singular vowel left counts as a V (Yoda becomes YVdV);
  5. The sum of the Vs is the number syllables in the word.;

Things that throw a monkey wrench into the problem are:

  • compound words;
  • non silent ending "e"s.

These are the rule breakers that don't follow my algorithm. Does anyone know of a dictionary of syllable rule breaker words?

I know this available as this was an approach similar to that used by Franklin Mark Liang (A guru in the art of syllabication) in his 1983 dissertation. This same approach is used by the online syllable counter and so I know such a dictionary exists. Having this dictionary to combine with my algorithm would make my approach very accurate.

Otavio Macedo
  • 8,208
  • 5
  • 45
  • 110
Tyler Rinker
  • 586
  • 2
  • 4
  • 17
  • 1
    Well, if you want a list of exceptions, they'll have to be exceptions to the specific rules you're using, no? Seems to me that the right approach is to get an all-purpose pronouncing dictionary and run your algorithm on the whole thing. Any word on which it makes the wrong prediction goes in your exception list. – Leah Velleman Dec 19 '11 at 23:09
  • I'd actually go the other way around. I'm not looking for 100% accuracy but as high as I can get without writing a ton of code. I'd run it through the dictionary first. Anything not in there gets run through the algorithm. This would greatly increase the accuracy. This is the approach used by others. I'm sure they use a tailored dictionary but I'm looking for more of a general one that I may augment. but I need some place to start with the dictionary. – Tyler Rinker Dec 19 '11 at 23:13
  • Right, that makes sense. So then you're not necessarily looking for a list of exceptions, but just a general-purpose list of words and their syllable counts? – Leah Velleman Dec 20 '11 at 03:40
  • Correct. I think I'm going to go with NETtalk link which seems to be the standard unless someone has a better suggestion. I've already imported the file in and cleaned it up for my purposes, now to finish the R coding. Sorry I wasn't clear on my needs as they became clear to me only as I worked through the problem. I'll let you know how my accuracy is coming. (I improved it again by adding a rule about consonant + le I found) – Tyler Rinker Dec 20 '11 at 05:31
  • Since you now have an improved understanding of the problem, it would be a good idea to edit the question to make clear what are you looking for. – Otavio Macedo Dec 20 '11 at 20:09
  • 1
    YOu don't mention the problem of syllabic consonants? There's a good discussion (here)[http://phonetic-blog.blogspot.com/2011/12/more-syllabic-consonants.html] that discusses the environment that conditions their occurrence. – Gaston Ümlaut Dec 23 '11 at 04:30
  • @Gaston Ümlaut I am just now seeing your comment. THe link you provide seems to be broken. It sounds interesting and am wondering if you have a current link? – Tyler Rinker Jan 22 '12 at 01:03
  • 1
    @TylerRinker Sorry, I miswrote the link. Actually there's two on that blog that may be interesting for you: this and this. – Gaston Ümlaut Jan 22 '12 at 12:25
  • Another issue is that the syllabification may vary between speakers. How many syllables does "fire" have? "Interest"? – Mechanical snail Dec 15 '12 at 10:50

1 Answers1

2

The NETtalk data set is what I am looking for. Here is a link to the actual data set LINK.

Tyler Rinker
  • 586
  • 2
  • 4
  • 17
  • Glad you found your answer, but I'm a little confused. The link to the wiki on NETtalk just seems to describe a rather general-purpose learning neural network with 211 data sets. Which data set in particular did you find useful? – Mark Beadles Dec 22 '11 at 21:01
  • 1
    @Mark You're correct the Wiki Link doesn't provide much about the NETtalk data file. Here's the link to the file LINK I'll add this to my answer as well. – Tyler Rinker Dec 22 '11 at 21:18
  • Thanks, that does make a lot more sense :) useful looking data set too. – Mark Beadles Dec 22 '11 at 21:22