11

Most algorithms for splitting text into sentences which I've found rely on punctuation being correct. However, in many real world applications, there will be substantial numbers of punctuation errors (missing periods, extraneous periods, etc.) Are there sentence-splitting algorithms which deal with this?

hippietrail
  • 14,687
  • 7
  • 61
  • 146
Alexey Romanov
  • 303
  • 2
  • 9
  • 1
    For English speech, yes; intonation gives it away. For English spelling, no, not really. Not only doesn't English record intonational information reliably in its orthography, but what little does leak through into punctuation is quite inconsistently applied, and that's never going to change. Get used to culling ambiguous parse options of text through non-syntactic information. Like frames, for instance. – jlawler Feb 04 '13 at 18:51
  • I was in fact imagining something like "if the likelihood of a sentence as given is too low, try splitting it in different reasonable places (or joining with next/previous sentence). If one of these splits increases likelihood substantially, repeat." – Alexey Romanov Feb 05 '13 at 07:46
  • Or just trying to select and run a suitable machine learning algorithm, which would presumably assign high weights to punctuation and to capital letters, but could detect ends of sentence even if they are missing. The question is if this achieves enough accuracy... – Alexey Romanov Feb 05 '13 at 07:57
  • 1
    It'll give you enough semantics to translate roughly, but it's unlikely to give sentence boundaries. At least as they writer intended them. – jlawler Feb 05 '13 at 23:20
  • 1
    The more I think about it, though, just the change in number of options should give you most constituent boundaries, including sentences, regardless of punctuation. But you couldn't distinguish sentence boundaries from other constituent boundaries without more processing of the kinds of options. – jlawler Feb 10 '13 at 23:58

2 Answers2

7

Here's how such a sentence boundary algorithm could be built.

The crucial thing is to come up with sequences of words (n-grams) that have a high likelihood of occurring at the end of a sentence or at the start of a sentence and a small likelihood of occurring within a sentence.

A list of these n-grams could be compiled with a large corpus such as the written part of the British National Corpus. Once the list is complete, the algorithm would go through the text without punctuation and for every word boundary look up the preceding n-grams in the list. For example, in a sequence of words such as

have you seen it it's incredible

once the algorithm is at have you seen it it would look up seen it in the list and give you a score such as: Occurred 56 times at the end of sentences and 22 times within a sentence, i.e. a score of 56/22 = 2.5. There should also be a list of three-word sequences, so look up you seen it, which might have a score of 13/2=6.5, then go on with a four-word list. Now average all scores and if it passes a certain threshold (which would need to be determined empirically) the algorithm sets a sentence boundary.

The same could be done with n-grams/word sequences occurring sentence-initially, and n-grams which occur at sentence boundaries.

Edit: As @P Elliott pointed out in the comments, a somewhat similar algorithm has been implemented (A Maximum Entropy Approach to Identifying Sentence Boundaries) and has high accuracy.

havlock
  • 133
  • 3
robert
  • 4,279
  • 27
  • 55
  • ,Your answer is very interesting,Have you done any experiment on what you describe as an algorithm?Or any other body has done experiment on it?And why do you not write down an explicit algorithm in the form we usually read or encounter? – XL _At_Here_There Aug 12 '13 at 12:30
  • No, I haven't written or tested any such algorithm and don't know of any examples. I guess the more verbose description I gave is more readable than a formal description. – robert Aug 12 '13 at 12:34
  • 1
    ,As far as I know,there is no algorithm for the question now.You know computer science is not like traditional linguistics,it has to be based on proof or experiment.We can not present an algorithm by imagination.I begin to know why my answer on the forum usually be downvoted . – XL _At_Here_There Aug 12 '13 at 12:40
  • It could be a little more gently articulated, but i do agree with the spirit of @XL_at_China 's comment. As an answer to a question as to whether there exists a sentence splitting algorithm, i would expect at least a reference to an algorithm which has actually been successfully implemented. The kind of brute force method you suggest does seem plausible, but i wouldn't say that it counts as an answer to the question - there's no way to evaluate it without actually implementing it. – P Elliott Aug 12 '13 at 13:23
  • 2
    In fact, if it's simple as this, i'd be frankly amazed if it hadn't been implemented. The algorithm described in this paper sounds similar to what you have in mind: http://www.aclweb.org/anthology-new/A/A97/A97-1004.pdf. – P Elliott Aug 12 '13 at 13:33
  • 1
    Well, so far it seems that no such algorithm exists (and if somebody comes up with one I'll be happy to upvote that answer). So technically the answer should be "No, there is none". Suggesting a way such an algorithm could work is surely more helpful than that. Also, this kind of pattern learning is often used in NLP, for example to train parsers. And evaluation (short of actual implementation) is possible in the sense that I can think of other algorithms for the problem that seem less likely to succeed, given previous NLP algorithms. – robert Aug 12 '13 at 13:34
  • @PElliott,sorry,maybe I speak too frankly . – XL _At_Here_There Aug 12 '13 at 13:35
  • Thanks for the link @P Elliott, I'll add it to the answer. – robert Aug 12 '13 at 13:36
  • 2
    No problem @robert - i hope i didn't sound too critical, your answer was interesting and well articulated. One difference between the algorithm i've linked to and the one you describe is that theirs only really looks at bigrams. They found that looking at larger contexts didn't meaningfully improve performance. Their model also focuses on existing punctuation in the text, and deciding on whether it marks a sentence boundary or not, rather than looking at every single word-boundary. The way that you've described it, i'd think that your algorithm might end up being computationally intractable. – P Elliott Aug 12 '13 at 13:47
  • http://en.wikiquote.org/wiki/Fred_Jelinek, Fred Jelinek is often quoted as saying "Every time I fire a linguist, the performance of our speech recognition system goes up".He was too frank to be polite – XL _At_Here_There Aug 12 '13 at 14:31
2

I built a sentence segmenter that works excellently on unpunctuated or partially punctuated text too. You can find it at https://github.com/bedapudi6788/deepsegment .

This models is based on the idea that Named Entity Recognition can be used for sentence boundary (i.e: beginning of a sentence or ending of a sentence). I utilised data from tatoeba for generating the training data and trained a BiLSTM+CRF model with glove embeddings and character level for this task.