9

I'm an amateur writer that happens to be a professional programmer.

I say this because I've recently jumped back into a personal research project in which the goal is to automate the de-anonymization of passages or, more simply, figure out the true author behind an anonymous piece of writing.

In order to do this, I'm looking at the key elements in a person's writing style and looking for patterns that match up to writers and their styles that I've already indexed. The patterns I'm looking for currently are:

  • Word choices & vocabulary
  • Readability score: calculated by the Flesch-Kincaid index (what Microsoft Word uses)
  • Punctuation preferences: whether 1 or 2 spaces are used after a sentence, frequency of exclamation points, commas, etc
  • Sentence structure: where the various elements of a sentence are usually located within that sentence, as well as general sentence length

What other patterns can be used to explicitly define a writing style?

drusepth
  • 191
  • 2

1 Answers1

7

The question title asks for ways of defining a writer's unique style, but the body indicates that the aim is to identify authors based on analysis of their writings. The identification of authors based on analysis of their style, known as 'authorship attribution', does not necessarily allow for characterising a 'unique style' as it may be more about statistical differences rather than unique styles. This means that while a human reader may not be able to guess that two pieces were by different authors, statistical calculations based on various measures may be able to detect this with a fair degree of certainty.

Authorship attribution has a long tradition, particularly in theological studies and philology, but has developed many new methods with the advent of natural language processing. Many different metrics have been employed in authorship attribution, such as:

  • word length
  • sentence length
  • character frequencies
  • word frequencies
  • vocabulary richness (various functions)

It's not possible to list all measures as it has been estimated (Rudman, 1998) that over 1,000 different measures had been developed up to that point.

Some useful references are listed below.

Gaston Ümlaut
  • 6,697
  • 30
  • 46