Does anyone know of text message corpora?

Question

I am looking for a large corpus of text messages. By large, I am hoping to have at least 15,000 text messages in my sample. I am fine with combining several smaller corpora into a larger corpus as I will also be adding thousands of text messages relating to the patterns of interest in my research.

Clarification on Requirements:

Text messages should predominantly be in English (US/American ideally), although a mixture of Spanish (Mexican Spanish) is also good.
The corpus can be either free or available for a reasonable fee.
They should be text messages, as I am specifically looking for euphemisms and slang used in text messaging as well as emoticons.

It isn't exactly what you're asking for but have you considered using the Twitter API to build a corpus of tweets? — acattle, Oct 28 '12 at 11:00
not free, but look at this: https://catalog.ldc.upenn.edu/LDC2017T07 — Daniel, Oct 26 '17 at 04:23
@Daniel this is in Egyptian Arabic, not English (see first bulleted requirement) in question — Dan, Nov 24 '17 at 22:53

score 11 · Answer 1 · edited May 12 '17 at 10:18

11

I only found this corpus on the NUS (National University of Singapore) site, but luckily it has a lot of entries.

It has a download¹ for a corpus containing ~10,000 text messages, which was the original corpus.

But if you go to this page², there is a table listing the corpora (they range from ~10,000 to ~51,000 text messages) available for download. The top line being the most recent corpus there; you can download either in XML or SQL, or download the statistics.

I must remind you (and everyone who uses it) that you should make sure to follow the instructions asked by the researchers, in case you choose it.

Update: please see the comment below my answer.

_{Notes: It looks like the download site is down, instead you can download all the data from their Github Repo}

edited May 12 '17 at 10:18

Michael Henretty

103
2

answered Feb 13 '12 at 22:50

Alenanno

9,388
5
48
80

7

@Dan O'Day I'm one of organizers of the NUS SMS Corpus, referenced in this answer. We had a recent paper in Language Resources and Evaluation that describes all of the SMS corpora we found, paid or free, that are of some non-trivial scale (you can get the preprint from the corpus website, linked in the answer). Unfortunately, none of them really match your U.S. Eng/Spa (Mexican) requirements. The NUS corpus' English messages are largely from Singaporean university students, so the language you'll see used are quite a bit different than those you would see in U.S. SMS. – Min-Yen Kan Oct 27 '12 at 14:32
@majnemɪzdæn Consider accepting the answer if it solves your question. :) – Alenanno Sep 30 '14 at 10:29
@Alenanno my apologies, I vacillate only because technically this did not meet my requirements (the Singaporean English didn't work for my requirements). The reality is that there likely is not a corpus that does. I ended up beginning to create one, but do not have enough samples yet to make it all that significant. I did upvote this answer, however - but perhaps another corpus will come to light? – Dan Sep 30 '14 at 13:06
@majnemɪzdæn Don't worry, there is nothing forcing you to accept, especially if you're not 100% satisfied, but I doubt there's a corpus that will, as you said. – Alenanno Sep 30 '14 at 14:49
1

It was recently brought to my attention that Purdue is no longer hosting the corpus I developed. Here is it on Web Archive: https://web.archive.org/web/20131123115633/http://cybersecurity.cit.purduecal.edu/content/tmcorpus.html – Dan Nov 24 '17 at 22:55
1

link die http://wing.comp.nus.edu.sg:8080/SMSCorpus/history.jsp – Haha TTpro Jan 16 '18 at 03:07

Hbar · Answer 2 · 2021-09-30T13:06:14.047

Here are two other datasets:

The NER dataset has the named entities replaced by entity types, like in [ORGANIZATION], [DATE], etc. The Arruda dataset is mainly English and surprisingly emoji free. Neither of the datasets provide much detail on how they were created.

Google has a search engine for datasets that is a good starting point when looking for datasets. It covers Kaggle, the US Government, some scientific journals, as well as Statista, Data World, and other data aggregators.

score 2 · Answer 3 · answered Sep 30 '21 at 13:24

A standard tool in the search for corpora is the CLARIN virtual language observatory (VLO). Searching for "twitter" and setting the language facet to English gives Twitter sentiment for 15 European languages as the top result. The corpus is under a free licence.

Does anyone know of text message corpora?

3 Answers3

Linked