2

I need a corpus of modern English, with part-of-speech tags, in order to train language models, specifically a part-of-speech tagger. The domain I am doing this in is the Hansard, the transcript of speeches in British Parliament. I have tried training using corpora of American English, with unimpressive results. Are there any open-access coropora for this domain that I could use, or at least corpora of modern educated British English? I am aware of the existence of the Hansard Corpus, and several other similar corpora, but I would like to download the corpus for use with the Python NLTK library.

Sir Cornflakes
  • 30,154
  • 3
  • 65
  • 128
  • 2
    A side note: NLTK also has some generic reader modules for loading external corpora. This depends on the format they’re in, but one can often find a friendly soul has written an appropriate module for NLTK. – Jeremy Needle Jul 14 '16 at 14:54
  • 1
    The NLTK book lists quite a lot annotated corpora for various languages. You might make a find there. – Natalie Clarius Jul 14 '16 at 15:13
  • See also this question http://linguistics.stackexchange.com/questions/12323/biggest-freely-available-english-corpus – Sir Cornflakes Mar 08 '17 at 16:41

1 Answers1

1

That I am aware of, most corpora of British English are not freely available. There are, however, corpora of British English, including POS-tags and formal language, which can be downloaded by individuals, although I am not aware of any which are open source. These include:

  • the British National Corpus, which can be downloaded in its XML version from the Oxford Text Archive (http://ota.ox.ac.uk/desc/2554), free in the UK (via Shibboleth).

  • various corpora, downloadable in various formats including a 'linear text' format, which can be purchased from the Brigham Young University corpus interface (http://corpus.byu.edu/full-text/).

ATJ
  • 196
  • 6
  • Those who are not specially looking to download the corpus may find the following (non-exhaustive) list of English corpora helpful: http://www.corpora4learning.net/resources/corpora.html#BE. – ATJ Mar 06 '17 at 22:09