Page Content

corpora available at the rcep

Corpora are large electronic collections of language data. Many corpora not only consist of written data such as newspaper articles but also contain samples of spoken language or even transcripts of conversations - which is one of the reasons why the use of corpora has become increasingly attractive for pragmatic research.

at the RCEP, students have access to the following corpora:

Corpus name Size (in words) Spoken/ written? Variety sampled Tagged? Available at RCEP?
ACE – Australian Corpus of English 1 million both Australian no yes

more information

Developer:
Pam Peters, Peter Collins and David Blair at Macquarie University, Sydney
Sampling period:
1986
Size:
1 million words; 500 text samples of approx. 2,000 words
Contents:

written and spoken language; modelled on LOB and BROWN

Variety sampled:
Australian English
Annotation:
untagged
Availability:

available for students at the RCEP (ICAME)

Homepage:

Manual of ACE

ANC ? American National Corpus 100 million (aim) both American yes yes

more information

Developer:
Randi Rappen
Sampling period:
from 1990 (ongoing)
Size:
aim 100 million words
Contents:

written and spoken texts (written part 90%), genres comparable to BNC

Variety sampled:
American English
Annotation:
XML tagged
Availability:

available for students at the RCEP

Homepage:

Homepage of the American National Corpus (ANC)

 

BROWN Corpus 1 million written American yes yes

more information

Developer:
Nelson Francis and Henry Kucera at Brown University, Providence, Rhode Island
Sampling period:
early 1960s
Size:
1 million words
Contents:

written language; 500 text samples of approx. 2,000 words; 15 text categories

Variety sampled:
American English
Annotation:
untagged and tagged version POS tagging
Availability:

Available for students at the RCEP (ICAME)

Homepage:

Manual of the BROWN Corpus

CEECS ? Corpus of Early English Correspondence Sampler 450,000 written British no yes

more information

Developer:
M. Rissanen, O. Ihalainen and M. Kytö at the Department of English, University of Helsinki
Sampling period:
1418-1680
Size:
450,000
Contents:

personal letters

Variety sampled:
British English
Annotation:
no annotation
Availability:

Available for students at the RCEP (ICAME)

Homepage:

Manual of the CEECS Corpus

COLT ? Bergen Corpus of London Teenage Language 500,000 spoken British yes yes

more information

Developer:
University of Bergen, Norway
Sampling period:
1993
Size:
500,000
Contents:

transcripts of spoken language of London teenagers (COLT is part of the BNC)

Variety sampled:
British English
Annotation:
POS tagging
Availability:

available for students at the RCEP (ICAME)

Homepage:

Homepage of the Colt Corpus

FLOB ? Freiburg-LOB Corpus of British English 1 million written British no yes

more information

Developer:
Christian Mair at the University of Freiburg
Sampling period:
1990s
Size:
1 million words
Contents:

written language; 500 text samples of approx. 2,000 words; 15 text categories (matches the original LOB corpus)

Variety sampled:
British English
Annotation:
untagged
Availability:

available for students at the RCEP (ICAME)

Homepage:

Manual of the FLOB corpus

FROWN ? Freiburg BROWN Corpus of American English 1 million written American no yes

more information

Developer:
Christian Mair at the University of Freiburg
Sampling period:
1990s
Size:
1 million words
Contents:

500 text samples of approx. 2,000 words; 15 text categories (matches the Brown Coprus)

Variety sampled:
American English
Annotation:
untagged
Availability:

available for students at the RCEP (ICAME)

Homepage:

Manual of the FROWN corpus

Corpus name Size (in words) Spoken/ written? Variety sampled Tagged? Available at RCEP?
Helsinki Corpus of English Texts: Diachronic Part 1.5 million written British no yes

more information

Developer:
M. Rissanen, O. Ihalainen and M. Kytö at the Department of English, University of Helsinki
Sampling period:
ca. 750 to 1700
Size:
1.5 million words
Contents:

samples of Old, Middle and Early Modern English texts

Variety sampled:
British English
Annotation:
untagged
Availability:

available for students at the RCEP (ICAME)

Homepage:

Manual of the Helsinki Corpus

Helsinki Corpus of Older Scots 830,000 written Nothern British no yes

more information

Developer:
M. Rissanen, O. Ihalainen and M. Kytö at the Department of English, University of Helsinki
Sampling period:
1450-1700
Size:
830,000 words
Contents:

Old, Middle and Early Modern English texts covering 15 prose genres

Variety sampled:
Northern British English
Annotation:
untagged
Availability:

available for students at the RCEP (ICAME)

Homepage:

Bibliography of the Helsinki Corpus of Older Scots (no specific manual available online)

ICE – International Corpus of English + SPICE Ireland - Systems of Pragmatic annotations for the spoken component of ICE-Ireland 1 million both all yes some parts

more information

Developer:
Jeffrey L. Kallen and John M. Kirk
Sampling period:
1990s
Size:
500 texts, each 2,000 words (1 million words)
Contents:

500 texts, spoken and written language (spoken part 60%):
Spoken (300)

  • Dialogue (180)
    • Private (100)
    • Public (80)
  • Monologue (120)
    • Unscripted (70)
    • Scripted (50)

written (200)

  • Non-printed (50)
    • Non-professional writing (20)
    • Correspondence (30)
  • Printed (150)
    • Informational (learned) (40)
    • Informational (popular) (40)
    • Informational (reportage) (20)
    • Instructional (20)
    • Persuasive (10)
    • Creative (20)

(Figures adapted from Kennedy (1998: 55))

SPICE-Ireland

  • provides pragmatic and discourse annotation and
  • a prosodic transcription to 100 of the 300 texts of the spoken component of the ICE-Ireland Corpus.
Variety sampled:
Aim is to sample all varieties of English
Annotation:
Textual markup, word class tagging, syntactic parsing (+ additional tags in some components)
Availability:
  • Hong Kong, East Africa, India, Philippines, Singapore can be freely donwloaded from the Homepage
  • Great Britain is available for students at the RCEP and at Corpus computer in IAAK library (Lehrstuhl Esser)
  • Ireland is available for students at the RCEP
  • SPICE-Ireland is only available at Prof. Schneider's office at the moment, but will be transferred to the RCEP soon.
Homepage:

Homepage of the ICE corpus


Manual of the ICE corpus

Kolhapur Corpus 1 million written Indian no yes

more information

Developer:
S. K. Verma at University of Lancaster and Shivaji University, Kolhapur
Sampling period:
1978
Size:
1 million words, 500 text samples of approx. 2,000 words
Contents:

written language; modelled on BROWN and LOB

Variety sampled:
Indian English
Annotation:
untagged
Availability:

available for students at the RCEP (ICAME)

Homepage:

Manual of the Kolhapur Corpus

Lampeter Corpus of Early Modern English Tracts 1.1 million written British yes yes

more information

Developer:
Josef Schmied, Claudia Claridge and Rainer Siemund at TU Chemnitz
Sampling period:
1640 -1740
Size:
1.1 million words
Contents:

non-literary prose texts of Early Modern English (various genres)

Variety sampled:
British English
Annotation:
textual markup
Availability:

available for students at the RCEP (ICAME)

Homepage:

Manual of the Lampeter Corpus (PDF)

Lancaster Parsed Corpus 140,000 written British yes yes

more information

Developer:
Roger Garside, Geoffrey Leech and Tamas Varadi at the University of Lancaster
Sampling period:
1961
Size:
140,000 words
Contents:

parsed subcorpus of the LOB

Variety sampled:
British English
Annotation:
POS tagging, syntactic parsing
Availability:

available for students at the RCEP (ICAME)

Homepage:

Manual of the Lancaster Corpus

LLC ? London-Lund Corpus of spoken English 500,000 spoken British yes yes

more information

Developer:
Randolph Quirk and Sidney Greenbaum at University College London Jan Svartvik at Lund University
Sampling period:
1960s, 1975-81, 1985-88
Size:
500,000 words
Contents:

spoken language, based on the Survey of English Usage (SEU, 1959, University College London) and on the Survey of Spoken English (SSE, 1975, Lund University)

Variety sampled:
British English
Annotation:
prosodic and discourse annotation
Availability:

available for students at the RCEP (ICAME)

Homepage:

Manual of the LLC

LOB ? Lancaster / Oslo-Bergen Corpus 1 million written British yes yes

more information

Developer:
Geoffrey Leech, University of Lancaster, and Stig Johansson, University of Oslo, in collaboration with Knut Hofland, Norwegian Computing Centre for the Humanities, Bergen
Sampling period:
1961
Size:
1 million words
Contents:

written language; 500 text samples of approx. 2,000 words; 15 text categories; British counterpart of Brown corpus

Variety sampled:
British English
Annotation:
untagged and tagged version POS tagging
Availability:

available for students at the RCEP (ICAME)

Homepage:

Manual of the LOB Corpus

Corpus name Size (in words) Spoken/ written? Variety sampled Tagged? Available at RCEP?
Newdigate Newsletter Corpus 750,000 written British no yes

more information

Developer:
Philip Hines, Jr., Norfolk, Virginia
Sampling period:
1692
Size:
750,000 words
Contents:

a series of more than 2,000 newsletters in the Newdigate series (most of which are addressed to Sir Richard Newdigate, Warwickshire)

Variety sampled:
British English
Annotation:
untagged
Availability:

available for students at the RCEP (ICAME)

Homepage:

Manual of the Newdigate Corpus

PoW ? Polytechnic of Wales Corpus 65,000 spoken British yes yes

more information

Developer:
The Computational Linguistics Unit at University of Wales College of Cardiff
Sampling period:
1978-1984
Size:
65,000 words
Contents:

transcripts of spoken child language

Variety sampled:
British English
Annotation:
POS tagging, syntactic parsing
Availability:

available for students at the RCEP (ICAME)

Homepage:

Manual of the PoW Corpus

(SB)CSAE ? Santa Barbara Corpus of Spoken American English 249,000 spoken American yes yes

more information

Developer:
John W. Du Bois, Wallace L. Chafe, Sandra A. Thompson, Charles Meyer, Robert Englebretson
Sampling period:
1990s
Size:
249,000 words
Contents:

transcripts and audio files of naturally occuring interaction from all over the US (mostly face-to-face conversations)

Variety sampled:
American English
Annotation:
transcripts are time-stamped, overlap indicated; marked-up version on talkbank.org
Availability:

available for students at the RCEP (parts 1-4)

 

marked-up open access version on talkbank.org

Homepage:

Homepage of the Santa Barbara Corpus of Spoken American English

SEC ? Lancaster / IBM Spoken English Corpus 52,000 spoken British yes yes

more information

Developer:
University of Lancaster and IBM Scientific Centre
Sampling period:
1984-87
Size:
52,000 words
Contents:

spoken language; transcripts from radio-broadcasts, recordings made at University of Lancaster

Variety sampled:
British English
Annotation:
prosodic markup, POS tagged
Availability:

available for students at the RCEP (ICAME)

Homepage:

Manual of the SEC

Corpus name Size (in words) Spoken/ written? Variety sampled Tagged? Available at RCEP?
Wellington Corpus of written New Zealand English 1 million written New Zealand no yes

more information

Developer:
Laurie Bauer at Victoria University, Wellington
Sampling period:
1986-90
Size:
1 million words; 500 text samples of approx. 2,000 words
Contents:

written language; modelled on BROWN and LOB

Variety sampled:
New Zealand English
Annotation:
untagged
Availability:

available for students at the RCEP (ICAME)

Homepage:

Manual of the Wellington Corpus (written)

Wellington Corpus of spoken New Zealand English 1 million spoken New Zealand yes yes

more information

Developer:
Janet Holmes, Bernadette Vine and Gary Johnson at at Victoria University, Wellington
Sampling period:
1988-94
Size:
1 million words; 500 text samples of approx. 2,000 words
Contents:

spoken language; formal, semi-formal and informal speech

Variety sampled:
New Zealand English
Annotation:
discourse markup
Availability:

available for students at the RCEP (ICAME)

Homepage:

Manual of the Wellington Corpus (spoken)

Related Content