Corpus Linguistics

Corpus Linguistics

Corpus Linguistics is the study of language using natural texts. A corpus is a collection of real documents used for the study of language (corpera is plural for corpus). The documents included in a corpus are generally diverse, to provide as broad a selection of native use of language as possible. This provides a much better understanding of day-to-day use of the language than textbooks. Textbooks are generally written in a formal style, while native texts, such as letters, transcripts of conversations, news articles and fiction, provide a much richer picture of how language is used in practice.

Most corpora are finite, that is, a length is defined before the collection project begins. Some are open-ended, like the COBUILD monitor corpus at the University of Birmingham (UK),which is constantly updated. There are advantages to a never-ending project; new and rare words and language usage that might not be unearthed in a corpus with a limited word count, for example, but there is also the potential disadvantage that each document can not be as carefully screened, so documents might contain inaccuracies.

Many corpora are online and readily available. The University of Virginia Library has an English language Digital Collections section that is free and easily searchable.

Scholars who dig deeply enough will find words that do not exist in common dictionaries, but there are alternative sources like and the Phrontisery, an extensive compilation of weird and unusual words.

Corpora in other languages:
Danish Corpus Page
French Corpus
German Corpus
Portugese Corpus
Spanish newspaper text
Cree Language Website
Manitoba Aboriginal languages
Interactive Native American Language Repository
Samala Chumash language tutorial
Omushkego Oral History Project

Additional Resources:
Improvising corpora for ELT: quick-and-dirty ways of developing corpora for language teaching
Centre for English Corpus Linguistics at the Université catholique de Louvain (Belgium)
Italian database in audio format containing the 500,000 word LIP-Corpus.
Corpus linguistics at ICT4LT - Information and Communications for Language Teachers.

Some corpora contain more than one language for comparison. A parallel corpus will contain the same texts translated into different languages. Linguateca Compara offers a an open-ended fiction collection of searchable English-Portuguese and Portuguese-English text translations.  

A corpus may be annotated or unannotated. Unannotated text is in natural state, without researcher's notes or comment. Annotated text has been analyzed and footnoted with linguistic information and references., a generally more useful tool.  Tagging, usually done by computer, is another form of annotation that assigns a code to indicate part-of-speech function to specific words.

Share on Google Plus Share on Facebook Share on Twitter Share on Pinterest