Statistical Analysis of Corpus Data with R
A Gentle Introduction for Computational Linguists and Similar Creatures
Course Materials –
Old Version –
Data Sets –
Exercises –
SIGIL Main Page
Statistical Analysis of Corpus Data with R is an online course by Marco Baroni and Stephanie Evert. It is based on a number of previous courses on similar topics taught together by the authors, in particular the course on R Programming for (Computational) Linguists given at the DGfS Fall School in Computational Linguistics (Potsdam, 2007).
News:
The SIGIL course is currently being restructured – a new Web page will be launched when a stable state has been reached. You can already download updated versions of most of the course units below.
New slide sets (work in progress)
back to top
R code examples in the slide sets below make use of functions and data sets included in a supporting R package or available as separate files, depending on whether the slides have been updated yet. Please install the following software and data:
- The
corpora
package (version 0.6), which is available on CRAN and can be installed with any standard R package manager.
- Any additional data and code files required by the unit you're studying. These are listed together with the handouts and exercises below. You can also download a ZIP archive with most data sets (2.9 MiB).
- Notice:
Some slides may still refer to data sets in the
SIGIL
package, which was rejected by CRAN. Please use the corpora
package instead, making sure that you have installed version 0.6 or newer.
It is recommended that you put all data and code files in an RStudio project directory (or your current working directory). All code examples in the slides and exercises will make this assumption.
SIGIL course units
-
Unit 1: General introduction / First steps in R
(updated on 12.07.2015)
-
Unit 2: Corpus frequency data & statistical inference
(updated on 20.06.2016)
-
Unit 3: Descriptive and inferential statistics for continuous data
-
Unit 4: Collocations, keywords & contingency tables
-
Unit 5: Word frequency distributions and Zipf's law: Using add-on packages
(updated on 23.06.2016)
-
Unit 6: Regression and the general linear model
-
Unit 7: Multivariate analysis
(update on 06.04.2023)
-
Unit 8: The non-randomness of corpus data & generalised linear models
(updated on 26.03.2010)
-
Unit 9: Inter-annotator agreement
Old version of the SIGIL course
back to top
- Introduction
(slides,
handout)
- Hypothesis tests for corpus frequency data
(slides,
handout)
- Word frequency distributions with zipfR
(slides,
handout)
- Clustering and dimensionality reduction
(slides,
handout,
data sets)
- Using statistical association measures for collocation extraction
- Part 1: contingency tables and association scores
(slides,
handout)
- Part 2: large-scale processing and evaluation
(slides,
handout)
- The limitations of random sampling methods
(slides,
handout)
- A short introduction to the mathematics of regression and linear models
(slides,
handout,
R examples)
- Statistical models
- Collected R code (ZIP archive) from handouts
- Some other sample R scripts (ZIP archive) with detailed comments
Data sets
back to top
- brown.stats.txt (basic type-token statistics for the Brown corpus)
- lob.stats.txt (basic type-token statistics for the LOB corpus)
- bnc_metadata.tbl* (metadata information from the British National Corpus)
- bigrams.100k.spc (frequency spectrum of bigrams from the first 100k tokens of Brown)
- bigrams.100k.tfl (type frequency list of bigrams from the first 100k tokens of Brown)
- bigrams.vgc (vocabulary growth curve of bigrams in the Brown corpus)
- comp.stats.txt* (distributional information for different types of Italian noun-noun compounds)
- brown_bigrams.tbl (bigram collocations in the Brown corpus, with full contingency tables)
- krenn_pp_verb.tbl* (German PP-verb collocations with manual MWE annotation)
- bnc_gender_small.tbl (data set for identification of author gender in the BNC)
Download ZIP archive with all data sets (2.9 MB).
* These files contain Unicode strings with accented characters. If you are running R on a Windows computer, specify the option encoding="UTF-8"
when loading the files with read.delim()
in order to handle such strings correctly.
Exercises
back to top
imprint & privacy