1 English texts from the International Corpus of English (NZ, JA and HK components)

source("../multivar_utils.R")
## Loading required package: Matrix
suppressMessages(library(data.table))
library(ggplot2)
library(ggExtra)
library(grid) # needed to work around bug in ggMarginal
library(seriation)
## Registered S3 method overwritten by 'seriation':
##   method         from 
##   reorder.hclust gclus
library(colorspace)

1.1 Data set and pre-processing

We begin by loading data tables containig metadata and feature vectors for the three components of ICE to be analysed

Meta <- fread("ice_metadata.csv", encoding="UTF-8")
setkey(Meta, file)
Features <- fread("ice_features_new.tsv", encoding="UTF-8")
colnames(Features)[1] <- "file" # ID variable is called "file" for historical reasons
setkey(Features, file)

Unfortunately, three texts from ICE-NZ consist solely of extra-corpus material (these are cases where ICE “texts” are composites from multiple shorter texts that Stella has split up in pre-processing). We therefore need to reduce the Meta table to existing texts.

Meta <- Meta[file %in% Features$file]

For comparison, we also load the old feature vectors (which included extra-corpus material in the texts). We have to apply the same adjustment as for the metadata.

OldFeatures <- fread("ice_features.tsv", encoding="UTF-8")
colnames(OldFeatures)[1] <- "file"
setkey(OldFeatures, file)
OldFeatures <- OldFeatures[file %in% Features$file]

There are 3198 rows in all three tables (metadata, new features, and old features) and the text IDs (file) match for new features (TRUE) and old features (TRUE). Since we index the data.tables by text ID, which automatically sorts them alphabetically, we cannot randomize the order of the rows as in previous analyses.

1.1.1 Normalization based on word counts

Note that relative frequencies in the feature vectors are based on word counts rather than token counts now. This was made possible by the new feature extraction pipeline using the cwb-featex tool (and by the fact that ICE is small enough so that the very inefficient query for word counts can still be executed).

1.1.2 Cleanup

First we check the data to see whether we have to clean it up:

  • the feature salutation_S is excluded because it is extremely sparse (and hence produces extreme z-scores)
  • the frequency of attributive adjectives (atadj_W) is highly correlated with the overall frequency of adjectives (adj_W), especially in the English texts; we therefore use non-attributive (“predicative”) adjectives (predadj_W) rather than (adj_W)
  • as preadj_W was not included in the old feature matrix, we have to calculate the value there (using a little trickery so it’s in the same place as in the new feature matrix despite renamed features)
Features[, qw("salutation_S adj_W") := NULL]
OldFeatures[, predadj_W := (adj_W - atadj_W)]
OldFeatures[, adj_W := atadj_W]
OldFeatures[, atadj_W := predadj_W]
OldFeatures[, qw("salutation_S predadj_W") := NULL]
setnames(OldFeatures, qw("adj_W atadj_W"), qw("atadj_W preadj_W"))

We also discard 354 texts with less than 100 words or less than 10 sentences because the quantitative features will be too unreliable and are prone to create outliers in the multivariate analysis. Before we do so, let’s check whether this affects the sub-corpora in substantially different ways.

to.discard <- Features[, !(word >= 100 & sent >= 10)]
table(to.discard, Meta$variety)
##           
## to.discard   HK   JA   NZ
##      FALSE 1113  922  809
##      TRUE   140  133   81

This is reasonable, so let’s proceed and not forget to filter the old feature vectors, too.

Features <- subset(Features, word >= 100 & sent >= 10)
Meta <- Meta[Features$file, ]
OldFeatures <- OldFeatures[Features$file, ]

After cleanup, there are 2844 texts and 44 features in the data set, including sentence, token and word counts. The two tables are still consistent, both for new features (TRUE) and old features (TRUE).

1.2 Text categories

A revised set of text categories has been provided in file text_categories.csv, which defines both category labels at three different layers of granularity (with 32, 20 and 12 categories) and the standard ordering of the categories.

TextCat <- fread("text_categories.csv", encoding="UTF-8")

We use this information to generate appropriate factor levels and colour coding for later visualisation. First check that there are no unexpected duplicates and full names, short labels and category codes match at every layer.

has.distinct <- function (tbl, n=32) 
  stopifnot(length(unique(do.call(paste, as.list(tbl)))) == n)
has.distinct(TextCat[, .(text_cat)])
has.distinct(TextCat[, .(textcat32)])
has.distinct(TextCat[, .(short32)]) # combinations of name, short label
has.distinct(TextCat[, .(code32)])  # and code are necessarily unique
has.distinct(TextCat[, .(textcat20)], 20)
has.distinct(TextCat[, .(short20)], 20)
has.distinct(TextCat[, .(code20)], 20)
has.distinct(TextCat[, .(textcat20, short20, code20)], 20)
has.distinct(TextCat[, .(textcat12)], 12)
has.distinct(TextCat[, .(short12)], 12)
has.distinct(TextCat[, .(code12)], 12)
has.distinct(TextCat[, .(textcat12, short12, code12)], 12)

We now collect text category names, short labels and codes in the specified ordering (to be used as factor levels and for labeling visualisations). Note that the levels are aligned at each granularity, so it is easy to map between names, labels and codes. Similar vectors of levels are created for the three varieties and for written vs. spoken mode.

types.variety <- qw("NZ JA HK")
types.mode <- qw("W S")
types.textcat32 <- unique(TextCat$textcat32)
types.short32 <- unique(TextCat$short32)
types.code32 <- unique(TextCat$code32)
types.textcat20 <- unique(TextCat$textcat20)
types.short20 <- unique(TextCat$short20) 
types.code20 <- unique(TextCat$code20)
types.textcat12 <- unique(TextCat$textcat12)
types.short12 <- unique(TextCat$short12)
types.code12 <- unique(TextCat$code12)

We also generate aligned rainbow colour vectors for the three layers of granularity, with short labels for easy lookup. For the less fine-grained categories, the colour of the “middle” sub-category is selected.

col.vec <- rainbow_hcl(32, c = 80, l = 60)
rainbow.32 <- structure(col.vec, names=types.short32)
tmp <- TextCat[, .(col = col.vec[mean(.I)]), by=short20]
rainbow.20 <- structure(tmp$col, names=tmp$short20)
stopifnot(all.equal(names(rainbow.20), types.short20))
tmp <- TextCat[, .(col = col.vec[mean(.I)]), by=short12]
rainbow.12 <- structure(tmp$col, names=tmp$short12)
stopifnot(all.equal(names(rainbow.12), types.short12))

An overview table of the colour vectors shows that they are correctly aligned. It is also exported to a PDF file as a handy reference.

par(mfrow=c(1, 3), mar=c(0,0,1,0))
mp <- barplot(rep(1, 32), col=rainbow.32, horiz=TRUE,
              xlim=c(0, 3), xaxt="n", main="32 categories")
text(1.1, mp, types.textcat32, adj=c(0, .5))
barplot(rep(1, 32), col=rainbow.20[TextCat$short20], 
        horiz=TRUE, xlim=c(0, 3), xaxt="n", main="20 categories")
text(1.1, mp, TextCat$textcat20, adj=c(0, .5))
barplot(rep(1, 32), col=rainbow.12[TextCat$short12], 
        horiz=TRUE, xlim=c(0, 3), xaxt="n", main="12 categories")
text(1.1, mp, TextCat$textcat12, adj=c(0, .5))

invisible(dev.copy2pdf(file="pdf_proc/colour_key_textcat.pdf", out.type="cairo"))

1.3 Revise metadata

We can now merge the additional information into the Metadata table and remove meta variables that are no longer needed. Before the merge, we make sure that the text_cat labels are identical for both data frames.

Meta[, text_cat := sub("[1-3]$", "", text_cat)] # for lookup in TextCat
stopifnot(setequal(TextCat$text_cat, unique(Meta$text_cat)))
Meta[, qw("text_type code_type code_gen text_gen") := NULL]
Meta <- merge(Meta, TextCat, by="text_cat")
Meta[, text_cat := NULL]

We also have to make sure that Meta is still aligned with Features.

setkey(Meta, file)
stopifnot(all.equal(Meta$file, Features$file))

Finally, all meta-variables are coded as factors with correct levels and ordering.

Meta <- transform(
  Meta,
  variety = factor(variety, levels=types.variety),
  mode = factor(mode, levels=types.mode),
  textcat32 = factor(textcat32, levels=types.textcat32),
  short32 = factor(short32, levels=types.short32),
  code32 = factor(code32, levels=types.code32),
  textcat20 = factor(textcat20, levels=types.textcat20),
  short20 = factor(short20, levels=types.short20),
  code20 = factor(code20, levels=types.code20),
  textcat12 = factor(textcat12, levels=types.textcat12),
  short12 = factor(short12, levels=types.short12),
  code12 = factor(code12, levels=types.code12))

1.4 Metadata distributions

Let us now take a look at the metadata categories. We have made sure that filenames are unique IDs: TRUE.

knitr::kable(xtabs(~ textcat32 + variety, data=Meta))
NZ JA HK
face-to-face conversations 90 98 105
phonecalls 10 15 11
classroom lessons 20 22 22
broadcast discussions 20 21 23
broadcast interviews 10 12 13
parliamentary debates 14 10 11
legal cross-examinations 18 15 16
business transactions 10 12 10
spontaneous commentaries 22 44 20
unscripted speeches 36 32 51
demonstrations 11 12 12
legal presentations 11 20 13
broadcast news 57 32 43
broadcast talks 36 25 45
non-broadcast talks 13 13 17
student essays 29 26 11
exam scripts 14 40 13
social letters 43 93 107
business letters 78 111 133
humanities 10 10 10
social sciences 10 10 16
natural sciences 10 10 12
technology 13 14 15
pop humanities 11 11 20
pop social sciences 12 12 27
pop natural sciences 11 23 21
pop technology 16 23 43
press news reports 95 67 117
administrative writing 14 20 21
skills/hobbies 13 24 27
press editorials 31 22 63
novels and short stories 21 23 45
knitr::kable(xtabs(~ textcat20 + variety, data=Meta))
NZ JA HK
conversations/phonecalls 100 113 116
classroom lessons 20 22 22
broadcast interactions 30 33 36
parliamentary debates 14 10 11
legal cross-examinations 18 15 16
business transactions 10 12 10
unscripted monologues 58 76 71
demonstrations 11 12 12
legal presentations 11 20 13
scripted monologues 106 70 105
student writing 43 66 24
social letters 43 93 107
business letters 78 111 133
academic writing 43 44 53
popular-scientific writing 50 69 111
news reports 95 67 117
administrative writing 14 20 21
skills and hobbies 13 24 27
press editorials 31 22 63
creative writing 21 23 45
knitr::kable(xtabs(~ textcat12 + variety, data=Meta))
NZ JA HK
private 100 113 116
public 92 92 95
unscripted 80 108 96
scripted 106 70 105
student writing 43 66 24
letters 121 204 240
academic writing 43 44 53
popular writing 50 69 111
reportage 95 67 117
instructional writing 27 44 48
persuasive writing 31 22 63
creative writing 21 23 45
knitr::kable(xtabs(~ mode + variety, data = Meta))
NZ JA HK
W 431 539 701
S 378 383 412

There is some imbalance in the number of text samples in the three varieties and their distribution across text categories, but this is due to the design of and artefacts in the ICE corpora.

1.5 Text lengths

Text lengths vary wildly, including many short texts with highly unreliable feature counts. The distributions look roughly balanced across the three varieties, but there is a large group of texts with approximately 2000 tokens. This indicates a target text size of 2000 words per text.

The two plots below compare how excluding extra-corpus material has affected the text sizes.

grid.newpage()
tmp <- cbind(Features, Meta[, .(variety)])
p <- ggplot(tmp, aes(x=word, y=sent, col=variety)) +
  scale_x_log10(limits=c(100, 10000)) + scale_y_log10(limits=c(10, 1700)) +
  geom_point(cex=.4) + labs(x="number of tokens", y="number of sentences") +
  guides(colour = guide_legend(override.aes = list(size=2))) +
  labs(title="New (w/o extra-corpus material)")
p <- ggMarginal(p, groupColour=TRUE, groupFill=TRUE)
grid.draw(p)

grid.newpage()
tmp <- cbind(OldFeatures, Meta[, .(variety)])
p <- ggplot(tmp, aes(x=word, y=sent, col=variety)) +
  scale_x_log10(limits=c(100, 10000)) + scale_y_log10(limits=c(10, 1700)) +
  geom_point(size=.4) + labs(x="number of tokens", y="number of sentences") + 
  guides(colour = guide_legend(override.aes = list(size=2))) +
  labs(title="Old (including extra-corpus material)")
p <- ggMarginal(p, groupColour=TRUE, groupFill=TRUE)
grid.draw(p)

Also check whether text lengths are different between text categories (without extra-corpus material).

grid.newpage()
tmp <- cbind(Features, Meta[, .(variety, short12)])
p <- ggplot(tmp, aes(x=word, y=sent, col=short12)) +
  scale_x_log10() + scale_y_log10() + scale_colour_manual(values=rainbow.12) +
  geom_point(size=.4) + labs(x="number of tokens", y="number of sentences") + 
  guides(colour = guide_legend(override.aes = list(size=2))) +
  labs(title="New (w/o extra-corpus material)")
p <- ggMarginal(p, groupColour=TRUE, groupFill=TRUE)
grid.draw(p)

2 Feature distributions

We transform the feature matrix from a data.table to an actual numeric matrix M, excluding the word and sentence counts.

M <- as.matrix(Features[, -c(1:4)])
rownames(M) <- Features$file
OldM <- as.matrix(OldFeatures[, -c(1:4)])
rownames(OldM) <- OldFeatures$file

Different features have entirely different ranges and distributions:

par(mar=c(8,2,2,0.1))
boxplot(M, ylim=c(0,3), las=3, main="raw values")

We therefore standardize all features to z-scores as in previous work. The distributions are still highly skewed with some extreme outliers. As an alterantive to removing very sparse feature, we apply a signed logarithmic transformation to deskew the feature distributions.

Z <- scale(M)
par(mar=c(8,2,2,0.1))
boxplot(Z, las=3, main="z-scores")

ZL <- signed.log(Z)
boxplot(ZL, las=3, main="log-transformed z-scores")

The old features need to be scaled with the same parameters (rather than being standardised independently), so that we can map them into the same space. (In practice, the difference is rarely larger than 1%, so it would not have made a big difference).

OldZ.1 <- scale(OldM)
OldZ <- scale(OldM, center=attr(Z, "scaled:center"), scale=attr(Z, "scaled:scale"))
OldZL <- signed.log(OldZ)

We check for collinearities and excessive correlation patterns between the features.

fnames <- colnames(Z)
cor.colors <- diverge_hsv(101, power=1)
par(cex=.5)
hmap(cor(Z), zlim=c(-1, 1), col=cor.colors, margins=c(7, 7), 
     keysize=1, cexRow=.8, cexCol=.8,
     main="correlation of z-scores for all texts")

hmap(cor(ZL), zlim=c(-1, 1), col=cor.colors, margins=c(7, 7), 
     keysize=1, cexRow=.8, cexCol=.8,
     main="correlation of log-transformed z-scores")

The correlations look reasonable. An overall block structure is visible – which presumably corresponds to the oral-written dimension – but the correlations within the blocks are fairly weak and the features are less directly linked to noun and verb frequency than in Biber’s analysis.

3 Serialize pre-processed data set

We also add word and sentence counts to the metadata table, so they can be used for filtering.

Meta[, word := Features$word]
Meta[, sent := Features$sent]
Meta[, word.old := OldFeatures$word]
Meta[, sent.old := OldFeatures$sent]

And we create nicely readable labels for the features.

feature.names <- 
  gsub("_+", " ",
       sub("_(\\pL)$", "/\\1", 
           colnames(Z), perl=TRUE))
cat(paste(feature.names, collapse=", "), "\n")
## word/S, lexical density, nn/W, np/W, nominal/W, neoclass/W, poss pronoun/W, pronoun all/W, p1 perspron/P, p2 perspron/P, p3 perspron/P, it/P, pospers1/W, pospers2/W, pospers3/W, atadj/W, predadj/W, prep/W, finite/S, past tense/F, will/F, modal verb/V, verb/W, infinitive/F, passive/F, coordination/F, subordination/F, interrogative/S, imperative/S, title/W, place adv/W, time adv/W, nom initial/S, prep initial/S, adv initial/S, text initial/S, wh initial/S, disc initial/S, nonfin initial/S, subord initial/S, verb initial/S

Finally, save all objects into a single .rda file. We also provide a fixed (i.e. reproducible) random ordering index for plots.

set.seed(42)
rand.idx <- sample(nrow(Meta))
save(Meta, rand.idx,
     Features, M, Z, ZL,
     OldFeatures, OldM, OldZ, OldZL,
     types.variety, types.mode,
     types.textcat32, types.short32, types.code32,
     types.textcat20, types.short20, types.code20,
     types.textcat12, types.short12, types.code12,
     rainbow.32, rainbow.20, rainbow.12, feature.names,
     file="ice_preprocessed.rda")