Dissemination Dynamics of Receding Words: A Diachronic Case Study of Whom

Bohmann, Axel; Bohmann, Martin; Hinrichs, Lars

doi:10.3389/frai.2021.654154

ORIGINAL RESEARCH article

Front. Artif. Intell. , 29 June 2021

Sec. Language and Computation

Volume 4 - 2021 | https://doi.org/10.3389/frai.2021.654154

This article is part of the Research Topic Computational Sociolinguistics View all 32 articles

Dissemination Dynamics of Receding Words: A Diachronic Case Study of Whom

Axel Bohmann¹*

Martin Bohmann^2,3

Lars Hinrichs⁴

¹Englisches Seminar, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
²Institute for Quantum Optics and Quantum Information - Vienna (IQOQI), Austrian Academy of Sciences, Vienna, Austria
³Vienna Center for Quantum Science and Technology (VCQ), Vienna, Austria
⁴Department of English, The University of Texas at Austin, Austin, TX, United States

We explore the relationship between word dissemination and frequency change for a rapidly receding feature, the relativizer whom. The success of newly emerging words has been shown to correlate with high dissemination scores. However, the reverse—a correlation of lower dissemination scores with receding features—has not been investigated. Based on two established and two newly developed measures of word dissemination—across texts, linguistic environments, registers, and topics—we show that a general correlation between dissemination and frequency does not obtain in the case of whom. Different dissemination measures diverge from each other and show internally variable developments. These can, however, be explained with reference to the specific sociolinguistic history of whom over the past 300 years. Our findings suggest that the relationship between dissemination and word success is not static, but needs to be contextualized against different stages in individual words’ life-cycles. Our study demonstrates the applicability of large-scale, quantitative measures to qualitatively informed sociolinguistic research.

Introduction

The Sociolinguistics of Emergence and Attrition

Sociolinguistic research is predominantly concerned with the emergence and spread of linguistic innovations, but has paid less attention to the dynamics of receding features. The canonical S-curve pattern of linguistic change (Labov, 1994) proceeds along three idealized stages—barely perceptible incipient change, rapid frequency increase through incrementation, and establishment of the feature within the community—to a theoretical steady state. Feature dynamics beyond this point are less well-understood. Yet, sociolinguists stand to gain insight from attention to receding features. These are of interest in their own right as part of a community’s repertoire, but also because systematic comparison of the dynamics involved in feature emergence and attrition can lead to a more comprehensive understanding of linguistic change in general.

The dynamics of lexical emergence have recently been addressed through large-scale computational-statistical methods. Grieve et al. (2017) develop a procedure to identify emerging words in a corpus of 8.9 billion Twitter messages, based on initially low frequency and a high increase in frequency over a given time period. In a follow-up study, Grieve (2018) predicts the further success of 54 emerging words identified in Grieve et al. (2017) as a function of word length, part-of-speech, underlying word-formation process, and novelty of the word’s referent. The latter predictor is shown to be particularly relevant in determining the frequency development of innovative words, whereas part-of-speech does not appear to play a significant role.

A further important predictor of a word’s success is its social dissemination, defined by Altmann et al. (2011) as the ratio between the number of social units (e.g. speakers or texts) in a sample that use the word and the expected number of social units using the word. This expected number is calculated under the assumption of random spread of the word across social units, given its relative frequency and each social unit’s total word count. Altmann et al. (2011) and Altmann et al. (2013) find higher dissemination scores to be a strong predictor of a word’s continued increase in frequency.

The notion of social dissemination has been taken up in Garley and Hockenmaier (2012) as well as in Stewart and Eisenstein (2018). In both of these studies, its predictive power is less evident, which may in part be attributed to the inclusion of proper nouns in Altmann et al. (2011). Usage of these may be more directly linked to social dynamics than usage of general innovations (Stewart and Eisenstein, 2018: 4368). Stewart and Eisenstein extend the concept of dissemination from the social to the linguistic context of words. They calculate linguistic dissemination based on a comparison between expected and observed unique trigram frequencies in which a given word occurs and show, on the basis of several statistical models, that this metric effectively predicts future frequency developments.

These large-scale, quantitative findings are conceptually related to recent work in a more qualitative perspective. Squires (2014) traces how one specific phrase coined by a TV personality is taken up on Twitter. After being used by fans of the show the phrase originates from in direct reference to the initial situation of utterance, the phrase gradually spreads to wider discursive contexts and becomes increasingly detached from its origin. Squires (2014) refers to this process as “indexical bleaching.” Given that indexicality describes the connection of a sign to the specific contexts it is embedded in, the notion of indexical bleaching may be related to Altmann et al.’s (2011) concept of social dissemination, including its extension in Stewart and Eisenstein (2018): the further a linguistic unit is indexically bleached, the more evenly disseminated it can be expected to be. One important thing to note about Squires’ research is that her focusing on an individual form allows her to trace in more detail the indexical dynamics involved in its spread. As such, her analysis is able to go beyond a static relationship between indexical focus and a word’s successful spread. She concludes that “indexical strength catalyzes uptake, but indexical loss facilitates diffusion” (Squires, 2014: 58).

This observation implies that the role of dissemination (which we take to be inversely related to indexical strength) in predicting a form’s future frequency development may assume different shapes at different stages of that form’s life-cycle. Most of the studies cited above have restricted their focus to the rapid emergence of innovative words, and to predictions about their relatively short-term success. Altmann et al. (2013) also consider the development of established words over longer time periods, yet their focus remains on frequency increase. The extent to which the dynamics of receding forms, i.e. those that are firmly established in the language but decrease in frequency, mirrors those of emerging ones is currently not well understood.

We focus on one particular such form, the relativizer whom, in order to shed light on the question of how frequency decline interacts with dissemination during an extended phase of attrition. In addition to implementing Altmann et al.’s (2011) original measure and Stewart and Eisenstein’s (2018) extension of it, we also address dissemination across registers and topics. This is done on the basis of a multi-dimensional analysis (Biber, 1988) and a topic model for the corpus under consideration. In contrast to Altmann et al.’s (2011) approach, focusing on text-level properties like register and topic enables us to treat the range of texts in our corpus not simply as distinct units, but to systematically relate them to one another in terms of their linguistic characteristics and discourse content. Tracing the association between a form and specific register contexts and topics is arguably a more immediate window into indexical focusing than simply quantifying its presence or absence in a number of texts which are conceived as otherwise undifferentiated units. Compared to Stewart and Eisenstein’s (2018) measure, our newly developed dissemination indices relate to characteristics of the textual environment on the whole, instead of the immediate collocation behavior of a word.

A Rapidly Receding Word: Whom

Standard English allows for nine different devices to introduce relative clauses (RCs): that, which, who, whose, whom, when, where, why, and zero (that is, the absence of an overt element introducing a relative clause). Competition among these forms is in part governed by categorical rules, e.g. the fact that that is only permissible for introducing restrictive RCs, and in part by probabilistic constraints. The latter have been the focus of many recent studies and are relatively well-documented for the three most prolific members of the set, which, that, and zero (e.g. Guy and Bayley, 1995; Levey, 2006; Hinrichs et al., 2015). In addition to language-internal constraints like antecedent noun phrase length, RC length and whether the relativizer assumes the subject or object role in the RC, Hinrichs et al. (2015) show that relativizer choice is susceptible to the influence of prescriptivist norms. Together with broader stylistic drifts, such as the colloquialization of written English (Leech et al., 2009), these factors account for a marked frequency decrease of which during the second half of the 20th century.

Although characterized by a similarly drastic decline in frequency over the past 200 years, whom has received comparatively less attention. This form is commonly regarded as a case-marked variant of who expressing objective case, analogous to the correspondence between she and her, he and him, etc. (although see Lasnik and Sobin (2000) for a competing account). Accordingly, traditional prescriptive grammar would require whom instead of who in RCs with human antecedents in which the relativizer occurs in the object position (Aarts, 1994: 73), as in (1).

(1) going for the jugular of anyone whom he considers an enemy <COHA_fic_1988_782035>

As early as 1921, Sapir (1921: 167) predicted that “within a couple of hundred years from to-day not even the most learned jurist will be saying ‘Whom did you see?’” Sapir identified several factors that conspire to render whom a moribund form: the general erosion of the English case-inflectional paradigm, the isolation of whom from the case-invariant remaining relativizers on the one hand and the system of personal pronouns on the other, as well as a purported “clumsiness” (Sapir, 1921: 171) in its phonetic shape. He further anticipated a general retreat of who and its variants from the class of relativizers in favor of highlighting their role as interrogative pronouns.

Many of these predictions have been borne out over the past 100 years. In the Corpus of Historical American English (COHA; Davies, 2010), the relative frequency of whom is consistently about an order of magnitude smaller than that of who throughout the 20th century. In terms of relative frequency, COHA shows a steady decrease of whom between 1810 and 2009, as can be seen in Figure 1. Using the Spearman correlation coefficient between relative frequency and year as an operationalization of the rise or fall of a word, whom is the fifth most rapidly receding item in the entire corpus, after shall, nor, vain, and whence. Along with this general decline, linguists have noted an increasing stylistic restriction to formal contexts, with prescriptivist discourse as an important catalyst (Aarts, 1994).

FIGURE 1

FIGURE 1. Frequency development of whom over 180 years of written American English.

Figure 1, however, also shows that the rate of decline has slowed considerably in the second half of the 20th century. Empirical research on the recent past of written English has come to varied conclusions as to the fate of whom. Bauer (1994: 76) contends that avoidance of the word “has probably been noticeable throughout the [20th] century” and that ongoing change is relatively negligible, an observation also shared by Mair (2006: 141–143). Aarts and Aarts (2002: 128), on the other hand, find “staggering” rates of decline, both in written and spoken corpora, between the 1960s and the 1990s. Figure 1, based on larger and more systematic corpus data than the studies previously cited, would seem to confirm Mair (2006: 143) verdict that “whom now seems to have reached the tail end of the characteristic S-shaped curve of progression in linguistic change.”

Despite disagreement about the most recent frequency developments, there is overwhelming consensus in the literature regarding the stylistic aspects of whom, namely: a strong association with very formal, almost exclusively written kinds of discourse. The fact that whom has not completely disappeared from the language is often attributed to its institutional backing in the educational system (Mair, 2006: 134). A discrepancy between actual usage and prescriptive norms means that most people “will recognize it as correct in a wider range of contexts […], but probably not use it” (Bauer, 1994: 77).

The strong stylistic connotations of whom are evident in meta-linguistic discourse as well. In present-day internet culture, a class of memes is circulating which capitalizes on these indexicalities. The structural template for these memes pairs a sequence of images with a sequence of words. The images are repetitions of the same motif, a stylized X-ray of a human head, showing a rise in brain size with every iteration. The words form the sequence who—whom—whoms—whomst’d.¹ The rhetorical effect is an equation of linguistic forms with levels of intellectual superiority. The fact that both whomst and whomst’d are nonce words created for the context of this meme indicates the level of metalinguistic play inherent in it. These two words are constructed by attaching graphemic material to the base word that does not add any semantic content. In the case of whomst, it is likely that the -st sequence is used in analogy to archaic second-person singular verb inflections that were still common in Early Modern English. The grammatical information these suffixes used to bear is nowadays encoded on the subject only. The position of whom in this sequence construes this form as similarly burdened by unnecessary graphemic material but indicative of intellectual attainment. The meme consequently suggests a change in status for whom in that it has largely lost its grammatical function of case-distinction but gained indexical strength linking it to educated and hyper-formal contexts.

The properties described above make whom a suitable candidate for a contextualized analysis of various dissemination measures. Its frequency development over the past 200 years follows a clear trajectory which mirrors that of the S-curve often observed in the spread of linguistic innovations. The factors contributing to its decline, while not yet analyzed in a quantitative perspective, are well-attested. In addition, metalinguistic discourse surrounding the correct usage of whom in the form of prescriptive and descriptive linguists’ comments is documented for at least as far back as the 18th century (Aarts, 1994). These facts enable us to formulate specific hypotheses regarding the dissemination of whom at different time periods and to contextualize observable dissemination developments against prior knowledge about the feature.

Research Objectives

We investigate the dynamics of dissemination that whom has undergone over the course of 180 years, between 1830 and 2009. Based on four quantitative measures, two established and two newly developed ones, we trace change in the dissemination of whom in this time period, which is characterized by continuous, but abating frequency decline. As can be seen in Figure 1, this decline is particularly rapid in the second half of the 19th and the first half of the 20th century, with the slope flattening again after around 1950.

On the basis of the literature on success during emergence, summarized in The Sociolinguistics of Emergence and Attrition, it would be valid to expect decrease in frequency to correspond with decrease in dissemination. This is the general statistical relationship that obtains in all the quantitative studies cited above, and is also a plausible hypothesis on purely theoretical terms. In the power-law distribution of any language’s vocabulary, the most common items are likely shared by all speakers and across contexts, whereas low-frequency items in the long tail of the distribution can be expected to show stronger contextual sensitivity (Kretzschmar, 2015), i.e. lower dissemination. As a word’s general frequency declines, one may consequently expect it to specialize into narrower niches of usage. In analogy to Squires’ (2014) term, we call this process “indexical focusing.” The tendency of receding forms to cluster in formulaic expressions serves as a case in point. In its extreme version, this process leaves receding words entirely unproductive and semantically intransparent outside of the larger constructions they are embedded in. Examples of such items include the highlighted words in the expressions to make short shrift or kith and kin. The baseline hypothesis for the analysis below, then, is that the frequency decline of whom will coincide with a decline in dissemination. However, Squires (2014) reminds us that this relationship may not be static.

Our focus on an individual word comes at the expense of generalizability. There is no guarantee that the dynamics we observe for whom are shared by all, or even the majority of, receding forms in the language. While recognizing this limitation, we suggest that this narrow focus also brings important advantages. In order to make statistical generalizations like those described in Altmann et al. (2011), Altmann et al. (2013), or Stewart and Eisenstein (2018) more immediately relevant to sociolinguistic research, they need to be understood in relation to individual features of interest. Unlike phenomena in statistical physics and other core sciences, words in a language are not merely units with certain statistical properties, but are embedded in individual histories of social meaning and metalinguistic reflection. The sociolinguistic record contains a large number of features about which a good deal is known in this respect. It is consequently possible to formulate specific expectations as to the relationship between frequency developments and dissemination measures for such features that go beyond general regularities. A consideration of individual words’ social role in conjunction with observable statistical properties promises to enrich our understanding of both these perspectives.

Our aim is to make the notion of dissemination tangible from a situated sociolinguistic perspective and to evaluate the utility of each dissemination measure for future application in contextualized sociolinguistic research. Specifically, we ask how well the four measures correlate with change in frequency, as well as how strongly correlated they are with each other. If no correlation between frequency and a given dissemination measure can be found, the utility of that measure is up to question. If the dissemination measures show no or only weak correlation amongst each other, this fact requires further attention. Our assumption is that, despite being operationalized at different levels, dissemination is a general property which we expect to take a similar shape independent of its precise quantification.

Materials and Methods

Corpus

Our analysis is based on the Corpus of Historical American English (COHA; Davies, 2010), which includes samples of written American English for each year between 1810 and 2009. The corpus is sub-divided into four genres: news, magazine, fiction, and non-fiction writing. Each word in the corpus is annotated with lemma and part-of-speech information.

Due to the difficulty of sampling historical language data, several aspects of the COHA sampling frame are not consistent throughout the 200 years it covers. For instance, the sparsity of texts for some genres from the more distant past has resulted in the inclusion of fewer, but longer individual texts for much of the 19th century. Further, newspaper texts are only sampled from 1860 onwards and different archives were used for the extraction of text samples for different time periods.² The effect of archival sources is visible especially for magazine writing, for which our register analysis (see below) shows a marked difference between texts before and after 1900.

Consideration of the above factors led us to exclude the first two decades of COHA (1810–1829) from the analysis. With a median number of 14.5 texts per year, these do not offer sufficient data for our analyses, most of which treat individual texts as the relevant units. We further note that the irregularities mentioned above are not fully resolved before the sampling point 1925. From this time on, both the archives used for text sampling and the mean number and word count of texts per year are consistent. While our analysis covers the years from 1830 up to 2009, then, the results are expected to be most robust for the latter half of this time period.

We work with the full-text, offline version of COHA, which includes lemma and part-of-speech information for each word. For each year between 1830 and 2009, we calculate the four dissemination measures for whom described in the following sections.

Social Dissemination

Following Altmann et al. (2011), we measure social dissemination (D^S), as the ratio between the observed and expected social units a word occurs in at a given time. For our purposes, the social units of relevance are the individual corpus texts. In other words, we divide the number of documents whom occurs in by the number of documents it is expected to occur in. To calculate the latter number, a probability of observing whom in each text is calculated based on the text’s word count and the relative frequency of whom in the corpus at the time point under consideration. These probabilities are then summed to approximate the expected document count. The assumption for this baseline model is that all words occur randomly in the texts, with a probability corresponding to their relative frequency. The probability to find the word whom at least once in the i^th text of word length $m_{i}$ is then given by $T_{i} = 1 - e^{- f m_{i}}$ , where $f$ is the relative frequency of whom in the considered year. Based on this, we can calculate the expected number texts containing whom via $\tilde{T} = \sum_{i = 1}^{N_{T}} T_{i}$ , where $N_{T}$ is the number of texts in the considered year. With this expectation of the baseline model, we can calculate the dissemination coefficient

D^{S} = \frac{T}{\tilde{T}}

which is the ratio between the number of texts in which whom is used $T$ and the expected number of texts following the baseline model. A value of D^S = 1 corresponds to dissemination of a word across texts as if its occurrence was entirely random. Values below 1 indicate “clumping” (Altmann et al., 2013: 3), i.e. the use of the word in a smaller set of texts than expected. The closer to 0 D^S is, the less regularly disseminated the corresponding word is. Under-dissemination is interpreted by Altmann et al. (2013) as a sign of low word vitality.

Linguistic Dissemination

Stewart and Eisenstein (2018) define linguistic dissemination (D^L) as the difference between the log count of unique trigrams a word occurs in $(C^{3})$ and the word’s expected log unique trigram count $({\tilde{C}}^{3})$ . Since the logarithms of frequency and unique trigram count are highly correlated (Egghe, 2007; Stewart and Eisenstein, 2018: 4364), it is possible to calculate the expected log trigram count based on a word’s frequency. In Stewart and Eisenstein (2018), this is done by fitting a linear model for all words at a given time point, with the words’ log frequencies as the predictor and their log trigram counts as the outcome variable. Linguistic dissemination is then defined as the residual error between the model prediction and the observed log trigram count $(D^{L} = C^{3} - {\tilde{C}}^{3})$ . Positive values indicate higher-than-expected numbers of trigrams, i.e. particular linguistic versatility, whereas negative values indicate a restriction of the linguistic contexts a word occurs in. Negative D^L is a predictor of frequency decline.

We treat individual sentences as the relevant context for trigram detection and do not consider trigrams across sentence boundaries. Each document in the raw, unannotated version of COHA is split at sentence-final punctuation marks (periods, question and exclamation marks, semicolons, and colons). For copyright reasons, the offline version of COHA replaces sequences of ten words at set intervals with ten @ symbols. We treat these like sentence-final punctuation and do not allow trigrams to extend across them. If a word occurs in a place in the sentence that does not permit a right or a left trigram neighbor, i.e. in the first, second, last, or second-to-last position, we still register three unique trigrams. In these cases, we insert “<START>” or “<END>” instead of actual words into the trigram in order to replicate the method in Stewart and Eisenstein (2018).

Counting all trigrams for each word at a given time period proved computationally intractable. We therefore restrict ourselves to a random selection from a list of 17,912 words that occur at least 1,000 times in the corpus on the whole. For each time period, 10,000 items from this list of words are drawn and their unique trigram counts and frequencies of occurrence are measured. Given the regular relationship between log frequency and log unique trigram count, this amount of data is sufficient to reliably estimate the coefficient of the linear model and hence D^L.

Register Dissemination

In addition to social and linguistic dissemination, we also propose a measure of register dissemination (D^R). Our notion of register is closely in line with that developed by Biber (e.g. Biber, 1988; Biber and Conrad, 2009; Biber, 2012), both in how we conceptualize and how we quantify it. The term is defined as “a variety associated with a particular situation of use” (Biber and Conrad, 2009: 6). While the relevant situational parameters may relate to medium and context of communication, communicative goals and norms, and a number of other extra-linguistic factors (see Biber and Conrad, 2009: chap. 2), they have a direct and measurable bearing on the linguistic properties of a stretch of discourse.

To measure the interrelationship between situational properties and linguistic characteristics, the exploratory method of multi-dimensional analysis (MDA; Biber, 1988) has been established in the corpus-linguistic community. This method proceeds by compiling a corpus of relevance for the analysis, i.e. one that represents the situational parameters of interest, as well as a number of linguistic features hypothesized to play an important role in register differentiation. Such features are usually relatively common, high- to mid-frequency ones, such as the frequency of passive-voice constructions, personal pronouns, or non-standard words in a text. For each corpus text, the frequency profile of each feature is measured. The resulting text-feature matrix is subjected to exploratory factor analysis (Thompson, 2004) in order to discover a small number of latent “dimensions of variation” (Biber, 1988) that capture a large amount of the total variance of the extracted features. Each dimension is characterized by the linguistic features it is most strongly associated with, and each corpus text can be scored on a continuum for each dimension. Qualitative consideration of the most strongly associated features and the highest- or lowest-scoring kinds of texts for a dimension drives the interpretation and labeling of each dimension.

We perform such an analysis for the entirety of the COHA data. We use 65 of the features proposed in Biber (1988) and 24 additional ones largely adapted from Bohmann (2019). In addition, we also include the relative frequency of each of the 100 most common part-of-speech trigrams in COHA. The resulting 116,614 × 179 text-feature matrix is subjected to factor analysis with the psych package (Revelle, 2020) in R (R Core Team, 2020). Following an inspection of the variances accounted for by the first 100 components of a principal component analysis over the features, we decided to extract five factors from the data. We use a principal axes factor solution rotated to the promax criterion, which allows for moderate inter-factor correlations. The factor scores for each text are calculated using the regression method (see Thompson, 2004; Revelle, 2020 for details).

Space does not permit a full discussion of the dimensions and the qualitative process that produced interpretations and labels for each. Here, we restrict ourselves to an overview in tabular form. Table 1 shows the five dimensions (i.e. factors developed in the factor analysis) with the labels we have chosen for them. The most strongly associated features, genres, and the dimensions’ development over time give an indication of what aspects of linguistic variation each captures.

TABLE 1

TABLE 1. The five dimensions of variation in COHA.

Both the social and linguistic dissemination measures are based on discrete counts, which are not available for register as we operationalize it. A different method for quantifying register dissemination is therefore required than those used for social and linguistic dissemination above. Two options suggest themselves. First, similarly to Altmann et al. (2011) we can treat the presence or absence of whom in a text as a binary variable. For each step in the time period under analysis, we can then divide our corpus in two groups of texts, those including and not including whom. Both of these groups can be characterized as multivariate Gaussian distributions in the five-dimensional register-score space. Register dissemination can then be treated as the distinctiveness of the whom-texts from those without whom in register space. If there is significant overlap between both groups, this can be taken to indicate relatively wide dissemination, whereas if the groups are found to be largely distinct, this is a sign of register-specificity. The amount of overlap between two multivariate Gaussian distributions can be expressed as the Bhattacharyya distance (Bhattacharyya, 1943) between them.

This method is susceptible to differences in text length, since longer texts have a higher baseline probability of including a given feature and hence ending up in the whom-group. One solution would be to sub-divide larger texts into smaller segments to achieve uniform text length, and to treat each segment as a sample in its own right. While this would be a feasible solution in principle, a more plausible one is to treat relative frequency of whom in a text as a scalar variable. Doing so accounts for the effect of text length in a principled way without requiring further manipulation of the data. Instead of creating distinct groups, this method situates texts on a whom-frequency continuum.

In order to quantify the association between whom and specific register properties, we fit a linear model at each year, with relative frequency of whom as the outcome and each text’s scores for the five dimensions as the predictor variables. The adjusted R² values of these models are taken as indices of register specificity. A dissemination coefficient with similar properties to that proposed by Altmann et al. (2011) can then be obtained by subtracting this adjusted R² from 1. The more predictive power the joint dimension scores yield regarding relative frequency of whom, the higher the model’s R² value and the lower the corresponding D^R. As with Altmann et al.’s (2011) index, a value of 1 indicates completely even dissemination in register space, whereas values below 1 suggest register clumping and consequently decreased vitality of the form.

In addition to the general D^R, the values of each dimension’s model coefficients can also be traced over time, giving a sense of which register dimensions are most predictive of whom-frequency and which are most subject to change over time.

Topic Dissemination

Apart from social and register properties, discourse topic may be an important predictor of linguistic variation. We create a topic model for COHA, which we restrict to 100,000 randomly selected texts for computational reasons. Specifically, we use latent Dirichlet allocation (LDA), which represents a predefined number of topics as probability distributions over the words in the corpus and treats every corpus text as a probability distribution over all topics (Blei et al., 2003).

Before generation of the topic model, the corpus data were preprocessed in the following manner: all words were lemmatized based on the information already included in COHA, and only words from the part-of-speech categories noun, verb, adjective, and adverb were retained. Sequences of proper nouns such as “United States” were treated as single words, once again drawing on the information provided in COHA. Finally, the top 1,000 bigrams and trigrams with a minimum absolute frequency of 100 were also treated as single units. Extraction and ranking of bi- and trigrams was done with NLTK’s collocations module (Bird et al., 2009), which uses pointwise mutual information as its association metric.

The LDA models themselves were constructed in Python’s (Python Software Foundation, 2020) gensim module (Řehůřek and Sojka, 2010), with the parameters chunksize set to 2,000, passes to 5, and iterations to 200. Such models were built for numbers of topics between 9 and 200. For each of these, model coherence was calculated with the C_v measure proposed in Röder et al. (2015). The candidate with the highest coherence is a 25-topic model. As with the register dimensions, it is not our aim here to discuss individual topics. Therefore, Table 2 simply shows the top five words in each topic to give a sense of the range and plausibility of the model on the whole.

TABLE 2

TABLE 2. The 25 LDA topics developed for COHA.

Our procedure of quantifying topic dissemination is largely the same as that for quantifying register dissemination, with one addition. The factor analytic procedure that produces register scores ensures that these are already uncorrelated, or only moderately correlated in the case of oblique rotation methods (Thompson, 2004). The opposite is the case for the topic probabilities of each text. Since these always sum to 1, they are fully collinear as a set and cannot be used directly as model predictors. We therefore subject them to principal component analysis and use the values of the principal components as predictors. This has disadvantages if one wishes to explore the effects of individual topics, but is entirely robust for evaluating the predictive power of the topic structure on the whole.