Word frequency effects found in free recall are rather due to Bayesian surprise

Musca, Serban C.; Chemero, Anthony

doi:10.3389/fpsyg.2022.940950

HYPOTHESIS AND THEORY article

Front. Psychol., 25 August 2022

Sec. Cognitive Science

Volume 13 - 2022 | https://doi.org/10.3389/fpsyg.2022.940950

Word frequency effects found in free recall are rather due to Bayesian surprise

Serban C. Musca^1*

Anthony Chemero²

¹Department of Psychology, Université Rennes 2, Rennes, France
²Department of Philosophy and Psychology, University of Cincinnati, Cincinnati, OH, United States

The inconsistent relation between word frequency and free recall performance (sometimes a positive one, sometimes a negative one, and sometimes no relation) and the non-monotonic relation found between the two cannot all be explained by current theories. We propose a theoretical framework that can explain all extant results. Based on an ecological psychology analysis of the free recall situation in terms of environmental and informational resources available to the participants, we propose that because participants’ cognitive system has been shaped by their native language, free recall performance is best understood as the end result of relational properties that preexist the experimental situation and of the way the words from the experimental list interact with those. In addition to this, we borrow from predictive coding theory the idea that the brain constantly predicts “what is coming next” so that it is mainly prediction errors that will propagate information forward. Our ecological psychology analysis indicates there will be “prediction errors” because the word frequency distribution in an experimental word list is inevitably different from the particular Zipf’s law distribution of the words in the language that shaped participants’ brains. We further propose the particular distributional discrepancies inherent to a given word list will trigger, as a function of the words that are included in the list, their order, and of the words that are absent from the list, a surprisal signal in the brain, something that is isomorphic to the concept of Bayesian surprise. The precise moment when Bayesian surprise is triggered will determine to what word of the list that Bayesian surprise will be associated with, and the word the Bayesian surprise will be associated with will benefit from it and become more memorable as a direct function of the magnitude of the surprisal. Two experiments are presented that show a proxy of Bayesian surprise explains the free recall performance and that no effect of word frequency is found above and beyond the effect of that proxy variable. We then discuss how our view can account for all data extant in the literature on the effect of word frequency on free recall.

Introduction

Contradictory results have been reported in the literature that examines the effect of the frequency of occurrence of words in the language (hereafter, word frequency, WF) on free recall (hereafter, FR). It first appeared that WF was positively related to FR performance, with high frequency (HF) words having a higher probability of being recalled than low frequency (LF) words (e.g., Hall, 1954; Murdock, 1960; Sumby, 1963). However, it was found the experimental design used influences the results, with different effects that arise from the manipulation of WF between-subjects (pure lists: a participant either receives only LF words or only HF words) vs. within subjects (mixed list: all participants receive a list comprising words of all frequencies). FR performance of mixed lists has been found to be negatively related to WF in some studies (DeLosh and McDaniel, 1996; Merritt et al., 2006; Ozubko and Joordens, 2007), but sometimes no relationship was found (Watkins et al., 2000; Ward et al., 2003; Ozubko and Joordens, 2007), and other studies found a positive relationship between WF and FR performance (Balota and Neely, 1980; Watkins et al., 2000; Hicks et al., 2005).

Tentative explanations go back to the 1970s (e.g., Brown et al., 1977; see also Glanzer and Adams, 1985) but recent work (Lohnas and Kahana, 2013) indicates that a consensus has not yet been reached. Lohnas and Kahana (2013) observed that the frequency values of “LF” and of the “HF” word list varied greatly between different studies. This prompted them to conjecture a non-monotonic relationship between WF and probability of recall, so Lohnas and Kahana (2013) proposed that a non-monotonic relationship of the kind conjectured would have the potential to reconcile the extant divergent results. Their experimental results do indeed show a non-monotonic relationship between WF and FR performance. Having partitioned the word pool into ten log frequency bins ranging from LF to HF, Lohnas and Kahana (2013) found the probability of FR as a function of the frequency bin to have the shape of a check mark: it was high for the first bin (i.e., the bin of lowest word frequencies), then plummeted to its lowest value for the second bin — to a value significantly lower than that for the first bin — and increased from the third bin on, reaching its peak around the tenth bin (i.e., the bin of the highest frequencies considered). While this result pattern helps to make sense of the contradictory results found in the literature, it does raise the question of why WF has such a peculiar non-monotonic effect on recall performance. This could lead one to suspect the observed effect is not a genuine effect of WF, or that some other variable is also at play. On the other hand, if it is indeed a genuine WF effect, the non-monotonic relationship found poses a challenge to extant models, all of which, to our best knowledge, posit a monotonic relationship between WF and FR.

Explanations exist for the negative and for the positive relationship between WF and recall, but no explanation encompasses both. A decreasing FR performance as WF increases (i.e., a negative relationship) can be explained by proposing that “recall performance results from the relative contributions of individual-item information and serial-order information” and that for “common items [i.e., HF words], order encoding should decrease from pure lists to mixed lists [while the] reverse should occur for unusual items [i.e., LF words]: order encoding should increase from pure to mixed lists” (DeLosh and McDaniel, 1996; Merritt et al., 2006). Ozubko and Joordens (2007) explain a negative relation by the asymmetrically strong links between low- and high-frequency words, with high-low intra-list associations being stronger (than the low-high ones) and thus giving rise to better performance on low-frequency words in the mixed list with random word order.

Increased FR performance as WF increases (i.e., a positive relationship) is accounted for by explanations originating from the two-process (generate-recognize) theories of recall and recognition (Kintsch, 1968; Anderson and Bower, 1972): HF words are easier to recall because more experience with HF words makes it easier to generate them as retrieval candidates than LF words. In particular Balota and Neely (1980) state that “this can be accommodated by [generate-recognize theories] by arguing that the nodes corresponding to HF words are much more likely to be generated than are nodes corresponding to LF words. Thus, even though it may be more difficult to recognize a generated node corresponding to an HF word than to recognize a generated node corresponding to an LF word, there are so many more HF than LF nodes generated that the net result is superior RCL [recall] of HF words.” A different explanation (Gillund and Shiffrin, 1984) starts from the proposal that HF words have stronger associative relations to other items than do LF words, both in terms of prior associations and in terms of experimental associations formed during the study. Because HF words have stronger associations than LF words, pure HF lists are more easily recalled than pure LF lists. Moreover, as LF words in mixed lists are cued by HF words, there is a better recall of LF words in mixed lists as compared to pure LF lists (which would explain why often no WF effect is found in mixed lists). Finally, because LF words are less effective cues than HF words, HF words are easier to recall in pure HF lists than in mixed lists.

Regardless of the ongoing debate on the merit of these explanations (and others not mentioned here), one is still free to pick the theoretical explanation of the WF effect that fits their results, and no particular theory can explain a non-monotonic relationship, such as the one found by Lohnas and Kahana (2013).

The ideas that we propose here aim at explaining all the result patterns found in the relationship between WF and FR: positive and negative relations, and also no relation, for mixed lists; a consistent positive relation for pure lists. Our tentative explanation will also speak to why the use of pure lists increases the probability of finding a WF effect on FR (as opposed to when a mixed list is used). Our claims are the following. Firstly, there are inevitable discrepancies that exist between the WF distribution of those-words-in-that-experimental-list and of the words in participants’ native language. Secondly, people’s brains are sensitive to such distributional discrepancies, such that Bayesian surprise (Itti and Baldi, 2009; Baldi and Itti, 2010) occurs, the result of which is that some words become more memorable than others. These claims offer a very different account of the data, one grounded in an ecological psychology approach to cognition in general, and in an ecological psychology analysis of the experimental situation used to derive FR performance. We present two behavioral experiments to support our account, and discuss the value of our interpretation with respect to the results of these experiments but also based on how our account may also explain when and why WF effects on FR performance have been found in the literature.

Inevitable distributional discrepancies between those-words-in-that-experimental-list and the words in participants’ native language

It is central to our main claims that the WF distribution of words in an experimental list is inevitably different from the WF distribution of words in the language. We will thus make this point here and consider its implications.

Zipf (1935) showed that WF distribution in English¹ is very skewed and right-tailed, meaning that a word’s frequency of use is inversely proportional to its rank frequency (see Figure 1) — a distribution later referred to as Zipfian. We like to think Zipf’s line of research did not have the influence it deserves because to our best knowledge no researcher tried to make sure the WF distribution of the words in their experimental list(s) of words was similar to that in participant’s native language or to explain the WF effect(s) on FR as arising from the discrepancies between these two distributions. Of course, one possibility is that those distributional discrepancies do not matter. Not only do we suggest they do, but we will bring some arguments that suggest this is how WF exerts its effect on FR performance.

FIGURE 1

Figure 1. Zipfian distribution of words in American English: count of words per bin (all words in American English are considered for this example). Bins are created by partitioning the frequency range into ten intervals of equal log frequency width. The abscissa represents occurrences of a word per one million words (log scale). Data is based on the SUBTLEX-US database (Brysbaert and New, 2009).

Figure 2 displays a visual comparison between the WF distribution of all English nouns in the singular in the CELEX 2 database (Baayen et al., 1995), in pane a, and of the English nouns in singular used by Lohnas and Kahana (2013) in their experiment, in pane b. A visual inspection reveals what further analyses confirm. Firstly, the distribution of our reference population of words (pane a), follows a Zipfian distribution, that is, the relation between the ranks of the words and their frequency of use follows a decaying exponential law — a decaying exponential model explains 96.1% of the variance in the data (p < 0.0001). Secondly, the distribution of the experimental word pool shown in the pane b of Figure 2, does not follow a Zipfian law: a decaying exponential law fits poorly the data, explaining only 25.34% of the variance (p > 0.14) — actually, the distribution is not different from a Gaussian distribution (W = 0.8966, p > 0.2). An exact multinomial test, carried out under a model comparison approach based on the calculation of the Bayes factor (BF), confirmed the difference between the two distributions — it yielded a posterior probability of virtually one for the model supposing the two distributions are different, BF₁₀ > 10⁹⁹.

FIGURE 2

Figure 2. Count of words per bin (and percentage out of a total per bin, italicized, on top of each bar): for all English nouns in the CELEX 2 database (A) vs. for the words used by Lohnas and Kahana (2013) in their experiment (B). Note the massive shape difference between the two distributions. The abscissa represents occurrences of a word per one million words (log scale). See text for details.

One could argue there is no way to warrant that other published studies did not use a WF distribution not too different from a Zipfian one. It is important to comprehend this cannot be the case. Indeed, if one used in their experimental list just one noun from the three bins of highest WF, they should also include about 139 (!) words from the bin of the lowest WF (and about 117 from the next bin, and so on), which make for quite an unpractical experimental list. The problem is even more acute when one considers not the pool of all experimental words that were used (as we considered in our example here) but the specific composition of an experimental list of words in particular because a list in an FR experiment can only comprise a limited number of words, so the distributional discrepancies would be even greater.

In other words, one cannot construct an experimental list of words the WF distribution of which would not be at odds with the WF distribution of the words in the participant’s native language. There are important consequences to this situation. One such consequence is that one cannot experimentally evidence the effect of such WF distribution discrepancy other than by measuring it. More importantly, one cannot control its effect (in the sense of partialling out its undue influence) but by measuring it and including it as a variable in all the statistical models that test for the putative effect of other variables of interest (e.g., WF, age of acquisition, etc.). Finally, if WF exerts its influence on FR performance through the WF distribution discrepancy we discussed, no (additional) WF effect should be found once one controls for the variable that measures the WF distribution discrepancy (i.e., with this latter variable included in the statistical model).

Priors, expectations, surprisal and Bayesian surprise

Our view builds heavily on predictive coding theory (e.g., Rao and Ballard, 1999; Friston, 2003; Bubic et al., 2010; Huang and Rao, 2011; Clark, 2013). Initially developed based on findings in the field of perception (e.g., Srinivasan et al., 1982), predictive coding theory consists of the idea the brain is “using top-down connections to try to generate, using high-level knowledge, a kind of “‘virtual version”’ of the sensory data via a deep multilevel cascade… [with] the top-down flow as attempting to predict and fully “‘explain away”’ the driving sensory signal, leaving only any residual “‘prediction errors”’ to propagate information forward” (Clark, 2013). Later on, predictive coding theory was extended to action and motor control (e.g., Brown et al., 2011; see also free energy theory: Friston and Stephan, 2007; Friston et al., 2009; Friston, 2010) and lately to cognition in general (e.g., Lupyan and Clark, 2015; Spratling, 2016). For instance, Lupyan and Clark (2015) conclude that “predictive processing [i.e., processing in a model that implements hierarchical predictive coding] thus provides a plausible mechanism for many of the reported effects of language on perception, thought, and action.” We would like to add free recall to that. While we do not endorse all the assumptions of predictive coding, we do agree with the general idea the brain continuously predicts what is coming next and possesses the neural circuitry necessary to sense any significant divergence between its predictions and what actually occurs, and then adjusts itself (i.e., learns, memorizes) to the differences found between what was predicted and what occurred. In other words, its learning depends positively on the unexpected: the more unexpected an occurrence, the more learning/memorization is induced.

What bears special relevance to our proposal is the fact that “prediction error, … the divergence from the expected signal,… reports the “‘surprise”’ induced by a mismatch between the sensory signals encountered and those predicted” (Clark, 2013). Clark (2013) very appropriately stresses that “[m]ore formally – and to distinguish it from surprise in the normal, experientially loaded sense – this is known as surprisal (Tribus, 1961)”. This is a crucial point since Tribus’ surprisal has nothing to do with the notion of cognitive surprise as it has been construed since the late 1960s. The latter, cognitive surprise is best described in Kamin’s (1969) words: “[…] perhaps […] it is necessary that the US [unconditioned stimulus] instigate some mental work on the part of the animal. This mental work will occur only if the US is unpredicted, if it in some sense surprises the animal” (Kamin, 1969, p. 293, our emphasis). Distinguishing surprisal from “cognitive” surprise is necessary in order to avoid the pitfall of implying that participants have to be able to report experiencing surprise for there to be an effect. Indeed, while a rat that receives the first electric shock of its life may be surprised in the phenomenological sense (i.e., it may experience surprise in addition to its brain experiencing surprisal), there is no need to suppose that learning can be modulated only by something the learner (e.g., a human participant) is conscious of (i.e., surprise).

More recently Itti and Baldi (Itti and Baldi, 2009; Baldi and Itti, 2010) introduced the concept of Bayesian surprise, which seems virtually identical to that of Tribus’ surprisal but has the advantage of bringing the Bayesian framework into play, along with a mathematical definition and the possibility of making numeric predictions: “The amount of information contained in a piece of data can be measured by the effect this data has on its observer. Fundamentally, this effect is to transform the observer’s prior beliefs into posterior beliefs, according to Bayes theorem.” (Baldi and Itti, 2010). This relates well to predictive coding: “Predictive coding […] still depends upon known priors.” (Friston, 2003). The question that must be answered then is what the known priors are and where they come from in the case of (memorizing and recalling) words.

To answer this question, we begin by taking a broadly ecological approach to the issue (Gibson, 1979; Turvey et al., 1981; Agre, 1997; Chemero, 2009), according to which one must carefully examine environmental and informational resources available to the participant before positing any specialized cognitive processes. Moreover, according to this approach we adopt, informational and environmental resources are often higher-order, relational properties. This means we assume the neural structures that enable FR are determined by experience with a language but do not explicitly represent it, as a model or list, or searchable structure. Likewise, we do not assume the information on the frequency of the to-be-memorized words is available to the participants in an experiment, or that the statistical structure of the language (e.g., WF, but not only) is plausibly explicitly represented by the cognitive machinery of participants — so that participants’ cognitive machinery could make use of the frequency of the to-be-memorized words and yield the WF effects; in this sense, WF, as manipulated experimentally, is not directly responsible for the FR performance.

What is available to a participant in the FR task is a recently presented list of words, a cognitive system that has been shaped by their native language, and relations between these. Over developmental time the statistical structure of the language being learned and used shapes the neural structures in the same way that weather patterns and flowing water affects the landscape. The number of encounters with a word (operationalized by its WF) and other factors (e.g., what other words co-occurred, on what occasions, etc.) is responsible for the particular pattern of these neural structures. WF in the environment of the past linguistic experience is thus a distal cause with respect to FR performance, as it is (one of) the cause(s) that has been shaping the cognitive system over a long period of time and made it what it is at the time of the experiment (and its FR part). FR performance is then the end result of relational properties that preexist the experimental situation and of the way the words from the experimental list interact with those.²

Our account differs from extant models in that it does not posit specialized cognitive processes to account for FR. At its worst, positing a specialized cognitive process can seem overly ad hoc and, hence, unexplanatory. A single account of the relationship to FR that makes sense of both the positive and negative WF relations without positing multiple processes should be preferable. To achieve that, our strategy was to find the higher-order property of the environment the participants in the experiments are responding to. Only after we know exactly what information participants are using in order to recall words does it make sense to speculate about the cognitive and neural mechanisms that enable the recall (Gibson, 1979; Turvey et al., 1981; Agre, 1997; Chemero, 2009). In particular, we consider that the participants in FR experiments are not responding just to the to-be-recalled words of an experimental list, but to a particular relationship between the to-be-recalled words and the corpus of the language as a whole (with the latter having shaped participants’ cognitive system in a particular way), a relationship that we will capture through a proxy variable we call surprisal proxy (hereafter, SP).

Thus, the known priors, in the theoretical framework we propose, are provided by the particular neural structures of a participant that were previously shaped and determined by the statistical properties of the words in the participant’s native language. That a neural signal is triggered when something about the actual stimulus is at odds with the brain’s expectations has been convincingly argued for in neuroscience (e.g., Steinberg et al., 2013), so we contend that Bayesian surprise³, a neural signal, is triggered in a participant’s brain when some word statistical properties are at odds with brain’s expectations, given brain’s priors. Thus it is Bayesian surprise that makes some words from a word list more memorable and more prone to being recalled as compared to others. As mentioned before, Bayesian surprise is the end result of an interaction between a particular word in a given position in a list comprising those words in particular (and not comprising others), on one hand, and a brain the linguistic priors of which are what they are as a result of that participant being competently using their native language for many years prior to the experimental situation.

Surprisal proxy, an experimental index of Bayesian surprise, and its interpretation

Because we have no direct means of measuring Bayesian surprise, we constructed an index of Bayesian surprise that we called surprisal proxy (SP) in order to test our view of FR performance, starting from the distributional properties of the words in the language and the to-be-recalled words in a list. SP is only intended as a variable that can be used to experimentally test our proposed account, and one should refrain from reifying SP into a concept or confounding it with Tribus’ notion of surprisal. This is the reason why we have not called SP Bayesian surprise: Bayesian surprise is the concept at work in our theoretical explanation, while SP is just a handy experimental proxy for it.

Before giving details on how the SP index is computed, some clarifications are in order. Firstly, we do not contend that FR performance is based on the computation of SP in the brain. Crucially, we assume no computation --- of the kind we will detail below when constructing our SP variable --- is carried out implicitly or explicitly by the participants. Secondly, the SP value for a to-be-remembered word on a list is not a transformation of that word’s frequency value. This is to say, that while for a given WF database, the WF value associated with a word is a fixed number⁴, the very same word will have SP values that will be low or high depending on which other words are in (and which are absent from) that list, and also on its order of occurrence with respect to the other words of the list. This latter point is made more manifest in the following presentation of the computation of SP.

Because WF is a discrete variable, computation of SP is carried out by intervals. Any given WF interval will include a certain number of words in the language, and potentially one or more words of the word list. The larger the considered WF interval, the larger will be the number of words in the language that are comprised within it. If no discrepancy exists between list and language in the distribution of WF, a large WF interval will also include a proportionally large number of words from an experimental word list. Another parameter to consider in the computation of SP is the particular form of the WF distribution (see Figure 1), which makes it such that for two WF intervals of equal widths, a (much) higher number of words in the language will fall into a WF interval situated in the low WF values — compare the leftmost and the second leftmost bins, or, more extremely, the leftmost and the rightmost bins in Figure 1. As there are far fewer words in the experimental list as compared to the words in the language, the words in the experimental list are used to define the WF intervals. The general idea of the computation of SP is the following (numeric examples follow):

(i) a given WF interval considered generally comprises a single experimental word (the exception being if there are two or more words of the exact same WF in the experimental word list); given the number of words in the word list, we can compute the percentage of the words from the list that are comprised in that WF interval, PctList (PctList is then the number of list words that fall within the interval divided by the total number of words in the list);

(ii) we determine how many words fall within that same WF interval in the language, and given the total number of words in the language, we can compute the percentage of the words from the language that are comprised in that WF interval, PctLanguage. Because the total number of words in the language may be very high in comparison to the number of words in the language that fall in the considered WF interval, PctLanguage is generally very low so a further step consists in taking the square root of PctLanguage, sqrt(PctLanguage);

(iii) the width of the interval, WInt is taken into account;

(iv) we compute the ratio of sqrt(PctLanguage) to PctList, then we divide by WInt; the log of the result is SP. In other words, for one word (or many words of the exact same WF) of the experimental list we have

S P = l o g (\frac{\frac{\sqrt{(PctLanguage)}}{PctList}}{WInt})

= l o g (\frac{\sqrt{(PctLanguage)} \times Wint}{PctList})

There are however a number of additional important details to be taken into account in the computation of SP. The first step in computing SP is defining the distribution of words in the language that we take as reference. Considering an experiment in which the words in the experimental lists are French nouns in singular form, we would take as a reference all nouns in French in the singular form (hereafter called only ‘words’, for simplicity’s sake) of all frequencies⁵. There are 24,530 French nouns in the singular form in the database we used, LEXIQUE (New et al., 2001, 2004), the reference database for frequency of use of words in adults in French. Next, one must ensure the distribution of our reference population of words follows a Zipfian distribution, that is, the relation between the ranks of the words and their frequency of use follows a decaying exponential law. This was indeed the case, as a decaying exponential model explains 99.84% of the variance in the data (p < 0.0001).

Importantly, the computation of SP takes into account the words so far presented to a participant in the experiment, which means a word’s SP value depends on that word’s position in the list. Let us suppose the first six words of a 30-word list that are presented to a participant are, in their order of presentation, plongeur (diver), cercle (circle), brouette (wheelbarrow), esquimau (eskimo), poireau (leek) and oie (goose), and that their respective word frequencies are 1.69, 42.43, 5.14, 0.88, 0.88, and 5.2 (occurrences per million words, from LEXIQUE). The word plongeur adds one word of its frequency to the experimental list (this general formulation is introduced to take into account the case of two or more words that follow immediately each other and have the exact same frequency), and the interval it is associated with is [0.07⁶; 1.69], that is, an interval of width 1.62. In this interval, there are 12,513 words in French, and there are a total of 24,530 French words in this example. The SP value for plongeur as the first word in the list is thus log{[sqrt(12,513/24,530)*1.62]/(1/30)}, that is about 1.54. The word cercle is associated with the interval [1.69; 42.43] because the inferior bound is given by the word with the highest frequency i) that has already been presented and ii) the frequency of which does not exceed that of the word at hand — here by the frequency of the first word, plongeur. In this interval, there are 6,204 words, and the width of this interval is 40.74. The SP value for cercle as the second word in the list is thus of log{[sqrt(6,204/24,530)*40.74]/(1/30)}, that is about 2.79. The SP value for brouette as the third word in the list is about 1.58 (there is nothing noteworthy about its computation). The words esquimau and poireau add to the experimental list two words of the exact same frequency and the interval they are associated with is [0.07; 0.88], an interval of width .81 and containing 10,037 words. The SP value for both esquimau and poireau as the fourth and fifth words in the list is thus of log{[sqrt(10,037/24,530)*0.81]/(2/30)}, which is about 0.89. Finally, the word oie adds one word of its frequency to the experimental list and the interval it is associated with is [5.14; 5.2], an interval containing 29 words, and the width of this interval is 0.06. The SP value for oie as the sixth word in the list is thus of log{[sqrt(29/24,530)*0.06]/(1/30)}, which is about −1.21.

The interpretation of an SP value is quite straightforward. One may very loosely think of the numerator of SP, sqrt(PctLanguage), as “what is expected” and of the denominator, PctList, as what is observed in the experimental list. If the interval width and/or the number of words in the language that fall within an interval is/are low (cf. word oie), little is expected, and finding that one word of 30 the list comprises falls in that interval should not generate any Bayesian surprise because, if we may say, “nothing is lacking with respect to what was expected” (SP is of about −1.21). On the contrary, the word cercle falls into a very large interval that comprises many words in the language. This makes manifest that one or more words of a lower WF were expected and were not present in the list. When the word cercle is presented, it entails the absence of all those other words of lower WF from the experimental list, which generates Bayesian surprise. That Bayesian surprise is associated with the word cercle (SP is about 2.79) — that word is present when Bayesian surprise occurs, and there is nothing else to associate Bayesian surprise with at that time. It is crucial to understand that there is nothing surprising about the word cercle. What is surprising to the cognitive system is that something else was expected (before a word of such a “high” WF) and that expectation is contradicted by the very presence of the word cercle: whatever was expected did not occur, the word cercle was presented instead. Based only on their SP values, among the six words in our example, the one that is expected to have the highest (lowest) recall probability is cercle (oie), with an SP of about 2.79 (−1.21).

Experiments

The approach we opted for in the experiments described below is motivated by the following. Firstly, as previously discussed, one cannot compare two conditions, one with a list of words that conforms and the other list of words that do not conform to the Zipfian distribution in the language, and observe when a WF effect on FR performance is obtained. One is thus left with the option of controlling the effect of these discrepancies at a statistical level. We introduced and defined the SP variable for this reason, as a proxy for the Bayesian surprise, which we suppose is derived by the brain from the distributional discrepancies. When testing for the influence of a variable (e.g., WF) on FR performance, we will do so with SP in the model.

The possible outcomes are thus the following. If participants’ cognitive systems are not sensitive to the discrepancies that exist between the WF distributions in the language and in the experimental lists, SP will not be a predictor of FR performance. This is a perfectly valid prediction despite being a null hypothesis prediction because we will carry out all statistical analyses in a Bayesian framework that allows for validating the null hypothesis if the model derived from it fits the data best. In this case, if WF is indeed a genuine predictor of FR performance, if it exerts its influence in a way different from what we propose here, the effect of WF should manifest itself.

On the other hand, if SP explains FR performance, that is, there is no WF effect above and beyond that of SP, then one must conclude that the manipulation of WF does not have a direct effect on FR performance. In this case, WF manipulations rather affect FR performance by creating statistical discrepancies that are detected by the brain, which by the mechanism of Bayesian surprise makes some words more memorable.

These predictions are tested in two classic FR experiments where participants are first asked to memorize a list of words that are presented to them one at a time and then asked to recall the memorized words in whatever order these come to their mind. For both these experiments, we chose a reasonably high number of words per list in such a way as to make sure there is enough variability for each and all different descriptors/predictors (i.e., WF, age of acquisition, etc.; see Experiment 1), in order to be able to test for the influence of each of these descriptors on FR performance in a within-subjects design.

Experiment 1

Experiment 1 was a proof of concept and as such suffers from some shortcomings. Among these, a small number of participants, probably too extensive a backward counting task between the memorization phase and the FR phase, and a programming error that led to the presentation of the same first four words of a list always in the same order. We decided to include it because it yielded some remarkable results.

Method

Participants

A total of fourteen participants (4 men) aged 19-25 years (mean = 20.65; SD = 1.63), all second-year psychology students at the University Rennes 2 (Rennes, France), participated in the experiment for course credit. They were all native speakers of French and were randomly assigned to one of the two experimental groups (see next section) so they memorized and recalled a single list of words.

Stimuli and apparatus

The words used as stimuli (cf. Appendix A) are 72 concrete and quite common French words that are names of objects presented as line drawings in Snodgrass and Vanderwart’s (1980) set and are also found in Alario and Ferrand (1999). The words are basic enough to be deemed suitable as experimental material in patients with anomia (e.g., McCarthy and Kartsounis, 2000) and as therapeutic material in patients with post-stroke aphasia (Macoir et al., 2017). The particular words retained here were chosen so as to maximize the variability along the WF dimension and also because a full set of descriptors exists for each of them — in French, there are only a few hundred words for which all the descriptors mentioned below are available. Two equivalent lists of 36 words were constructed (L1 and L2, cf. Appendix A) and each participant was presented randomly with one of the lists (seven participants saw L1).

The word descriptors were the following. Number of phonemes (PhN) and printed frequency (WFA) come from LEXIQUE (New et al., 2001, 2004). The frequency of use of words in books for French children (WFC) is taken from the MANULEX database (Lété et al., 2004). Age of acquisition (AoA), the age at which a speaker first knows consistently the meaning of a word, is taken from Chalard et al. (2003). Conceptual familiarity (CFam), the familiarity of the concept a word refers to, is based on Bonin et al. (2003). We also included as predictor animacy (Anim), a binary variable that opposes animate to inanimate things, because animacy was found to influence performance in a naming task (see Howard et al., 1995), possibly because it is information that is available before name phonology for the object to be named (e.g., van Turennout et al., 1997). For SP, our last predictor, note that a given word does not have a fixed SP value, its SP value depends on that word’s position in the list, the words the list contains, which of those were already presented, and the words that are not present in the list.

As no correlation was different from one experimental list to the other (all p > .05), pairwise correlations between all numeric descriptors are presented in Table 1 for both experimental lists combined⁷ (point-biserial correlation was used to correlate the dichotomous animacy variable with the other numeric variables).

TABLE 1

Table 1. Pairwise correlations between the word descriptors considered in Experiment 1.

The task was driven by E-prime 2.0 (PST Inc., PA, United States), on an IBM-compatible computer running Windows and using a 3:4 ratio 17” screen. Stimuli were presented on a black background in white lowercase 24-sized bold Courier New font characters. They were displayed centered both horizontally and vertically on the screen. Participants sat in front of the screen at an approximate viewing distance of 60 cm.

Design and procedure

WF and other word descriptors vary within-subject. Each participant was tested individually. Participants were welcomed and received the instructions on the computer screen. They were instructed to memorize all the words that were going to be presented to them and were told they would have to recall those words but that they would not be required to recall them in the order they were presented to them.

Each stimulus was presented once, for 3,000 ms, and was followed by a black screen for 1,500 ms. The first four words of the list⁸ and the other 32 words (always presented in a random order that differed from one participant to another) were presented without interruption so that nothing distinguished the former ones from the later ones. After the last word of the list was presented, a message indicated to the participant that the presentation of the to-be-memorized words was over and the participant was then instructed to count backward in threes from 300 (i.e., 300, 297, 294, etc.). The participant carried out this task for 120 s, then was handed a sheet of paper and a pencil and asked to write down all the words they could remember. They had 5 min to complete the FR task.

Results

In order to avoid confounding an effect of the order of presentation with that of a word descriptor, the four low WF words that were presented always in the same order at the beginning of the list to all participants (see the previous section) were excluded from the analyses. Out of the thirty-two remaining words, the number of free recalled words varied between four and eighteen (mean = 8.79, SD = 3.75).

All analyses that follow are carried out within a Bayesian model comparison approach that consists in building all models, comparing them, and retaining the model that shows the best fit to the data. Model fit to data is evaluated through the Bayesian Information Criterion (BIC: Schwarz, 1978), with the lowest BIC value reflecting the most probable model. All models are mixed-effect logistic regression models with a logit link on the binomial dependent variable (one that codes for whether a word was recalled or not, hereafter Resp), a random subject effect that allows taking into account the variations in mean performance between participants and possibly one or more other fixed-effect independent variables. Data analysis and interpretation were carried out with R (R Core Team, 2014) using the R2STATS GUI (Noël, 2014) based on the lme4 library (Bates et al., 2015).

The null model is formally written as:

l n [\frac{ϕ_{i j}}{(1 - ϕ_{i j})}] = β_{0 j} = γ_{00} + u_{0 j}, (1)

with ϕ_ij = P(Resp_ij = 1| β_j) and u_0j∼N(0, ψ)

and an augmented model that includes a word descriptor (e. g., WF in adults, WFA) as a fixed-effect predictor is formally written as:

l n [\frac{ϕ_{i j}}{(1 - ϕ_{i j})}] = β_{0 j} + β_{1 j} W F A_{i j} = γ_{00} + γ_{10} W F A_{i j} + u_{0 j}, (2)

withϕ_ij = P(Resp_ij = 1| β_j) and u_0j∼N(0, ψ)

If we note the participant variable as Subj and we keep in mind that all models are mixed-effect logistic regression models with a logit link on the binomial dependent variable Resp, the same two models can be written more informally using the notation introduced by Bates et al. (2015) as Resp∼(1| Subj) (the null model formally defined in (1)), and, respectively, Resp∼(1| Subj)+WFA (the augmented model formally defined in (2)). For simplicity’s sake, we will use the later model notation throughout. Also for simplicity’s sake, we do not present here the more complex random slopes models we built (i.e., models where the values of the dependent variable are considered to vary not only as a function of the predictor to be tested while keeping the same slope for all participant, but also allows the supplementary degree of freedom that the slopes be different between the participants), because the results were the same and those more complex models did not fit the data better.

We first checked whether there was a list effect. In order to do that, we defined a model similar to that defined in (2), but including the List (L1 vs. L2) as a predictor (instead of WFC), that is, Resp∼(1| Subj)+List, and compared its BIC (BIC = 539.69) to that of the null model (BIC = 534.28) defined in (1), Resp∼(1| Subj). As the BIC for the null model is lower than that of the augmented model including List (i.e., the null model is the better of the two models), we can conclude that the performance does not depend on the list. Thus, all subsequent analyses are carried out on all data, irrespective of the list.

In what follows, the details of the models mentioned are given in Table 2. The next analysis concerned the existence of order effects. We considered a linear effect of word order in the list (M2) and a quadratic one (M3) but both models were less good than the null model (M0). The lack of primacy and recency effects is somewhat surprising, but less so if one considers the first four words that were systematically presented in the same order were discarded from the analyses (which may have led to the absence of a primacy effect) and also that the word presentation was followed by a 2-min backward counting task (which may have decreased enough the recency effect so as to mask it).

TABLE 2

Table 2. Models and model fit (BIC) to the data of Experiment 1.

Having established there were no order effects, we turn now to the one-by-one analysis of all the word descriptors. These were considered under their raw form or as log-transformations of these, and models considered their linear effect. In addition, when graphically the relationship between a predictor and the dependent variable seemed to have a quadratic form, a model considering a quadratic effect was also defined and tested. The models considering as a predictor adult WF, a linear version (M4), or its log-transformation (M5) are both less good that the null model (M0), which means that WF cannot explain the data. The same is true for words’ age of acquisition (models M6 and M7), word length (number of phonemes; model M8), conceptual familiarity (models M9, M10, and M11), and child WF (models M12, M13, M14, and M15).

Considering as a predictor SP yields a model (M16) that is better than the null model (M0). The other word descriptor that seems to explain the performance is animacy (cf. M17), with a better recall for animate than for inanimate things (39.56% vs. 24.37% mean percent recall). A model that includes both predictors (i.e., SP and Anim) under an interactive effect assumption (M18) does not account better for the data than the model that supposes only an effect of SP (i.e., M16). The same is true for a model that includes both predictors under an additive effect assumption (M19)⁹.

To summarize, no word descriptor among those considered — including WF — other than SP explains the FR performance in this experiment. The best model involving adult WF as a predictor (although, again, not better than the null model), Resp∼(1| Subj)+log₁₀WFA, has a poor fit to the data (this conclusion is confirmed by the inspection of Figure 3A). On the other hand, SP exhibits a clear relationship to FR performance (see Figure 3B), one that is monotonic (the model, Resp∼(1| Subj)+SP, tests for a linear effect of SP). This supports our idea that the higher the Bayesian surprise associated with a word, the higher its probability of subsequent recall. The next experiment will further put to test this view.

FIGURE 3

Figure 3. Free recall probability as a function of (A) word frequency in adults (log scale), and (B) SP. The dashed line stands for the mean probability of the recall curve. See text for details.

Experiment 2

No WF effect was evidenced in Experiment 1 (while SP had a significant effect on FR). This may have been a consequence of the low number of participants in that experiment (i.e., by lack of statistical power). Accordingly, significantly more participants took part in Experiment 2. However, Experiment 2 is not just a replication of Experiment 1 with more participants. Unlike Experiment 1, Experiment 2 includes a dummy task, presented to the participants before the memorization phase of the FR experiment, that introduces a manipulation aimed at drawing participants’ attention to WF. This manipulation aims at favoring the apparition of a WF effect on FR performance.