A closer look at the sources of variability in scalar implicature derivation: a review

Khorsheed, Ahmed; Gotzner, Nicole

doi:10.3389/fcomm.2023.1187970

REVIEW article

Front. Commun., 02 June 2023

Sec. Language Communication

Volume 8 - 2023 | https://doi.org/10.3389/fcomm.2023.1187970

A closer look at the sources of variability in scalar implicature derivation: a review

Ahmed Khorsheed¹^*

Nicole Gotzner²^*

¹Faculty of Modern Languages and Communication, Universiti Putra Malaysia, Serdang, Malaysia
²Institute of Cognitive Science, Osnabrück University, Osnabrück, Germany

For more than 20 years, studies in experimental pragmatics have provided invaluable insights into the cognitive processes involved in deriving scalar implicatures and achieving inferential comprehension. However, the reports have always contained a notable degree of variability that remained inadequately discussed in the literature. For instance, upon closer inspection of the experimental record, one can always find a group of individuals who tend to be largely pragmatic, overwhelmingly logical, or sometimes mixed not showing a clear preference. There also exist newly-devised paradigms that prompt a radically different type of response than other paradigms, and thus new evidence casting doubt on long-established findings in the field. More recent research on scalar diversity further suggests that differences in the semantic structure of scalar words can lead to differing rates of scalar implicatures and can modulate the time invested in pragmatic processing. Indeed, one can contend that the current empirical landscape on scalar implicatures can be characterized as having three primary sources of variability: inter-individual, methodological, and linguistic. What factor or factors are behind these patterns of variability, and how can we interpret them in light of a pragmatic theory? This paper has a 2-fold objective: one is to review the previous experimental record on scalar implicatures from variability-based lenses, and the other to discuss the factor(s) that could account for this observed variability in the literature. Avenues for future research are provided.

1. Introduction

For over two decades, the study of scalar implicatures has served as a testing bed for several accounts in experimental pragmatics. Studies mainly investigated how readers and listeners would evaluate utterances containing weak scalar words (e.g., <some, all>, <or, and>, <might, must>) and how the implicature embedded in them would manifest itself in real-time processing, especially if it is generated automatically as posited by the default account (e.g., Levinson, 2000), or with processing costs by Relevance Theory (e.g., Sperber and Wilson, 1986; Wilson and Sperber, 1998). For this theoretical debate, a lot of experimentation has been carried out. People used a variety of testing techniques which include but are not limited to eye-tracking paradigms (e.g., Huang and Snedeker, 2009a, 2018; Politzer-Ahles and Matthew Husband, 2018), mouse-tracking paradigms (e.g., Tomlinson et al., 2013), event-related potentials (ERPs) (e.g., Noveck and Posada, 2003; Chevallier et al., 2010; Nieuwland et al., 2010; Hunt et al., 2013; Spychalska et al., 2016), sentence verification tasks (e.g., Bott and Noveck, 2004; Chevallier et al., 2008; Bott et al., 2012), dual tasks (e.g., De Neys and Schaeken, 2007; Marty and Chemla, 2013), and reading comprehension vignettes (Breheny et al., 2006), among others. Results generally concluded in favor of Relevance Theory (see Noveck, 2018; Breheny, 2019; Khorsheed et al., 2022b for reviews). This evidence was also replicated across different populations and languages (Katsos et al., 2016), and it was obtained from materials covering a wide range of scalar terms (van Tiel et al., 2016). However, upon closer inspection of the data in these studies and relevant literature, one finds different variability patterns that are worth attending to. In our review, we have identified three specific categories of variability: inter-individual, methodological, and linguistic.

Regarding inter-individual variability, numerous studies have consistently revealed that individuals always prefer one particular reading over another. For instance, in a study conducted by Noveck and Posada (2003), which used the “Some cats have ears” type of material, participants were divided into two qualitatively distinct groups: those who primarily adhere to a logical reading, and those who predominantly adopt a pragmatic interpretation. Sometimes there existed a smattering of “mixed” participants who do not seem to be consistent in preferring a certain reading (see Bott and Noveck, 2004). This observation was also witnessed in other studies mainly employing sentence verification paradigms to measure participants' response rates (see Feeney et al., 2004; Breheny et al., 2006; De Neys and Schaeken, 2007; Pouscoulous et al., 2007; Bott et al., 2012; Heyman and Schaeken, 2015; Antoniou et al., 2016; Mazzaggio and Surian, 2018). However, the precise factors contributing to this inter-individual variability are still unknown.

When it comes to methodology, recent reports have highlighted that certain testing paradigms elicit a completely different type of response than other paradigms. For instance, in the context of binary judgment tasks, the developmental literature has shown that small children, even linguistically competent, are less sensitive to underinformative some expressions compared to adults (e.g., Noveck, 2001; Papafragou and Musolino, 2003; Pouscoulous et al., 2007). However, when modifications are made to the binary judgment task, such as introducing a middle response option (Katsos and Bishop, 2011), incorporating a rating scale (Jasbi et al., 2019), or employing a rewarding system (Bleotu et al., 2021), children demonstrate greater sensitivity to implicatures than previously observed. These methodological adjustments and their resulting effects were also observed in the adult literature (e.g., Grodner et al., 2010; Politzer-Ahles and Fiorentino, 2013). Such research has revealed that adults can process scalar implicatures as fast as logical responses, thus providing evidence opposing the bulk of findings in the literature.

Another intriguing aspect emerging from the literature, especially from work on semantic-pragmatic interaction enterprises, is the influence of different linguistic structures on the derivation of scalar implicatures (Verstraete, 2005; Baker et al., 2009; van Tiel et al., 2016, 2019; Gotzner et al., 2018; Sun et al., 2018; van Tiel and Pankratz, 2021). For instance, van Tiel et al. (2016) assessed the triggering phenomenon for 43 weak and strong scalar pairs that come from a variety of grammatical categories, including adjectives, auxiliary verbs, main verbs, and adverbs. Results showed that the proportion of endorsement rates to these tested scalar terms is highly variable: a crucial factor accounting for this variance pertains to whether the strong scalar expression denotes an upper-bound, or due to the nature of the underlying measurement scales these expressions participate in (see also Gotzner et al., 2018).

This line of research currently casts doubt on the assumption that a single mechanism accounts for all scalar implicatures and suggests that variability could either stem from different processing paths associated with the inference itself or from how alternatives are composed for different scalar expressions.

In a nutshell, with this quick panorama presented above, one can say that the empirical landscape on scalar implicatures has three distinct patterns of variability that deserve attention and further investigation: inter-individual variability, variability triggered by differences in the apparatus of testing paradigms (i.e., methodological), and variability engendered by differences in the semantic structure of lexical scales (i.e., linguistic). What factor or factors are behind this variability? This paper has a 2-fold objective: one is to review the previous experimental record on scalar implicatures, and the other to discuss the factors that underlie these variability patterns. In what follows, we will discuss each category of variability in a separate section and showcase the underlying factors with relevant discussions and directions for future research.

2. Inter-individual variability

Recent reports in the literature show that participants vary in the tendency with which they derive scalar implicatures. While some individuals consistently prefer logical readings, others pragmatic (e.g., Noveck and Posada, 2003; Bott and Noveck, 2004; De Neys and Schaeken, 2007). One may ask why this occurs? This question has recently been the center of numerous discussions in the literature, and general results suggest that several factors contribute to this phenomenon. These factors include, but are not limited to, individual differences in working memory capacity (Feeney et al., 2004; Dieussaert et al., 2011; Marty and Chemla, 2013; Antoniou et al., 2016), Theory of Mind ability (Fairchild and Papafragou, 2021; Khorsheed et al., 2022a), and other personality characteristics which may include one's social skill and/or communication skill (Nieuwland et al., 2010; Yang et al., 2018), systemizing skill (Pijnacker et al., 2009; Chevallier et al., 2010; Barbet and Thierry, 2016), language proficiency (Antoniou et al., 2020; Khorsheed et al., 2022a), and attitudes toward honesty and integrity (Feeney and Bonnefon, 2013; Mazzarella, 2015), among others. In our view, these variability factors can be forced into two distinct groups of factors: internal cognitive factors and external social factors. While the internal factors are essentially related to the involvement of internal cognitive processes in scalar implicature derivation (i.e., working memory capacity and theory of mind), the external social factors pertain to personality traits and characteristics that may impact an individual's decision to accept or reject the scalar implicature (e.g., age, proficiency, politeness). The discussion below examines these two qualitatively-distinct groups of factors and their effects on scalar implicature derivation.

2.1. Internal cognitive factors

2.1.1. Working memory capacity

Many studies agree that scalar implicature derivation is a process that draws on one's working memory resources. This observation was mainly supported by experiments utilizing response deadlines (Bott and Noveck, 2004; Bott et al., 2012), dual tasks (De Neys and Schaeken, 2007; Dieussaert et al., 2011; Marty and Chemla, 2013), and direct measures of working memory capacity (Antoniou et al., 2016; Fairchild and Papafragou, 2021). For instance, Bott and Noveck (2004, Experiment 4) tested the likelihood that scalar implicature computation involves cognitive resources by manipulating the time available to participants i.e., Long condition vs. Short condition. While the former condition allows participants to have a relatively long time to respond (3 s), the latter allows only a relatively short time (900 ms). Crucially, the latter condition was designed to pressure participants and narrow down their cognitive resources. As Bott and Noveck (2004) pointed out, participants with more cognitive resources (i.e., in the Long condition) would draw scalar implicatures more often than participants with curtailed cognitive resources (i.e., Short condition). The task required French participants to read categorical sentences (e.g., “Some elephants are mammals”) and judge if they are true or false among other control sentences. Bott and Noveck (2004) found that participants were more successful at interpreting the scalar implicature when given more time to draw on their working memory resources, but they were less likely to draw the implicature when their cognitive resources were rendered limited. Notably, this effect was only observed in the underinformative items but not in the patently-true or patently-false control sentences (for similar evidence, see De Neys and Schaeken, 2007; Bott et al., 2012; Marty and Chemla, 2013; Tomlinson et al., 2013). At the time, this finding had implications for studies interested in inter-individual variability. In other words, the propensity to be dominantly logical or pragmatic in a given task might be a proxy of individual differences in working memory capacity: those with greater working memory capacity may derive more scalar implicatures than participants with low working memory capacity (see Feeney et al., 2004; Banga et al., 2009; Dieussaert et al., 2011; Janssens et al., 2014; Heyman and Schaeken, 2015; Antoniou et al., 2016).

For instance, in a direct attempt to investigate how individual differences in working memory capacity may influence the derivation rate of scalar implicatures, Antoniou et al. (2016) conducted a study that employed two measures of working memory alongside a pragmatic task. While the pragmatic task involved statements such as There are hearts on some of the cards that participants have to judge as true or false based on a visual display showing hearts on all five cards, the working memory measures comprised a backward digit span task and reading span task. In a regression model, Antoniou et al. (2016) tested the relationship between participants' scalar implicature derivation rate and their overall working memory scores, in addition to a battery of personality measures included in the model. Their results revealed that only working memory capacity can account for the variance in participants' scalar implicature derivation rate. Specifically, individuals with greater working memory capacity were more likely to reject underinformative sentences compared to those with lower working memory capacity. Their finding was replicated in some studies (e.g., Yang et al., 2018; Fairchild and Papafragou, 2021), but not observed elsewhere (Banga et al., 2009; Dieussaert et al., 2011). For instance, Banga et al. (2009) investigated the relationship between working memory capacity and scalar implicature derivation in sentences such as Some elephants have trunks and their findings did not reveal any significant difference in the rate of scalar implicatures between individuals with lower and higher working memory abilities (see also Janssens et al., 2014; Heyman and Schaeken, 2015). This discrepancy currently prompts inquiries regarding the potential factors that contribute to the presence or absence of working memory effect. Could it be attributed to a confounding variable?

Antoniou et al. (2016) suggest that, besides the cognitive effort required for the calculation of the implicature, some experimental designs place more cognitive demands on participants' cognitive resources than other designs, and thus is a potential explanation for these observed discrepancies in the literature. In a similar vein, Heyman and Schaeken (2015) propose that the cognitive cost invested in scalar implicature derivation may be relatively small, not placing enough cognitive demands on the working memory resources, and thus participants with limited working memory capacity can still derive pragmatic interpretations in comparable proportions to those with high working memory capacity. Indeed, these two views seem to be corroborated by the experimental evidence obtained from dual tasks (e.g., De Neys and Schaeken, 2007; Dieussaert et al., 2011; Marty and Chemla, 2013). For instance, Dieussaert et al. (2011) used a dual-task methodology in which adult participants were instructed to evaluate the truth value of underinformative statements like Some tulips are flowers while their executive working memory was experimentally burdened by concurrent memorization of simple and complex dot patterns: low vs. high cognitive load. Interestingly, Dieussaert et al. (2011) found no direct relationship between participants' memory scores and the proportion of pragmatic responses, but they observed an interaction effect between working memory capacity and the cognitive load imposed by the memorization task. More specifically, Dieussaert et al. (2011) showed that low working memory capacity itself does not lead to fewer pragmatic interpretations unless an additional cognitive load is imposed.

While these aforementioned explanations for the working memory discrepancy seem plausible from a methodological standpoint, we contend that the presence or absence of the working memory effect in scalar implicature derivation may reflect the varying effort that participants would expend to make the linguistic utterance maximally relevant to their expectations. This viewpoint is consistent with Relevance Theory (e.g., Sperber and Wilson, 1986), which asserts that human cognition is geared to the maximization of relevance in communication and that the processing of an utterance and the mental effort associated with it may greatly vary by the extent to which an individual is willing to bridge the gap between the linguistic meaning of an utterance and the intended meaning of the speaker. This bridging process may involve doing enrichments, revisions, and re-organizations of existing beliefs and plans. As such, the processing cost observed in computing scalar implicatures and the demands imposed on working memory may vary as a function of the extent to which an individual will engage in deeper mind-reading activity, or Theory of Mind (ToM) (Noveck, 2018; Fairchild and Papafragou, 2021; Ronderos and Noveck, 2023). To make this observation more evident, the following subsection will discuss ToM in relation to working memory capacity and scalar implicature derivation.

2.1.2. Theory of mind

The post-Gricean view on how scalar implicatures are made and entertained suggests that a Theory of Mind (ToM) component is integrated into the process responsible for inference-making (Noveck and Sperber, 2007; Noveck, 2018). In essence, ToM refers to one's skill to attribute beliefs, intents, and desires to others and use these attributions to make predictions about another's behavior (e.g., Apperly et al., 2010; Apperly, 2012; Bergen and Grodner, 2012). Recent research in experimental pragmatics revealed a reliable relationship between ToM and pragmatic interpretations (Marocchini and Domaneschi, 2022), including scalar implicatures (Fairchild and Papafragou, 2021; Ronderos and Noveck, 2023). However, a currently debated question is whether the involvement in ToM reasoning is part of the behavioral variation in scalar implicature derivation between individuals. An indirect response to this question comes from Antoniou et al. (2016) whose results revealed a negative relationship between participants' age and the rate of pragmatic responses. Antoniou et al. sought to explain this potential link between age and the derivation rate of scalar implicatures based on speculation suggesting that older adults are less likely to employ ToM reasoning compared to younger adults (see also Bernstein et al., 2011; Henry et al., 2013), and hence they generate scalar implicatures to a lesser extent than younger peers. However, this explanation was only speculative and lacking empirical support. Antoniou et al. (2016) suggested that one can only make certain that a conclusive link between ToM and derivation rate exists if one employs a direct measure of ToM, such as the tasks utilized in previous studies on ToM (Keysar et al., 2000, 2003; Apperly et al., 2010).

In capitalizing on Antoniou et al.'s (2016) results, a recent study by Fairchild and Papafragou (2021) investigated the role of ToM in the observed variation in scalar implicature derivation. Recognizing that ToM reasoning requires cognitive effort (Epley et al., 2004; Apperly et al., 2010; Lin et al., 2010), the authors took the executive function (EF), especially the working memory capacity, as a caveat measure in their design. Their study employed five tasks: dual scalar implicature task, auditory backward digit span task, simple scalar implicature task, and the mind in the eyes and strange stories task (see their first experiment). The composite score obtained from the digit span task and high cognitive load trials embedded in the dual scalar implicature task was taken as a measure of EF, whereas the data obtained from the mind in the eyes and strange stories tasks were used to create a composite ToM score. Fairchild and Papafragou (2021) conducted relationship analyses and their preliminary results revealed that both EF and ToM are significant predictors of scalar implicature derivation rate: participants with better EF and ToM abilities are more inclined to adopt a pragmatic interpretation of sentences such as some dogs are mammals. Interestingly, however, when the shared variance between EF and ToM was controlled for, only ToM exhibited a unique contribution to the variability in pragmatic responses, that is, participants with higher ToM abilities tend to derive more scalar implicatures compared to those with lower ToM abilities (see also Khorsheed et al., 2022a for similar evidence). Currently, these results raise skepticism regarding the direct role of working memory in scalar implicature derivation. The presence or absence of working memory effect (or processing cost) in scalar implicature computation seems to be contingent on the level of ToM that an individual will employ to discern a speaker's intention(s) (see Ronderos and Noveck, 2023, for a more lucid account). That said, the involvement of ToM in scalar implicature derivation is possibly both individual- and situation-particularized and further investigation into these variability factors in scalar implicature processing may add valuable insights to this line of research.

2.2. External social factors

As previously discussed, the external social factors cover various personality traits and characteristics that may partly impact the tendency with which one would penalize pragmatic violations (Nieuwland et al., 2010; Feeney and Bonnefon, 2013; Mazzarella et al., 2018; Yang et al., 2018; Terkourafi et al., 2020; Khorsheed et al., 2022a). Nieuwland et al. (2010) were the first to provide breakthrough evidence falling into this category. Their work examined the impact of processing underinformative statements (e.g., Some people have lungs) on the N400 event-related potential (ERP) and if individual variation in N400 modulation is a product of differences in personality-related skills. Taken as a caveat, Nieuwland et al. (2010) obtained a measure of the pragmatic skill of participants through the Communication subscale in the Autism Spectrum Quotient (Baron-Cohen et al., 2001) and found that participants with a low score on the Communication sub-scale (referred to as “pragmatically skilled participants”) are more sensitive to pragmatic violations than participants with a high score. The same effect was replicated by Zhao et al. (2015) who found that acoustically-presented underinformative sentences (e.g., “Some tigers have tails”) elicit a prominent ERP effect in participants with a low score on the Communication and Social Skill sub-scales (“the high pragmatic ability group”).

In a similar vein, Barbet and Thierry (2016) recently showed that individuals with high-systemizing skills, as measured by the Systemizing Quotient (SQ-R) (Wheelwright et al., 2006), reject pragmatic violations more frequently than those with low systemizing skills. The systemizing skill refers to the extent to which one can analyze systems, extract controlling rules, and predict outputs (Baron-Cohen et al., 2003; Wheelwright et al., 2006). According to Baron-Cohen et al. (2003), individuals with high systemizing-brain are more likely to attend to details and features in objects and systems, and they treat these particular details as measured variables. They can also identify the effect of a certain operating on the input by finding its effects elsewhere in the output (correlation rules): “If the speaker says X, then A (input) changes to B (output), and thus a strong sensitivity to patterns”. Given this, some hearers may base their judgments on statistical patterns that help them gauge the likelihood that a potential interpretation is relevant to the speaker's intended meaning. This view was supported by work on individuals with high-functioning autism and Asperger's syndrome (Pijnacker et al., 2009; Chevallier et al., 2010). Results showed that participants with high systemizing abilities are more likely to reject sentences with pragmatic violations. This finding was also tested in other contexts among neurotypical individuals where results revealed a positive relationship between one's high systemizing ability and their propensity to reject underinformative sentences (Barbet and Thierry, 2016).

According to several reports (Feeney and Bonnefon, 2013; Heyman and Schaeken, 2015; Antoniou et al., 2016; Mazzaggio and Surian, 2018), the personality-based account covers a broad range of factors that extend beyond those previously discussed. These include, but are not limited to, extroversion, agreeableness, consciousness, neuroticism, and openness as measured by the Big Five Inventory (B5) (John et al., 2008), communication, attention control, attention switching, attention to detail, and imagination by the Autism Spectrum Quotient (ASQ) (Baron-Cohen et al., 2001), and politeness and honesty as measured by the Honesty/Integrity/Authenticity scale (HIA) (Goldberg et al., 2006), among others. Nevertheless, it is noteworthy that while certain autistic traits seem to exhibit a significant relationship with the derivation rate of scalar implicatures, other general personality characteristics do not show such a link (see Heyman and Schaeken, 2015; Antoniou et al., 2016; Khorsheed et al., 2022a). This discrepancy could potentially be driven by theoretical considerations: while the aforementioned measures obtained from the ASQ and SQ-R are indicative of autistic-related traits whose pronounced prevalence among some individuals may interact with the internal cognitive processes responsible for scalar implicature computation (see Noveck, 2018, p. 184–192), the effects obtained from the B5, HIA, or other general personality characteristics are possibly contingent on situational factors (e.g., Feeney and Bonnefon, 2013; Mazzarella, 2015; Holtgraves and Perdew, 2016; Terkourafi et al., 2020), or may be linked to metalinguistic and egocentric decisions. For instance, an individual may accept an underinformative statement even when they realize that, for example, the use of some in “Some cats have ears” is underinformative. We suggest that this variability is arbitrary and not consistent; and therefore, its impact on the derivation rate of scalar implicatures is not certain.

3. Methodological variability

As spelled out extensively by Noveck and Sperber (2007), experimental work needs to ensure that a given effect is robust. That is, one wants to replicate the same result over and over again across a variety of comparable tasks and testing paradigms. When two studies produce comparable outcomes, this bolsters each other's findings. However, when a study produces a new kind of result in a predictable manner, it pays to tease out the factors that underlie this observed effect.

In fact, for more than two decades now, studies on scalar implicatures have used a wide variety of techniques to examine whether the process responsible for scalar implicature derivation is cognitively costly or cost-free. The initial data discovery obtained from the developmental literature (Noveck, 2001) and adult literature (Bott and Noveck, 2004) showed that scalar implicatures involve processing costs. However, at the time, this evidence was considered counter-intuitive and one worry was that these results seemed to be based on “reasoning” tasks, and thus risked being not generalizable to other tasks such as sentence processing. Apropos, this has led scholars to design diverse testing paradigms with varying degrees of veracity. In the context of adult processing, this led to investigating scalar implicatures using text comprehension vignettes (Breheny et al., 2006), sentence processing via eye-tracking paradigms (Huang and Snedeker, 2009a), and dual tasks to show that an added cognitive burden can impair pragmatic processing (De Neys and Schaeken, 2007). These further follow-up studies led to further confirmation of the initial finding that scalar implicatures involve processing costs (e.g., Dieussaert et al., 2011; Marty and Chemla, 2013; Tomlinson et al., 2013; Spychalska et al., 2016; van Tiel and Schaeken, 2017; Fairchild and Papafragou, 2021).

However, with time in the last decade, the adult literature witnessed several experimental and contextual manipulations in which no processing costs are observed. This trend was notably seen in the work of Grodner et al. (2010) (see also Bergen and Grodner, 2012; Politzer-Ahles and Fiorentino, 2013; Hartshorne et al., 2015; Barbet and Thierry, 2018). For instance, Grodner et al. (2010) proposed that if the partitive “Some of” is phonetically reduced to “summa,” if the paradigm does not involve numbers within the subitizing range, and if the context draws attention to the underinformative use of “summa” in contexts where “alla” (meaning “all of”) is the case, then “summa” cases may appear to be as reactive as “alla” cases, although not entirely (see their first-half “summa” results on p. 46). In a similar vein, Barbet and Thierry (2018) investigated scalar implicature on single words, including critical quantified terms such as “some” by utilizing a Stroop task that was contextually neutral and not biased toward either an upper-bound or lower-bound reading. Their study aimed to explain discrepancies observed in previous context-dependent reading experiments, especially the processing effort observed in reading the “some-region” in Breheny et al. (2006) and Bergen and Grodner (2012), but not in Politzer-Ahles and Fiorentino (2013), or Hartshorne et al. (2015). Barbet and Thierry (2018) did not find an N450 effect associated with pragmatic interpretation, they did observe a P600 effect potentially indicative of pragmatic processing (Spotorno et al., 2013; Spychalska et al., 2016).

In the context of children and implicature understanding, much work has been devoted to accounting for Noveck's original discovery that younger children tend to be largely logical before gradually incorporating pragmatic considerations as they grow older. Specifically, Noveck (2001) investigated how 5-, 7-, and 9-year-olds would evaluate utterances with weak logical terms (e.g., X might be Y) in contexts in which the stronger alternative (X must be Y) is true. Noveck's results showed that 5-year-old children, even linguistically competent, are more logical than other older children and adults in responding (i.e., treating might as compatible with must but not exclusive). As Noveck (2001) pointed out, children's ability to provide more nuanced interpretations of underinformative utterances develops progressively with age. Noveck's finding sparked considerable interest among researchers, leading them to explore whether children's limited engagement in pragmatic computations compared to adults represents a genuine developmental effect or is influenced by other experimental factors. This in itself prompted scholars to develop more user-friendly paradigms. While for some researchers the interest was to prove that younger children can perform in an adult-like manner (Papafragou and Musolino, 2003), for others it was to demonstrate that a developmental effect still exists, even with simplified tasks (Pouscoulous et al., 2007).

For instance, Papafragou and Musolino (2003) conducted a study in Greece in which 5-year-old children and adults were presented with a set of vignettes narrated by an experimenter using toy-props. One of these vignettes described three horses jumping over a log, followed by a summary provided by a puppet stating that “Some of the horses jumped over the log.” Papafragou and Musolino (2003) showed that an overwhelming majority of the children (88%) perceived the puppet's response as an accurate summary of the story, but the adults did not (i.e., akin to Noveck's main finding). However, it was proposed that children's apparent logical behavior could be a result of their limited awareness of the ultimate goal of the task, which is to impose severe penalties on sentences that are not sufficiently informative. As such, in a follow-up experiment, Papafragou and Musolino (2003) made adjustments to the design and used more explicit instructions, including giving the children four warm-up lessons and training to detect the pragmatic terms anomaly. The results of their experiment revealed that while children's ability to discern underinformative statements had indeed progressed, their overall performance still fell notably short of that demonstrated by adults (~50 vs. 90%). The training effects also seemed to dissipate when the same participants are tested a week later (see Guasti et al., 2005).

In a more recent examination of the developmental effect identified by Noveck (2001), researchers proposed that children's non-adult behavior may be linked to artifacts inherent in the binary judgment task. It was suggested that children are highly sensitive to pragmatic violations but struggle to express their pragmatic abilities when tasked with providing truthful judgments (Katsos and Bishop, 2011). To investigate this proposal, Katsos and Bishop (2011) devised a new task in which participants can rate the appropriateness of a speaker's utterance based on a 3-point scale, e.g., children can reward a puppet's underinformative response with a “small,” “big,” or a “huge strawberry,” depending on “how good” it describes a given story. In one of their stories, Katsos and Bishop (2011) had 5-to-6-year-old children see a pile of carrots (five carrots) on the left side of a screen and a mouse carrying each one of them to the right side of the screen. The mouse consistently traverses from the right side of the screen to the pile of carrots to carry each one of them back to its initial position. At the end of the narrative, the experimenters used an animated fictional character to describe the scene by using a less-than-optimally informative statement, stating “The mouse picked up some of the carrots”. According to Katsos and Bishop (2011), children would exhibit sensitivity to pragmatic violations if their evaluation of the speaker's statement merits a “medium-sized strawberry” as opposed to a “huge strawberry” (indicating a response that is somewhat forgiving but still more stringent than complete agreement). The results revealed that children almost always assigned the speaker with a “medium-sized strawberry” on the scale, indicating a heightened sensitivity to underinformative sentences compared to previous findings using binary judgment tasks. This effect has also been replicated in other studies (Schaeken et al., 2018; Jasbi et al., 2019; Bleotu et al., 2021).

Given these findings above, a question may arise as to why children appear to be overwhelmingly “pragmatic” in rating tasks but logical in binary judgment tasks. In other words, which task provides a more valid measure of children's genuine inferential ability? Several scholars have sought to address this question. For instance, Noveck (2018) suggests that ternary judgment tasks, or tasks using rewards and ratings, are metalinguistic tasks that, in their essence, ask participants to focus on the utterance's “appropriateness” and/or “how well” the speaker says it, and thus risk being able to provide a reliable measure of how children process underinformative utterances (see also Noveck, 2018 for more alternative explanations, p. 89–90). In contrast, binary judgment tasks are deemed reasoning-based tasks that establish themselves on the assumption that listeners should be able to recognize, without explicit instruction, that “the purpose of the task is not to determine whether a given sentence is true or false in a given context but rather whether the sentence in question can be used felicitously in that context” (Papafragou and Musolino, 2003, p. 269). Therefore, children's difficulty to derive scalar implicatures in eye tracking experiments (Huang and Snedeker, 2009b), and in contexts in which the communicative expectation is made maximally relevant (Papafragou and Musolino, 2003; Pouscoulous et al., 2007), lends doubt to the claim that the developmental effect is an artifact of an overt judgment response.

Recent priming studies on scalar implicatures suggest that these discrepancies may be attributed to scale-activating confounds that these tasks may give rise to during scalar implicature derivation: while the binary judgment task does not make the relevance of the stronger alternative pronounced (e.g., Bott and Frisson, 2022), the “rewarding” task has the potential to encourage participants to take the task about an utterance's “goodness” as opposed to utterance's interpretation (Skordos and Papafragou, 2016). Alternatively, children's main problem in processing scalar implicatures may lie in their inability to retrieve and activate alternatives (Barner et al., 2011; Skordos and Papafragou, 2016; Rees and Bott, 2018; Gotzner et al., 2020), possibly due to limited processing resources (Chierchia et al., 2001; Reinhart, 2004; Barner et al., 2011). However, recent work by Jasbi et al. (2019) indicates that adults, much like children, exhibit a preference for “weak” response options in 3-, 4-, and 5-point Likert scale tasks. As such, this finding seems to challenge the notion that children's inclination to deviate from a fully logical endorsement to a “medial” response is due to the interplay between response option and activating scales, because adults, despite having the cognitive ability to call out the stronger alternative scale, still exhibit a preference for a “weak” response option in rating tasks.

In this review, our contention is that when children engage in tasks involving a scale or rewards, they perhaps assess the “fit” of a given utterance or situation based on its “numerical weight” on the scale, establishing a one-to-one correspondence. In other words, a “reward” that is smaller in size or value might be a product of a self-perceived estimation of “how appropriate” the utterance is, or “how appropriately” the speaker delivers it in a given situation (i.e., reminiscent of Noveck's (2018) explanation above). Shall our observation be true, this harks back to Degen and Tanenhaus (2015) naturalness account and their gumball paradigm: participants tended to show fast reaction times and high accuracy rates when some is meant to depict an “intermediate” number of gumballs, but a reversed pattern when some is meant to describe a visualization of small set size (0 gumballs) or an unpartitioned set size (all 13 gumballs), which is considered a less natural representation of the world. This being so, the results obtained from rating tasks currently raise concerns as they may not truly reflect a genuine inference-making, but rather the case of “exact description” readings.

The other aspect of our argument is that participants, including both children and adults, may become more attracted to an “intermediate” response than a strict downright response in higher cognition tasks. Because while the optimally-false response to a scalar implicature is likely to constitute a fully-fledged inferential step (i.e., the upper limit of human processing capacity), an “intermediate” response may represent a more lenient response and less refined understanding. Notwithstanding, this proposition remains doubtful in light of the qualitative feedback furnished by children in Papafragou and Musolino's study (2003, p. 267). The majority of children's verbatim was a massive replica of the puppet's logical “some” (74%), as opposed to adults who consistently invoked the stronger “all” on the same scale (98%). Such qualitative feedback in itself suggests that children's logical responses are characteristically different from adults' pragmatic responses, although children and adults do act artificially alike when a middle response option is added to the judgment task. That said, future research should place special focus on teasing out the factors underlying the effects observed in rating tasks. The obtained results may have implications for methodological robustness in experimental pragmatics and the broader field of experimental research.

4. Linguistic variability

Recent empirical research suggests that different scalar terms possess different potentials in giving rise to implicatures (Baker et al., 2009; Doran et al., 2012; Beltrama and Xiang, 2013; van Tiel et al., 2016, 2019; Gotzner et al., 2018; Simons and Warren, 2018; Sun et al., 2018). For instance, Doran et al. (2012) examined the triggering of scalar implicatures across a range of different scale types, including gradable adjectives, ranked orderings, and quantificational items. Results showed that pragmatic interpretations are frequently less likely to arise in the case of gradable adjectives than for quantifiers, cardinal numerals, or rank orderings. Beltrama and Xiang (2013) also provided evidence that adjectival scales appear different from modal scales with respect to the implicatures they trigger. In the case of adjectival scales, it was found that weak adjectives (e.g., “decent”) always trigger implicatures to the negation of the middle and strong scale-mates (e.g., “good”, “excellent”), but the middle adjectives, which constitute another potential trigger on the same scale, fail to generate an upper-bounding inference; and therefore, the adjectives themselves differ in the extent to which they give rise to implicatures. Beltrama and Xiang (2013) found no such difference in modal scales, such as <possible, likely, certain>. In other words, the modal scales exhibit trigger boundaries between the weak, middle, and strong term; and therefore, the use of the weaker term on the same scale (i.e., “possible”, “likely”) gives rise to the proposition that the stronger scalar term (i.e., “certain”) does not hold. This adjective-modal discrepancy was explained in terms of their boundedness, specifically the unbounded nature of the adjectival measurement scales as opposed to the bounded nature of modal scales.

A more comprehensive investigation into the underlying factors contributing to the variability in scalar implicature derivation rates as influenced by different scale structures comes from work by van Tiel et al. (2016). van Tiel et al. assessed the triggering phenomenon for 43 weak and strong scalar pairs that come from a variety of grammatical categories, including adjectives, auxiliary verbs, main verbs, and adverbs. In their experiment, a fictional character made a statement involving a weak term and the participants were asked to decide whether or not this proposition implied that the statement would have been false if the expression had been replaced with strong members on the same scale. For example, participants were queried about whether the statement in (1) entailed the scalar inference in (2).

(1) John is attractive

(2) John is not stunning

The results of their study provided evidence that the proportion of endorsement rates to these tested scalar terms was highly variable. For instance, while only a few participants endorsed the potential scalar inference in (2) triggered by the weak term attractive in (1), almost all participants endorsed the scalar inference in sentences associated with some. van Tiel et al. (2016) sought to explain the factors that might be the underlying cause for this diversity in endorsement rates by examining the semantic distance between the weaker and stronger term, their association strength, the availability of the stronger term, its relative frequency, and the presence of an upper-bound on the underlying measurement scale. van Tiel et al. (2016) demonstrated that both the upper boundedness and semantic distance of scalar terms are the only significant factors that affect the rate of inferred implicatures (see also Stateva et al., 2019 for work on cross-linguistic differences relating to the numerical bounds by quantifiers).

Benz et al. (2018) later revisited van Tiel et al.'s (2016) methodology raising the concern that their task may have triggered negative strengthening. In other words, the experimental material in van Tiel et al.'s study involved scalar terms whose interpretation may lead to negating the stronger scale-mate which may in turn give rise to negative strengthening. For example, the utterance “John is not stunning,” in sentence (2) above, may be strengthened to convey that “John is rather ugly”, and therefore triggering a confound that is incompatible with the semantic meaning of attractive. This means that the participants may have derived the scalar implicature but they decided to cancel it since the strengthened enrichment of the stronger scale-mate appeared in conflict with their interpretation of the implicature-modified weaker term.

In a follow-up study, Gotzner et al. (2018) tested a set of 70 adjective pairs balanced in scale structure and found that the endorsements of scalar implicature are “anti-correlated with the degree of negative strengthening of the stronger scale-mate”. Similar to the findings of van Tiel et al. (2016), the upper-bound entailment and semantically distant scale-mates gave rise to higher endorsement rates in the scalar implicature task. Gotzner et al. (2018) also demonstrated that adjectives are not per se less likely to yield low endorsement rates for scalar implicatures but their behavior depends on the scale structure underlying semantics of scalar expressions. For example, adjectival scales like <possible, certain>, which denote a lower and an upper-bound, behave similarly to the <some, all> scale. Gotzner et al. (2018) found that polarity, the adjectival extremeness of the strong term (e.g., “gigantic” vs. “large”), and the nature of the standard invoked by the weaker scale-mate (minimum standard adjectives like dirty vs. relative adjectives like large) accounted for about 66% of the observed variance in the endorsement rates.

More recently, this scalar diversity work has been taken to the domain of language processing (van Tiel et al., 2019; van Tiel and Pankratz, 2021). For instance, van Tiel et al. (2019) examined the processing of both positive and negative scalar words and found that rejecting underinformative utterances containing positive scalar words (e.g., “might,” “some,” “or”) consistently led to processing slowdowns, whereas rejecting utterances with negative scalar words (e.g., “low” and “scarce”) did not result in noticeable processing delays. This work proposed that the processing cost observed in previous studies may not generalize to the entire family of scalar words, and that different scalar words may undergo distinct processing mechanisms (see Khorsheed et al., 2022b for review and further discussion).

In another interesting study, Alexandropoulou et al. (2022) examined how the scale structure underlying different types of adjectives may affect the derivation of scalar implicatures. They employed an incremental decision task where participants read sentences with temporal ambiguity and had to distinguish a target referent (“warm water”) from a competitor (“hot water”). The visual scene either contained a contrast item (cold water) or no contrast item (along with one or two unrelated distractors). Alexandropoulou et al. (2022) found distinct verification strategies for different classes of adjectives following the role of context in their lexical semantics (building on Aparicio et al., 2016). Specifically, the immediate visual context facilitated the derivation of scalar implicatures triggered by relative adjectives (“warm but not hot”), whereas for minimum-standard adjectives scalar implicatures (“breezy but not windy”) were computed robustly and independently of the visual context. The authors concluded that different kinds of scalar meaning are computed incrementally and potentially in parallel.

Overall, an important unresolved question arising from this line of research is whether strong terms are less likely to be activated or if different inference mechanisms are involved across different scalars (see also Gotzner and Romoli, 2022 for further discussion). To tease apart how alternatives are constructed across scales and how listeners derive inferences about those alternatives, future work should systematically compare different scalar expressions in the well-established processing paradigms and derive predictions for the different behavior of scalar expressions, as based on their specific lexical-semantic properties. In essence, this body of work sheds new light on the borderline between semantic and pragmatic meaning.

5. Conclusion

This paper has reviewed the experimental record on scalar implicatures and explored the factors impacting their derivation, including inter-individual, methodological, and linguistic factors. It has highlighted how this variability contributed to our understanding of the conditions, mechanisms, and factors involved in the emergence and processing of scalar implicatures, while also underscoring the challenges it poses to existing theories and scholars in the field. Currently, the empirical evidence does not conclusively favor one pragmatic account over others, leaving room for ongoing debates and discussions. In light of this, our paper provides new directions for future research that may aid in resolving key debates surrounding the processing of scalar implicatures.

Author contributions

AK prepared the first draft of the manuscript, except for Section 4 which was jointly composed with NG. Both authors have made revisions to the paper and share responsibility for all parts. Both authors contributed to the article and approved the submitted version.

Funding

This work was supported by the German Research Foundation (DFG) as part of an Emmy Noether grant awarded to NG (GO 3378/1-1) and open access funds by the University of Osnabrück.

Acknowledgments

The authors would like to express their thanks and appreciation to TS and to two reviewers whose comments and feedback on a prior version of this manuscript have greatly improved our paper.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Alexandropoulou, S., Discher, H., Herb, M., and Gotzner, N. (2022). “Incremental pragmatic interpretation of gradable adjectives: the role of standards of comparison,” in Semantics and Linguistic Theory, eds J. R. Starr, J. Kim, and B. Öney (El Colegio de Mexico and the Universidad Nacional Autónoma de México), 32, 481–497.

Google Scholar

Antoniou, K., Cummins, C., and Katsos, N. (2016). Why only some adults reject under-informative utterances. J. Pragmat. 99, 78–95. doi: 10.1016/j.pragma.2016.05.001