Extracting Summary Statistics of Rapid Numerical Sequences

Rosenbaum, David; Glickman, Moshe; Usher, Marius

doi:10.3389/fpsyg.2021.693575

ORIGINAL RESEARCH article

Front. Psychol., 01 October 2021

Sec. Cognitive Science

Volume 12 - 2021 | https://doi.org/10.3389/fpsyg.2021.693575

Extracting Summary Statistics of Rapid Numerical Sequences

David Rosenbaum¹^*

Moshe Glickman^2,3

Marius Usher¹

¹School of Psychological Sciences, Tel-Aviv University, Tel Aviv, Israel
²Department of Experimental Psychology, University College London, London, United Kingdom
³Max Planck Centre for Computational Psychiatry and Ageing Research, London, United Kingdom

We examine the ability of observers to extract summary statistics (such as the mean and the relative-variance) from rapid numerical sequences of two digit numbers presented at a rate of 4/s. In four experiments (total N = 100), we find that the participants show a remarkable ability to extract such summary statistics and that their precision in the estimation of the sequence-mean improves with the sequence-length (subject to individual differences). Using model selection for individual participants we find that, when only the sequence-average is estimated, most participants rely on a holistic process of frequency based estimation with a minority who rely on a (rule-based and capacity limited) mid-range strategy. When both the sequence-average and the relative variance are estimated, about half of the participants rely on these two strategies. Importantly, the holistic strategy appears more efficient in terms of its precision. We discuss implications for the domains of two pathways numerical processing and decision-making.

Introduction

Imagine a stock-market operator viewing a rapid sequence of stock returns on which she needs to make a fast buy/sell decision, or alternatively, a person who faces a crowd of people, each exhibiting a distinct emotional expression toward the person, who then needs to decide if to approach or not. In situations such as this, the rapid extraction of summary statistics of the elements (numerical returns or emotional expressions), in particular their average, has obvious advantages. Recent research over the last two decades has convincingly demonstrated that humans have a remarkable ability to extract summary statistics from large sets of visual elements, briefly presented together or in fast sequence, with regards to visual properties such as size, orientation, and even emotional expression (Ariely, 2001; Dakin, 2001; Parkes et al., 2001; Chong and Treisman, 2005; Haberman and Whitney, 2011; Robitaille and Harris, 2011; Allik et al., 2013; Khayat and Hochstein, 2018; Rosenbaum et al., 2021). Similarly, it has been reported that humans can extract the arithmetic average (the simplest summary statistic) from rapid numerical (or numerosity) sequences (Malmi and Samson, 1983; Brezis et al., 2015; Katzin et al., 2021). For example, Brezis et al. (2018) reported that human observers are able to provide quite accurate estimates of numerical average for sequences of 2-digit numbers (sequence length 6–12) that are presented at a rate of 4/sec. Moreover, they demonstrated the precision of these estimates increases with sequence length, and decreases with the sequence mean and variance. Finally, they proposed a model based on a population of broadly number-magnitude detectors (Dehaene, 1992, 2007; Feigenson et al., 2004), which accounted for all these data patterns (see next section). Furthermore, in a recent study, Hadar et al. (2021) have supported the idea that averaging is a type of gist extraction that is facilitated by an “abstract” mind-set (Gilead et al., 2019).

While this extensive research has focused on the extraction of an ensemble-average, there is less research on the extraction of higher order summary statistics, such as the variance (Kareev et al., 2002; Morgan et al., 2008; Solomon, 2010; Bronfman et al., 2015; Ward et al., 2016), in particular for numerical sequences. This shortage stands out, given the well-known impact of the variance of a payoff-set on risk-preferences (Markowitz, 1952; Weber, 2010; Summerfield and Tsetsos, 2012; Zeigenfuse et al., 2014; Erev et al., 2017; Glickman et al., 2018; Vanunu et al., 2019, 2020). One idea that has been proposed in recent research is that summary statistics (such as the sum, the average or the variance) are automatically extracted and relied on to form intuitive preferences (Betsch et al., 2001; De Gardelle and Summerfield, 2011; Brusovansky et al., 2019; Vanunu et al., 2019). Note, however, that in order to account for risk-preferences, the intuitive system (Glöckner and Betsch, 2008) needs to extract not only the average but also a measure of the alternative-variance (Weber, 2010; Vanunu et al., 2019). Finally, intuitive decisions between complex alternatives are often subject to marked individual differences (Newell and Shanks, 2003; Glöckner and Betsch, 2008; Betsch and Glöckner, 2010; Brusovansky et al., 2018, Brusovansky et al., 2019). While some people form preferences that reflect summary statistics, others form preferences that are based on simplifying heuristics that focus on a single aspect of the information (e.g., Take The Best; TTB). Experimental context is another factor that appears to affect the weight that people give to outlier elements¹ (De Gardelle and Summerfield, 2011; Vandormael et al., 2017; Vanunu et al., 2020). Similarly, Vanunu et al. (2019) have shown that the weight by which outlier elements contribute to risk preferences depends on how the evaluations are made, one at a time or in groups (implying a complexity cost to the estimations of several summary statistics, in parallel). It is thus important to examine if similar individual differences take place in the explicit estimation of summary statistics and to better understand the factors that determine the type of the estimation strategy a person is likely to deploy.

Motivated by these shortcomings, the aim of this paper was to probe further into the capacity of human observers to extract summary statistics of rapid numerical sequences. In particular, we extend previous investigations with regards to the following questions: (i) Can human observers extract higher order summary statistics (such as the relative sequence-variance²) at the same time as they extract the average? (ii) What types of mechanisms do people deploy when making such estimations? We contrast between holistic frequency-based estimations and sequential rule-based and working-memory limited strategies such as the mid-range; see next section for details. (iii) Are there individual differences in the mechanism or the strategy that human observers deploy in these tasks? (iv) Does the frequency with which participants deploy such strategies depend on the type of summary statistic that are required and on the distribution of values that are used as samples, and (v) which of those strategies (or mechanisms) result in more efficient estimations?

We start with a brief computational section that provides the motivation for our experiments. Following, we present four experiments in which human observers were asked to evaluate summary statistics of rapid (4 Hz) numerical sequences. In Experiment 1 (a and b) only the average is evaluated, while in Experiments 2 and 3, the participants are required to make two estimations for each sequence: average and relative-variance in Experiment 2, and average and confidence in Experiment 3. Finally, we present a computational analysis of the data, which is focused on individual differences with regards to the estimation mechanism and we examine their relative efficiencies.

Computational Modeling of Numerical Averaging: Dependency on Sequence-Length

We are focusing on the intuitive averaging of the type that people can make for rapid sequences (four per second) of two digit numbers under time pressure (see section Methods of Experiment 1). This belongs to the domain of approximate numerical estimation (Dehaene, 1992; Dehaene et al., 1998; Feigenson et al., 2004) and excludes an explicit computation of the average based on summing the numbers and division by the sequence length. Malmi and Samson (1983) considered two alternative ways of computing a sequence average: (i) a running average with decreasing weights, (ii) the estimation of the “center of mass” from a frequency distribution, and they concluded in favor of the latter. More recently, Brezis et al. (2015), Brezis et al. (2018) population averaging model (see Figure 1 for illustration), provides a computationally explicit mechanism that generates a noisy frequency representation of a numerical sequence and estimates its center of mass.

FIGURE 1

Figure 1. The population averaging model. (A) Numbers activate broadly tuned number- magnitude detectors. (B) In this illustration we show the noisy population profile in response to the presentation of an example sequence of 3 numbers: 20, 50, 80. (C) The estimation of the average is based on the central of mass of the population response (reproduced from Brezis et al., 2015).

Accordingly, when presented with a sequence of numbers, the sequence's average is estimated from the population-averaging of the number-magnitude tuned activation profile, by weighting the contribution of each detector's activity according to its preferred number magnitude (Georgopoulos et al., 1986). At display's offset, the center of mass is estimated by summing each neuron's detector's activation multiplied by its preferred magnitude and dividing by the sum of the overall network's activity (see Brezis et al., 2018, for a biological plausible neural implementation):

\begin{array}{l} M e a n = \frac{(\sum F_{i} \cdot T_{i})}{\sum F_{i}} & (1) \end{array}

where for each detector i, F = firing rate; T = the detector's preferred magnitude. The sum is taken over all numerosity detectors.

The population averaging model makes a particular prediction on how the precision of the estimate depends on the length of the sequence. As the sequence-length (n), increases, the frequency representation becomes less noisy (due to Poisson variability of neural detectors, which decreases with the activation magnitude) and therefore the precision of the estimate increases (Figure 2). This is thus equivalent to a computation of the average, based on noisy representations of its magnitudes.

\begin{array}{l} M e a n = \frac{\sum_{i = 1}^{n} X_{i} + ε_{i}}{n}, ε N (0, σ^{2}) & (2) \end{array}

where, X_i is the ith item of the sequence and σ²is the variance of the encoding noise.

FIGURE 2

Figure 2. RMSD (root mean square deviations) as a function of sequence length- predictions for different models. The dashed rectangle indicates the relevant range of sequence-lengths in our experiments. In all simulations, sets of n numbers in the range of 0–100 are sampled from a uniform distribution. Blue and black: Normative-holistic and population averaging models. Both models show similar predictions in which accuracy improves as sequence length increases. Red: Analytic-random model with k = 3. Accuracy deteriorates as sequence length increases. Magenta: Mid-range model. Non-monotonic function. Shows a mild improvement as sequence length increases in the relevant experimental range.

For simplification, we will here approximate the population averaging with this simplified model, which we label the normative-holistic averaging model. Normativity here corresponds to the property that all the items are used and weighted equally in the estimation of the average, while holism involves the idea that the estimation does not rely on a sequential application of symbolic rules and is not subject to WM capacity limitations [see contrast with heuristic models below, and Glöckner and Betsch (2008) for a similar model that is normative and yet automatic]. The normative-holistic model predicts that the precision of estimation would improve with $\frac{1}{\sqrt{n}}$ (Figure 2 blue line), which results from the averaging of the encoding noise, and is similar to that of the population-averaging model (Figure 2 black line).

We contrast the normative-holistic model estimation with a heuristic computation based on symbolic representations of the numbers that are subject to working memory capacity limitations (Ashcraft, 1992). The simplest such model, assumes that out of the n-numbers, the observers are able to remember a few random samples (2 or 3) and use them to evaluate the average of the sequence by computing their mean.³ For the case in which the samples stored in working memory are random, we label this the analytical-random model, as it relies on explicit (rule based) computation of the average based on what is available in working memory.⁴ This analytic-random model is a variant of the procedural arithmetic model (Anderson et al., 1996), in which procedural operations are serially performed on symbols available in WM. Since with increasing n, the few WM-samples provide a less accurate representation of the set, the precision of the analytical-random model decreases with n (Figure 2, red line for k = 3).

It is possible, however, to consider a more sophisticated analytic version of this strategy, which still computes the average based on few (e.g., two samples), but selects those samples in a non-random way (Myczek and Simons, 2008; but see Chong et al., 2008). For example, an efficient way to select two samples is to select the maximum and minimum of the sequence, resulting in the so called, mid-range heuristic.⁵ As shown in Figure 2 (magenta line), the mid-range heuristic predicts a milder improvement with sequence-length (for n > 7), since the mid-range provides a better estimate of the average, the longer the sequence is. Note that despite using one less samples (2 instead of 3) the mid-range precision is higher than that of the random-analytic with n > 3. However, this advantage comes with a cost. For the case that the numbers are selected from a non-symmetric, skewed distribution, the mid-range strategy predicts systematic deviations.

Summary of Computational Contrasts Between Estimations Strategies

We have contrasted several mechanisms that observers can deploy to estimate the average of a rapid numerical sequence. The normative-holistic mechanism assumes each item contributes to the estimate but is subject to a potentially large encoding noise, which averages out with n. The analytical-random model assumes an exact (symbolic) computation of the average based on few (2–3) random samples, and it predicts precision to decrease with n. Finally, the mid-range model, predicts a milder improvement in the precision estimate with n (for n > 7), but it also predicts that only the extreme values contribute to the average estimate. Thus, to contrast between estimations strategies, we will focus on how the precision of the estimate (computed either as RMSD or as a Pearson correlation between actual and estimated sequence-average⁶) depends on sequence-length. In addition, we focus on the decision weights of the ranked items (De Gardelle and Summerfield, 2011; Vandormael et al., 2017; Vanunu et al., 2020), to probe the deployment of mid-range strategies.⁷ Finally, we wish to examine if the estimation mechanism varies with task complexity (e.g., estimating the average only or both the average and the relative-variance, Vanunu et al., 2019) or with the distribution from which the values are sampled, which is likely to modulate gains based on prior expectations (Summerfield and De Lange, 2014).

To anticipate our results, we found that observers can estimate relative-variance of rapid numerical sequences remarkably well, while they also estimate the sequence average. Moreover, while we find individual differences in the estimation strategy, most observers deploy a holistic mechanism when the task requires average estimation only, and about half of them deploy such a holistic mechanism even when they estimate both the average and the relative variance. Furthermore, we found that the likelihood of these two strategies is affected by the distribution from which the values are sampled. Finally, we find that the participants who deploy the holistic mechanism tend to be more precise in their estimations.

Experiments

Four experiments were carried out to probe individual differences in the mechanism by which human observers evaluate rapid numerical sequences. In all the experiments the sequences presented two digit numbers at a rate of 4/s and sequence-length was randomly selected to be either 6 or 12 items (based on Brezis et al., 2015, we avoided shorter sequence lengths to discourage explicit computation strategies, which were found for sequences of 4 items). Previous research has indicated that the number magnitude is automatically encoded when subjects are presented with two digit numbers (Dehaene et al., 1990; Fitousi and Algom, 2019). In Experiments 1a and 1b, we require participants to evaluate only the sequence average (see Figure 3). These experiments differ in the distribution from which the numbers are sampled: uniform in 1a, and skewed in 1b. Under an adaptive observer assumption, skewed distributions are likely to make the deployment of a mid-range estimations less likely.⁸ In Experiments 2 and 3 we used (like in Experiment 1a) numbers that are sampled from a uniform distribution, but we required the participants to provide two estimates for each sequence. In addition to the average, Experiment 2 required an estimation of sequence-variance, while Experiment 3 required an estimation of the confidence in the estimation of the average. Since confidence is expected to reflect the uncertainty on the value of the estimate (Yeung and Summerfield, 2012; Lebreton et al., 2015) and this is likely to increase with the variance of the sequence, we consider the confidence estimation, an indirect/implicit estimation of the variance. Since mid-range (or min/max) strategies provide a simple and natural way to estimate the sequence-variance, we can expect that the deployment of this strategy will increase in Experiments 2 and 3 (see experimental methods for participants and procedure details for all experiments).

FIGURE 3

Figure 3. Illustration of an experimental trial (Exp. 1–2). (A) Each trial begins with a 250 ms fixation cross, after which a sequence of two-digit numbers is presented (250 ms/numeral). The sequence-length is 6 or 12 (randomized between trials). (B) The participants were asked to convey the sequence's average, by vertically sliding a mouse-controlled bar set on a number ruler (white bar) between 1 and 100. The number corresponding to the bar's location is being dynamically displayed (30 in the display).

We thus manipulated two factors in our experiments: (I) the distribution from which the values are sampled (Experiment 1a vs. Experiment 1b), and the task complexity (estimating only the average; Exp 1 (a/b) or estimating both the average and the relative-variance (Experiment 2 and 3). Finally, the number of participants (N = 25) in all our experiments was selected based on previous studies that relied on model selection to classify participants to decision strategies (Newell and Shanks, 2003; Glöckner and Betsch, 2008; Pleskac et al., 2019).⁹

Experiment 1a

Methods

Participants

Twenty-five undergraduate from Tel-Aviv University (Mean age = 22.2; SD = 1.7) participated in the experiment. All participants were naive to the purpose of the experiment and had normal, or corrected-to-normal, vision. Informed consent was obtained from all subjects. Participants were awarded with course credit for their participation. All procedures and experimental protocols were approved by the ethics committee of the Psychology department of Tel Aviv University (Application 743/12). All experiments were carried out in accordance with the approved guidelines.

Apparatus and Stimuli

Displays were generated by an Intel I7 personal computer attached to a 24″ Asus 248qe monitor with a 144 Hz refresh rate, using 1920 × 1080 resolution graphics mode. Responses were collected via the computer mouse. Viewing distance was approximately 60 cm from the monitor. The stimulus consisted of sequences of 6 and 12 numerical values which were presented with a presentation rate of 4 HZ and were uniformly distributed. To generate the sequences, we used a 3 × 3 orthogonal design of mean: 40, 50, or 60 and ranges (to control variance) of mean ± 10, mean ± 25 or mean ± 39. The numbers in each sequence were independently sampled from one of these nine distributions (random between trials).¹⁰

Procedure and Design

Each trial began with a fixation display that consisted of a black 0.2° × 0.2° fixation cross (+) that remained on the screen for 250 ms. Then, a sequence of 6 or 12 numbers (randomized between trials) was presented. Once the sequence terminated, the participants were required to estimate the sequence's mean value using an analog ruler that was displayed and ranged from 1 to 100 (see Figure 3). Participants underwent 300 experimental trials divided into 10 blocks. Each block terminated with performance-feedback (real-average/estimated-average correlation) and a short, self-paced break.

Results

We used two measures to quantify each subject's precision in averaging. The first is the Pearson correlation of the real and estimated averages of the sequences of each participant (see Figure 6A, for an example of an individual observer). The average Pearson correlation was high [r₍₂₉₈₎ = 0.75, SD = 0.12] and was significantly higher than zero (all participants' p's < 0.001). Second, we computed the root mean square deviation (RMSD) between the real averages and the participants' responses (note that higher value of RMSD imply lower accuracy). The RMSD was significantly lower than the simulated RMSD generated by randomly shuffling participant's responses across trials [Actual RMSD = 7.7; Shuffled RMSD = 14.8; t₍₂₄₎ = 18.7, p < 0.001, Cohen's d = 3.6].

To test the sequence-length effect, we carried out a paired sample t-test between the RMSD of the six items condition compared to the 12 items condition, which showed a significant difference. The precision improved (RMSD decreased) with sequence-length: six items sequences (Mean = 8.08, SD = 3.3) and 12 items sequences (M = 7.4, SD = 2.9); t₍₂₄₎ = 3.84; p < 0.01, Cohen's d = 0.77 (see Figure 4A). These results are consistent with the predictions of the normative-holistic model which predicts better performance for the 12 items sequences relative to the six items sequences due to the fact that uncorrelated single item noise averages out, but is also consistent with predictions of the mid-range model. In addition we found that precision decreases with sequence variance (see Figure 4B): repeated measure ANOVA with the within subject factor of the 3 groups of variance; F_(1.2,29.1) = 48.8; p < 0.001, $η_{p}^{2}$ = 0.67, indicating a linear trend.

FIGURE 4

Figure 4. (A) The sequence length effect for all four experiments (* corresponds to a statistically significant effect). In all experiments, the participants performed significantly better (lower RMSD) in the 12 items condition compared to the six items (at group level). (B) The Variance effect for all experiments (excluding 1b) which did not manipulated variance. All 3 experiments showed a linear trend in which precision decreases as variance increases.

In order to probe whether participants based their average estimations mostly on extreme values, as predicted by a midrange strategy (see Figure 5, red plot), or on all items with equal weights as predicted by a normative-holistic strategy (see Figure 5, blue plot), we ran separate regressions for each of the sequences' length conditions, using the sequence ranked numbers as predictors (see Methods section; in the twelve-items sequences we paired the items in order to have six predictors in both conditions) and we averaged both conditions weights. As shown in Figure 5A, the regression weights show a mild U-shaped pattern (black line), indicating a tendency to overweight extreme values as predicted by a mid-range strategy, but note that the decision weights are closer to normative-holistic model (blue line), according to which participants based their estimations on all items. A paired sample t-test between the weight given to inlying ranks [2–5] and outlying ranks [1 and 6] (Vandormael et al., 2017) was statistically significant; t₍₂₄₎ = 3.75, p < 0.001. The mean difference between outlying and inlying ranks across participants was 0.10.¹¹

FIGURE 5

Figure 5. Ranking-weighting profile of the numbers (A) Experiment 1a. Real data regression (black) compared with normative-holistic and mid-range predictions (blue and red, respectively). (B–D) Same analysis for Experiments 1b, 2, and 3, respectively.

We next examined individual differences. We found remarkable individual differences in the sequence-length effect (see Figure 6B) and in the curvature of the ranked decision weights (U-shape pattern), indicating the presence of different estimation strategies (Figure 6C). Remarkable individual differences in the precision of the estimations were also found (Figure 6D).

FIGURE 6

Figure 6. Experiment 1a (A) Trial by trial performance of a representative individual observer. The scatter plot depicts the participant's estimation (y-axis) for each of the presented number sequence averages (x-axis). Blue and red lines represent the regression and identity lines, respectively. (B) Individual differences in RMSD-difference (δ RMSD) between the two sequence length conditions (positive values indicate improvement in performance as sequence length increases; Participants are sorted by RMSD-difference (from the smallest from the smallest difference to the largest). (C) Individual differences in the difference between outliers' weight (the averaged coefficient of the first and last items in the sequence) and inliers' weight (the averaged coefficient of all items between the second and the one before last; sorted by outlier-inlier difference, from the smallest difference to the largest). (D) Individual differences in the correlations between the estimated average and the actual average (chance corresponds to r = 0). Participants in (B–D) are sorted independently.

Experiment 1b

Experiment 1b was identical to Experiment 1a, with the only exception that in the majority of trials the distributions used to generate the sequences were skewed rather than uniform (same means). In addition, the variance was not manipulated factorially, as in Experiment 1a, however, the trials had a high variability in their sequence mean and variance (see Methods section). Based on the adaptivity assumption, this was expected to reduce the reliance on the mid-range strategy (see footnote 8), as indicated in flatter ranked regression weights (closer to the blue than the red-line in Figure 5B).