The Fewer Reasons, the More You Like It! How Decision-Making Heuristics of Image Quality Estimation Exploit the Content of Subjective Experience

Leisti, Tuomas; Vaahteranoksa, Mikko; Olives, Jean-Luc; Peltoketo, Veli-Tapani; Häkkinen, Jukka

doi:10.3389/fpsyg.2022.867874

ORIGINAL RESEARCH article

Front. Psychol., 21 June 2022

Sec. Perception Science

Volume 13 - 2022 | https://doi.org/10.3389/fpsyg.2022.867874

This article is part of the Research TopicScene-Dependent Image Quality and Visual AssessmentView all 7 articles

The Fewer Reasons, the More You Like It! How Decision-Making Heuristics of Image Quality Estimation Exploit the Content of Subjective Experience

Tuomas Leisti¹^*

Mikko Vaahteranoksa²

Jean-Luc Olives²

Veli-Tapani Peltoketo²

Jukka Häkkinen¹

¹Department of Psychology and Logopedics, Faculty of Medicine, University of Helsinki, Helsinki, Finland
²Huawei Technologies Oy (Finland) Co., Ltd., Helsinki, Finland

Imaging science has approached subjective image quality (IQ) as a perceptual phenomenon, with an emphasis on thresholds of defects. The paradigmatic design of subjective IQ estimation, the two-alternative forced-choice (2AFC) method, however, requires viewers to make decisions. We investigated decision strategies in three experiments both by asking the research participants to give reasons for their decisions and by examining the decision times. We found that typical for larger quality differences is a smaller set of subjective attributes, resulting from convergent attention toward the most salient attribute, leading to faster decisions and better accuracy. Smaller differences are characterized by divergent attention toward different attributes and an emphasis on preferential attributes instead of defects. In larger differences, attributes have sigmoidal relationships between their visibility and their occurrence in explanations. For other attributes, this relationship is more random. We also examined decision times in different attribute configurations to clarify the heuristics of IQ estimation, and we distinguished a top-down-oriented Take-the-Best heuristic and a bottom-up visual salience-based heuristic. In all experiments, heuristic one-reason decision-making endured as a prevailing strategy independent of quality difference or task.

Introduction

It is deceptively easy to regard the visual quality of an image as something essentially objective. An image can be described almost exhaustively by measuring the light emitted from a display or reflected from a print. Therefore, it also appears that quality can be measured solely using this information. Nevertheless, only subjective evaluations offer first-hand data about quality, and even instrumental measurements of quality require a subjective reference, which the measurements eventually try to predict (Engeldrum, 2004b).

Why do subjective estimations play such a significant role if images are fully described by objective measures? The answer can be sought from the general idea of quality as a property, which can order images according to their utility, excellence, beauty, or simply preference (Janssen and Blommaert, 1997; Keelan, 2002; Engeldrum, 2004a). This ordering is logically possible only if quality is defined as a one-dimensional quantitative property. Hence, there must be a rule for how the inherent multidimensionality of image can be transformed into this quality dimension. The result, a quality scale, should be able to put products in order and should therefore conform both to the axioms of transitivity and completeness and to the general opinion about the quality in order to correctly predict consumer choices, which usually is the final goal of marketers, engineers, and designers. The quality scale is typically operationalized as a mean opinion score (MOS), which simply is a mean quality rating of the image given by a group of participants in subjective image quality (IQ) tests.

To understand the relation between the physical properties of an image and its quality scores, which reflect how these properties are perceived, much of the methodology of subjective IQ estimation has been adopted from psychophysics (Keelan, 2002; Jin et al., 2017). Imaging science has thus conceptualized quality mainly as a physical, objective phenomenon, and its psychological counterpart is only a subjective reflection of objective quality. The ultimate goal is to find appropriate psychophysical functions that can predict subjective ratings from objective properties of the image (Keelan, 2002; Engeldrum, 2004b; Jin et al., 2017). This is feasible, of course, if only one dimension, such as blur, varies between the images. However, when IQ is multidimensional, that is, images vary simultaneously according to several dimensions, such as blur, noise, contrast, and color saturation, the situation is more complicated because only some visual dimensions, such as color saturation, lightness, and hue, are integrated at the perceptual level (Garner and Felfoldy, 1970). Most quality dimensions are perceived as separate, and therefore, the multidimensional quality estimation task is essentially a judgment and decision-making problem. This is evident, for example, when the viewer must decide between blurry and noisy images. The first aim of this paper is to examine how research participants recruited to IQ tests make decisions about quality and how these decisions influence the test results.

Another challenge for the traditional psychophysical approach is that, psychologically, estimation of IQ is also a result of active, experiential, and interpretative activity. For example, we tend to think blur is an IQ defect, but professional photographers also use blur as an artistic effect or as a way to attract viewers’ attention. Furthermore, new camera phones use “bokeh” effect to create artistic-looking photos, simulating the narrow depth of field typical for photographs taken with professional cameras. Whether the viewer considers blur an advantage or a disadvantage therefore depends on the interpretation; participants tend to give higher ratings to IQ when blur is considered artistic (Radun et al., 2008). Thus, subjective quality is not merely a psychophysical function of physical image properties; quality estimations result from both subjective interpretation and objective, perceptible features of the image.

“Artistic” is one IQ attribute that is difficult to define using objective properties of an image, but it is not the only one. When people are asked to describe the reasons for their IQ ratings, they can use similar, rather abstract attributes such as “warm,” “atmospheric,” “good colors,” “vivid,” “soft,” or “fresh” (Leisti et al., 2009; Radun et al., 2016; Virtanen et al., 2019, 2020). Lower level properties, such as sharpness, noise, contrast, or color fidelity, are easier to measure objectively, but they do not present an exhaustive description of the subjective factors that determine the viewer’s experience of IQ (Radun et al., 2016). A wide semantic gap exists between the subjective descriptions of quality and the objective properties of an image. Therefore, the second aim of this paper is to examine what information research participants use in their decisions; we are interested in the decision space (Nyman et al., 2010) from which the attributes used in the quality evaluation are sought.

Interpretation-Based Quality: Probing the Experience of Quality

When quality evaluation is based on subjective aspects that are nearly impossible to measure objectively, the question is how to gain information about the crucial quality attributes and build a model describing the associations between physical properties, these subjective quality attributes, and overall quality. The Interpretation-Based Quality (IBQ) method was developed to understand the experiences that people exploit in their judgments of IQ (Nyman et al., 2005, 2006; Radun et al., 2008, 2010; Virtanen et al., 2019, 2020). Initially, the purpose was to bridge the semantic gap between low-level properties of the image and high-level attributes of subjective experience by examining how viewers interpret differences in quality in natural images. Radun and colleagues (Radun et al., 2008, 2010; Virtanen et al., 2020) did this by gathering subjective descriptions of quality from interviews of research participants about the relevant aspects of their quality judgments, analyzing these descriptions qualitatively, and exploring the underlying structure and dimensionality between these descriptions and physical stimuli. The IBQ method incorporated these interviews into experimental designs and controlled laboratory conditions such that the data provided by descriptions could be associated with instrumental data and experiment parameters using statistical and computational methods (Radun et al., 2008, 2010; Eerola et al., 2011).

The approach employed by the IBQ approach therefore represents subjective-to-objective mapping, which first describes the subjective phenomena, such as the subjective experience of IQ as it manifests in quality descriptions in this case, and then seeks the objective counterparts of the subjective attributes of experience (see Albertazzi, 2013; Felin et al., 2017). A similar approach has been employed in vision science when, for instance, visual illusions are used to study the functioning of the human vision system (Albertazzi, 2013). After describing the relevant dimensions of experience, models can be created that predict the quality experience and subsequent ratings on the basis of objectively measurable physical metrics (e.g., Eerola et al., 2011). This kind of top-down, interpretative approach complements the prevalent objective-to-subjective mapping tradition in IQ, adopted from psychophysics.

A significant difference exists between the subjective-to-objective and the objective-to-subjective approaches. When IQ estimation is considered similar to the estimation of lightness or contrast in simple stimuli, any disagreement between participants is regarded as error. If subjective experience is considered primary, however, quality evaluation is a preference task and no objectively correct answer exists. This preferential aspect is evident in the case of no-reference IQ in particular, where no “original,” unprocessed reference image exists, only different versions of the same scene (Engeldrum, 2004a). Photographs may not have, for example, a correct solution for lightness levels or color balance; instead, many equally natural solutions can exist (Felin et al., 2017). Moreover, it is questionable whether consumers want realistic photographs because they seem to prefer more colorful images (Janssen and Blommaert, 1997). Although the objective-to-subjective approach works well when estimating the visibility, thresholds, and saliency of image artifacts, it does not capture the meaning of these artifacts to the participant, particularly in complex, multidimensional everyday environments. Preferential or esthetic attributes, such as contrast, naturalness, and colorfulness, cause even more difficulties because their effects on IQ cannot be determined by visibility (Keelan, 2002).

What Is the Subjective Experience of Quality and Why Is It Important?

As the IBQ approach claims to examine the subjective experience of quality, it should also define this phenomenon. In philosophy and psychology, subjective experience refers to the pure, non-reflective content of consciousness such as seeing red or feeling anger or pain (Morsella, 2005; Baumeister and Masicampo, 2010). All of the relevant low-level phenomena of quality, such as blur, grain, colors, contrast, and lightness level, are experienced somehow and the participants’ ratings reflect judgments based on these experiences.

How are the physical properties of images experienced? The human visual system (HVS) consists of numerous feature detectors that are sensitive to different aspects of the visual scene such as line orientations, spatial frequencies, movement, and color (Zeki and Bartels, 1999; Kravitz et al., 2013). These functional aspects of the HVS have been adopted to the IQ metrics decades ago (Teo and Heeger, 1994; Sheikh and Bovik, 2006); the HVS-based IQ models use similar channels to process image information and apply knowledge about HVS properties, such as contrast sensitivity, to estimate the visibility of the defect. The problem with these HVS models lies in the subsequent step: what to do with these HVS-adapted visual features? The usual solution is just to sum all types of degradations, using the Minkowski metric rule, to derive an overall estimation of IQ (Engeldrum, 2002; Keelan, 2002; Jin et al., 2017). This bottom-up approach does not, however, take into account the meaning of the visual information. HVS-based models have been criticized for not considering, for instance, the structure of the image, leading to low correlations with MOS values (Wang et al., 2004).

What kind of picture emerges if the problem is approached from top-down and IQ is conceptualized as a subjective experience? In cognitive neuroscience, there is a converging consensus that the role of subjective experience is to integrate information from massive parallel sets of independent processors in the brain (Dennett, 2001; Baars, 2005; Morsella, 2005; Dehaene et al., 2006; Morsella et al., 2016). Therefore, the results from the detectors in the HVS are not experienced as such, and more importantly, their information is not mechanically summed in order to achieve an estimation of IQ. Instead, subjective visual experience is a result of active interaction between the bottom-up and top-down processes and interpretation of the resulting information, based on current task needs (O’Regan and Noë, 2001; Hochstein and Ahissar, 2002; Lappin, 2013). Visual experience emerges, when the bottom-up or feed-forward processes first provide a gist of the visual scene, and the top-down processes then amplify the details by focusing attention on the task-relevant aspects (Hochstein and Ahissar, 2002; Crick and Koch, 2003; Lamme, 2006; Kravitz et al., 2013). The “bandwidth” of subjective experience is relatively narrow, thus, only a minor subset of all visual information is represented in detail (Cohen et al., 2016). Eye movements, guided by involuntary and voluntary attention, are needed to acquire details over the entire visual scene.

Information from the feature detectors is integrated into percepts that are relevant from the action point of view (Cisek and Kalaska, 2010; Morsella et al., 2016). For example, when a participant’s task is to evaluate the IQ, information about different IQ features becomes available in subjective experience, enabling the individual to make the required decisions and complete the task. The interpretation of the task and image properties has a significant effect on the attention regulation of the participant in a quality estimation task (Radun et al., 2016). Attentional focus amplifies and attenuates visual information at the visual cortex, changing the way the image and its quality is experienced (Hochstein and Ahissar, 2002; Dehaene et al., 2006; Tse et al., 2013). What people experience is therefore highly context-dependent. Perception, decision-making, and motor control form a tightly interconnected, dynamic system (Cisek and Kalaska, 2010).

How Subjective Experience Becomes a Pairwise Choice

The two-alternative forced-choice (2AFC) method is the basis of many IQ grading systems (Keelan, 2002; Keelan and Urabe, 2003). The 2AFC method is sensitive and it enables testers to describe quality differences in just noticeable differences (JNDs; Keelan, 2002). Unlike category scales, such as Likert, JND provides an unambiguous, well-defined measure of quality difference between two images. It is therefore the unit of measurement of IQ standards such as quality ruler (Keelan and Urabe, 2003; Jin and Keelan, 2010).

The drawback of the 2AFC method is a narrow dynamic range. Large amounts of blur, noise, or color distortion exceed the threshold for consciousness without intention. Detection of artifacts is not probabilistic at this stage, and IQ cannot be scaled using probabilistic methods. Even when differences are multidimensional, saliency of certain attributes captures involuntary attention, providing a heuristic reason for rejecting the photograph. These kinds of decision tasks thus rely on simple choice heuristics, require little voluntary search for defects, and are easy, fast, and reliable (Gigerenzer et al., 1999). There is not much practical difference whether large differences in the task are one-dimensional (supra-threshold task) or multidimensional (heuristic task). Table 1 provides a schematic categorization of different IQ estimation tasks, differentiating them by two factors, dimensionality and quality difference within an image pair.

TABLE 1

Table 1. Image quality estimation tasks categorized according to the dimensionality of the differences and the magnitude of the overall quality difference.

When quality differences between alternatives within a pair are small, less than two JNDs, one-dimensional and multidimensional tasks become completely different tasks. I will call them threshold tasks and conflict-resolution tasks. When JND less than two in a one-dimensional task is caused solely by a small difference in visibility of a single attribute, in the multidimensional task, it can also be caused by conflict between dimensions. How participants make judgments and choices between blurry and noisy images, for instance, should be very different from one-dimensional threshold tasks, typical for psychophysics, where comparisons are made between images with different levels of blur.

When conflict emerges, a voluntary decision about the importance of different attributes is required. Here, a deliberative approach is an automatic brain reaction (Alter et al., 2007; Botvinick, 2007), which involves more detailed analysis of the attributes and conscious reasoning about the importance of different attributes (Shafir et al., 1993) in order to resolve the conflict. Subsequently, the decision process slows down because the task requires serial top-down control. Attributes in subjective experience form the “decision space” (Nyman et al., 2010; Morsella et al., 2016), which represents the aspects governing the choice.

Research on judgment and decision-making has traditionally suggested that a normative solution to such multidimensional choice problems is a compensatory strategy, which uses all available data and weights it according to its importance (e.g., Payne et al., 1988). In most cases, compensatory strategy requires too much time and cognitive resources (Simon, 1955), and there is much evidence that heuristic, one-reason strategies perform well in most real-life choices (Gigerenzer et al., 1999). In other words, for most decisions, only one reason is required for a satisfactory choice, which diminishes the time and effort involved. Some studies suggest that decision strategies gradually shift toward a more heuristic style, and compensatory strategies are more typical for novices (Garcia-Retamero and Dhami, 2009; Leisti and Häkkinen, 2018). Experts can therefore rely on more efficient strategies in their decisions.

When there is only one reason, the question that follows is what determines the specific reason. So far, it is known that the attributes unfolding in the decision space are dependent on the context such as image content (Radun et al., 2008). Not only is the visibility of artifacts dependent on the content, but also personal interpretation of image properties differs between contents. Therefore, solving the decision problem returns to the question of how the alternatives are interpreted and experienced. Subjective phenomena always have personal meaning that is not contained within the physical stimuli (Albertazzi, 2013), thus, the view that IQ consists of static component attributes, or “-nesses” that are subjective representations of objective image properties (Engeldrum, 2004b) becomes problematic.

The IBQ approach is based on the attribute data that participants produce spontaneously as reasons for choices. This differs from typical approaches that rely on psychophysical threshold tasks or category scales, where experimenters specifically prompt observers to evaluate quality on predefined attributes scales. The weakness of these ready-made scales is that the subjective decision space of the participants cannot be known beforehand, as the emerging set of attributes is dependent on the personal interpretation of the task (Radun et al., 2016). Asking consumers to evaluate products with a predefined attribute may interfere with their personal approach by diverting attention away from attributes that they would normally consider important (Tordesillas and Chaiken, 1999; Radun et al., 2016). This may not only change the weighting of the individual attributes (Wilson and Schooler, 1991) but may also interfere with the consumer’s experience of quality, which is dependent on the aspect receiving attention (Tse et al., 2013; Yamada et al., 2014).

Purpose of the Study

The purpose of this study is to describe the decision space that unfolds to participants when they are required to make decisions in a 2AFC task and how they use the attributes that emerge in this space. We are specifically interested in small multidimensional differences present between flagship camera phones. This is a context where we expect the quality deviations to be most dependent on experiential aspects and personal taste instead of defects, for which there are several instrumental measures available. Our approach is exploratory and focuses on the following aspects: IQ differences within image pairs, numbers of reported reasons, decision times, and specific IQ attributes.

Our introduction opened up two orthogonal research questions, the first concerning the roles of subjective experience and decision-making in IQ estimation, and the second differences between small and large quality differences. Cross-sections of these research questions yield four specific themes: what kind of reasons is reported when differences between images are small or large and what kind of decision strategies are applied to the attributes that emerge from those experiences when differences are small or large?

Experiment 1

Methods

Participants

Participants (N = 32) were recruited from the student email lists of the University of Helsinki to participate in an experiment about decision-making and IQ. The number of participants was dictated by the counterbalancing of the 32 stimulus images evenly in each condition between participants. We tested participants for visual acuity (Lea numbers), near contrast vision (near F.A.C.T.), and color vision (Farnsworth D15). All participants passed the tests. They received a movie ticket as compensation for their participation. The mean age of the participant group was 26.1 years (SD = 4.4). Of participants, 27 were females and five males.

Stimuli

The stimuli were based on 32 predetermined image contents that represented typical use scenarios of camera phones (Figure 1). Scenarios were selected according to their location in photospace (Keelan, 2002), which is a frequency distribution of photographs taken by ordinary users of point-and-shoot cameras, located in two dimensions according to their shooting distance and illumination level. Most frequent use cases in photospace were stressed in content selection, but the selection also included multiple skin colors and challenging cases, defined by the experts (authors MV and J-LO).

FIGURE 1

Figure 1. Image contents used in the study. Of 32 contents, 25 were in landscape mode (A) and seven in portrait mode (B). For certain contents, there were duplicates taken with back and front camera or with different zooming or flash settings. The models gave a written consent for the publication of the photographs.

We used four flagship camera phones from leading manufacturers to create four different versions for each content. A professional photographer took five different photographs of each content with each device. From these five photographs, the photographer chose the best image for the experiment based on his own opinion. The images (N = 128) were rescaled to 2,560 × 1,440 (landscape) or 1,080 × 1,440 (portrait) resolution. We presented each image on Eizo 27” ColorEdge CG2730 display, calibrated to sRGB color space, 120 cd/m² luminance, 2.2 gamma, and D65 white point. The displays had no known differences in uniformity. The ambient illumination in the laboratory was set at 20 lux, using D65 fluorescent lamps.

The stimulus material is available upon request from the corresponding author.

Procedure

After providing informed consent and passing the vision tests, participants were given the instruction that the task would be a paired comparison task, where they should choose the better of the two images. We emphasized that the task is subjective, i.e., there are no right or wrong answers. We asked the participants to consider which of the images they would save for general use, e.g., putting it in a photo album or on social media or showing it to family or friends.

We used the PsychoPy (Peirce, 2007) environment for creating the experiments. We used the 2AFC method, thus, with 32 contents and four devices for each content, there were altogether 192 image pairs for the 2AFC task. We divided the experiment into 32 blocks; in each block, participants evaluated six image pairs of single content in random order. In half of the blocks, we employed the IBQ method (Radun et al., 2008); participants provided explanations for their decisions. In the rest of the blocks, the participants only made choices, without explanation, to reduce the experiment duration.

The conditions with or without explanations formed super-blocks, consisting of half of the contents. We varied the order of these super-blocks and counterbalanced them between participants. In other words, half of the participants completed all of the silent contents first and then the contents with explanations, and vice versa. We counterbalanced also the contents within the super-blocks and randomized them between participants.

Within each trial, two stimulus images were presented on two parallel calibrated displays at the same time, and a third non-calibrated display next to the keyboard was used for answering. Simultaneously with the images, two buttons appeared on the response display for participants to indicate their preference. After selecting the better image, the text field appeared below the buttons for the participant to explain their choice in Finnish (in explanations condition). After this, the participant proceeded to the next image by pressing the “next” button below the text field. Participants could not proceed if they had not indicated a choice or the text field was empty. Between trials, a neutral gray rectangle replaced the images for 500 ms.

The Ethics Review Board in Humanities and Social and Behavioral Sciences of the University of Helsinki approved the experimental protocols of this study (decision no. 40/2017).

Data Analysis

Quantitative Analysis

When the images produced by the four different devices were compared pairwise, the total number of choices was six for each content. With 32 image contents, the total number of pairs was 192. We transformed the choice probabilities in each pair further into just noticeable differences (JNDs). We first used the logit transformation:

\begin{array}{l} logit (p) = ln (\frac{p}{1 - p}) & (1) \end{array}

where p and 1-p represent the choice probabilities of the images. We then rescaled logit values into JND values using the formula

\begin{array}{l} J N D = \frac{logit (p)}{logit (.75)} & (2) \end{array}

Qualitative Analysis

In the qualitative analysis of explanations for the choices, we followed the approach employed by Radun et al. (2008, 2010). We used Atlas.ti software (Muhr, 2004) for this purpose. Before the analysis, we imported the explanation data with identifiers for participant ID, image content, experiment trial, and image such that each attribute could be linked to each trial, stimulus, and participant. In the first phase, we coded each explanation with attribute codes that we found in the explanation, without making any interpretation of the meanings of the attributes; e.g., the explanation, such as “This photo is blurry and faded,” was coded with codes “blurry” and “faded.” In other words, codes denote a certain attribute that has been used to explain the selection or rejection of a certain image from a certain pair. We made the coding blindly, without knowing the identities of the cameras and the contents. In the second phase, we further streamlined the coding scheme by creating more exact definitions for each attribute code and merging similar attributes (e.g., “blurry,” “unsharp,” and “not sharp” into “unsharp”). After this, we re-analyzed the explanation data using these definitions and corrected possible deviations from these definitions.

Quantitative Analysis: Attributes

In addition to counts, we calculated other descriptive measures for attributes. First, we calculated measure of accuracy for each attribute i in each pair j:

\begin{array}{l} accura c y_{i j} = | \frac{n_{p} - n_{q}}{n_{p} + n_{q}} | & (3) \end{array}

where n_p and n_q are the counts of the attribute in each image p and q. When accuracy measures were calculated over several pairs or attributes, means were weighted according to the count of the attributes. We then calculated a measure of valence for each attribute to determine whether the attribute was considered positive or negative. This measure describes the proportion of attribute mentioned with the selected or rejected image in relation to all occurrences:

\begin{array}{l} valenc e_{i j} = | \frac{n_{selected} - n_{rejected}}{n_{selected} + n_{rejected}} | & (4) \end{array}

where n_selected are n_rejected counts of attribute i for selected and rejected images in each pair j. Overall valence was calculated as the weighted mean over all pairs, and the weight was determined by the attribute counts.

Results and Discussion

Subjective Attributes

Our qualitative analysis yielded 52 subjective IQ attributes¹ that participants mentioned more than once (Appendix A). Participants mentioned some aspect of sharpness, color, or lightness level most often as the principal reason for choice. In addition, there were 30 positive and 21 negative attributes that occurred only once in explanations and could not be merged with other attributes; these were omitted from further analyses.

The Number of the Reported Reasons in the 2AFC Trials

From the viewpoint of the reported reasons, participants’ choices can be explained by a rather simple heuristic strategy: in most trials, participants reported one attribute (Mean = 1.2; Standard Deviation = 0.51; later abbreviated as M and SD, respectively) for selecting and one attribute (M = 0.9; SD = 0.62) for rejecting an alternative. Based on the valence calculated for each attribute, only a small minority of the attributes given to the selected alternative were negative (M = 0.06; SD = 0.27). The same applied to the positive attributes given to the rejected alternative (M = 0.8; SD = 0.28). The number of attributes was approximately the same for all contents; the maximum mean number of positive attributes for selecting was 1.3 and the minimum 0.9. The corresponding figures for rejection and negative attributes were 1.1 and 0.5. This kind of answering scheme may have been also prompted by the test design, which included one field for explaining the selection and another for explaining the rejection.

The IQ Attributes and the Magnitude of the Quality Difference

We transformed the choice distributions within the pairs to JND values using logit transformation and then divided all 192 pairs into groups according to the quality differences between the alternatives. The step between groups was one JND. The first group (JND = 0) consisted of all pairs with a difference below 0.5 JNDs, the second group (JND = 1) with a difference between 0.5 JND and 1.5, etc. Figure 2A shows the distribution of pairs in these quality difference groups.

FIGURE 2

Figure 2. Distribution of quality differences (A) and the qualitative shift between one and three just noticeable differences (JNDs) illustrated by three measures: mean number of different attributes (B), mean accuracy of attributes (C), and mean decision time (D).

After dividing the trials into categories according to their quality differences, we calculated the mean number of different attributes, the mean accuracy of the attributes, and the mean response time in each category and plotted the results in Figure 2. Visual examination of Figure 2 suggests that decisions in pairs with large differences are made with a smaller number of different attributes (Figure 2B), with high accuracy (Figure 2C) and rapidly (Figure 2D), whereas larger variety of attributes, lower accuracy, and slower decisions are typical for small differences. Correlational analysis supports this impression: spearman correlations between quality difference (in JNDs) and the number of different attributes, mean accuracy, and mean decision time were r (190) = −0.49, r (190) = 0.58, and r (190) = −0.60, respectively (all p < 0.001).

We further analyzed the total number of attributes in each pair given by all participants, the total number of different attributes given by all participants, and the mean number of attributes given in each trial, and divided the attributes according to their valence (positive or negative) and whether they were given to the selected or the rejected alternative. The decrease in the number of different attributes (Figure 2B) is mainly due to a decrease in the number of reasons that conflict with the majority choice (Figure 3B). In other words, when the difference between the images is small, both the number of positive attributes for the rejected alternative and the number of negative attributes for the selected alternative are larger (Pearson correlation coefficients in Table 2, second row).

FIGURE 3

Figure 3. Mean number of attributes that participants used in each pair according to their quality difference (A), number of different attributes that participants used (B), and total number of reasons given in each pair (C). Black lines represent the number of chosen alternatives, gray lines the number of rejected alternatives, and solid lines and dashed lines the positive and negative attributes, respectively.

TABLE 2

Table 2. Spearman correlation coefficients between quality difference and total number of attributes, mean number of different attributes, and mean number of attributes in a choice.

When participants and pairs are examined individually, the number of reasons increases slightly as the preference difference increases (Spearman r = 0.28; p < 0.001; Figure 3A). This is due to the increasing numbers of positive attributes for the selected alternative and negative attributes for the rejected alternative (Table 2, third row). These increases are, however, rather low: an average from 1.05 to 1.2 for positive attributes and from 0.76 to 0.93 for negative attributes. Finally, Figure 3C illustrates the total counts of positive and negative attributes for better and worse alternatives as a function of quality difference (also Table 2, first row).

On an individual level, only one reason is usually required to justify a choice, independent of quality difference. However, when we examine the number of different attributes over a larger group of participants, the decision space expands significantly when differences are small. A larger number of different attributes indicates that with small quality differences participants’ attention diverges to different image properties and image areas due to a lack of salient quality defects. However, participants’ prevailing decision strategy does not seem to change at different quality levels.

To test the hypothesis that participants use the same decision strategy in all of their choices, independent of quality level or other factors, we divided all 3,074 choices, where reasons for choices were given, into quartiles according to their decision times and calculated mean numbers of attributes given in each quartile. The result is shown in Figure 4, suggesting that no radical change in decision strategy occurs when participants use more or less time to make a choice. We tested this by estimating the coefficient B₁ in a linear regression model:

\begin{array}{l} y = B_{0} + B_{1} t + B_{2} i d_{1} + B_{3} i d_{2} + B_{4} i d_{3} \dots + B_{n + 1} i d_{n} & (5) \end{array}

where t indicates the decision time and y the number of the attributes. To control for the effect of individual differences, we included participant identities in the model as dummy variables id₁…id_n that indicated the identity of the participant 1…n with value of 1, the value being otherwise 0.

FIGURE 4

Figure 4. Mean number of attributes in all choices, divided into quartiles according to the decision times. The x-axis represents the different quartiles of decision times. Black lines represent the number of chosen alternatives, gray lines the number of rejected alternatives, and solid lines and dashed lines the positive and negative attributes, respectively.

According to the estimated coefficient B₁ of all models, longer decision times do not mean a larger number of attributes. On the contrary, longer decision times are related to a smaller number of positive attributes for the selected alternative, the value of the coefficient B₁ being −0.0052 (the standardized coefficient β₁ was −0.49), suggesting that longer decision times are due to participants having difficulties in finding a reason to make a choice. Student’s t-test showed that the coefficient B₁ differed from zero [t(3,071) = −2.63; p = 0.009]. The proportion of the variance R² explained by the model was 0.11.

The model predicting the number of negative attributes for the selected alternative indicated that there is a slight increase in the number of attributes with increasing decision time, as the coefficient B₁ was 0.002 [β₁ = 0.042; t(3,071) = 2.17; p = 0.03; R² = 0.05]. This implies that longer decision times may involve a conflict that the negative aspects of the selected alternative induce to the choice.

Nevertheless, longer decision times do not mean an increase in the number of positive or negative attributes given to the rejected alternative, as the B₁ coefficient did not differ from zero in those models, according to t-test [t(3,071) = 0.190; p = 0.85 and t(3,071) = −1.84; p = 0.066, respectively].

Only the number of attributes that participants use to explain selection, not rejection, are related to decision times, suggesting that heuristic strategy prevails at all time spans. In this strategy, participants seek reasons for selecting certain alternative and may hesitate if the preferred alternative has also negative, conflicting properties. Conflict between positive attributes of both alternatives in a pair, however, does not cause an increase in decision times. Participants appear to focus on finding plausible reasons for selecting one alternative and do not use additional time to deliberate over the positive aspects of both alternatives, indicating a form of confirmation bias.

The Use of Attributes When Quality Differences Are Large and Small

We further analyzed how different subjective attributes are used in pairs with large and small quality differences. We used two JNDs as a cut-point and divided the pairs into two categories: the small difference category (difference less than two JNDs) and the large difference category (difference more than two JNDs). We then calculated the proportion of each attribute in these small and large difference categories.

We based our analysis on the hypothesis that the use of attributes can contribute to small quality differences through three mechanisms: (1) divergent attention to several attributes and dilution of overall difference; (2) lower accuracy of attributes, caused by smaller visual differences; and (3) use of attributes with ambiguous or low diagnostic value due to lack of more diagnostic attributes.

In Figure 5, we plotted the proportion of the attributes in the large difference category against the number of all other attributes mentioned within the pair. For instance, if the attribute in question is “sharp,” we counted all other attributes used by all participants, such as “grainy” and “natural colors.” The figure illustrates that most subjective attributes appear to describe rather small differences, and large differences are associated with a few clear defects, such as blur, noise, or red eyes. Hence, when differences are smaller, participants’ attention toward different attributes diverges, increasing the number of attributes. Attributes referring to colors, contrast, and lightness level are typical for the smaller differences. When quality differences are larger, salient attributes attract the attention of most participants, leading to a higher consensus and a smaller number of attributes. This is in line with our third hypothesis. In other words, people have fairly high tolerance for differences in preferential attributes and appear to focus on them only when no visible defects or artifacts exist. Detection of defects is a heuristic decision rule for the participants; in pairs where differences are large, participants make fast choices using a limited set of attributes, which clearly differentiate the alternatives.

FIGURE 5

Figure 5. Proportion of attributes in the large difference category and number of other attributes mentioned in the pair. The subjective attributes most often mentioned for large quality differences were less often accompanied by other attributes. However, when quality differences were small, there was a plethora of different subjective attributes, most referring to color balance and general lightness level. In other words, choices in large quality difference pairs is usually explained by smaller number of subjective attributes than in small quality difference pairs.

Figure 6 illustrates the relation between the proportion of the attribute in the large difference category and its accuracy. It is evident that the least accurate attributes are less specific and given in pairs where quality difference is small. Such attributes are, for instance, “colors good,” “colors bad,” “lightning good,” and “clear.” However, attributes referring to sharpness are also relatively inaccurate despite their frequency in larger quality differences and apparent clear meaning. Because the attributes are brought up spontaneously, it is peculiar that people use attributes like “sharpness” when no clear, shared understanding about the sharpness difference exists. We examine this further in Experiment 2.

FIGURE 6

Figure 6. Subjective attributes according to their accuracy and their proportion in the large difference category. Only attributes with frequency more than 31 are included; the accuracy estimation of rarer attributes is biased.

There are also attributes in the small difference group that are accurate, for instance, the more specific attributes referring to colors such as “colors faded,” “gray,” and “yellowish.” Figure 6 shows that weaker accuracy in smaller quality pairs is mostly caused by the use of less accurate attributes, not weaker general accuracy of all attributes. A notable exception to this rule is “sharpness.”

In addition to accuracy and divergent attention, the use of attributes with low diagnostic value may lead to small quality differences. For instance, the attribute “bright” does not clearly indicate whether the image is good or not and is not therefore very diagnostic, unlike the attribute “grainy,” which immediately reveals that the IQ is not good. In Figure 7, we have plotted attributes according to their proportion in large difference pairs against the valence of these attributes, showing that some attributes, typically used in small difference decisions, are neither positive nor negative, such as light colors, brightness, and blurry background. Most of the attributes, however, are unambiguously positive or negative, even when quality differences are small. Thus, heuristic quality estimation strategy seems to avoid attributes that have unclear valence and seeks plausible, justifiable reasons.

FIGURE 7

Figure 7. Subjective attributes according to their valence and proportion in the large difference category (for calculation of valence, see section “Data Analysis”). Most attributes align to either end of the valence dimensions. The named attributes in the middle of the valence dimension represent the small minority of all attributes, are typically mentioned in pairs with small difference, and have rather neutral meanings. Location of other attributes in these two dimensions is reported in Appendix A.

Difference Between Trials With Explanations and “Silent” Trials

We have earlier shown that performance in IQ estimation tasks differs slightly between trials where participants are required to give reasons for their decisions and trials, which do not have such requirement. Most importantly, participants are typically more consistent in their decisions when explanations are required (Leisti and Häkkinen, 2016, 2018). This is probably due to a more thorough information search, which also leads to more pronounced differences between alternatives (Leisti et al., 2014). We found this preference polarization also in this study; the mean preference difference was 1.65 JNDs when explanations were required and 1.46 JNDs in silent trials, suggesting that participants were more unanimous in the former condition. Despite the differences in consistency, Spearman correlation coefficient between conditions was r(190) = 0.89. We come back to this issue in Experiment 3.

Experiment 2: How Threshold Task Differs From Quality Estimation Task

It is important to note the difference between subjective attributes that occur in the explanations for decisions and the psychophysical or psychometric tasks where participants estimate the magnitude of a single attribute. The frequency of an attribute in explanations is not directly associated with its magnitude because its occurrence is related to its importance in the quality estimation and to the magnitudes of other attributes. One factor is also accessibility of an attribute (Kahneman, 2003); some attributes are more familiar and more often associated with quality, so participants may be biased toward these attributes. As clearly shown by the results of Experiment 1, participants predominantly seek one plausible reason to justify their choice and therefore use the most salient attribute in explanations, potentially masking the magnitudes of other attributes.

Similarly, when people are free to use any vocabulary that they desire, there is a possibility that attributes will not have the same meanings between participants. Some aspects of quality may also be difficult to verbalize, leading participants to use less specific expressions such as “good colors” or “good lighting.” In addition, with small, near-threshold differences, it may be difficult for naïve participants to distinguish between sharpness, graininess, or contrast.

In Experiment 2, we used the same materials and a similar method, but the choices were no longer explained; instead, the participants were asked to estimate the quality after the choices using content-specific attributes with buttons similar to those used for indicating their choices. For instance, participants were asked whether image A or B was sharper, more natural, or had better skin tone, depending on the content. These attributes were derived from the qualitative analysis of Experiment 1, representing the most important aspects of IQ in each content.

Experiment 2 had a dual purpose. First, we wanted to understand the relation between the visual magnitude of each attribute, determined by a threshold task, and the frequency of its counterpart in subjective explanations. Second, attributes clearly differ in their accuracy; we therefore wanted to explore whether certain attributes are more ambiguous than others, i.e., their meaning differs between participants, and whether the differences in accuracy are caused by “false positives,” i.e., cases where some participants have detected differences where none exist.