Replication Rate, Framing, and Format Affect Attitudes and Decisions about Science Claims

Barnes, Ralph M.; Tobin, Stephanie J.; Johnston, Heather M.; MacKenzie, Noah; Taglang, Chelsea M.

doi:10.3389/fpsyg.2016.01826

ORIGINAL RESEARCH article

Front. Psychol., 22 November 2016

Sec. Cognition

Volume 7 - 2016 | https://doi.org/10.3389/fpsyg.2016.01826

Replication Rate, Framing, and Format Affect Attitudes and Decisions about Science Claims

Ralph M. Barnes¹^*

Stephanie J. Tobin²

Heather M. Johnston³

Noah MacKenzie⁴

Chelsea M. Taglang⁵

¹Department of Psychology, Montana State University, Bozeman, MT, USA
²School of Psychology, Australian Catholic University, Banyo, QLD, Australia
³Psychology Department, Columbus State Community College, Columbus, OH, USA
⁴Department of Social Sciences, Clermont College, University of Cincinnati, Batavia, OH, USA
⁵Department of Psychology and Counseling, Hood College, Frederick, MD, USA

A series of five experiments examined how the evaluation of a scientific finding was influenced by information about the number of studies that had successfully replicated the initial finding. The experiments also tested the impact of frame (negative, positive) and numeric format (percentage, natural frequency) on the evaluation of scientific findings. In Experiments 1 through 4, an attitude difference score served as the dependent measure, while a measure of choice served as the dependent measure in Experiment 5. Results from a diverse sample of 188 non-institutionalized U.S. adults (Experiment 2) and 730 undergraduate college students (Experiments 1, 3, and 4) indicated that attitudes became more positive as the replication rate increased and attitudes were more positive when the replication information was framed positively. The results also indicate that the manner in which replication rate was framed had a greater impact on attitude than the replication rate itself. The large effect for frame was attenuated somewhat when information about replication was presented in the form of natural frequencies rather than percentages. A fifth study employing 662 undergraduate college students in a task in which choice served as the dependent measure confirmed the framing effect and replicated the replication rate effect in the positive frame condition, but provided no evidence that the use of natural frequencies diminished the effect.

Introduction

“The gold standard for reliability is independent replication.”

–Frank and Saxe (2012).

“Even 100 failures in a row to replicate, following statistical significance, will not impugn the original decision to reject H₀ and affirm H₁.”

–Sohn (1998).

Science journalists may announce that a particular substance causes cancer, only to announce a few weeks or months later that subsequent studies failed to confirm the earlier finding. The public may be informed that a new drug treats a particular disorder safely, only to be told that further research indicates that the drug is either not safe or not effective (or both). After the findings of an initial study have been made public, science journalists might report that subsequent studies have replicated or, more likely, failed to replicate the initial results. How are members of the general public, who are regularly bombarded with conflicting scientific reports, influenced by this information? This paper will present several studies that investigate how information about research replication rates (the percentage of attempted replications of a study that succeeded in replicating the study) impact non-scientists' attitudes about science claims. Furthermore, the current studies investigate how the format in which replication rate information is presented can influence attitudes.

Many scientists feel that replication is a critical component of science (e.g., Ioannidis and Khoury, 2011; Santer et al., 2011; Tomasello and Call, 2011). Commonly touted benefits of replication include the ability to detect bias and random error (Bayarri and Berger, 1991), the ability to eliminate investigator error (Cohen, 1997) and Type 1 error as explanations for research results, and the creation of a cumulative body of work (Fowler, 1995). The importance of replication has been emphasized by some as a reaction to claims that a large proportion of scientific findings, are false positives (Ioannidis, 2005; see also Risch, 2000; Sterne and Davey Smith, 2001; Hirschhorn et al., 2002; Wacholder et al., 2004; Simmons et al., 2011).

The concept of replication is discussed in the mass media and information regarding replications (and failed replications) therefore has the potential to influence popular opinion regarding scientific claims. For instance, the media informed the general public about failures to replicate Pons and Fleischmann's cold fusion results (Browne, 1989), Andrew Wakefield's autism results (Boseley, 2010), Benveniste's water memory results (Sullivan, 1988), and the finding that the XMRV retrovirus was associated with chronic fatigue syndrome (Tuller, 2011). A widely disseminated New Yorker article aimed at a non-scientist audience (Leher, 2010) frankly addressed issues such as replicability and the file drawer effect.

The present studies were not designed to answer the question of how (or whether) confidence in a finding should be modified based on the outcome of replication efforts. Rather the present goal is descriptive: we wish to find out how knowledge about replication efforts affects the manner in which non-scientists alter their confidence in research findings. Many non-scientists lack a firm grasp of scientific and research methods, so we have no clear expectation of the amount of influence replication information will have on lay persons. However, we can generate the more conservative hypothesis that non-scientists are impacted by knowledge of replication in such a way that knowledge of a failed replication of a study will cause lay persons to reduce their confidence in that study.

When people make decisions under conditions of uncertainty, the framing of the information can influence their decisions (Tversky and Kahneman, 1981; Levin et al., 1998; Gong et al., 2013). Broadly speaking, framing effects are revealed when the wording used to describe a given piece of information has an impact on the choices and decisions individuals make. Framing effects have been demonstrated to affect choice, attitude, and behavior in a wide range of tasks, but not all framing studies explore the same phenomenon. Levin et al. (1998) include three types of framing in their typology: risky choice, goal, and attribute. In attribute framing, the dependent measure of interest is the evaluation of a single option (e.g., whether-or-not a claim is true) rather than a choice between independent options (e.g., whether-or-not one option is preferable to another), therefore the framing paradigm used in the present studies is an example of attribute framing. A consistent finding in the attribute framing literature is that a particular alternative is rated more favorably when described positively than when described negatively (Davis and Bobko, 1986).

In the attribute framing literature (see Sher and McKenzie, 2006 for a review), researchers sometimes claim that two different frames contain logically equivalent information (i.e., contain the same logical content). Cases where both frames contain logically equivalent information may not be equivalent in other crucial aspects, however. Some of the mechanisms proposed to explain attribute framing are concerned with the ways that differently framed statements are nonequivalent. The first possibility (Sher and McKenzie, 2006) is that two different logically equivalent frames may be informationally inequivalent because of information leakage: the communicator's choice of frame may contain information (e.g., the choice that the communicator favors) and this leaked information may be detected by the recipient of the communication (McKenzie and Nelson, 2003; Sher and McKenzie, 2006). The second possibility is that people interpret “60% replication rate” to mean “at least 60% have succeeded in replicating” and interpret “40% failure to replicate” to mean “at least 40% have failed to replicate” (Macdonald, 1986; Mandel, 2001). A third explanation of the attribute framing effect relies on the notion that people are sensitive to the descriptive valence of the words employed. Levin and Gaeth (1988) have argued that an attribute described with a positive label evokes favorable associations in memory, while an attribute described with a negative label evokes unfavorable associations in memory.

Not only do those who intend to communicate replication rate information to others have a choice of frame (positive, negative) they also have a choice of numeric format. The claim “10 studies attempted to replicate finding X, and 6 succeeded” is presented in terms of natural frequencies (Hoffrage et al., 2002). Another important characteristic of natural frequencies is that the total sample size and number of cases in a given subset are both transparent. When the same information is converted to probability (0.6) or percentage (60%), the number of cases in the subset and the total sample size are no longer available. So, proportion and percentage retain only one out of three piece of information found in the natural frequencies.

Natural frequencies have been compared with probabilities in the context of the debate concerning Bayesian reasoning (Gigerenzer, 1996; Kahneman and Tversky, 1996). Some have claimed that, due to their evolutionary heritage, humans have a natural predisposition to think in terms of natural frequencies (Gigerenzer and Hoffrage, 1995; Cosmides and Tooby, 1996; Brase et al., 1998; Hoffrage et al., 2000). Replacing probabilities with natural frequencies tends to minimize a number of biases (Fiedler, 1988; Gigerenzer et al., 1991; Koehler et al., 1994). Slovic et al. (2000) found that risk estimates were impacted by numeric format (probabilities vs. natural frequencies) and Purchase and Slovic (1999) found that when risks are small, presenting them in terms of frequency rather than proportion or percent has a tendency to alarm people. Regardless of the theoretical value of the manipulation, there is a simple practical reason for presenting participants with probability information in both percentage and natural frequency formats: if the format that probability information is presented in interacts with information about replication rate, then knowledge of this effect would be of use to producers and consumers of science information.

Note that in the English language one generally does not substitute “no percent” for “0%” or “all percent” for “100%.” When using natural frequencies, however, an individual sometimes has the option to use either Arabic numerals or verbal descriptors to specify the quantity of a subcategory. For instance, a particular situation could be described with the phrase “0 of the 5 studies replicated the effect” or the phrase “none of the 5 studies replicated the effect.” These two phrases are mathematically (and informationally) equivalent, so one might predict that they would be psychologically equivalent as well. As Feynman (1967) has pointed out, however, mathematical equivalence is not the same as psychological equivalence. Somewhat related to this issue, there is extensive research exploring the unique aspects of qualitative probability expressions and quantitative probability values (Toogood, 1980; Beyth-Marom, 1982; Nakao and Axelrod, 1983; Budescu and Wallsten, 1985; Mosteller and Youtz, 1990). Qualitative probability expressions include such verbal expressions as mostly, very likely, exceptionally unlikely, and rare while quantitative probability values include such numeric expressions as 90% likelihood and 1% likelihood. Though this literature reveals that these two manners of presenting probability information are not always interpreted the same way by participants, it should be noted that, unlike the natural frequency phrases given as an example above, qualitative probability expressions are neither logically nor mathematically equivalent to quantitative probability values. Since Arabic numbers might have different psychological effects than the use of qualitative terms, and both are found in media reports on science, we will use both kinds of expressions in the current study.

One of our major goals concerns a practical application; we would like to determine how high the replication rate of a study has to be before the knowledge about the replication rate will actually cause an increase, rather than a decrease, in confidence about that finding. Our second major goal is to determine if replication rate affects participants in a way that is independent of numeric format. Our third major goal is to determine what effect, if any, the framing of replication rate information has on attitudes. We have chosen attitude difference as the dependent measure with which to achieve our research goals.

Our first hypothesis (H1) is that participants will be sensitive to replication rate information. In the absence of a compelling theoretical reason to expect a more complex geometrical trend, it is our hypothesis that the relationship between attitude difference scores and replication rate will be linear: higher replication rates will be associated with more favorable attitudes toward the claims. Given the findings reported in studies that employed an attribute framing paradigm, our second hypothesis (H2) is that the framing of the replication rate information will have at least as strong an impact as the replication rate information itself. More specifically, it is our hypothesis that positively framed replication rates will lead to more favorable attitudes than negatively framed replication rates. Our third hypothesis (H3) is that the framing effect will be less pronounced when the numeric format of replication rate information is in natural frequencies rather than percentages. The third hypothesis is based on research that indicates that human reasoning is facilitated by the use of natural frequencies. Finally, it is our hypothesis (H4) that when replication rate information is presented as natural frequencies, the attitude difference will not depend on whether the natural frequency is presented with Arabic numerals (e.g., 0 out of 10) or words (e.g., none of the 10).

Experiment 1a

The goal of Experiment 1 was to test our first two hypotheses using replication rate information in percentages. Thus, replication rate information was presented in terms of percentages and was framed in either a positive or negative fashion. We predicted that increasingly larger replication rates would be associated with more positive attitudes and that this would be evidenced by a significant linear trend, with no other significant trend components. We also predicted that negative frames would lead to more negative attitudes than would positive frames and the effect size associated with the framing effect would be at least moderate in size.

Methods

Ethics Statement

Ethics approval for Studies 1 through 5 was obtained from all institutions at which data was collected. These institutions include Columbus State Community College (CSCC), Haverford College (HC), Lafayette College (LC), Montana State University (MSU), Salem State University (SSU), University of Cincinnati-Clermont College (UCC), and The University of Houston (UH). For Study 1, data was collected and IRB approval was granted from CSCC, HC, SSU, and UH. For Study 2, IRB approval was granted from Lafayette College. For Studies 3 and 4, data was collected, and IRB approval was granted from CSCC, LC, and UCC. For Study 5, data was collected, and IRB approval was granted from MSU and UCC.

In all cases a written information sheet or a computer screen containing the information was provided to participants, but the committees waived the need for written informed consent from the participants as the research was low risk. Some minor deception was necessary to test our hypotheses, participants' rights and welfare were not adversely affected, and a full debriefing was provided at the end of the study.

Participants

One hundred and eighty-three undergraduate students enrolled in psychology courses volunteered to participate. Data from 11 participants were excluded because of failure to answer all the items in the questionnaire and/or failure to follow instructions. Seventy-eight percent of the remaining 172 participants were female. The age range for the sample was 18–49 and the average age was 21.9. Seven participants were students at a private liberal arts college, 119 were enrolled at a private university, 11 were enrolled at a state college, and 35 were enrolled at a community college. All participants were given extra-credit in their psychology courses as compensation. The stimuli, procedure, and primary dependent measure employed in Experiments 1 through 4 were novel. For this reason, we did not know what effect sizes would be likely to obtain. We therefore aimed to recruit 100 participants per frame for each experiment. Due to incomplete questionnaires and failure to follow instructions the exact number of participants per was never exactly 100. However, in all studies the decisions to terminate data collection were always made prior to looking at any of the data.

Stimuli and Procedure

Stimuli consisted of 24 science claims about topics that were either fictions created by the authors or obscure topics that would be unfamiliar to participants. Thus, the claims were designed to function as the “blank predicates” that are commonly used by psychologists interested in deductive reasoning. Twelve of the claims were critical to this study (see Table S1) and the other 12 were distractor items that were included to keep participants from determining the purpose of the study. The stimuli for all experiments have been included in the Supplementary Materials.

Additional information for the 12 critical items consisted of the replication rate of studies that supported each initial claim (see Tables S2, S3). The additional information was framed in either a positive or negative fashion (a between-subjects manipulation). Frame was chosen as a between-subjects variable because we worried that if items on a given questionnaire switched between positively and negatively framed information, participants might be more likely to misread an item, and give a response appropriate to the wrong frame. In the positive frame condition, six replication rates (percentages of studies that successfully replicated the results of an earlier study) were chosen (9, 17, 24, 69, 77, and 84%); creating six potential versions of additional information for each initial claim. In the negative frame condition, six replication failure rates (percentages of studies that failed to replicate the results of an earlier study) were used (91, 83, 76, 31, 23, and 16%) creating six potential versions of additional information for each initial claim. For a given frame (positive, negative) and initial claim, the six related pieces of additional information differed only in regards to the percentage of successful replications reported. The additional information created for distractor items did not include any information regarding the replication (or lack thereof) of scientific studies. Distractor items consisted of science claims that contained additional supporting information that was unrelated to replication rate.

Each participant was randomly assigned to complete 1 of 16 (8 positive framed, 8 negatively framed) paper and pencil questionnaire variants (see Stimuli S1 for an example). Each questionnaire variant contained 24 science claims: 12 distractor items and 12 critical items. Six of the critical items were science claims in isolation (baseline items) and six of the critical items were science claims paired with additional information. For each critical item, a questionnaire would include either the baseline version of the item or the claim paired with additional information but never both. As with the critical items, 6 of the distractor items were science claims in isolation and 6 of the distractor items were science claims paired with additional information. Half of the questionnaire variants contained replication rate presented in a positive frame and half contained replication rate presented in a negative frame. In each questionnaire the initial section of 24 science claims was followed by several demographic questions.

The sequence of the 24 claims was randomized with the constraint that no more than three distractor items or three critical items could appear in a row. Additionally, no more than two non-baseline items could appear in a row. Pairings of critical items and replication rates was unique to each questionnaire. For instance, if a 17% replication rate was matched to science claim number 1 in variant 8, then none of the other variants would pair science claim number 1 with that particular replication rate.

After each science claim, respondents were presented with an opportunity to indicate their attitude toward the claim using a 6-point scale that ranged from strongly favor (1) to strongly oppose (6). It was unlikely that participants would find each of the critical science claims equally compelling, therefore difference scores, rather than raw attitude scores, served as the dependent measure. Had we used raw attitude scores, we would not have been able to determine the unique contributions of plausibility of the claim and information about replication rate. We used the responses to the baseline items (critical items not paired with additional information) to calculate mean attitude scores for each of the 12 claims. For each critical item paired with additional replication rate information trial, the attitude score was subtracted from the mean score of the appropriate critical claim in isolation (the mean baseline attitude score). A negative difference scores indicates that participants found a science claim to be less convincing when it was followed by additional information. For instance, across all participants presented with the baseline version of claim number 10, the average attitude was 3.1. If a participant responded with a “4” on the 1–6 scale for claim 10 on a trial in which the claim 10 was followed by additional information about the percentage of studies that successfully replicated the research upon which the claim was based, then that participant's difference score would be −0.9, indicating a shift away from the average baseline attitude of nearly one integer on the 6 point scale.

In this experiment, and all subsequent experiments, decisions to terminate data collection were made before any data were looked at. All of our studies, pilot studies, and conditions related to the topic of replication rate have been reported. Over the course of the current paper, all independent, and dependent variables have been fully reported. That is, the authors have no additional studies or data related to this topic in a “file drawer.”

Results and Discussion

Descriptive statistics for Experiment 1a are presented in Table 1 and the pattern of data for Experiment 1a is also presented graphically in Figure 1 (Data Available from Montana State University Scholarworks, https://doi.org/10.15788/M23014). Preliminary analyses considering gender and age were conducted on this experiment (and also Experiments 1b, 2, 3, and 4) revealed no significant effects involving either gender or age (p > 0.19 across Experiments 1a, 1b, 2, 3, and 4). Due to these findings, and because there were no theoretical reasons to expect that the replication rate or framing effects would be influenced by these variables, all reported analyses involve data in aggregate form.

TABLE 1

Table 1. Mean and standard deviation of difference scores for Experiments 1a and 1b.

FIGURE 1

Figure 1. Difference score as a function of frame for Experiments 1a and 1b. Error bars indicate 95% confidence intervals.

A 2 × 6 mixed factorial ANOVA of attitude difference scores considering frame (positive, negative) as the between-subjects factor and replication rate (9, 17, 24, 69, 77, 84%) as the within-subjects factor revealed a significant main effect of frame, F_{(1, 170)} = 83.4, p < 0.001, $η_{p}^{2}$ = 0.329, and a significant main effect of replication rate, F_{(5, 850)} = 41.1, p < 0.001, $η_{p}^{2}$ = 0.195. There was, however, no interaction between frame and replication rate (p > 0.39). The lack of a significant interaction between frame and replication rate indicated that the difference in the slopes (reported in Table 2) between the two frame conditions was not statistically significant. Further analysis of the main effect of replication rate revealed a significant linear trend, F_{(1, 170)} = 124.8, p < 0.001, $η_{p}^{2}$ = 0.423. Note that due to the large number of participants in this and subsequent studies, it is likely that a great number of analyses will yield significant differences. Because of this problem, we will not report trend analyses with effect sizes that are less than moderate (i.e., $η_{p}^{2}$ < 0.06) as we feel that they have no have any practical meaning. For each replication rate, a single sample t-test was conducted to compare the difference score to zero. Here (and elsewhere) we used the Bonferroni correction to maintain the family-wise error rate at 0.05, and the results can be seen in Table 1.

TABLE 2

Table 2. Descriptive Statistics for Experiments 1a, 1b, 2, 3, and 4.

Because the mean attitude change values for the six replication rates reflected a strong linear trend, it was possible to calculate the replication rate associated with no change in attitude by determining the line of best fit through the mean values of the six attitude difference scores for each frame. Some readers may find this calculation useful, as it makes transparent the relative influence of framing and replication rate. This information may be more interesting to science journalists and researchers in the field of science communication than to researchers in the field of judgment and decision making (JDM). When calculating the line of best fit through the mean difference scores for each frame (positive, negative) replication rate was assigned to the x axis and attitude difference to the y axis, as depicted in Figure 1. The resulting equations for the lines of best fit, the x intercept, y intercept and slope values for Experiment 1a (as well as Experiments 1b, 2, 3, and 4) are presented in Table 2. The x intercept is the point at which knowledge of the replication rate has no effect on attitude. Values on the x axis above this level are associated with increasingly favorable attitudes toward science claims, while values below this level are associated with increasingly negative attitudes toward science claims. The slope represents the degree to which participants are sensitive to replication rate information. A slope of 0 would obtain if participants did not revise their attitudes about a science claim at all in the face of information regarding the percentage of studies that had successfully replicated the finding upon which the claim was based. While the slopes calculated for each of the two frames in Experiment 1a are different, the difference is not statistically significant, as indicated by the non-significant interaction between frame and replication rate. The impact of the framing of the information (positive, negative) can be described in two ways. The difference scores in the positive condition were more positive than those in the negative condition (n = 96, M = 0.256 and n = 76, M = −0.669, respectively). Described in this way, framing the information in a negative fashion leads (on average) to a difference score 0.925 lower than that expected in the positive condition. The other way of describing the impact of the frame is to compare the x intercept values for the two frame conditions. The x intercept for the positive frame condition is 31.3 while the x intercept for the negative condition is 79.4. This means that when replication rate information is framed in a positive manner, replication rates greater than 31.3% will lead to an increase in favorable attitude toward the claim upon which the research was based. In contrast, when the information is framed negatively, replication rates would have to exceed 79.4% (i.e., 21% or less failed to replicate) before individuals would adopt a more favorable attitude toward the claim.

Experiment 1a confirmed several of our hypotheses. First, the experiment revealed that non-scientists (i.e., students) make use of replication rate information in developing attitudes about science claims. The results also indicate that the relationship between attitude difference and replication rate is roughly linear (H1). That is, attitudes toward a claim increase as a simple linear function of the percentage of attempted replications that were successful. As measured by partial eta-squared, the effect size for frame was even stronger than the effect size for replication rate, indicating that the framing of the information has a very strong impact on attitude (H2).

Experiment 1b

The six replication rate values employed in Experiment 1a were not evenly spaced and therefore resulted in a number of large gaps. In order to provide a clearer picture of participant response to replication rate information, in Experiment 1b we repeated Experiment 1a using new replication rate values. The values in Experiment 1b were chosen for their ability to “close up” the gaps in the values used in Experiment 1a. As in Experiment 1a percentages that could represent simple ratios (e.g., 25 and 50%) were avoided so that the replication rate would imply that a large number of replications were attempted.