Student-delivered behavior-specific praise: a systematic literature review and meta-analysis

Royer, David James; Ennis, Robin Parks

doi:10.3389/feduc.2024.1444394

SYSTEMATIC REVIEW article

Front. Educ., 13 September 2024

Sec. Special Educational Needs

Volume 9 - 2024 | https://doi.org/10.3389/feduc.2024.1444394

This article is part of the Research TopicBehavior-specific praise in preK-12 settings: Expanding the knowledge baseView all 7 articles

Student-delivered behavior-specific praise: a systematic literature review and meta-analysis

David James Royer¹^*^†

Robin Parks Ennis²^†^‡

¹Department of Special Education, Early Childhood, and Prevention Science, University of Louisville, Louisville, KY, United States
²Department of Curriculum and Instruction, University of Alabama at Birmingham, Birmingham, AL, United States

Behavior-specific praise is an easy-to-implement, teacher-delivered strategy that supports academic engagement while preventing and reducing disruptive behavior. By letting students know what they did, specifically, to meet academic, behavioral, and/or social expectations, students who find teacher attention reinforcing are more likely to engage in the same behavior more often in the future. While teacher-delivered behavior-specific praise was classified as a potentially evidence-based practice using Council for Exceptional Children standards, less is known about the effects of students who deliver behavior-specific praise to their peers. This systematic literature review and meta-analysis explored the literature base and found 36 articles meeting inclusion criteria. Fifteen articles included positive peer reporting as the independent variable, 20 included tootling as the intervention, two compared those interventions, and three used an “other” form of peer praise (i.e., peer praise notes, peer monitor tokens). Nine tootling articles met all eight quality indicators by absolute coding, and 32 out of all 36 studies met an 80% weighted quality indicator coding criterion for being methodologically sound. From these, we classified positive peer reporting in the mixed evidence category and tootling in the evidence-based practice category. We discuss benefits of various components in each type of peer praise intervention, limitations of the literature review, and make recommendations for future researchers.

Introduction

Students with emotional and behavioral disorders (EBD) are those who have difficulty meeting school expectations, from following the rules, to performing academically at grade level, to sustaining appropriate peer and adult relationships (Mundschenk and Simpson, 2014). Point prevalence estimates indicate 12% of students have at least a moderate EBD and 20% have at least a mild EBD (Forness et al., 2012), yet only 0.5% of students received special education services under the emotional disturbance (ED) category of Individuals with Disabilities Education Improvement Act (2004) each year from 2011 to 2020 (latest data), down from 0.7% in years 2005–2007 and 0.6% in years 2008–2010 (U.S. Department of Education, 2022). This means most students with EBD, those classified as having externalizing (e.g., aggression, defiance, arguing, disruptive behavior, rule violations, substance use; Romer et al., 2020) and/or internalizing (e.g., withdrawal, negative affect, anxiety, depression; Romer et al., 2020) behavior patterns, attend general education classes and do not receive special education support.

As a result of the challenges associated with EBD, students with or at risk for EBD often experience social isolation, peer rejection, and fewer positive interactions with students and adults (Zweers et al., 2021). Certainly this makes sense, as young students especially may not have the social skills to develop a good relationship with someone who is volatile, or they may not want to be friends with a peer they perceive to frequently get in trouble at school. Similarly, for adults, without the skills and strategies needed to support students with EBD, it can be difficult for teachers to maintain a positive or supportive relationship with them (O’Connor et al., 2011), and it may seem easier for some teachers to simply send a student with EBD out of the classroom when they are repeatedly disruptive, for example. Over the last 20 years or so, however, more schools are working to adopt tiered models that prevent most challenging behavior, such as positive behavioral interventions and supports (PBIS; Sugai and Horner, 2019) and the comprehensive, integrated, three-tiered (Ci3T; Lane et al., 2019b) model of prevention. Within these models of increasingly intensive student supports, educators are empowered with tools and low-intensity strategies that increase their classroom self-efficacy and give them confidence in their ability to keep students with challenging behavior in the classroom learning.

One way for teachers to increase positive interactions with all students at Tier 1, including those with or at risk for EBD, is to focus on the ratio of positive statements to corrections and reprimands (Caldarella et al., 2023). Many teachers receive training at some point early in their career about having a 4:1 or 5:1 ratio of positive to negative statements, such as learning for every academic correction (e.g., “you forgot to take the reciprocal of the fraction”) to also give a few positive acknowledgments (e.g., “I see you were following the mnemonic we learned yesterday and I like that you remembered to isolate the variable”) immediately, and more later even if unrelated to the initial correction (e.g., “Your printing is very neat,” “Thank you for raising your hand and waiting quietly,” “Well done”) in order to get to a higher ratio of positives to negatives. These positive statements can have the most impact on future student academic performance and behavior when they are specific in identifying exactly what the student did well (Brophy, 1981).

Behavior-specific praise (BSP) is a form of positive reinforcement that specifically acknowledges desired behaviors and strengthens the likelihood socially acceptable behaviors will occur more often in the future, especially when students like the attention (Cooper et al., 2020). BSP statements can be written or oral, indicating precisely the behavior observed (including academic behavior) that met expectations (Menzies et al., 2023). For example, a teacher might say to on-task students during math work, “I like the way you are using your small white board to show your math work” when there was a past incident of inappropriate white board use or to promote continued appropriate use. When BSP is sincere, varied, targets effort instead of ability, and the student finds attention reinforcing, what was specifically praised is more likely to occur more often (Lane et al., 2015). This contrasts with general praise, where a specific action is not identified, such as saying, “Good job” or giving a thumbs up with a smile. General praise is a good way to increase positive interactions with students too, but BSP has the added benefit of specifying exactly what expectations were met, not only for the child receiving the praise but as a reminder to all students nearby (Sutherland et al., 2000). There are many studies showing the impact BSP has on increasing academic engagement and decreasing disruptive behavior, and the strategy has been classified as a potentially evidence-based practice applying Council for Exceptional Children (2014) standards (Royer et al., 2019).

Unfortunately, naturally occurring rates of positive feedback are “alarmingly low” (Scott et al., 2017, p. 61), even at the elementary level where the rate per minute is 0.137 on average, with a positive to negative ratio of 3:1. That means only every 7.5 min does an elementary student typically receive positive feedback from their teacher, and the rates and ratios are even lower at middle school (0.061 or every 16.4 min; 1.74:1) and high school (0.033 or every 30.3 min; 0.65:1; Scott et al., 2017). Obviously, there is a need for students to receive higher rates of positive interactions from their teachers, and potentially, their peers can help as well.

Student-delivered behavior-specific praise

Students can deliver praise to their peers and help increase the number of positive interactions, especially for students with EBD and/or who are socially isolated (usually due to internalizing behavior patterns) or socially rejected (usually due to externalizing behavior patterns). Collins et al. (2020) conducted a meta-analytic review of peer-reporting interventions utilizing single-case research designs and identified 21 studies meeting inclusion criteria. Their findings suggested peer reporting interventions had a positive impact on student behavior outcomes, noting variability among included studies’ approaches to peer praise. Additionally, authors compared studies using log response ratios, tau (measure of overlap), and moderating effects of targeted contextual variables. While Collins et al. (2020) applied elements of What Works Clearinghouse standards to their inclusion criteria, they did not code studies for quality indicators and did not evaluate each approach to peer praise in isolation. When Ennis et al. (2020) mapped the 50-year knowledge base on BSP, they found six journal articles on peer praise, such as peer praise notes used to increase social interactions among three junior high school students at risk for EBD (Peterson Nelson et al., 2008) and peer praise notes to reduce problem behaviors at recess for an elementary school with 462 students (Teerlink et al., 2017). Even more prolific than praise notes were the approaches to peer praise called positive peer reporting and tootling.

Positive peer reporting

Positive peer reporting (PPR) is a brief period of time for peers to publicly praise typically one “star” target student with BSP, encouraging prosocial behavior and earning tokens for each appropriate BSP toward a class reward (group contingency). PPR interventions are generally designed to increase the frequency and improve the quality of the target student’s prosocial interactions with peers (Morrison and Jones, 2007) and have added benefits for the whole class’s behavior given the group contingency. When the star is not known (one variation of PPR), students, in theory, are on their best behavior in case they will be the ones publicly praised later. Studies vary in terms of how long a student was the star (e.g., changed each day, each week), how many stars (e.g., one, three), when peers observe the star for prosocial behaviors (e.g., all day, during one subject), if the star is known or unknown, and when and for how long peers publicly praise the star (e.g., end of subject for 3 min, end of day for 10 min). All but one PPR study was published before 2014, when the peer praise literature turned all but exclusively to investigating tootling interventions.

Tootling

Tootling is a classwide application of PPR where students observe all peers instead of one or a few stars and privately report specific prosocial behaviors on index cards to the teacher. Each appropriate tootle with required components (e.g., name of both students giving and receiving praise, praise statement is specific) earns points toward a class reward (group contingency). The name tootling comes from merging ‘toot your own horn’ and tattling and is intended to be the opposite of tattling (Skinner et al., 1998). Tootling interventions vary in terms of how long of a time period peer observations occur (e.g., all day, during one subject), how many tootles can be written and turned in (e.g., two maximum per session, unlimited), and when and for how long the teacher reads tootles to the class (e.g., end of subject for 5 min, end of day for 3 min). Some studies included a public posting of tootles for everyone to read, either using technology like Class Dojo for live display when entered by students electronically (McHugh Dillon et al., 2019) or using paper posted to a bulletin board (Harry et al., 2023).

In a seemingly transitionary time of researchers shifting focus from PPR to tootling, two studies compared PPR to tootling. Barahona (2010) found neither intervention reduced disruptive behavior more than a minimal amount across three elementary grade 3 general education classrooms, while in contrast, Sherman (2012) found both PPR and tootling increased appropriate behavior and reduced inappropriate behavior for four students in general education classrooms grades 3–6. More analyses are therefore needed to determine how PPR compares to tootling and how effective peer-delivered praise is, generally.

Purpose

Given the emphasis in PBIS and Ci3T tiered models of prevention on teachers using the low-intensity strategy of behavior-specific praise (BSP) to support positive, productive, safe learning environments, and given Ennis et al. (2020) found six peer praise studies but did not include theses and dissertations, the purpose of this systematic literature review and meta-analysis was to explore student-delivered praise further. Specifically, our research questions were: (a) To what extent did peer praise interventions address Council for Exceptional Children (2014) quality indicators of methodologically sound studies? (b) What is the evidence-based practice status of peer praise according to Council for Exceptional Children (2014) guidelines, applying an 80% minimum criterion for methodologically sound studies (Lane et al., 2009)? (c) What was the magnitude of effects for peer praise interventions?

Method

Search and article selection

We conducted an exhaustive search of student-delivered BSP research, involving four search steps: (1) electronic, (2) ancestral, (3) hand, and (4) expert nomination (Lane et al., 2022). First, we searched Educational Resources Information Center (ERIC), ProQuest Dissertations and Theses Global, American Psychological Association (APA) PsycINFO, APA PsychARTICLES, and Research Library through December 2023 using Boolean search terms (behavio* AND specific AND praise AND peer) OR (tootling), “peer praise note*,” and “positive peer reporting.” This search returned 183 unique manuscripts (articles, theses/dissertations) after duplicates were removed (see Figure 1). Both authors independently screened titles and abstracts for inclusion, and interrater reliability (IRR) was 97.81% and Cohen’s κ = 0.95 [95% CI = 0.9, 1.0], which takes chance agreement into consideration, indicated near-perfect agreement (Cohen, 1960; Landis and Koch, 1977), resulting in 60 manuscripts to read in full. Both authors independently read in full and found 35 manuscripts for inclusion (91.38% IRR; κ = 0.81, 95% CI = [0.65, 0.97], indicating substantial agreement). Next, both authors conducted independent hand searches of any journal with two or more published studies included in our electronic search (i.e., Journal of Behavioral Education, Journal of Positive Behavioral Interventions, School Psychology Review) and found no additional articles for inclusion (IRR = 100%; κ = 1.00). Both authors then conducted independent ancestral searches of included studies’ references, yielding 43 titles to screen. We obtained abstracts, and of those, 13 studies were then obtained to read in full, with two additional articles identified for inclusion (κ = 0.94, 95% CI = [0.91, 0.96], indicating near-perfect agreement) for a total of 37 included studies. Finally, we contacted corresponding authors and journal editors to inquire of any additional studies utilizing student-delivered BSP; while five articles were nominated from this step, no additional manuscripts were included. Later, when we began quality indicator coding, we realized one article (Wilson et al., 2001) involved counting tootles without ever sharing them aloud with students (thus students never heard peer praise intended for them), so we excluded it at that stage, resulting in 36 total studies.

Figure 1

Figure 1. Systematic search procedures and inclusion illustration for peer-delivered behavior-specific praise (peer praise) literature review.

Inclusion criteria

The included studies met six criteria. First, independent variable(s) included, primarily, student-delivered verbal or written BSP, defined as “providing students with praise statements that explicitly describe the behavior being praised” (Allday et al., 2012, p. 87), and was not packaged with other interventions (e.g., precorrection, Good Behavior Game, peer tutoring). Group contingency, self-monitoring, performance feedback, and other forms of increasing peer-delivered BSP were acceptable pseudo-packages (components of a peer-to-peer praise intervention). If students tootled to teachers about peers, studies were included when tootles were read aloud or somehow shared with peers later. Second, dependent variable(s) included at least one of the following student outcome measures: challenging behavior (e.g., disruptive behavior, problem behavior, aggression, off-task), time on task/academic engaged time, social skills, social interactions (including compliments and encouragements), and/or social status. Third, participants were school-age youth, general education or special education, from grades preK-12. Fourth, the intervention took place in a school setting, including university-sponsored laboratory schools (non-clinical) and alternative schools for students with severe behavior when part of a public or private school district. Studies conducted in residential treatment centers were also included if the study took place in the school setting. Home settings, or clinics resembling classroom settings, were excluded as they were highly controlled settings, varying substantially from traditional school settings. Fifth, the study followed an experimental design: single case or group. Sixth, the study was a thesis, dissertation, or journal article available in English. We did not place a date restriction and accepted articles from any year.

Coding procedures

To understand both the rigor and relevance of the included studies, we conducted both quality indicator (QI) and descriptive coding. Both authors have published numerous quality assessment reviews, both together (e.g., Ennis et al., 2017; Royer et al., 2019) and separately (e.g., Royer et al., 2017; Ennis and Losinski, 2019); therefore, we did not code practice articles not included in this review prior to coding included studies. We met and reviewed the elements relevant to this review prior to coding, discussing potential nuances to QIs, then coded one article at a time independently before meeting to discuss discrepancies and clarify QIs before coding the next study.

QI coding

We independently coded included articles for Council for Exceptional Children (2014) QIs of methodologically sound studies. QI 1.0 examines context and setting and we required studies to have at least one demographic variable to describe the setting that confirmed inclusion (e.g., school setting). QI 2.0 examines the participants and we again required at least one demographic variable for participants (2.1) and a description of why students or classes (depending on the case of analysis) were targeted for inclusion (2.2). QI 3.0 examines the intervention agent. Since student-delivered BSP typically involved implementation steps by adults and students, we required one demographic variable for each type of interventionist (3.1) and required evidence of both adult and student interventionist training, including an active check for understanding or use of a script to deliver the intervention (3.2). The remaining QIs did not require unique clarifications or distinction for this review, including (4.0) description of practice, (5.0) implementation fidelity, (6.0) internal validity, (7.0) outcome measures/dependent variables (DVs), and (8.0) data analysis. Certain quality indicators are only applicable to either single-case (i.e., 6.5, 6.6, 6.7, 8.2) or group (i.e., 6.4, 6.8, 6.9, 7.6, 8.1, 8.3) design methodology and we applied them accordingly. For additional details on QI components, please see Council for Exceptional Children (2014).

We independently coded articles in a QI matrix (Lane et al., 2019a) in MS Excel one at a time, then compared and discussed any disagreements before coding the next. The mean IRR was 98.36% across all 36 articles (range = 89.29%–100%) and 97.11% by QI component (range = 80.00–100%). Overall κ for QI coding was 0.89 (95% CI = [0.83, 0.94]) indicating near-perfect agreement.

During QI coding, both authors independently made notes of descriptive characteristics of the studies that correspond to the Council for Exceptional Children (2014) QI. The first author’s coding was used to create the descriptive table and the second author verified all information cell-by-cell, and while no errors were found, she suggested 24 refinements (out of 288 table cells) for easier readability. IRR for descriptive coding was 91.67%.

Evaluation procedures for classifying the evidence base of practices

For a study to be included in calculations for an evidence-based practice category, it had to meet 80% or more of QIs (Lane et al., 2009) using weighted coding, and if the study utilized single-case design, it had to include at least three cases (e.g., students, classrooms) and QI 6.5 had to be met (the design had to provide the possibility of at least three demonstrations of effect). We reviewed studies meeting these criteria and classified them as having either positive, neutral or mixed, or negative effects according to Council for Exceptional Children (2014) standards. For group studies, we used author-published effect sizes or calculated effect sizes when enough data were provided (e.g., n, M, and SD per group), then followed What Works Clearinghouse cut scores (as listed in Council for Exceptional Children, 2014) for positive (d ≥ 0.25), neutral or mixed (−0.25 < d < 0.25), or negative (d ≤ −0.25) effects.

We then used these classifications to determine if student-delivered BSP met Council for Exceptional Children (2014) criteria for an evidence-based practice (EBP), potentially EBP, mixed evidence, insufficient evidence, or negative effects. Council for Exceptional Children (2014) standards state an evidence-based practice (intervention, strategy, or practice scientifically validated through rigorous research methodology) has one of the following: (a) two group design studies utilizing randomized assignment with 60 or more participants, (b) four group design studies not utilizing randomized assignment with 120 or more participants, (c) five single-case studies (each with at least three participants and 75% or more showing therapeutic outcomes) with 20 or more total participants, or (d) a combination of group and single-case studies. Combinations can include one group randomized with 30 or more total participants and three single-case studies with 10 or more total participants, or two group non-randomized with 60 or more total participants and three single-case studies with 10 or more total participants. Additionally, no study can have negative effects and the ratio of studies with positive effects to neutral or mixed effects must be at least 3:1. More details about potentially EBP, insufficient evidence, and negative effects category criteria can be found in Council for Exceptional Children (2014).

Data extraction and analysis

We calculated effect sizes for each dependent variable in group and single-case design studies that were eligible to contribute to the evidence-based practice classification (i.e., met our Council for Exceptional Children, 2014 80% weighted criterion, met QI 6.5, and had three or more cases if a single-case research design study). First, we extracted data from graphs using WebPlotDigitizer (Rohatgi, 2024) prior to performing analysis. When a study had multiple outcome measures, our primary focus was on outcomes of academic engagement/on-task behavior and disruptive behaviors. When study designs included multiple intervention conditions, such as students serving as peer praise recipient and peer praise teller (e.g., Chenier, 2010), we combined intervention conditions into one and compared those results to baseline.

For withdrawal/reversal and multiple baseline designs, we utilized a web-based calculator (Pustejovsky et al., 2023) to calculate between-case standard mean difference (BC-SMD) effect size estimates. For the one eligible alternating treatment design study (Thoele, 2024), we utilized a web-based calculator (Manolov and Onghena, 2018) to calculate an average difference between successive observations (ADISO) value. ADISO values can be standardized for comparison across studies by dividing by the standard deviation. For group design studies, we used author-provided n, M, and SD for each group to calculate Hedges’s g. BC-SMD and standardized ADISO effect sizes are comparable to standardized mean differences from group comparison design studies (Valentine et al., 2016). Effect sizes were interpreted as small (0.20–0.50), medium (0.50–0.80), or large (≥0.80; Fritz et al., 2012). When determining if a single-case research design study had positive, neutral or mixed, or negative effects for consideration for the evidence base, we relied on the more conservative visual analysis in keeping with Council for Exceptional Children (2014) standards for evidence-based practices (as opposed to substituting our calculated effect size estimates).

We calculated both fixed-effect (assumes one true effect size underlies all studies; more weight given to larger studies with less variance) and random-effects (true effect size may vary across studies; studies with larger variances receive less weight) model (Dettori et al., 2022) meta-analyses for (a) all studies we were able to calculate an effect size for, (b) PPR studies separately, and (c) tootling studies separately, following formulas described by Schluter (2024). We constructed a forest plot of each study’s dependent variables’ effect sizes and the three overall peer praise category meta-analysis results following procedures demonstrated by Lajeunesse (2021).

Results

The 36 included studies represented 13 dissertations, four theses, and 19 journal articles, spanning from 1976 to 2024. The journal articles were published in 13 unique journals, with the Journal of Behavioral Education and the Journal of Positive Behavior Interventions containing three articles each. Dissertations and theses represented 10 unique institutions, with University of Southern Mississippi and Louisiana State University each accounting for four dissertations/theses.

QI 1.0: Context and setting

All studies met QI 1.0 for context and setting by providing at least one detail about the school and/or classroom setting, allowing us to determine inclusion criteria (see Figure 2 for a summary of QI coding across studies). Published studies implemented peer-delivered BSP across the preK-12 continuum, with most taking place in elementary schools (n = 24; see Table 1 for descriptive characteristics of all studies). Similarly, studies also took place across the least restrictive environment continuum, with most taking place in general education settings including whole school (n = 26), followed by special education classrooms (n = 6), and residential settings (n = 4). Of note, many studies reported school- or facility-wide implementation of positive behavioral interventions and supports, with some even reporting school- or facility-wide fidelity scores (e.g., Sherman, 2012; Kennedy et al., 2014). While most studies took place in academic settings, a few studies took place in alternate settings, including the playground (Chenier, 2010; Teerlink et al., 2017) and homework time during after-school care (Kirkpatrick et al., 2019).

Figure 2

Figure 2. Methodological rigor of student-delivered behavior-specific praise (peer praise) studies. Peer praise studies are on the abscissa, and Council for Exceptional Children (2014) QIs met are on the primary ordinate (shaded cells=met, clear cells=not met). The secondary ordinate displays QIs met by absolute (triangles; 8.0 QIs required) and weighted (circles; 6.4 QIs required, 80%) coding to be considered methodologically sound. The weighted coding criterion of 6.4 is indicated by the horizontal black line. CEC=Council for Exceptional Children; QI=quality indicator.

Table 1

Table 1. Descriptive characteristics and study effect classification (EC) for peer-delivered behavior-specific praise (peer praise) studies.

QI 2.0: Participants

All studies met QI 2.1 for providing at least one detail about study participants. 77.78% of studies met QI 2.2. for reporting details of why the student or class was targeted for intervention (e.g., disability status, challenging behavior, classroom management support needs). Many studies utilized data-based decision making to identify students for participation, with some studies confirming teacher or principal referrals of students or classrooms with direct observation screenings (e.g., Wright, 2019). Some studies utilized the class as a unit of analysis by pooling student data (e.g., Grieger et al., 1976), others examined the data of target students within classrooms (e.g., Ervin et al., 1996), and some studies reported both class and target student data (e.g., Lambert, 2014; McHugh et al., 2016).

QI 3.0: Intervention agent

For QI 3.1, 86.11% of studies met this QI by including demographics about both the adult and student (i.e., delivering BSP to peers) interventionists. However, only 63.89% of studies met QI 3.2 by providing sufficient information about the training of both interventionists. Some authors provided adults and/or students with a script to ensure fidelity of all implementation steps of the peer praise intervention—McHugh et al. (2016) even included procedures for rehearsing the script with feedback. Lum et al. (2019) is one example of many where authors assessed and reported fidelity of the training steps to ensure researchers remembered to execute all training steps with all interventionists. A few authors, including Wright (2019), even reported training integrity with IOA for researchers training teachers and teachers training students.

QI 4.0: Description of practice

100% of studies met QI 4.1 and 4.2 by including adequate details on study procedures and materials. Within the 36 studies examining student-delivered BSP, there was some variation among intervention procedures. Twenty studies examined the tootling intervention, 15 examined positive peer reporting, two studies compared the two approaches (Barahona, 2010; Sherman, 2012), and three studies evaluated peer praise outside of positive peer reporting or tootling procedures by using written peer praise notes or peer helpers’ verbal praise (Lund, 2000; Kennedy et al., 2014; Teerlink et al., 2017).

QI 5.0: Implementation fidelity

For QI 5.1, an impressive 94.44% of studies assessed and reported implementation fidelity data. 100% of studies met QI 5.2 for either directly reporting dosage or reporting information from which dosage could be inferred (e.g., graphed data with estimated daily dosage). However, only 72.22% of studies met QI 5.3, as some studies did not include language making it clear that fidelity was assessed throughout the intervention and/or intervention phases. Of note, Steeves (2017) utilized exemplary procedures for tracking dosage in a group design study, having teachers self-report daily implementation fidelity outcomes. Lambert (2014) and Lambert et al. (2015) both collected IOA of implementation fidelity data between two observers to ensure accuracy, a robust procedure though not required by Council for Exceptional Children (2014) QIs.

QI 6.0: Internal validity

QIs 6.1, 6.2, and 6.3 refer to both group and single-case research design studies. QI 6.1, met by 94.44% of studies, refers to the researcher’s ability to control the independent variable. As an exemplar, Lum et al. (2017) included procedures during withdrawal phase for explicitly telling teachers to remove all intervention materials (e.g., tootle submitting container, poster of group contingency progress) and tell students the class was not going to tootle, if asked. QI 6.2, met by 97.22% of studies, refers to adequate description of baseline/comparison conditions. Both Hoff and Ronk (2006) and McHugh et al. (2016) provided detailed descriptions of not only how data were collected during baseline conditions but also what instructional procedures occurred (e.g., weekly social skills meeting; science brief lessons with hand-on activities and worksheets). QI 6.3, met by 80.56% of studies, refers to baseline/control conditions having no or extremely limited access to the independent variable. We marked QI 6.3 as not present in studies that did not include explicit mention of removing materials, telling teachers not to implement, and/or limiting access to the intervention in control/withdrawal conditions. An exemplar, Kirkpatrick et al. (2019), included assessing fidelity of baseline and withdrawal conditions to report that 0% of implementation steps were implemented.

Within the 36 included studies, there were four (11.11%) that utilized group research design methodology. Of those four, 100% met QI 6.4 for clearly describing/utilizing best practices for group assignment, 75.0% met QI 6.8 for reporting (or allowing our calculation of) overall attrition, but only 25.0% (i.e., Steeves, 2017) met QI 6.9 for reporting directly or including enough data to allow us to calculate differential attrition.

Of the 32 (88.89%) single-case research design studies, 29 (90.63%) met QI 6.5 for using an experimental design that provided for the possibility of at least three demonstrations of effect. 31 studies (96.88%) met QI 6.6 for including at least three data points in all baseline conditions, and 27 studies (84.38%) met QI 6.7 for utilizing a design that controls for common threats to internal validity.

QI 7.0: Outcome measures/DVs

100% of studies met QIs 7.1, 7.2, and 7.3 for utilizing socially important outcomes, clearly defining dependent variables and their measurement, and reporting effects of all dependent variables. 88.89% of studies met QIs 7.4 and 7.5 for utilizing appropriate timing of dependent variable data collection (i.e., group designs close to end of intervention, single-case three or more data points per condition) and providing adequate evidence of group measure reliability or IOA of single-case research design direct observation dependent variables. Of note, McHugh et al. (2016) and Lum et al. (2017) impressively reported κ to account for chance agreement between two raters. QI 7.6 refers to group design methodology, and 75.0% of the four included studies met this QI for including adequate evidence of validity. For example, both Murphy (2013) and Gray (2023) included measures of social validity, as did Steeves (2017), who additionally discussed construct validity.

QI 8.0: Data analysis

QIs 8.1 and 8.3 apply to group design methodology. Of the four group studies in this sample, 75% met QI 8.1 for employing appropriate data analysis techniques, and 50% met QI 8.3 for reporting measures of effect or sufficient information from which we could calculate effect sizes. QI 8.2 applies to single-case research design methodology and requires studies to include a clear graph reporting data from all conditions for each unit of analysis. Of the 32 included single-case studies, 96.88% met this QI.

Evidence base supporting student-delivered behavior-specific praise

Based on Council for Exceptional Children (2014) standards for EBPs, tootling met criteria two times for classification as an evidence-based practice by having a minimum of five single-case research design studies with 20+ participants and also by having at least one group design study with 30+ participants and at least three single-case research design studies with 10+ participants. PPR did not meet criteria for evidence-based practice, potentially EBP, nor mixed evidence because only one of the five single-case research design studies that met Council for Exceptional Children (2014) weighted criteria for methodological rigor had positive effects while four were neutral or mixed effects—two studies with positive effects were needed for the mixed evidence category, and so we classified PPR into the insufficient evidence category.

Figure 3 contains a forest plot of estimated effect sizes for all studies meeting 80% or more of QIs, our weighted Council for Exceptional Children (2014) criterion for methodologically sound studies. Each study is marked by the type of student-delivered BSP intervention employed: tootling (k = 20), positive peer reporting (k = 15), and other (k = 3), with two studies marked as both PPR and tootling given authors compared the two interventions. The forest plot concludes with overall effect sizes for student-delivered BSP, inclusive of all studies and categories of peer praise, and then we considered the evidence base for PPR and tootling separately given the large and clear divide of studies into these categories. The important work of the three studies utilizing direct peer praise was inadequate in number for consideration of a separate evidence-based practice categorization or omnibus effect size.

Figure 3

Figure 3. Effect sizes of dependent variables in student-delivered behavior-specific praise studies eligible to contribute to evidence-based practice classification [i.e., met Council for Exceptional Children, 2014 80% weighted criterion, met QI 6.5, had 3+ cases if SCRD]; ▸, positive peer reporting study; †, tootling study; *, other study (peer praise notes or peer assistants).

Discussion

It was encouraging to find 34 of the 36 studies (94.44%) met QI 5.1 for reporting implementation fidelity results, as some past systematic literature reviews found very few studies met this important QI (e.g., 47.92% of studies coaching educators to increase BSP in Ennis et al. (2020); 46.15% of instructional choice studies in Royer et al., 2017). Results across the studies included in this systematic literature review showed student-delivered BSP can improve academic engaged time and reduce the disruptive behavior and social isolation of students with or at risk for EBD. Even PPR studies, which had mixed evidence in terms of Council for Exceptional Children (2014) standards for EBP, showed most individual student participants improved on multiple outcomes; there were just not enough participants in studies (minimum needed is three), studies did not meet QI 6.5 (study design provides for the possibility of at least three demonstrations of intervention effect), or <75% of participants showed improvement (see Table 1), and thus those studies could not be considered in the EBP classification calculations. Individual students who were socially withdrawn/rejected increased their social interactions when they received peer-delivered BSP as the star in PPR studies (e.g., Short, 1999; Chenier, 2010). Such results showed how student-delivered BSP can help increase the number of classroom positive interactions and support teachers who might not always be able to give as much attention to quiet students as they would like, perhaps because they feel drained of energy from going “through the same cycle with the [disruptive] behavior kids” over and over (Lanza, 2020, p. 36).

It was interesting to note the clear shift in studies from PPR to tootling in 2014, though it is unclear why the shift occurred at that time. Skinner et al. (1998) introduced the concept and name tootling at a 1998 conference, and the first tootling study was by Shelton (2002) a few years later. The next tootling study was Cihak et al. (2009), then researchers compared tootling to PPR in Barahona (2010) and Sherman (2012), with the final PPR study a year later by Murphy (2013) and all others through 2024 being tootling except an outlier 2023 dissertation (Gray, 2023). Perhaps this follows an ‘evolution’ in student-delivered BSP: from a single student being the PPR ‘star’ receiving all peer BSP, to having three students as stars, to having the star(s) be unknown so more students engage in expected behaviors hoping peers will notice and report on them later if they end up being the star(s), to scaling up peer praise classwide with tootling where all students are now observed by peers. We understand why teachers might prefer tootling because all students can receive BSP from their peers. In PPR studies, all students were reminded of behavior expectations when they were told to be on the lookout for the star(s) meeting those expectations, but in the end, only the star(s) received attention in the form of BSP from peers, so perhaps perceived as less effective by teachers. Since these studies had neutral or mixed evidence, it also could be that many students found being the ‘star’ and thus the center of attention at the end of the day or session was embarrassing or aversive—at least one student in PPR studies, “Katie” (Moroz and Jones, 2002), did better when she was the praiser, not the recipient of peer praise (while some students would certainly desire to stand center-stage and have peer praises heaved upon them). These could be reasons why investigation shifted to tootling, where, in theory, the whole class would have better behavior as everyone can tootle on everyone. Possible downsides to tootling compared to PPR include the loss of students receiving that BSP directly from peers (because teachers read tootles aloud compared to PPR stars hearing BSP from peers) and how not all tootles are shared with the intended recipients when teachers only read 3–5 at the designated time. This might balance a limitation to PPR studies though, how students in PPR studies do not write down the good behavior they notice and might forget who and what they saw by the time it was PPR reporting. This lower dosage of BSP for the star in PPR studies might be comparable to how tootles are not all read to students.

Even with the shift to tootling, which allowed for classwide student recognition from peers, it was surprising that most students in tootling studies who were praised by peers on a tootle slip probably never knew it. In almost all tootling studies, only 3–5 tootles were read at the end of the tootling period, class, or day, followed by all tootles being counted and the group reward tracker updated. What happened to the tootles after the few were read aloud was not reported in studies except for Thoele (2024), who sent tootles home after the group reward was met. Typical procedures therefore appear to be missing the important opportunity of letting all students hear or read the praise that was intended for them. It should have been a quick and easy step for teachers or a student leader to at least distribute tootles to the recipient if there was not time in the day to read them all. There were some exceptions to this in tootling studies, however, where all students were able to receive their peers’ praise. For example, Harry et al. (2023) and Barahona (2010) did not read tootles aloud to the class but instead publicly posted all tootles on a bulletin board after class where students could read them the next day or gave them to students to keep, respectively. Teachers in Ray (2019) did read all tootles aloud, and it is possible the teacher in Cihak et al. (2009) read all tootles aloud during the 20 min allotted at the end of class, but it was not explicitly stated. In McHugh Dillon et al. (2019), students typed tootles into computer stations that immediately displayed them on an interactive whiteboard for everyone to read. In these cases, which happened to have the largest effect sizes for academic engagement, all students were not only able to read the praise intended for them specifically but could read all tootles given by any student. This might have provided additional reminders to students about what behaviors were expected and/or helped increase student motivation to meet expectations in the hope of receiving similar tootles themselves the next day.

We expected to find more peer praise note studies in theses and dissertations and were surprised to find the literature so clearly split between PPR and tootling. We thought more interventions would have taught students to say or write BSP statements immediately and directly to their peers, just like adult educators say or write specific praise for students. However, there were only three. Lund (2000) had fifth-grade peer monitors use BSP and give tokens to students contingent on quiet on-task behaviors on a fixed-interval schedule, with results showing engagement improved dramatically and disruptions decreased for both token earners and peer monitors. Two other studies utilized peer praise notes as previously reported in the 50-year map of BSP literature of Ennis et al. (2020): Kennedy et al. (2014) compared teacher-written and student-written praise notes during art class for grades 2–4 in a residential facility and found both worked equally well to reduce inappropriate behavior; Teerlink et al. (2017) implemented peer praise notes schoolwide at recess for an elementary school with 2–3 students per grade trained to be peer praisers, demonstrating peer praise notes were effective at reducing playground office discipline referrals. We hope future researchers will continue to investigate the effects of student-delivered specific praise notes, as there were not enough studies to evaluate the practice for EBP determination, but it appears to be a promising practice that has students directly and immediately recognizing appropriate and prosocial behavior of their peers using BSP.

We found it interesting to learn PPR was not at least a potentially EBP when it was PPR studies where students heard directly from peers what they did well to earn their specific praise, even though the reporting did occur at the end of the session or day (delayed reinforcement). We expected praise heard directly from peers to be more impactful compared to tootles read by teachers. We acknowledge, of course, many PPR studies were missing the required minimum participants to meet Council for Exceptional Children (2014) QIs or used a design that did not allow for the possibility of three demonstrations of effect, so it still could be that PPR’s direct sharing of BSP to peers is more impactful than having teachers read student praises. Given the limited number of PPR studies meeting QIs, it is difficult and perhaps not appropriate to compare PPR effect sizes to tootling study effect sizes—some are higher, some lower, some similar—there are just too few.

Something to consider regarding the EBP of tootling, is how do we know what the effective component(s) of the intervention are? We learn in each study the class tootling goal and how many times the class met the goal to know the dosage of tootles written for the class, but readers do not learn the dosage per student, not even target students (similarly in PPR studies, dosage of praise statements received by the star was not reported). It could be that popular students received the most tootles. It could be that dosage is not important because the key component might be knowing peers are watching for good behavior even if they do not fill out a tootle of even if your tootle is not read aloud. Future studies should report the dosage of tootles written and received (read aloud or posted for reading) for each target student and the average per day per student in the classroom. In addition to unknown dosage, we also cannot isolate if peer-specific praise is a key factor in tootling—it is the teacher who reads tootles, so it is possible students receiving praise perceive it as teacher attention even though it was written by a peer. Many of the early studies of PPR targeted socially isolated students or peers who needed to increase positive interactions with peers (e.g., Ervin et al., 1996; Moroz and Jones, 2002). It seems counter-intuitive, then, to have the intervention be so teacher-driven, limiting the potential for positive social interactions and praise directly between peers. Plus, the teacher typically praised the appropriate behavior mentioned in the tootle, and both students who wrote and received each tootle drawn, so it could be that teacher attention/praise is the most responsible for changes in student outcomes. Additionally, observed changes in student behavior may be partially attributed to the interdependent group contingency in each tootling study, and we cannot know to what degree. We do know group contingency interventions, especially in general education classrooms, are an evidence-based practice when What Works Clearinghouse standards were applied (Maggin et al., 2017). The group contingency component of tootling interventions, for some students, might be the strongest motivator for good behavior, more so than peer praise or teacher attention, and receiving the group reward may be the most reinforcing aspect for some students. We believe changes in student behavior during tootling studies are most likely a combination of teacher attention, peer attention, group contingency/reward, and knowing your peers are watching you to write a tootle that might be read later. Future researchers could run a component analysis study to more definitively determine active ingredients in tootling interventions, with and without group contingency, and/or compare typical tootling procedures to truly student-delivered BSP interventions where students immediately and directly praise peers when they observe targeted prosocial behavior.

We encourage readers to keep in mind we used two different methods to look at the effects of each study. For Council for Exceptional Children (2014) to determine whether a single-case research design study had positive effects, 75% of cases needed to have a functional relation in the therapeutic direction, and if not, the study was classified as neutral or mixed effects. Of the two methods we used, this is the more conservative approach using visual analysis. In comparison, when we calculated BC-SMD, all participant data were used, which could result in an effect size that, if examined in isolation (e.g., without having CEC classification at hand), could seem to indicate overall positive results. For example, Chenier (2010) had two of three PPR students with positive outcomes and the third student with neutral results—when looking at all three students as a non-concurrent multiple baseline, there was not a functional relation. Yet, the BC-SMD estimated effect size was 0.52 (medium effect), likely due to the large level changes in the two students who had therapeutic outcomes plus the small increase in level for the one student with neutral results. A similar comparison can be made in another PPR study, Moroz and Jones (2002), as well as in tootling studies such as Wright (2019). Wright (2019) demonstrated a functional relation in two of three classrooms (66.7%) in their A-B-A-B design where 75% was needed for positive results, so the Council for Exceptional Children (2014) classification was neutral or mixed evidence; the BC-SMD estimated effect size was 0.46 for disruptive behavior (small effect) and 0.91 for academically engaged behavior (large effect) when all student data are considered in the examination of mean level changes despite the lack of a functional relation. We therefore suggest readers interpret BC-SMD effect sizes with caution and with overall CEC study designation in mind. This is in alignment with Maggin et al. (2017) recommendations, who also applied BC-SMD effect size estimates in their meta-analysis of single-case research design group contingency studies. The authors noted that a lot more investigation is required in terms of how researchers separate assessments of effect size and methodological rigor in single-case research, but that using parametric analysis and visual analysis together in systematic literature reviews and meta-analyses is supported.

Limitations

As with any literature review, it is possible, despite our best efforts to be systematic in our search, that we missed including some studies. We followed Lane et al.’s (2022) established guidelines for an exhaustive search to prevent missing articles and included theses and dissertations to best represent the full literature base on peer praise. Future researchers might additionally attempt to conduct forward ancestral searches of the included studies. Similarly, although all steps of our study review process after procurement of articles (i.e., QI coding, descriptive coding, study evidence-base practice classification, effect size calculations) were completed by two authors with high levels of reliability, there is always the possibility there was an error in our coding or that other researchers may interpret results differently. Thus, we encourage interpretations of our results regarding the student-delivered BSP body of literature be made with caution as readers keep these limitations in mind.

Educational implications

Teachers in schools where PBIS or Ci3T is implemented might consider implementing one of the versions of student-delivered BSP. The whole school might even try it as a Tier 1 prevention effort that extends PBIS to the student level as peer praisers, where teachers get help from students implementing the low-intensity strategy of BSP as a positive reinforcement for meeting schoolwide behavior expectations. Or, teachers might notice many students in their classroom need support staying on task or engaging in more prosocial behavior and decide to implement a version of student-delivered BSP in their classroom only, such as tootling, all day or for a particular time of day where behavior is most challenging. If just one or two students are socially isolated and not being included by peers, in addition to reteaching appropriate social skills lessons for all students, teachers could implement PPR and make those students the ‘star’ at a higher rate than peers. Teaching students to specifically praise their peers with PPR or tootling would not take more than a few minutes each day, would not interfere with other teacher-delivered low-intensity strategies that support engagement and reduce disruptions, and may help teachers increase their classroom self-efficacy to keep all students in the room learning.

The delayed specific praise seen in PPR and tootling studies worked for almost all student participants but not everyone, so it might work for more and have even larger impacts if teachers taught students to praise peers directly and immediately (e.g., “Thanks for cleaning up the floor around all our group’s desks, Robyn”) for a targeted time of day when challenging behavior is known to occur most often, or even the full day. Teachers could then reinforce direct and immediate student praise with teacher-delivered BSP (e.g., “Jayson, I love how you thanked Robyn for cleaning up the whole group table”) to encourage student BSP to occur more regularly. A recommended component often considered key to the powerful impact of BSP is immediacy (Ennis et al., 2018), so making the shift in the classroom to praising peers right away instead of waiting until the end of the hour or the day might help students stay even more on task with appropriate behavior. Similarly, when done authentically, praising the recipient directly might be more impactful compared to students telling the teacher what they saw (praise recipient hears it but not directly addressed to them) or writing down what they saw for the teacher to read to the class later (praise recipient learns about it from the teacher but does not hear it from the praiser). In most tootling studies, teachers only read 3–5 tootles, so most students did not hear if a peer recognized their prosocial behavior, whereas teaching students to praise peers directly and immediately would allow all students to hear the praise intended for them and thus be more reinforcing to the behavior being specifically praised.

Most included studies took place at elementary grade levels when young students seek teacher attention, so it might make more sense to study student-delivered BSP at the middle and high school levels. Adolescent students in secondary schools tend to seek peer attention more than adult attention, so perhaps peer praise is best suited for middle and high school settings where students already seek out peer approval. Future researchers should do more peer praise studies at the secondary level to test if adolescents are indeed more motivated by and reinforced by peer attention in the form of student-delivered BSP compared to elementary students who typically desire teacher attention.

Summary

We conducted an exhaustive systematic literature review on student-delivered BSP to peers and found 36 articles focused primarily on positive peer reporting (PPR) and tootling interventions. We used Council for Exceptional Children (2014) standards for evidence-based practices to code included articles for quality indicators (QI) using a weighted 80% criterion and classified PPR in the insufficient evidence category and tootling in the evidence-based practice category. We calculated each eligible (80% QI met; QI 6.5 met; three or more cases in single-case research designs) study’s effect size, either between-case standardized mean difference estimate (A-B-A-B withdrawal/reversal and multiple baseline designs), standardized average difference between successive observations (for one alternating treatment design), or Hedges’s g (two group designs), then calculated a random-effects meta-analysis for PPR at 0.2254 (small effect), 1.0238 (large effect) for tootling, and 0.7408 for all eligible studies. Future researchers should (a) continue to investigate PPR with sufficient participants using methodologically sound research designs, (b) conduct tootling studies in middle and high school settings, (c) component analysis studies of the tootling intervention to determine active ingredients (e.g., teacher attention, peer praise, dosage of teacher and student praise), (d) conduct additional peer praise note studies to allow determination of evidence-base practice category, and (e) conduct studies where students across contexts are taught to directly and immediately recognize peer prosocial behavior using BSP.

Data availability statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Author contributions

DR: Writing – original draft, Writing – review & editing. RE: Writing – original draft, Writing – review & editing.

Funding

The author(s) declare that no financial support was received for the research, authorship, and/or publication of this article.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

References indicated with an asterisk (*) were included in the literature review.