- Department of Physical Therapy, Faculty of Rehabilitation Science, Nagoya Gakuin University, Nagoya, Aichi, Japan
This study examined the reliability and validity of judging system scores of past hip-hop dance competitions in Japan. The analysis focused on the scores for each assessment category separately. Judges’ scores were obtained from national dance competitions held annually in Japan between 2014 and 2019. In these competitions, five experienced judges evaluated the dancers’ performances. The judges scored on a 10-point scale in five categories as follows: creativity, expression and interpretation, impression, technical quality, and synchronisation. This study found that the technical quality category demonstrated good reliability, whilst the impression showed poor reliability. Systematic bias was significant for all categories. There are no levels of difficulty defined for technique, no criteria set for correct movement and no explanation provided for each scoring level, which suggests that each judge may have interpreted the criteria for evaluating hip-hop dance differently. Developing these definitions and identifying the biases that affect evaluation would ensure a reliable evaluation system.
Introduction
Hip-hop dance is freestyle dance that began as street dancing, a part of the hip-hop culture (Craine and Mackrell, 2010), which includes breaking, rocking, popping, house and street jazz dances (Ojofeitimi et al., 2012). It has spread rapidly and many hip-hop dance competitions have been held worldwide. Originally, the impression of the audience was considered to be the most important factor in evaluating hip-hop dance; the winner of a competition was determined based on the audience’s extent of excitement. However, in recent years, hip-hop dance has become more competitive. It was first considered an Olympic sport in the 2018 Youth Olympics and will make its debut in the 2024 Olympics (International Olympic Committee, 2021). In this context, clear evaluation criteria must be defined for hip-hop dance to be considered a viable competition so that dancers, judges and audiences share a common understanding of the definition of superior hip-hop dance performance.
In Olympic artistic gymnastics, evaluations are divided into artistic and technical categories. Scores are determined by absolute evaluations that are based on the difficulty and kinematic criteria for all techniques, as defined in the Code of Points (Fédération Internationale de Gymnastique, 2021). Many studies have examined the reliability of this evaluation system using the results of past competitions, and high reliability has been reported (Leskošek et al., 2010; Atiković et al., 2011; Bučar et al., 2012; Pajek et al., 2013, 2014). In figure skating, another Olympic sport, final scores are calculated based on scores for technical elements, programme components and any deductions (International Skating Union, 2021). The reliability of figure-skating judges has also been investigated. Inter-judge correlation has been found to be above 0.9 for both technical and artistic scores (Lockwood et al., 2005). Thus, both artistic gymnastics and figure-skating competitions employ highly reliable evaluation systems.
In Dancesport, competitive ballroom dancing, a new evaluation system based on absolute evaluation, was introduced in 2013; this replaced the previous evaluation system that was found to be relative (World DanceSport Federation, 2021b). In the new evaluation system, as in artistic gymnastics and figure skating, evaluation is divided into artistic and technical aspects. The scoring system is based on a 10-point scale, with a performance description defined for each level. Research on the reliability of the new evaluation system reported that the mean correlation amongst all judges was 0.48 (Premelč et al., 2019), which was lower than correlation scores for artistic gymnastics and figure-skating competitions. Insufficient description of performance at each level was determined to be a reason for poor reliability.
At the biggest hip-hop dance competitions worldwide, multi-member groups compete, and their performances are evaluated across 10 categories in two domains (Table 1; Hip Hop International, 2021). The combined scores of the 10 categories are used to rank the competitors. Although descriptions of each category’s evaluations have been publicised, detailed kinematic criteria for techniques and the criteria for assessing each level along a scale have not been described. Thus, judges are likely to score dancers based on their own interpretations and criteria. At the 2018 Youth Olympic Games, all break-dancing (a form of hip-hop dance) matches were set up in a battle format, either individual or group, and the winner was determined by a relative evaluation based on which dancer was better in each of the six categories in three domains (Table 2; World DanceSport Federation, 2021a).
As in figure skating and artistic gymnastics, in hip-hop dance competitions, including break-dancing, performances have been evaluated in categories that include both technical and artistic aspects (Tables 1, 2). For the technical aspect of the assessment, difficulty levels for techniques have not been established, and the correct movements for each technique have not been defined; thus, it is not clear how judges evaluate the technical aspect of performance. Studies have reported that factors such as facial expression (Cunningham et al., 1990) and body shape (Tovée et al., 1999; Pawlowski et al., 2000), as well as movement, affect the judges’ evaluation of dance performances. Sato and Hopper (2021) found that the reliability of the judges’ scores varied when the actual dancer videos and humanoid animations created from actual dancer movements were evaluated, suggesting that dancer appearance impacted the evaluation of judges. Although several categories exist within the evaluation of the artistic aspect of hip-hop dancing in the current system (Table 1), evaluation categories that consider biases such as those (un) favouring facial expression or body shape have not been developed. To date, the reliability of the evaluation systems currently in use has not been reported based on the results of past competitions in hip-hop dance. To develop an objective evaluation system, the reliability of current evaluation systems must first be examined.
This study analysed judges’ scoring of hip-hop dance competitions held in the past, ascertaining each judging category separately and examining the reliability of the scores.
Materials and methods
Judges’ scores were obtained from national dance competitions held annually in Japan throughout the years 2014–2019. However, the performances in these competitions were not videotaped. These competitions were open to dancers of elementary to junior high school age, and the results for each year, of the competition final, performed by the dance teams that won the preliminary rounds, were used for analysis. The dance team consisted of at least 5–40 dancers. Dance genres covered in this competition were hip-hop, which includes rocking, popping, breaking, house and street jazz. This study was approved by the Nagoya Gakuin University Research Ethics Committee. All data used in the analysis were anonymised, and participants were offered opt-out opportunities.
Five experienced judges evaluated the dancers’ performances in each competition. They were not the same individuals each year. The judges scored on a 10-point scale in five categories, as follows: creativity, expression and interpretation, impression, technical quality and synchronisation. There were no descriptions of performance for each point level (0–10), and the judges were not allowed to share or discuss their evaluations with each other. The final scores for each of the five categories for individual dance teams were calculated as the mean of the five judges’ scores.
Descriptive statistics of all judges’ scores for each category were calculated for each year of the competition. The following statistics values were calculated for validity analysis (Bučar et al., 2012). Signed and absolute deviations from the final score for individual judges were calculated as measures of bias. Mean rank and deviation from the expected rank were also assessed for individual judges. The expected rank was calculated as (m + 1)/2, where m is the number of judges, with reference to Bučar et al. (2012). The reliability of the evaluation was examined and assessed using intra-class correlation coefficients (ICC) for single and mean of five raters for both two-way random (consistency) and fixed (agreement) effects (Premelč et al., 2019). Kendall’s W (Kendall’s coefficient of concordance) was also calculated. A Kendall’s value of W < 0.40 was considered poor, 0.40–0.50 moderate, 0.50–0.70 good and greater than 0.70 excellent. ICC values were interpreted as follows: less than 0.40 poor reliability; 0.4–0.75 good reliability; greater than 0.75 excellent reliability (Fleiss et al., 2013). All data were analysed using SPSS Statistics software (version 25.0; SPSS Inc., Chicago, IL, United States).
Results
Amongst the five categories, the highest mean score was 7.35 ± 1.03, for impression, and the lowest was 7.10 ± 1.13, for technical quality (Table 3). Appendix 1 presents the statistics of scores for individual judges, and Table 4 shows values extracted from them, indicating the best and worst deviations in judging. In terms of score bias, the maximum absolute deviation from the final score and mean rank deviation from the expected rank were generally significant for all categories. Regarding the correlation between the scores of the individual judges and the final score, which is the mean of the five judges, technical quality demonstrated the largest maximum correlation coefficient and impression demonstrated the smallest minimum correlation coefficient in most of the competition years.
Table 3. Mean, minimum and maximum values for five categories and the final scores for the 2015–2019 competitions.
In terms of score reliability, the Kendall’s W values ranged from 0.319 to 0.681 (Table 5). In each year of the competition, the category with the highest reliability was technical quality, with most values indicating good reliability, with scores ranging from 0.576 to 0.681. The category with the lowest reliability was impression, with most values indicating poor reliability, with scores ranging from 0.319 to 0.448. Similar ICC results were obtained; the single-measure ICC coefficients for absolute agreement and consistency for technical quality demonstrated fair to good reliability. The average-measure ICC coefficients for absolute agreement and consistency for almost all categories showed good to excellent reliability.
Discussion
To develop hip-hop dance competition and elevate its competitive status, an evaluation system with high reliability must be developed. This study was the first to examine the reliability of evaluation results of hip-hop dance competitions.
Regarding the reliability, the Kendall’s W values ranged from 0.319 to 0.681, which was comparable to the reliability assessments for Dancesport (Premelč et al., 2019), thus indicating that the reliability was not high. In contrast, high reliability has been reported for judging in artistic gymnastics competitions (Leskošek et al., 2010; Atiković et al., 2011; Bučar et al., 2012; Pajek et al., 2013). In artistic gymnastics, the level of difficulty and correct movements for all techniques are defined, and point deductions are described in detail in the Code of Points. However, in hip-hop dance, there are no defined criteria for the difficulty of a technique or a correct movement, and there are no descriptions of each of the 10-point level. This means that each judge may interpret the criteria for evaluation and evaluate the performance differently in hip-hop dance. Various biases also reportedly affect judges’ evaluations, including the position of the judges (Dallas et al., 2011), experience of the judges (Flessas et al., 2015), order of the performances (Plessner, 1999) and reputation of the dancers (Findlay and Ste-Marie, 2004). Factors such as the dancers’ facial expression and appearance also affect performance evaluations (Cunningham et al., 1990; Tovée et al., 1999; Pawlowski et al., 2000; Sato and Hopper, 2021). These biases may have impacted the low reliability found in this study.
In hip-hop dance, dancers typically perform in groups. Similarly, rhythmic gymnastics involves a group competition, in which five competitors perform, and the judges must evaluate the performances of the five gymnasts simultaneously. The reliability of performance evaluations in artistic gymnastics and figure skating reported in previous studies were all for individual performance competitions, and no studies to date have investigated the reliability of performance evaluations in team competitions such as rhythmic gymnastics. When judges pay attention to one competitor, they lose information about execution to other competitors. Flessas et al. (2015) reported that when evaluating the five-gymnast ensemble routines in rhythmic gymnastics, international-level judges did not rely on eye fixation to detect errors and may have used other cognitive strategies, as compared to novice and national-level judges. Thus, evaluating performance in the case of group competitions can be considered more challenging, and this may also have affected the reliability results of hip-hop dance.
This study assessed systematic bias in judging to evaluate score validity. For all categories, the values of absolute deviations from the final score and mean rank and deviation from the expected rank were larger than those values for artistic gymnastics (Bučar et al., 2012), suggesting a more significant systematic bias. Fernandez-Villarino et al. (2013) reported that the special circumstances in which judges must evaluate dancers of different ages and skill levels in one competition could create problems, thereby making it difficult for judges to distinguish performances. The competitions analysed in this study were open to students from elementary to junior high school age; thus, a wide range of skill levels was likely observed and incorporated into performance evaluations. This wide range may be one of the reasons for the higher systematic bias that was found. Pajek et al. (2014) suggested that a possible reason for the low validity of artistic scores in artistic gymnastics was poorly defined criteria in the Code of Points. In this study, biases due to judges’ perceptions and preferences in evaluating the quality of performance and differences in interpretation of the judging criteria are assumed to contribute to score variability.
Amongst the evaluation categories used in this study, technical quality and synchronisation fall within the technical category, whilst creativity, expression/interpretation and impression fall within the artistic category. Technical quality, on the technical side, demonstrated the highest reliability, whilst impression, on the artistic side, showed the lowest reliability. Similar results were found in figure skating (Lockwood et al., 2005), artistic gymnastics (Pajek et al., 2014) and Dancesport (Premelč et al., 2019). These results implicate that the artistic side of evaluation may be more impacted by factors, including facial expression and body shape, as previous studies have demonstrated (Cunningham et al., 1990; Tovée et al., 1999; Pawlowski et al., 2000). Therefore, a new evaluation system that accounts for this effect would improve reliability on the artistic side of evaluation of hip-hop dance.
To implement a reliable evaluation system in hip-hop dance competitions, a detailed description of each level for each category must be provided as a first step. A clear evaluation system or tool will help judges interpret the criteria in the same way, thus reducing score variability due to differences in interpretation. Second, evaluation categories must also be reconsidered. In hip-hop dance, many factors other than movement are considered to affect performance evaluation. Evaluation categories should be based on the factors that affect performance evaluation. In artistic gymnastics and figure-skating competitions, rankings are determined by the final technical and artistic point scores. In hip-hop dance, the difficulty of the technique is important, but the artistic aspect is also important. The weight of the technical and artistic aspects in the evaluation, including the number of evaluation categories for each of these two aspects, must be considered. Third, biases that have been reported, including the order of performance, the position of the judges and the experience of the judges, should also be verified in hip-hop dance. Fourth, using a video system that is designed to record performances and observe them immediately afterwards would allow judges to observe dances multiple times; the use of such a system should be considered. Fifth, in hip hop dance competitions that are performed as a group competition, the evaluation criteria must be provided separately for individual dancers’ performance and group performance. In rhythmic gymnastics, the evaluation criteria are separately defined for the evaluation of individual gymnasts and collaborative performances (Fédération Internationale de Gymnastique, 2021).
In this study, the results of hip-hop dance competitions in which multiple dance groups’ performances are ranked by performance scores (similar to gymnastics and figure skating) were analysed. However, break dancers will most likely compete in a one-on-one battle format at the Paris Olympics. Break dancing originated in hip-hop culture, and the winner is determined by the extent of the audience’s excitement, which is influenced by their preferences and subjective impressions of the performance. However, as hip-hop dance (including breakdancing) has grown in popularity, objective evaluation systems have been developed to combat potential biases such as reputation and style preferences (Fogarty, 2018). Although the competition format for break dancing at the Paris Olympics is unknown, our findings can be used to develop a reliable standard of evaluation in a battle format. As mentioned earlier, multiple evaluation categories (divided into technical and artistic aspects) should be established, as well as a detailed description of each level for each category. Given that the characteristics of break dance are strongly linked to the creative expression of one’s identity, emotions and artistic sensibilities, the weightage of technical and artistic aspects should also be considered in the final score (Fogarty, 2018). Competing to determine which dancer is better scored in these evaluation categories allows for a more reliable evaluation.
This study has a few limitations. First, the performances of dancers with a wide range of skill levels were used for evaluation, as the competitions from which the data were pulled and analysed were open to elementary and junior high school-aged participants. Study results may have been different using data from competitions with more skilled adult dance performances. Second, only the scores of judges from competitions in Japan were analysed. Further studies should be undertaken to investigate scores from competitions held in other countries and world competitions. Third, it is not clear how the judges who participated in the competitions analysed in this study varied in their ability to evaluate the performance accurately and consistently. Since judges’ experience is an important factor influencing evaluation reliability (Flessas et al., 2015), this factor may have influenced this study’s results.
This study was the first to investigate the reliability of the evaluation results of hip-hop dance competitions. The study’s results will contribute to the development of a more reliable evaluation system for hip-hop dance competitions. To implement a reliable evaluation system, the reliability of the evaluation must be constantly investigated and feedback must be provided at the same time the system is developed. An evaluation system that can be explained objectively provides not only reliable evaluations but also guidelines for dancers and coaches to use as they work towards achieving high scores in competitions. A new evaluation system will ensure that hip-hop dance continues to develop as an Olympic sport.
Data availability statement
The datasets generated and/or analyzed during the current study are not publicly available due to contract with the organisation that provided the data but are available from the corresponding author on reasonable request.
Ethics statement
Studies involving human participants were reviewed and approved by the Nagoya Gakuin University Research Ethics Committee. Written informed consent from the (patients/ participants or patients/participants legal guardian/next of kin) was not required to participate in this study in accordance with the national legislation and the institutional requirements.
Author contributions
NS contributed to the conception and design of the study, organized the database, performed the statistical analysis, and wrote the manuscript.
Funding
This work was supported by the Nagoya Gakuin University Grant (2021–2024).
Conflict of interest
The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Supplementary material
The Supplementary material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpsyg.2022.934158/full#supplementary-material
References
Atiković, A., Kalinski, S. D., Bijelić, S., and Vukadinović, N. A. (2011). Analysis results judging world championships in men’s artistic gymnastics in the London 2009 year. Sport. Log. 7, 95–102. doi: 10.5550/sgia.110702.en.095A
Bučar, M., Čuk, I., Pajek, J., Karacsony, I., and Leskošek, B. (2012). Reliability and validity of judging in women’s artistic gymnastics at university games 2009. Eur. J. Sport Sci. 12, 207–215. doi: 10.1080/17461391.2010.551416
Craine, D., and Mackrell, J. (2010). The Oxford Dictionary of Dance. New York: Oxford University Press.
Cunningham, M. R., Barbee, A. P., and Pike, C. L. (1990). What do women want? facialmetric assessment of multiple motives in the perception of male facial physical attractiveness. J. Pers. Soc. Psychol. 59, 61–72. doi: 10.1037/0022-3514.59.1.61
Dallas, G., Mavidis, A., and Chairopoulou, C. (2011). Influence of angle of view on judges’ evaluations of inverted cross in men’s rings. Percept. Mot. Skills 112, 109–121. doi: 10.2466/05.22.24.27.PMS.112.1.109-121
Fédération Internationale de Gymnastique (2021). Rules. Available at: https://www.gymnastics.sport/site/rules (Accessed November 9, 2021).
Fernandez-Villarino, M. A., Bobo-Arce, M., and Sierra-Palmeiro, E. (2013). Practical skills of rhythmic gymnastics judges. J. Hum. Kinet. 39, 243–249. doi: 10.2478/hukin-2013-0087
Findlay, L. C., and Ste-Marie, D. M. (2004). A reputation bias in figure skating judging. J. Sport Exerc. Psychol. 26, 154–166. doi: 10.1123/jsep.26.1.154
Fleiss, J. L., Levin, B., and Paik, M. C. (2013). Statistical Methods for Rates and Proportions New Jersey: John Wiley & Sons.
Flessas, K., Mylonas, D., Panagiotaropoulou, G., Tsopani, D., Korda, A., Siettos, C., et al. (2015). Judging the judges’ performance in rhythmic gymnastics. Med. Sci. Sports Exerc. 47, 640–648. doi: 10.1249/MSS.0000000000000425
Fogarty, M. (2018). “Why are breaking battles judged? The rise of international competitions” in The Oxford Handbook of Dance and Competition. ed. S. Dodds (New York: Oxford University Press), 409–428.
Hip Hop International (2021). Rules Regul. Available at: http://www.hiphopinternational.com/officialrules/ (Accessed November 9, 2021).
International Olympic Committee (2021). Breaking. Available at: https://olympics.com/en/sports/breaking/ (Accessed November 9, 2021).
International Skating Union (2021). ISU judging system. Available at: https://www.isu.org/figure-skating/rules/fsk-judging-system (Accessed November 9, 2021).
Leskošek, B., Čuk, I., Karácsony, I., Pajek, J., and Bučar, M. (2010). Reliability and validity of judging in men’s artistic gymnastics at the 2009 university games. Sci. Gymnast J. 2, 25–34.
Lockwood, K. L., McCreary, D. R., and Liddell, E. (2005). Evaluation of success in competitive figure skating: an analysis of interjudge reliability. Avante 11, 1–9.
Ojofeitimi, S., Bronner, S., and Woo, H. (2012). Injury incidence in hip hop dance. Scand. J. Med. Sci. Sports 22, 347–355. doi: 10.1111/j.1600-0838.2010.01173.x
Pajek, M. B., Cuk, I., Pajek, J., Kovač, M., and Leskošek, B. (2013). Is the quality of judging in women artistic gymnastics equivalent at major competitions of different levels? J. Hum. Kinet. 37, 173–181. doi: 10.2478/hukin-2013-0038
Pajek, M. B., Čuk, I., Pajek, J., Kovač, M., and Leskošek, B. (2014). The judging of artistry components in female gymnastics: a cause for concern? Sci. Gymnast J. 6, 5–12.
Pawlowski, B., Dunbar, R. I., and Lipowicz, A. (2000). Tall men have more reproductive success. Nature 403:156. doi: 10.1038/35003107
Plessner, H. (1999). Expectation biases in gymnastics judging. J. Sport Exerc. Psychol. 21, 131–144. doi: 10.1123/jsep.21.2.131
Premelč, J., Vučković, G., James, N., and Leskošek, B. (2019). Reliability of judging in dance sport. Front. Psychol. 10:1001. doi: 10.3389/fpsyg.2019.01001
Sato, N., and Hopper, L. S. (2021). Judges’ evaluation reliability changes between identifiable and anonymous performance of hip-hop dance movements. PLoS One 16:e0245861. doi: 10.1371/journal.pone.0245861
Tovée, M. J., Maisey, D. S., Emery, J. L., and Cornelissen, P. L. (1999). Visual cues to female physical attractiveness. Proc. Biol. Sci. 266, 211–218. doi: 10.1098/rspb.1999.0624
World DanceSport Federation (2021a). Buenos Aires 2018 youth Olympic games rules and regulations. Available at: https://www.worlddancesport.org/News/BreakingForGold/BAYOG_Rules_and_Regulations-2667 (Accessed November 9, 2021).
World DanceSport Federation (2021b). Judging systems. Available at: https://www.worlddancesport.org/Rule/Competition/General/Judging_Systems (Accessed November 9, 2021).
Keywords: hip-hop dance, judging system, aesthetic sport, reliability, validity, competition, Japan
Citation: Sato N (2022) Improving reliability and validity in hip-hop dance assessment: Judging standards that elevate the sport and competition. Front. Psychol. 13:934158. doi: 10.3389/fpsyg.2022.934158
Edited by:
George Waddell, Royal College of Music, United KingdomReviewed by:
João Nunes Prudente, University of Madeira, PortugalPirkko Markula, University of Alberta, Canada
Copyright © 2022 Sato. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Nahoko Sato, nsato@ngu.ac.jp