AUTHOR=Clark Corinna C. A. , Rooney Nicola J. 

TITLE=Does Benchmarking of Rating Scales Improve Ratings of Search Performance Given by Specialist Search Dog Handlers?

JOURNAL=Frontiers in Veterinary Science

VOLUME=Volume 8 - 2021

YEAR=2021

URL=https://www.frontiersin.org/journals/veterinary-science/articles/10.3389/fvets.2021.545398

DOI=10.3389/fvets.2021.545398

ISSN=2297-1769

ABSTRACT=Rating scales are widely used to rate working dog behaviour and performance.  Instruments used to rate ability have often been designed by training and practitioner organisations, with relatively little consideration of  how seemingly insignificant aspects of the scale design might alter the validity of the results obtained. Here we illustrate how manipulating one aspect of rating scale design, the provision of verbal benchmarks or labels, can affect the ability of observers to distinguish between differing levels of search dog performance in an operational environment. 

We compared inter-rater reliability, raters’ ability to discriminate between different levels of search dog performance, and their use of the whole scale before and after being presented with benchmarked scales for the same traits. Raters scored the performance of two separate types of explosives search dog (High Assurance Search (HAS) and Vehicle Search (VS) dogs), from short (approximately 30s) video clips, using 11 previously validated traits. Taking each trait in turn, for the first five clips raters were asked to give a score from 1, representing the lowest amount of the trait evident, to 5, representing the highest. Raters were given a list of adjective-based benchmarks (e.g. very low, low, intermediate, high, very high) and scored a further five clips for each trait. For certain traits the reliability of scoring improved when benchmarks were provided (e.g. Motivation and Independence), indicating that their inclusion may potentially reduce ambivalence in scoring, ambiguity of meanings, and cognitive difficulty for raters. However, this effect was not universal, with the ratings of some traits remaining unchanged (e.g. Control), or even reducing in reliability (e.g. Distraction). There were also some differences between VS and HAS (e.g. Confidence, reliability increased for VS raters and decreased for HAS raters). There were few improvements in the spread of scores across the range, but some indication of more favourable scoring.

This was a small study of operational handlers and trainers utilising training video footage from realistic operational environments, and there are potential cofounding effects. But it illustrates why it is vitally important to validate all aspects of rating scale design, even if they may seem inconsequential.