Using Expert Elicitation to Abridge the Welfare Quality® Protocol for Monitoring the Most Adverse Dairy Cattle Welfare Impairments

Tuyttens, Frank A. M.; de Graaf, Sophie; Andreasen, Sine Norlander; de Boyer des Roches, Alice; van Eerdenburg, Frank J. C. M.; Haskell, Marie J.; Kirchner, Marlene K.; Mounier, Luc.; Kjosevski, Miroslav; Bijttebier, Jo; Lauwers, Ludwig; Verbeke, Wim; Ampe, Bart

doi:10.3389/fvets.2021.634470

ORIGINAL RESEARCH article

Front. Vet. Sci. , 28 May 2021

Sec. Animal Behavior and Welfare

Volume 8 - 2021 | https://doi.org/10.3389/fvets.2021.634470

This article is part of the Research Topic Animal Welfare Assessment, Volume II View all 14 articles

Using Expert Elicitation to Abridge the Welfare Quality^® Protocol for Monitoring the Most Adverse Dairy Cattle Welfare Impairments

$\nFrank A. M. Tuyttens,$ Frank A. M. Tuyttens^1,2^*

Sophie de Graaf^1,3

Sine Norlander Andreasen⁴

Alice de Boyer des Roches⁵

Frank J. C. M. van Eerdenburg⁶

Marie J. Haskell⁷

Marlene K. Kirchner⁸

Luc. Mounier⁵

Miroslav Kjosevski⁹

Jo Bijttebier¹

Ludwig Lauwers^1,3

Wim Verbeke³

Bart Ampe¹

¹Animal Sciences Unit, Institute for Agricultural and Fisheries Research (ILVO), Merelbeke, Belgium
²Department of Nutrition, Genetics and Ethology, Faculty of Veterinary Medicine, Ghent University, Merelbeke, Belgium
³Department of Agricultural Economics, Ghent University, Ghent, Belgium
⁴Department of Veterinary and Animal Sciences, University of Copenhagen, Frederiksberg, Denmark
⁵Université Clermont Auvergne, INRAE, VetAgro Sup, UMR Herbivores, Saint-Genès-Champanelle, France
⁶Department of Veterinary and Animal Sciences, Section of Animal Welfare and Disease Control, University of Copenhagen, Frederiksberg, Denmark
⁷Scotland's Rural College, Department of Population Health Sciences, Section Farm Animal Health, Faculty of Veterinary Medicine, Utrecht University, Utrecht, Netherlands
⁸Animal Behavior and Welfare, Animal and Veterinary Sciences, SRUC, Edinburgh, United Kingdom
⁹Animal Welfare Center, Faculty of Veterinary Medicine, Ss. Cyril and Methodius University in Skopje, Skopje, North Macedonia

The Welfare Quality^® consortium has developed and proposed standard protocols for monitoring farm animal welfare. The uptake of the dairy cattle protocol has been below expectation, however, and it has been criticized for the variable quality of the welfare measures and for a limited number of measures having a disproportionally large effect on the integrated welfare categorization. Aiming for a wide uptake by the milk industry, we revised and simplified the Welfare Quality^® protocol into a user-friendly tool for cost- and time-efficient on-farm monitoring of dairy cattle welfare with a minimal number of key animal-based measures that are aggregated into a continuous (and thus discriminative) welfare index (WI). The inevitable subjective decisions were based upon expert opinion, as considerable expertise about cattle welfare issues and about the interpretation, importance, and validity of the welfare measures was deemed essential. The WI is calculated as the sum of the severity score (i.e., how severely a welfare problem affects cow welfare) multiplied with the herd prevalence for each measure. The selection of measures (lameness, leanness, mortality, hairless patches, lesions/swellings, somatic cell count) and their severity scores were based on expert surveys (14–17 trained users of the Welfare Quality^® cattle protocol). The prevalence of these welfare measures was assessed in 491 European herds. Experts allocated a welfare score (from 0 to 100) to 12 focus herds for which the prevalence of each welfare measure was benchmarked against all 491 herds. Quadratic models indicated a high correspondence between these subjective scores and the WI (R² = 0.91). The WI allows both numerical (0–100) as a qualitative (“not classified” to “excellent”) evaluation of welfare. Although it is sensitive to those welfare issues that most adversely affect cattle welfare (as identified by EFSA), the WI should be accompanied with a disclaimer that lists adverse or favorable effects that cannot be detected adequately by the current selection of measures.

Introduction

A tool to correctly assess and monitor animal welfare is key to many initiatives to improve the welfare of livestock (1). Obviously, the characteristics of this monitoring tool depend on how it is to be applied. For example, the tool may be very elaborate, refined, high tech, and comprehensive if it is to be used in experimental animal welfare research or for in-depth assessments of a limited number of focal herds by a multidisciplinary team of highly trained specialists. The focus of the current study, however, is on a tool that is to be taken up widely by the food industry at large (e.g., for an animal welfare label on food products). For this type of application, the logistic feasibility, the costs, and the user-friendliness are major constraints. At the same time, as socioeconomic stakes can be high, decisions about the animal welfare status allocated to herds or food products ought to be transparent, non-disputable, and accepted as valid by the main stakeholders (e.g., farmer, auditor, retailer, consumer).

Balancing these logistic and scientific requirements is a huge challenge. As a multidimensional societal concept, the number of ways that the welfare of livestock can be affected positively or negatively, and how these effects can be assessed, is very diverse and almost endless. The scientific ambition to accurately document any small change in the status of any of these multiple animal welfare aspects is poorly compatible with the industry demand that the tool is cost efficient and easy to implement. Hence, choices will need to be made about which aspects of welfare to include and about the resolution by which these will be documented. These choices will be subjective to some degree because the conception of animal welfare is partly values based, and people differ in what they consider important or desirable for animals to have a good life (2).

Another characteristic of the monitoring tool that depends on the intended application concerns the need to aggregate the information from the individual welfare measures into an integrated, balanced overall welfare index (WI). Such aggregation may be redundant in case the tool is used to provide farm-specific feedback on how certain welfare problems in a herd could be addressed. However, it is essential for the purpose of the tool developed in this study, namely, to inform consumers about the general welfare status of the animals from which food is derived (1). In fact, aggregating data from various welfare measures into a WI reflecting the overall welfare status of the herd is one of the most difficult challenges in animal welfare science (3). As there is no “gold standard” for overall herd welfare, aggregating data on various welfare measures into an overall index again requires some degree of subjectivity (4).

Standardized methodologies for assessing the welfare of various categories of farm animals, including broiler chickens, laying hens, growing pigs, sows, veal calves, and dairy cattle, were developed in the European Welfare Quality^® (WQ) project (5). The WQ protocols have been praised for being very comprehensive and for the implementation of a hierarchical approach to integrate data on a multitude of predominantly animal-based welfare measures enabling the assignment of farms or herds to one of the four overall welfare categories (not classified, acceptable, enhanced, and excellent). Although issues about consistency over time (6–9) and about reliance on complete and standardized farm/slaughterhouse records (10–12) have been raised, the WQ protocols have been criticized mainly with regard to the (i) the feasibility [mainly labor costs per farm, e.g., (11, 13)], (ii) the variable quality of the welfare measures included in the protocol (8, 10, 14), and (iii) the way these measures are aggregated into an overall WI (15–21). Indeed, uptake of the WQ protocols by the authorities and food industry at large for improving and better marketing of farm animal welfare has been below expectation. Although stakeholders have expressed interest in welfare monitoring of various types of farm animals, they have emphasized that the labor demand of about one farm or herd per day per certified assessor needs to be reduced. de Jong et al. (11) have addressed these industry concerns by proposing time-saving simplifications to the WQ broiler chicken protocol but—to our knowledge—no such modifications have been shown promising for the other protocols. This is particularly needed for the dairy cattle protocol as it takes up to 4.4–7.7 h to complete for a herd of 25–200 cows, respectively, excluding the time needed for making the appointment and for travel (22).

Criticisms on welfare measures often relate to their poor reliability, validity, or feasibility (10, 11, 13, 14). There is a growing consensus now that animal-based measures are preferred for directly assessing the outcome of the complex effects of the environment and management on the animal's actual state of welfare (1, 23, 24). Although one of the novel characteristics of the WQ protocols was the emphasis on animal-based measures, the WQ protocols also include resource- or management-based measures that have been criticized for describing the potential or risk for good or bad welfare rather than directly measuring the welfare status itself. The dairy cattle protocol, for example, relies on resource-based measures for assessing 3 of the 12 welfare criteria (water availability and cleanliness for the criterion absence of prolonged thirst, tethering for the criterion ease of movement, and pasture access for the criterion expression of other behaviors). It is particularly worrying that sensitivity analyses have revealed that a limited number of (often resource-based) measures seem to have a disproportionally large effect on the overall welfare categorization [e.g., 88% of the overall dairy cattle welfare categorization is predicted by water availability and cleanliness (17)], whereas some key (often animal-based) measures such as lameness and mortality have a negligible effect (16–18, 21). This appears to be an unwanted side effect of the very complex and hard-to-understand (and hence poorly transparent to most end-users) integration method, which was needed to aggregate so many measures of different scales with different thresholds.

Aiming for a wide uptake by the milk industry, in the current study, we revised and simplified the WQ dairy cattle protocol with a view to (i) drastically reduce the time needed to complete an assessment, (ii) make use of a minimal number of key animal-based measures, and (iii) transparently aggregate these measures into a continuous (and thus discriminative) WI. We describe and illustrate the steps in the development of this revised and simplified protocol for quantifying the level of herd welfare, albeit without claiming to be exhaustive. The WI is based upon the intuitively sensible method of Burow et al. (25) in which the relative weight of each welfare measure depends on its severity score (expert judgement of how severely a given welfare problem affects the welfare of an individual cow) multiplied by the herd prevalence for that measure. Moreover, we investigate the extent to which the integration method should allow compensation of poor scores with better scores. In some studies (4), it is argued that such compensation should be restrained, as good results on one aspect cannot compensate for poor scores on other aspects (e.g., having a good body condition score cannot compensate for being severely lame). Other studies, however, indicate that compensation between welfare aspects may be possible [reviewed by Leknes and Tracey (26)]. At present, there is little evidence that compensation reduction is warranted, let alone what type of compensation-reduction method best corresponds with expert opinion. The latter is examined in one of the proposed steps in this study. Some of the steps inevitably demand subjective decisions. These were based upon expert (defined as an animal scientist trained to use the WQ dairy cattle protocol) opinion, as considerable expertise about cattle welfare issues and about the interpretation, importance, and validity of the welfare measures was deemed essential. For this study we opted not to involve people without in-depth knowledge and expertise in dairy cattle welfare and the measures involved because of doubts about their ability to adequately balance the importance of different welfare measures. Indeed, the relative importance that ought to be allocated to a given welfare measure could depend on how exactly it is measured on-farm (e.g., selection of and size of the sample, to what extent confounding factors may influence the measures, objectivity of the measure). Moreover, it has been shown that detailed information on how data on welfare measures is collected on-farm can significantly influence the relative weights they are given by experts (27). Even for dairy cattle welfare experts, it can be a daunting task to make decisions about overall welfare status by integrating the scores of the various measures in such a way that the outcome reflects the range of what can be expected among real farms and allows realistic differentiation between these farms. Expert welfare scoring of herds was, therefore, based on a large database of WQ data that reflect a wide range of dairy herd types in Europe and thereby ensuring a substantial but realistic spread in observed values.

Materials and Methods

Our approach to revise and simplify the WQ dairy cattle protocol involved five steps. The same steps can be used to revise and simplify the other WQ protocols or to add additional welfare measures if this would be deemed desirable. The first four steps inevitably require subjective decisions for which experts with knowledge of the WQ dairy cattle protocol were consulted. We emailed 31 researchers who were known to the authors, to our network, or to the Welfare Quality Network to have been trained to use the WQ dairy cattle protocol. These trained users were in turn asked to provide contact details of any additional animal welfare scientists who would be suitable (i.e., trained to use the WQ protocol). Fourteen declined the invitation to participate because they could not fill out the survey in time or did not respond. All experts who agreed to participate in the current study had experience with the WQ protocol for dairy cattle (i.e., were trained to perform the WQ protocol for dairy cattle and had used it to assess the welfare of dairy herds), were animal scientists, and had authored at least one peer-reviewed scientific paper about dairy cattle welfare involving the WQ protocol. Although we did not select for this, all participating experts were from Europe (the WQ protocols are used predominantly in Europe), and a total of eight nationalities were represented (British, Spanish, Macedonian, Dutch, Finnish, Austrian, German, and French). No experts whose input was used in the analyses were involved in creating the surveys.

Step 1 entails selecting animal-based welfare measures to be included in the protocol. At the core of Steps 2 and 3 is the WI. Based upon Burow et al. (25), the WI was constructed from perceived severity of welfare problems (“severity score”) and observed prevalence of these welfare problems. The severity scores for the various welfare measures were determined in Step 2 by asking the experts to score how severely each of the selected welfare problems (that are quantified by the selected measures) impairs the welfare of an animal. The following formula forms the basis to integrate data on selected welfare measures into one score:

\begin{array}{l} W e l f a r e i n d e x s c o r e = \frac{1}{n m} \times \sum_{m = 1}^{n m} S m \times r P m \end{array}

Here, n represents “number,” m refers to “measure,” S represents the “severity score,” which ranged from 0 to 100, and rP refers to “relative prevalence,” which is calculated as prevalence per herd/prevalence at 97.5th percentile of that measure among all herds in the EU database. In the proposed formula, rP rather than absolute prevalence was used so each herd covered the same possible spectrum for each measure. Prevalence of the 97.5th percentile was set as the maximum for each measure score, to prevent an extreme prevalence value of single measures from having a disproportionately large influence on the score. Therefore, herds with values equal to or higher than the 97.5th percentile were automatically given the maximum measure score. This allowed for a uniform method to determine thresholds for the different compensation-reduction methods (CRMs) that were tested. To achieve a score on a scale of 0 (very poor welfare)−100 (excellent welfare) and to test various CRMs, the formula was complemented as follows:

\begin{array}{l} W e l f a r e i n d e x s c o r e = 100 - \frac{100}{S m a x} \times \sum_{m = 1}^{n m} S m \times r P m \times C m \end{array}

Here, Cm is the “compensation-reduction factor” for measure m (value between 1 and Cmax), and Smax is the sum of the products of Sm and the maximal compensation-reduction factor ( $S m a x = \sum_{m = 1}^{n m} S m \times C m a x$ ). To gain input for this formula, we performed two independent online surveys among the dairy cattle welfare experts. In Step 3, the WI is calculated, and correspondence with expert opinion is analyzed. Similarity between experts' welfare scores for several fictitious herds and integrated WI using the aforementioned formula with various CRMs is analyzed. Step 4 consists of interpreting the WI (what score indicates poor/good welfare). Step 5 comprises of checking to what degree the selected welfare measures are associated with factors that have the most severe impact on dairy cattle welfare. The five steps are elaborated below.

Step 1: Selecting Welfare Measures

Welfare measures were selected from the WQ protocol for dairy cattle (22). We used three criteria for selecting measures: (1) they ought to be animal-based, (2) it must be possible to express them as a percentage to allow using the proposed WI-formula, and (3) they must be considered as important for dairy cattle welfare by the experts. The importance of the measures was based upon an online survey where 17 experts ranked all WQ measures (n = 27) on importance for the overall welfare status of a herd of dairy cattle. Although the experts were presumed familiar with each of these measures, the precise methodology could be consulted in the WQ protocol for the assessment of dairy cow welfare (www.welfarequalitynetwork.net). It was mentioned to the experts that for ranking (inter alia) reliability, validity, perceived relevance, and prevalence may be considered. Subsequently, we compared compliance of these selected measures with the outcomes of published studies in which expert opinion had been used as well to rank cattle welfare measures on importance (25, 28–30). Hence, in theory, measures could have been added in the case that the literature search would have revealed important animal-based measures that had not passed our initial selection (but this was not the case in our study).

Step 2: Determining Severity Scores

To determine the severity scores for the selected measures, 14 of the same aforementioned 17 experts completed a second survey. In this second survey, they were asked to score how severely the welfare of an individual cow is affected by each of the six selected welfare impairments on a scale of 0 (totally not severe)−100 (extremely severe). The experts were informed that they may take (their perception of) both the degree and duration of suffering into account. In the ensuing Step 3, median severity scores were used in calculating the WI.

Step 3: Calculating WI and Testing Coherence With Expert Opinion

For checking correspondence between expert scores and aggregated WIs, in the subsequent part of the second survey, the 14 selected experts were presented with a graph showing the observed prevalence distribution of all selected welfare measures for 491 European herds that had been assessed using the WQ protocol (Figure 1). To reflect the current range present in Europe across various herding systems, existing WQ datasets were collated from seven European research institutes and included data from 10 countries [Macedonia, The Netherlands, France, Belgium, Scotland, Denmark, Romania, Northern Ireland, Spain, and Austria, more details in de Graaf et al. (20)]. In the graph, six “focus herds” were highlighted per expert (example: Figure 1; data shown in Table 1). These focus herds were fictitious but were based upon real herd data from the European dataset. In total, 12 focus herds were created to fit the following descriptions: (1) two herds that scored high in prevalence, taking the European dataset as a reference (indicating poor welfare) on all measures; (2) two herds that scored low (indicating good welfare) on all measures; (3) two herds that scored medium on one-half of the measures and high on the other half; (4) two herds that scored the other half of the measures medium and the other half high; (5) two herds that scored medium on all measures except for one (high for somatic cell count > 400,000), and (6) two herds that scored medium on all measures but high for one (high for severe lameness). High scoring measures in the latter two mentioned herds were chosen randomly from the selected measures. Highest prevalence belonged to the top 5% for all welfare measures, medium between 40 and 60%, and lowest scores were from the lowest 5%. Each expert was presented with six focus herds, one of the two for each category (Table 1). Experts were asked to allocate a welfare score to each focus herd they were presented with using a tagged visual analog scale from 0 to 100. Tags were “Not Classified (<20),” “Acceptable (20–55),” “Enhanced (55–80),” and “Excellent (>80),” following WQ categorization (22). Each of the 12 focus herds was thus scored by six to eight experts. Subsequently, the degree of correspondence between expert scores and WI's were calculated with varying CRMs. One of the tested CRMs was “veto,” where thresholds are defined for each measure above which a value cannot be compensated for. This is achieved by automatically attributing the worst possible welfare score to a herd, independent of the prevalence of other welfare problems. The other tested CRMs use various formulas to allocate increasingly more weight to worse scores on a certain measure. Tested formula in the current study were “Discrete,” “Linear,” “Broken line,” and “Exponential” and are illustrated in Figure 2. In addition, scores were calculated without CRM (“no CRM”), thus allowing full compensation between measures as default.

FIGURE 1

Figure 1. One of the graphs presented to experts in the second survey showing the distribution of all herds in the database (n = 491) for the six selected measures. Colored triangles mark six (of the 12) focus herds.

TABLE 1

Table 1. Prevalences for the 6 selected dairy cattle welfare measures, for each of the 12 fictitious herds the experts (n = 14) allocated an integrated index score.

FIGURE 2

Figure 2. Illustration of the compensation reduction methods (except Veto) tested in this study with a maximal compensation of 3 and a threshold of 40. No compensation reduction method (CRM, black line) results in the diagonal (value before and after compensation is the same). Discrete gives no compensation reduction for measures up to a certain threshold of Sm*rPm, above which the Sm*rPm score is multiplied with maximum fixed value Cm. For linear CRM, Cm increases linearly with an increasing Sm*rPm score of the welfare measures. The broken line CRM gives no compensation reduction for measures up to a certain threshold of Sm*rPm, above which Cm increases in a linear manner. Exponential CRM increases Cm exponentially with an increasing Sm*rPm score of the welfare measures.

For discrete, broken line, and veto CRM, a threshold at which compensation reduction starts needed to be determined. For all CRMs apart from veto, it also had to be determined what the maximum level of compensation reduction (Cmax) was. We checked which threshold value of S^*rP (ranging between 5 and 70 in increments of 5) and which value for Cmax (set at between 1.5, 2, 3, 5, and 10) corresponded best with expert opinion based on model R². For the 20 models with the highest R², we calculated also the Akaike information criterion (AIC) and four additional metrics [root mean square error (RMSE), mean absolute difference, Liao's improved concordance correlation coefficient [ICCC, (31), and the Bland–Altman 95% limits of agreement [LOA, (32)] for quantifying the agreement between the model prediction and the experts' opinion. We ranked these 20 models according to the six agreement metrics and calculated the mean rank (giving equal weight to each of the six metrics). The model with the lowest mean rank was selected as the model (i.e., type of CRM) that provided the best fit with the opinion of the experts.

Statistical analyses were performed using the program R 3.2.2 (R Foundation for Statistical Computing, Vienna, Austria). Both linear and quadratic models were used to test correspondence between expert scoring and the integrated scores to determine if adding a CRM to the WI formula generated a better fit for varying thresholds and values of C. The Agreement Interval package was used to calculate the measures of agreement.

Step 4: Interpreting the WI

To interpret the WI scores in terms of bad/medium/good welfare, we asked the experts to score overall welfare for the 12 focus herds on a tagged visual analog scale with labels for four welfare categories following WQ categorization (“not classified” from 0 to 20, “acceptable” from 20 to 55, “enhanced” from 55 to 80, and “excellent” from 80 to 100). To extrapolate thresholds of these welfare categories, we (scatter) plotted the expert scores against the WI scores for the 12 fictitious herds and added the best fitting curve. We then identified the three points where the best-fitting curve intersects with the WQ thresholds of the scale on which the experts scored (expert scores 20, 55, and 80).

Step 5: Exhaustiveness Check

In Step 5, we assessed to what degree the selected measures are indicative of the “worst adverse effects” (factors that have the most severe impact) on dairy cattle welfare. For this end, we compared the selection of welfare measures with a list of worst adverse effects on dairy cattle welfare and associated animal-based welfare measures in a European Food Safety Authority (EFSA) report by Nielsen et al. (30). In this report, worst adverse effects were selected based upon several other EFSA reports (24, 33–37), Presi and Reist (38), Brenninkmeyer and Winckler (39), and expert opinion (Table 2).

TABLE 2

Table 2. Summary of which of the “worst adverse effects” for dairy cattle welfare are associated with the selection of welfare measures in the current study based upon Nielsen et al. (30).