Reconstructing individual responses to direct questions: a new method for reconstructing malingered responses

Orrù, Graziella; Ordali, Erica; Monaro, Merylin; Scarpazza, Cristina; Conversano, Ciro; Pietrini, Pietro; Gemignani, Angelo; Sartori, Giuseppe

doi:10.3389/fpsyg.2023.1093854

ORIGINAL RESEARCH article

Front. Psychol., 15 June 2023

Sec. Quantitative Psychology and Measurement

Volume 14 - 2023 | https://doi.org/10.3389/fpsyg.2023.1093854

This article is part of the Research TopicApplied Data Science in PsychologyView all 8 articles

Reconstructing individual responses to direct questions: a new method for reconstructing malingered responses

Graziella Orrù¹^*

¹Department of Surgical, Medical, Molecular & Critical Area Pathology, University of Pisa, Pisa, Italy
²Scuola IMT Alti Studi Lucca, Lucca, Italy
³Department of General Psychology, University of Padua, Padua, Italy

Introduction: The false consensus effect consists of an overestimation of how common a subject opinion is among other people. This research demonstrates that individual endorsement of questions may be predicted by estimating peers’ responses to the same question. Moreover, we aim to demonstrate how this prediction can be used to reconstruct the individual’s response to a single item as well as the overall response to all of the items, making the technique suitable and effective for malingering detection.

Method: We have validated the procedure of reconstructing individual responses from peers’ estimation in two separate studies, one addressing anxiety-related questions and the other to the Dark Triad. The questionnaires, adapted to our scopes, were submitted to the groups of participants for a total of 187 subjects across both studies. Machine learning models were used to estimate the results.

Results: According to the results, individual responses to a single question requiring a “yes” or “no” response are predicted with 70–80% accuracy. The overall participant-predicted score on all questions (total test score) is predicted with a correlation of 0.7–0.77 with actual results.

Discussion: The application of the false consensus effect format is a promising procedure for reconstructing truthful responses in forensic settings when the respondent is highly likely to alter his true (genuine) response and true responses to the tests are missing.

Introduction

Truthful responses to sensitive questions are difficult to collect as, under some conditions, the respondent is deceptive and responds based on social desirability. For example, pedophiles do not respond “yes” to direct questions such as “are you a pedophile?.” Similarly, drunk drivers will not admit guilt when confronted with direct questions of the “did you do it?” type (Locander et al., 1976). When answering such questions, most respondents express socially desirable responses or a tendency to give overly positive self-descriptions (Paulhus, 2002, p. 50). As a result, researchers who collect respondents’ answers at face value tend to underestimate the prevalence of undesirable characteristics while overestimating the prevalence of desirable characteristics.

The deceptive attitudes of the respondents led researchers to devise questioning techniques that guarantee complete anonymity in order to facilitate truthful responses to direct questions addressing sensitive issues. Among the best techniques proposed are the stochastic lie detector and the crosswise technique (Hoffmann and Musch, 2016). The aforementioned techniques enable respondents to conceal their responses to sensitive questions, allowing the researcher to estimate the prevalence of a sensitive characteristic across the entire sample. Although these techniques effectively determine the overall truthfulness of responses at group level, they do not address the issue of accurately estimating the truthfulness of responses for individuals. This problem occurs when a person is undergoing a psychiatric evaluation for disability or insurance purposes and is asked direct questions, such as “Did you think about suicide?,” as their responses may be influenced by external incentives and are collected using clinical questionnaires. Clinical questionnaires are typically built as a list of symptom-related items, and subjects respond by simulating or exaggerating their psychiatric symptoms when responding (Resnick, 1997). This behavior, called faking bad or malingering, is commonly observed in forensic settings, such as insurance claims or insanity claims in criminal proceedings (Sartori et al., 2017, 2020). Several malingering detection techniques based on validity scales (i.e., MMPI-2 and MCMI-III) or specific questionnaires (i.e., SIMS) have been proposed to detect faking in psychological testing (Young, 2014). For example, the SIMS distinguishes malingerers from honest respondents with high accuracy (Van Impelen et al., 2014), collecting responses to questions covering a wide range of pseudo-psychopathology.

However, these methods can only detect the presence or absence of the distorted response style and cannot determine whether a claimant is feigning depression or another condition. No procedure appears to be available to estimate the true level of depression in an individual after malingering has been extracted through some form of correction procedure. Rather, no model exists to retrace the truthful responses following dishonest malingering responses.

In addition to the tendency to alter a truthful response into a more severe description (malingering), the opposite phenomenon is also observed. Faking good, also known as dissimulation, is the tendency of subjects to give socially desirable responses rather than choosing responses that reflect their true feelings (Zerbe and Paulhus, 1987). In legal setting, such a tendency, independent of psychopathology, may give rise to the dissimulation of psychopathology or, in other words, denial of psychopathological symptoms (e.g., suicidal ideation). Regarding faking good on psychological questionnaires, desirability scales have been developed to assess a participant’s propensity to offer others a more desirable psychological profile (Kowalski et al., 2018). Currently, there is no procedure for reconstructing a participant’s true level of response when they provide abnormally high, socially acceptable responses on psychological questionnaires.

In this proof-of-concept paper, we will use the phenomenon known as the “false consensus effect” to reconstruct truthful responses on a psychological questionnaire. The false consensus effect refers to the phenomenon usually observed as overestimating the proportion of others’ responses in a given population that share characteristics with one’s own response (see Ross et al., 1977; Mullen et al., 1985, for a meta-analysis). In more detail, the prevalence estimates of OTHER questions (also called consensus estimates) may foretell overt personal behaviors. The typical format of the OTHER questions is: “How many persons out of 100 would you guess will respond ‘yes’ to the following question? I always think about suicide.” For example, Botvin et al. (1992) showed that peers’ smoking estimation might predict future adolescent smoking habits. The fact that people tend to base their estimates on others’ characteristics is well established despite the underlying mechanisms not being fully understood (Marks and Miller, 1987). Notably, this tendency is so strong that it persists even when people are explicitly instructed about the bias. For example, Krueger and Clement (1997) coached participants about the false consensus phenomenon just before they made their prevalence estimates and still found no reduction in false consensus (Oostrom et al., 2017). Thus, people seem unable to avoid revealing information about themselves, even when aware of exhibiting this phenomenon. It is worth noting that an alternative explanation of the false consensus effect was put forward by Dawes (1989) who explained the phenomenon using Bayesian analysis.

In the current investigation, to reconstruct truthful responses, we applied the “false consensus effect” as follows:

- Each participant was required to respond to the original version of the questionnaire, which required a YES/NO response (called ME questions).

- The participant was also required to respond to a variant of the original question in the following format: “How many out of 100 persons will respond YES to the following question: Do you think about suicide?” (OTHER% responses). The expected response is, therefore, a percentage (e.g., 10%).

The study aimed to evaluate whether it is possible to predict ME responses from the OTHER% responses at single-subject level. This prediction is expected based on the false consensus effect phenomenon. To derive accurate predictions, we applied state-of-the-art machine learning (ML) techniques (Mazza et al., 2019; Pace et al., 2019; Orrù et al., 2020a,b, 2021) as ML appears to boost predictive accuracy over more traditional psychometric techniques. Based on the consensus prevalence from the same subject to the same question, ML models were used to predict the subjects’ TRUE/FALSE responses to direct questions. As predictors, the average scores across all subjects estimating the prevalence of the TRUE response on the same item were also included.

To anticipate the results, we found that individual consensus estimates can be used to predict the participants’ own TRUE/FALSE responses to a single question with 75–80% accuracy. The same consensus prevalence estimates may be used to predict the overall percent score of the subject who answered TRUE to the same questions, with a correlation of around 0.7. Using distinct questionnaires for the evaluation of anxiety and the Dark Triad, the procedure was validated in two separate investigations.

Study 1: anxiety

Methods

Participants

For this experiment, one hundred healthy participants (79 females) were recruited using a mailing list platform. All subjects were volunteers and provided informed consent before starting the online questionnaire. The experimental procedure was approved by the local ethics committee of the University of Padua, in accordance with the Declaration of Helsinki.

The participants’ mean age was 28.9 years (27.6 females, 34.2 males), with an average of 16.4 years of education (SD = 2.7).

Stimuli and experimental procedure

The questionnaire consisted of two sections, in which participants were asked to respond twice to the same item. In the first step, participants were required to provide an estimation of the prevalence of “True” responses among their peers (OTHER%) before indicating their personal response of “True/False” to direct questions (ME responses). The subjects answered all the items in the OTHER% category first, followed by all the items in the ME category. This order was chosen because studies have shown that providing the prevalence estimation first causes a more pronounced false consensus effect (Mullen et al., 1985).

The questionnaire comprised 27 items classified as follows:

- Prevalence of Anxiety (A+): (n = 10) adapted from Spielberger et al. (1983). Anxious responders are expected to answer TRUE.

- Anxiety (A–): (n = 10) positively reframed version of the previous 10 items. Anxious responders are expected to respond FALSE to such items.

- Bizarre items (B): (n = 3) SIMS-adapted items (Smith and Burger, 1997). Honest subjects respond FALSE to items such as “I forget how to get back home,” whereas positive answers indicate that the subject is a malingerer.

- Control items (C): (n = 4) items that are endorsed by most subjects, such as “I like pizza.”

Bizarre (B) and control items (C) were required to double-check the data’s quality (Monaro et al., 2018a,b). Honest subjects were expected to respond positively to control items (C) and negatively to bizarre items (B).

When responders take a test in a real setting in which there is fake proneness (e.g., insurance claims) it has been suggested (Van Impelen et al., 2014) that items that are rarely endorsed by responders or items that are frequently endorsed by responders should be included in order to evaluate the overall level of accuracy of the responders. For this reason, we also included Bizarre (B) and Control items (C).

Data analysis

ML techniques implemented in the Weka software (Frank et al., 2016) were used to analyze the data. Weka is a Java-based scripting language-based collection of ML algorithms for data mining tasks. It includes different tools for data preparation, classification, regression, clustering, association rule mining, and visualization.

We used 10-fold cross-validation to test ML models in order to obtain realistic estimates of single-subject single question responses to ME questions. Cross-validation is usually a very good procedure for determining the extent to which a result is replicable, at least for what has been referred to as exact replication (Cumming, 2008). When no cross-validation is used, the results are inflated and overly optimistic, and the model may not replicate when applied to out-of-sample data (Bokhari and Hubert, 2018). We applied: (1) ML classification techniques to predict the specific ME response (TRUE/FALSE) of a participant based on their prevalence estimation on OTHER%; (2) ML regressors to predict the percentage of TRUE responses of a single participant given to a set of items (A+; A-, C; B).

Results of Study 1

The analysis was carried out on the entire set of 27 questions (one for OTHER% and one for ME). The results were obtained by analyzing 2,700 stimuli for the OTHER% questions and 2,700 for the ME questions. The percentage of TRUE and FALSE ME responses by item type is presented in Table 1. As expected, a few participants endorsed bizarre (B) items, socially undesirable items, while the majority endorsed control items (C). True responses to A+ and A-items were around 50, 70% for control items (C), and 13% for bizarre items (B).

TABLE 1

Table 1. True and False responses to items indexing A+, A–, C, and B.

In Table 2, average OTHER% estimations are reported separately for TRUE responses to ME questions and FALSE responses to ME questions. The magnitude of the effect size (d = 1.2) indicated that the OTHER% estimations differed based on whether the subject endorsed the item when responding to ME questions. For the effect size interpretation, Cohen (1988, 1995) specifies the following intervals: 0.1–0.3 for a modest effect, 0.3–0.5 for an intermediate effect, and 0.5 and above for a large effect.

TABLE 2

Table 2. Columns report the average (SD) of OTHER% and MEANOTHER%, given TRUE and FALSE ME responses to direct questions.

The correlation between the TRUE/FALSE ME response and OTHER% was r = 0.506. MEANOTHER% indicated the average value of the same figure across all participants, with a correlation of 0.512 to individual ME responses. The value of Pearson r correlation varies between –1 (a perfect negative correlation) to +1 (a perfect positive correlation) (Pearson, 2008).

From the data reported in Table 2, a strong false consensus effect emerged. Participants who gave a high percentage in OTHER% estimates also endorsed the corresponding sentence when responding to the ME questions. Those who answered TRUE had a significantly higher percentage of estimation in OTHER% (62%) than those who answered FALSE (36%).

Prediction of the specific ME (TRUE/FALSE) response based on OTHER% and MEANOTHER% prevalence estimation

We showed that the percentage of ME responses can be predicted using OTHER% responses. A Naive Bayes classifier¹ was trained and validated to classify a single response to a single item as TRUE or FALSE on the basis of the estimation of OTHERS% and the MEANOTHER%. The first value was the participant’s estimation of the percentage of peers expected to endorse the item (e.g., How many of 100 people would respond TRUE to the following question: “I forget how to get back home” → 12%). MEANOTHER% indicated the average value of the same figure for all participants on the target question. By comparing the responses of the participants and all other responses, this information contributed to the classification. Results of the Naive Bayes classifier trained using the 10-fold cross-validation were the following: correctly classified items = 2061/2700, accuracy = 76.33%, with a mean absolute error (MAE) = 2.97% and AUC = 0.84. In Table 3, the confusion matrix derived from cross-validation is reported. The good result generalizes also to other classifiers based on different statistical assumptions (e.g., Logistics, SVM, Decision Tree). The results of these classifiers are comparable to the figures reported in detail above. This result indicates that the result is robust across models and is not the result of model hacking (cherry-picking the best-performing model, Orrù et al., 2020a). Best-performing models are usually difficult to interpret, giving rise to a clear interpretability/accuracy trade-off (Johansson et al., 2011). In short, interpretable models usually are not the best performers, and the best performer’s classifiers are usually not interpretable. One strategy consists of using hard-to-interpret ML models to estimate maximum accuracy and easy-to-interpret decision rule models for more confidence-based evaluations. One such classifier is the C4-5 decision tree (Quinlan, 1993) which output is a set of easy to understand if-then decision rules. Running the C4-5 decision tree algorithm, we identified the decision tree represented in Figure 1. The three decision rules reported above yielded an accuracy of 75.78% and an AUC = 0.803.

TABLE 3

Table 3. Confusion matrix derived from cross-validation.

FIGURE 1

Figure 1. C4-5 decision tree. The final leaf reports the number of instances classified by the rule and the number of errors. For example, the first node is the following: IF MEANOTHER%>0.516, THEN the response predicted by this leaf is TRUE with an accuracy = 82% (597 responses are in this leaf with 107 errors).

The AUC’s value ranges from 0 to 1. A model with 100% incorrect predictions has an AUC of 0.0, while a model with 100% correct predictions has an AUC of 1.0. In general terms, an AUC of 0.5 often indicates no discrimination in a forced choice between two alternatives discrimination. A value of AUC from 0.7 to 0.9 is considered good, more than 0.9 is regarded as remarkable (Carter et al., 2016).

Prediction of the percentage of TRUE responses of a single participant for a given set of items (i.e., A+; A–; C; B)

The above-mentioned analysis reported above was carried out for all of the stimuli presented to the participants. The ML models were run again separately on A+, A-, B, and C items. Similar results were observed, separately, for anxiety-related items, as well as bizarre (B) and control (C) items.

We evaluated the prediction of the percentage of TRUE responses (%TRUE) to ME questions given by each participant to a set of items belonging to A+, A–, C, and B. For every participant, four TRUE% scores have been computed, one for each item category (A+, A–, C, and B), and the regression model was developed to predict this value based on item type (A+, A–, C, and B), OTHER%, and MEANOTHER%. A linear ridge regression model (using 10-fold cross-validation) yielded a correlation between the actual and predicted score of r = 0.719 with a mean absolute error (MAE) of 0.168. A similar result was observed using only average OTHER% and MEANOTHER% (without any information about the item type) as a correlation of r = 0.72 and MAE of 0.17, resulting from a similar ridge regression model. This result indicates that the same model can accurately predict the ME responses independently from their content (A+, A–, C, and B). Finally, the results obtained by different regressors (MLP, SVM regressor) indicate that the reported result was robust across regressors based on differing assumptions.

Prediction of the percentage of TRUE responses of a single participant for a given set of items (i.e., A+; A–)

In order to evaluate that the prediction accuracy is not inflated by prediction on control items B and C, we have replicated the analysis using only the A+ and A-items. The following linear regression analyses was conducted with 10-fold cross-validation, considering the items Anxiety (A+) and Anti-anxiety (A–). The results indicate that the correlation between the true and estimated data (ME TRUE %) was r² = 0.7246, p < 0.001 (mean absolute error = 0.1674). The same result was confirmed with other regressor models (Support Vector Machine).

In addition, the possibility of inferring a subject’s single response to individual items (ME TRUE or ME FALSE) from the OTHERS estimate made by the group was investigated. Individual ME responses were predicted (using a Random Forest classifier) with 77.7% accuracy on the basis of OTHER% estimates.

Study 2: Dark Triad

The Dark Triad is a group of three sub-clinical personality traits generally seen as negative and socially undesirable. The three traits are: (1) Narcissism, which is characterized by excessive self-importance, a lack of empathy for others, and a need for admiration; (2) Machiavellianism, which involves a tendency to be manipulative and exploit others for personal gain; (3) Psychopathy, which is characterized by a lack of remorse or guilt, a tendency to be deceitful and manipulative, and a lack of empathy and concern for others. These traits are often referred to as the “Dark Triad” because they are associated with negative and socially undesirable behaviors (Paulhus and Williams, 2002). A super-short version of the Dark Triad psychometric questionnaire has been proposed (Dirty Dozen; Jonason and Webster, 2010) and was used here.