Using Item Response Theory for the Development of a New Short Form of the Eysenck Personality Questionnaire-Revised

Colledani, Daiana; Anselmi, Pasquale; Robusto, Egidio

doi:10.3389/fpsyg.2018.01834

ORIGINAL RESEARCH article

Front. Psychol. , 02 October 2018

Sec. Quantitative Psychology and Measurement

Volume 9 - 2018 | https://doi.org/10.3389/fpsyg.2018.01834

This article is part of the Research Topic Clinical Psychometrics: Old Issues and New Perspectives View all 20 articles

Using Item Response Theory for the Development of a New Short Form of the Eysenck Personality Questionnaire-Revised

$\r\nDaiana Colledani*$ Daiana Colledani^*

Pasquale Anselmi

Egidio Robusto

Department of Philosophy, Sociology, Education and Applied Psychology, School of Psychology, University of Padova, Padova, Italy

The present work aims at developing a new version of the short form of the Eysenck Personality Questionnaire-Revised, which includes Psychoticism, Extraversion, Neuroticism, and Lie scales (48 items, 12 per scale). The work consists of two studies. In the first one, an item response theory model was estimated on the responses of 590 individuals to the full-length version of the questionnaire (100 items). The analyses allowed the selection of 48 items well discriminating and distributed along the latent continuum of each trait, and without misfit and differential item functioning. In the second study, the functioning of the new form of the questionnaire was evaluated in a different sample of 300 individuals. Results of the two studies show that reliability of the four scales is better than, or equal to that of the original forms. The new version outperforms the original one in approximating scores of the full-length questionnaire. Moreover, convergent validity coefficients and relations with clinical constructs were consistent with literature.

Introduction

In the view of Eysenck (see Eysenck and Eysenck, 1975, 1991), the structure of personality may be effectively described by three main traits: psychoticism (P), extraversion (E), and neuroticism (N). These dimensions are also known as the “Giants Three” and represent basic, independent, and biologically founded traits. They characterize all subjects, with varying degrees, and allow for effectively describing behavioral, emotional, and individual differences among adults and young people. According to the authors, PEN traits do not represent pathological dimensions in themselves, but could lead to the development of abnormal conditions only in particular situations (Eysenck and Eysenck, 1991). In this perspective, neurosis and psychosis should be conceived as pathological exaggerations of the underlying traits of neuroticism and psychoticism (Eysenck and Eysenck, 1991; Mor, 2010).

Extraversion and neuroticism have been the first two dimensions included in the Eysenck's model and were conceptualized as orthogonal continua (Eysenck and Eysenck, 1964, 1991). The neuroticism dimension describes a trait opposed to emotional stability, and defines the degree to which a person is predisposed to experience negative affect (Eysenck and Eysenck, 1964, 1991; Mor, 2010). Individuals with high levels of this trait tend to be worried, apprehensive, moody, fed-up, and irritable (Eysenck and Eysenck, 1991; Eysenck and Barrett, 2013). Extraversion is the second dimension included in the model and depicts sociable, carefree, friendly, convivial, easygoing, and impulsive individuals. This trait is opposed to introversion which, in contrast, defines individuals introspective, quiet, serious, and reserved (Eysenck and Eysenck, 1975, 1991; Eysenck and Barrett, 2013). The third dimension included in the Eysenck's model has been psychoticism, or toughmindedness. The typical toughminded is an individual hostile, aggressive, untrusting, cold, unemotional, rude, lacking in human feelings, and unfriendly. On the opposite pole of the continuum, there are individuals with well-adjusted personality, agreeable, empathic, tolerant, conscientious, open-minded, friendly, and warm (Eysenck and Eysenck, 1975, 1991; Eysenck and Barrett, 2013).

Over the years, a series of instruments has been developed for the assessment of PEN traits on both young and adult people (e.g., Eysenck and Eysenck, 1964, 1975; Eysenck et al., 1985). These instruments also included a Lie (L) scale, which measures dissimulation and the tendency to deceive (Eysenck and Eysenck, 1964). Several contributions have been offered for the refinement of the psychometric properties of Eysenck's questionnaires, as well as for the development of brief versions (Eysenck et al., 1985; Francis and Pearson, 1988; Corulla, 1990; Francis et al., 1992; Francis, 1996). The psychometric properties and factor structure of all these instruments have been investigated in cross cultural research (e.g., Hosokawa and Ohyama, 1993; Maltby and Talley, 1998; Forrest et al., 2000; Qian et al., 2000; Scholte and De Bruyn, 2001; Aluja et al., 2003; Alexopoulos and Kalaitzidis, 2004; Dazzi et al., 2004; Francis et al., 2006; Karanci et al., 2006; Tiwari et al., 2009; Picconi et al., 2018). Unidimensionality of N and L scales has been widely supported in literature (e.g., Lajunen and Scherler, 1999; Ferrando, 2001; Ferrando and Chico, 2001; Ferrando and Anguiano-Carrasco, 2009; Dazzi, 2011). Contrasting results have been found concerning E scale: There are several studies supporting the unidimensionality of this scale (e.g., Rocklin and Revelle, 1981; Ferrando and Chico, 2001; Dazzi, 2011), but there is also some evidence suggesting the presence of two dimensions (Eysenck and Eysenck, 1963; Vidotto et al., 2008). Finally, there is large agreement in the literature that P scale comprises different facets (e.g., Howarth, 1986; Roger and Morris, 1991), which nevertheless contribute to a unique dimension (Chico and Ferrando, 1995; Dazzi, 2011).

Eysenck's instruments have been extensively employed for clinical, forensic, educational, and organizational purposes (e.g., Nyborg, 1997; Judge et al., 2000; Wood and Newton, 2003; Laidra et al., 2007; Smillie et al., 2009; Almiro et al., 2016), and all scales showed significant relations with a variety of psychologically and clinically relevant constructs and behaviors. Research, for instance, suggests that individuals with high levels of neuroticism may experience symptoms of anxiety and depression (e.g., Eysenck, 1991; Saklofske et al., 1995; del Barrio et al., 1997; Dazzi et al., 2004; Jylhä and Isometsä, 2006), and may also be more likely exposed to stress and health problems (e.g., Denney and Frisch, 1981; Huang et al., 2015; Bergomi et al., 2017). In contrast, extraversion appears to be mainly linked to adaptive social behavior, mental well-being, happiness, and life satisfaction (e.g., Lu, 1995; Mor, 2010; Gale et al., 2013). Moreover, this trait has been found to be negatively related to symptoms of anxiety and depression, to self-reported mental disorder and to health care use for psychiatric reasons (e.g., del Barrio et al., 1997; Jylhä and Isometsä, 2006). Finally, psychoticism has been often cited in relation to inappropriate social behaviors, such as unsafe sexual habits, heavy drinking, criminal behavior, dysfunctional impulsivity, gambling, and drug abuse (e.g., Barnes et al., 1984; Blaszczynski et al., 1985; Bogaert, 1993; Lodhi and Thakur, 1993; Francis, 1996; Conrad et al., 1997; Grau and Ortet, 1999; Hoyle et al., 2000; Chico et al., 2003; Heaven et al., 2004; Gudgeon et al., 2005; Colledani, 2018).

The short form of the Eysenck Personality Questionnaire-Revised (EPQ-R; Eysenck et al., 1985; Eysenck and Eysenck, 1991) includes 48 items (out of 100 of the EPQ-R), 12 per each of the four dimensions. This version of the instrument has been translated in several languages and is widely used, across different countries, for scientific and clinical purposes (Hosokawa and Ohyama, 1993; Aluja et al., 2003; Alexopoulos and Kalaitzidis, 2004; Dazzi et al., 2004; Francis et al., 2006; Tiwari et al., 2009; Sanavio et al., 2013). However, it suffers from the same drawbacks of the full-length version. In particular, P scale exhibited poor reliability with a restricted range of scores and a strong positive skewness (Bishop, 1977; Block, 1977; Claridge, 1981; Hosokawa and Ohyama, 1993; Katz and Francis, 2000; Alexopoulos and Kalaitzidis, 2004). In addition, several items showed differential item functioning (DIF) across gender (Eysenck et al., 1985; Eysenck and Eysenck, 1991; Lynn and Martin, 1997; Forrest et al., 2000; Karanci et al., 2006; Escorial and Navas, 2007), which makes the comparison between groups questionable.

A better selection of the items from the full-length version of the instrument could allow for reducing some of the aforementioned drawbacks. The present work aims at developing a new version of the short form of the EPQ-R with improved psychometric properties.

Item response theory (IRT; Bock, 1997; Thissen and Steinberg, 2009) is one of the most promising approaches to this aim. There are several successful applications of IRT for the development and validation of measurement scales (see, Da Dalt et al., 2013, 2015; Balsamo et al., 2014; Anselmi et al., 2015; Zanon et al., 2016; Sotgiu et al., 2018). Moreover, compared with classical test theory, IRT was found to provide more diagnostic information useful for the development of brief scales (Spence et al., 2012; Bortolotti et al., 2013; Petrillo et al., 2015). IRT allows for identifying the items that are best at discriminating different levels of the latent trait of interest, while ensuring that the entire trait continuum is covered. Selecting these items can result in a brief version of the scale that produces scores very similar to those obtained with the full-length scale and has the same external validity (i.e., the same correlations with other constructs; Reise and Henson, 2000; Spence et al., 2012). Moreover, IRT allows for detecting items that are unclear, ambiguous, or which exhibit DIF. These items should be not included in the brief scale. Despite advantages offered by IRT, only a few studies employed this approach for the refinement of Eysenck's instruments (e.g., Ferrando, 2001; Ferrando and Chico, 2001; Escorial and Navas, 2007; Maij-de Meij et al., 2008). Recently, Colledani et al. (2018) used IRT for developing a new version of the abbreviated form of the Junior EPQ-R (6 items per scale). The new version outperformed the original one on several aspects.

This work includes two main studies. In Study 1, a series of analyses were performed on the responses to the full-length version of the EPQ-R in order to select the 48 items (12 per each scale) with the best psychometric properties. In Study 2, the functioning of the new short form was tested in a new data sample. Reliability, validity and factor structure were examined. Relationships of the new scales with social desirability, the dimensions of the Five Factor Model (FFM), and clinically relevant constructs were verified.

Study 1

Participants

A total of 590 participants took part in the study (mean age = 36.69 years, SD = 14.16; from 18 to 75 years; 55.8% females). They were recruited from different Italian regions through convenience sampling. All participants were native Italian speakers and completed the questionnaire anonymously and voluntarily. All standards for research with human subjects were respected. Written informed consent was obtained from the participants. The project has been approved, now as later, by the Ethical Committee for the Psychological Research of the University of Padova since a prospective ethics approval was not required at the time when the research was conducted (Protocol n. 2622).

Instruments

The participants were presented with the Italian version of the EPQ-R (Dazzi et al., 2004; Dazzi, 2011). The instrument consists of 100 dichotomous items (yes/no), 32 for P scale (e.g., “Should people always respect the law?,” “Do you enjoy hurting people you love?”), 23 for E scale (e.g., “Do you enjoy meeting new people?,” “Can you get a party going?”), 24 for N scale (e.g., “Would you call yourself a nervous person,” “Are you often troubled about feelings of guilt?”), and 21 for L scale (e.g., “Are all your habits good and desirable ones?,” “Have you ever cheated at a game?”). Administration of the questionnaire was individual and paper-and-pencil.

The Italian version of the questionnaire has good reliability and the four-factor structure was confirmed (α = 0.67, 0.78, 0.85, and 0.75 for P, E, N, and L scales, respectively; Dazzi et al., 2004; Dazzi, 2011). The reliability found in the current sample (α = 0.60, 0.79, 0.85, and 0.77 for P, E, N, and L scales) is in line with literature.

Studies in the Italian context aimed also to test the factor structure and the psychometric characteristics of the short version of the instrument (Dazzi et al., 2004). Consistently with cross-cultural findings, results supported the four-factor structure of the instrument and showed reliability coefficients satisfactory for E, N, and L scales, while lower for P (α = 0.37, 0.77, 0.83, and 0.70 for P, E, N, and L, respectively; Dazzi et al., 2004). The reliability found in the current sample (α = 0.40, 0.73, 0.83, and 0.73 for P, E, N, and L scales) is in line with literature.

Analysis Strategy

The two-parameter logistic (2PL) model (see Thissen and Steinberg, 2009) was separately estimated on the responses to each of the four scales of the questionnaire. This model describes the probability that a subject endorses a certain item as a function of the latent trait level of the subject (parameter θ), the “endorsability” level of the item (i.e., the ease of providing a “yes” response to that item; parameter ε), and the capability of the item in differentiating subjects with different trait levels (parameter δ). In the case of the P scale, for instance, the greater the value of parameter θ, the greater the level of psychoticism of the subject; the greater the value of parameter ε, the greater the ease of responding “yes” to the item (i.e., of providing a response that is indicative of the presence of psychoticism); the greater the value of parameter δ, the greater the capability of the item in differentiating between subjects with different levels of psychoticism. All the analyses were run using the packages “difR” (Magis et al., 2016) and “ltm” (Rizopoulos, 2012) for the statistical environment R (R Core Team, 2016).

The 2PL assumes unidimensionality of the scales. Confirmatory factor analyses were run on the data of each of the four scales (for a reasonable fit, CFI ≥0.90, RMSEA < 0.08; see Hu and Bentler, 1999; Marsh et al., 2004; Brown, 2006). These analyses confirmed the unidimensionality of N [ $χ_{(252)}^{2}$ = 1046.791, p ≤ 0.001; RMSEA = 0.073; CFI = 0.919] and L [ $χ_{(189)}^{2}$ = 532.901, p ≤ 0.001; RMSEA = 0.056; CFI = 0.900]. Fit indices of E scale were close to acceptance [ $χ_{(230)}^{2}$ = 808.417, p ≤ 0.001; RMSEA = 0.065; CFI = 0.890]. The unidimensional model did not fit the data of P scale [ $χ_{(464)}^{2}$ = 1841.233, p ≤ 0.001; RMSEA = 0.071; CFI = 0.467]. An exploratory factor analysis on this scale suggests a four-factor solution with 7 items out of 32 exhibiting cross-loadings. In line with literature (e.g., Howarth, 1986; Roger and Morris, 1991; Chico and Ferrando, 1995; Dazzi, 2011), this result confirms that P scale defines a complex and multifaceted construct.

Item Selection for the New Short Scales

DIF and item fit statistics were used to identify the items with the poorest psychometric properties that were not included in the new short scales.

Three item fit statistics were used: infit, outfit (Wright and Masters, 1982), and the index suggested by Bock (1972). Infit and outfit are two χ²-based statistics, the former being effective in detecting unexpected responses to items close to a subject's trait level, the latter being effective in detecting unexpected responses to items far from the subject's trait level. In this work, items with infit and/or outfit higher than 1.4 (Wright and Linacre, 1994) were considered misfitting and not included in the new short scales. The index suggested by Bock involves grouping subjects into n categories on the basis of their latent trait level, and observed and expected proportions of subjects endorsing the item for each group are compared (Bock, 1972; Reise, 1990). In this work, subjects were grouped into four categories and the items which displayed a medium (0.3 ≤ Φ < 0.5) to large (Φ ≥ 0.5) effect size (Cohen, 1988) were not selected for inclusion in the new questionnaire.

Items exhibiting gender DIF were also excluded from the new questionnaire. Both uniform and non-uniform DIF were considered. The former is a systematic bias expressing a different probability of endorsing an item for the members of a specific group. The latter is a non-systematic bias which varies with the latent trait level. Females were used as reference group. Effect sizes of uniform and non-uniform DIF were evaluated by the R² difference test (Nagelkerke, 1991; Gómez-Benito et al., 2009), with values higher than 0.035 denoting moderate DIF and values higher than 0.07 denoting strong DIF (Jodoin and Gierl, 2001; Magis et al., 2016).

Parameters ε and δ were examined to select, among the remaining items, those that allow for covering the entire trait continuum and with the greatest discrimination level.

Assessment of the Psychometric Characteristics of the New Short Scales

Reliability and validity of the newly developed PEN-L scales were evaluated and compared with those of the original short scales. Reliability was evaluated through Cronbach's α and test information function (TIF). TIF tells us how well the test measures the latent trait levels over the entire range of interest (Baker, 2001; Petrillo et al., 2015). The larger the value of TIF, the greater the accuracy with which the latent trait levels are measured. TIF depends on the latent trait range under consideration and on the number of items in the test (Baker, 2001). In this work, the old and new short scales had the same length (12 items), and TIF was defined on the same range of latent trait levels (−5 to 5). Validity was evaluated using a bias index and the correlation between scores obtained with full-length and short scales. The bias index was computed as the average difference (in absolute terms) between the parameters θ estimated on the full-length scales and those estimated on the short scales. Low biases suggest that the latent trait estimates obtained with the short scales approximate those of the full-length versions. In addition, the correlations between scores obtained with the full-length and short scales were computed and corrected for common items using the Levy's (1967) method.

Results

Three of the 32 items of P scale exhibited uniform and non-uniform gender DIF of moderate (Items 68 and 91) or strong (Item 12) size. Fit statistics were adequate for all the items. From the remaining 29 items, 12 were selected taking into account their parameters ε and δ. This resulted in a new short scale, that differed from the original one for eight items (see Table 1). Specifically, Item 91 was changed because it showed uniform and non-uniform gender DIF of moderate size. These modifications allowed for obtaining a new scale with increased reliability (α increased from 0.40 to 0.62; TIF increased from 8.13 to 12.86) and with scores that better approximate those obtained with the full-length scale (bias decreased from 0.37 to 0.18, corrected correlation increased from 0.47 to 0.52). It is worth noting that Cronbach's α of the new 12-item scale (0.62) largely resembles that of the full 32-item scale (0.60).

TABLE 1

Table 1. Easiness (ε) and discrimination (δ) parameters for the 32 items of the Psychoticism scale.

Regarding the 23 items of E scale, only Item 55 exhibited uniform gender DIF of moderate size and no item showed misfit. Selecting 12 items upon the basis of their parameters ε and δ, we obtained a new E scale that differed from the original one for three items (see Table 2). The differences in reliability and validity of the new and original scales were small in size, nevertheless in favor of the new version (α increased from 0.73 to 0.75; TIF increased from 16.62 to 16.83; bias decreased from 0.21 to 0.19; corrected correlation increased from 0.74 to 0.77).

TABLE 2

Table 2. Easiness (ε) and discrimination (δ) parameters for the 23 items of the Extraversion scale.

Concerning N and L scales, no one item exhibited gender DIF or misfit. Therefore, items were selected considering their ε and δ parameters. For both scales, the new versions differed from the original ones for two items (see Tables 3, 4). Item 35 was present in the previous version of the N scale but it has not been included in the new one because of its redundant content. Reliability of the new scales largely resembles that of the original versions (α = 0.83, 0.82; TIF = 20.86, 20.80 for original and new N scale, respectively; α = 0.73, 0.74; TIF = 13.86, 14.15 for original and new L scale, respectively). Concerning N scale, a slight decrease of bias was observed (from 0.22 to 0.16). The other indexes remained substantially unchanged (bias = 0.20, 0.18 for original and new L scale, respectively; corrected correlation = 0.74, 0.75 for original and new L scale, respectively; 0.83, 0.84, for original and new N scale, respectively).

TABLE 3

Table 3. Easiness (ε) and discrimination (δ) parameters for the 24 items of the Neuroticism scale.

TABLE 4

Table 4. Easiness (ε) and discrimination (δ) parameters for the 21 items of the Lie scale.

Discussion

This study aimed at developing a new short version of the EPQ-R with improved psychometric characteristics. IRT based statistics allowed the identification of 48 items without gender DIF or misfit, well discriminating, and well distributed along the four latent traits continua. The new version of the P scale differs from the original one for eight items (out of 12), E scale for three, and N and L only for two. The largest improvement was reached for P scale, which in literature was found to perform less well than the other three scales (e.g., Bishop, 1977; Block, 1977; Claridge, 1981). In particular, the new version is not affected by gender DIF and outperforms the original one for reliability and approximation of the scores obtained with the full-length form. The new versions of the other three scales performed as well as, or slightly better than the original ones. Although small in size, these improvements are valuable taking into account that were obtained by substituting a small number of items and reducing content redundancy.

Study 2

This study aimed at investigating the functioning of the new version of the short EPQ-R on a new data set. Other to reliability and factor structure, construct validity was evaluated by taking into account relationships with social desirability, the dimensions of the FFM, and measures of anxiety and depression.