Using Item Response Theory for the Development of a New Short Form of the Eysenck Personality Questionnaire-Revised

Colledani, Daiana; Anselmi, Pasquale; Robusto, Egidio

doi:10.3389/fpsyg.2018.01834

ORIGINAL RESEARCH article

Front. Psychol., 02 October 2018

Sec. Quantitative Psychology and Measurement

Volume 9 - 2018 | https://doi.org/10.3389/fpsyg.2018.01834

This article is part of the Research Topic Clinical Psychometrics: Old Issues and New Perspectives View all 20 articles

Using Item Response Theory for the Development of a New Short Form of the Eysenck Personality Questionnaire-Revised

$\r\nDaiana Colledani*$ Daiana Colledani^*

Pasquale Anselmi

Egidio Robusto

Department of Philosophy, Sociology, Education and Applied Psychology, School of Psychology, University of Padova, Padova, Italy

The present work aims at developing a new version of the short form of the Eysenck Personality Questionnaire-Revised, which includes Psychoticism, Extraversion, Neuroticism, and Lie scales (48 items, 12 per scale). The work consists of two studies. In the first one, an item response theory model was estimated on the responses of 590 individuals to the full-length version of the questionnaire (100 items). The analyses allowed the selection of 48 items well discriminating and distributed along the latent continuum of each trait, and without misfit and differential item functioning. In the second study, the functioning of the new form of the questionnaire was evaluated in a different sample of 300 individuals. Results of the two studies show that reliability of the four scales is better than, or equal to that of the original forms. The new version outperforms the original one in approximating scores of the full-length questionnaire. Moreover, convergent validity coefficients and relations with clinical constructs were consistent with literature.

Introduction

In the view of Eysenck (see Eysenck and Eysenck, 1975, 1991), the structure of personality may be effectively described by three main traits: psychoticism (P), extraversion (E), and neuroticism (N). These dimensions are also known as the “Giants Three” and represent basic, independent, and biologically founded traits. They characterize all subjects, with varying degrees, and allow for effectively describing behavioral, emotional, and individual differences among adults and young people. According to the authors, PEN traits do not represent pathological dimensions in themselves, but could lead to the development of abnormal conditions only in particular situations (Eysenck and Eysenck, 1991). In this perspective, neurosis and psychosis should be conceived as pathological exaggerations of the underlying traits of neuroticism and psychoticism (Eysenck and Eysenck, 1991; Mor, 2010).

Extraversion and neuroticism have been the first two dimensions included in the Eysenck's model and were conceptualized as orthogonal continua (Eysenck and Eysenck, 1964, 1991). The neuroticism dimension describes a trait opposed to emotional stability, and defines the degree to which a person is predisposed to experience negative affect (Eysenck and Eysenck, 1964, 1991; Mor, 2010). Individuals with high levels of this trait tend to be worried, apprehensive, moody, fed-up, and irritable (Eysenck and Eysenck, 1991; Eysenck and Barrett, 2013). Extraversion is the second dimension included in the model and depicts sociable, carefree, friendly, convivial, easygoing, and impulsive individuals. This trait is opposed to introversion which, in contrast, defines individuals introspective, quiet, serious, and reserved (Eysenck and Eysenck, 1975, 1991; Eysenck and Barrett, 2013). The third dimension included in the Eysenck's model has been psychoticism, or toughmindedness. The typical toughminded is an individual hostile, aggressive, untrusting, cold, unemotional, rude, lacking in human feelings, and unfriendly. On the opposite pole of the continuum, there are individuals with well-adjusted personality, agreeable, empathic, tolerant, conscientious, open-minded, friendly, and warm (Eysenck and Eysenck, 1975, 1991; Eysenck and Barrett, 2013).

Over the years, a series of instruments has been developed for the assessment of PEN traits on both young and adult people (e.g., Eysenck and Eysenck, 1964, 1975; Eysenck et al., 1985). These instruments also included a Lie (L) scale, which measures dissimulation and the tendency to deceive (Eysenck and Eysenck, 1964). Several contributions have been offered for the refinement of the psychometric properties of Eysenck's questionnaires, as well as for the development of brief versions (Eysenck et al., 1985; Francis and Pearson, 1988; Corulla, 1990; Francis et al., 1992; Francis, 1996). The psychometric properties and factor structure of all these instruments have been investigated in cross cultural research (e.g., Hosokawa and Ohyama, 1993; Maltby and Talley, 1998; Forrest et al., 2000; Qian et al., 2000; Scholte and De Bruyn, 2001; Aluja et al., 2003; Alexopoulos and Kalaitzidis, 2004; Dazzi et al., 2004; Francis et al., 2006; Karanci et al., 2006; Tiwari et al., 2009; Picconi et al., 2018). Unidimensionality of N and L scales has been widely supported in literature (e.g., Lajunen and Scherler, 1999; Ferrando, 2001; Ferrando and Chico, 2001; Ferrando and Anguiano-Carrasco, 2009; Dazzi, 2011). Contrasting results have been found concerning E scale: There are several studies supporting the unidimensionality of this scale (e.g., Rocklin and Revelle, 1981; Ferrando and Chico, 2001; Dazzi, 2011), but there is also some evidence suggesting the presence of two dimensions (Eysenck and Eysenck, 1963; Vidotto et al., 2008). Finally, there is large agreement in the literature that P scale comprises different facets (e.g., Howarth, 1986; Roger and Morris, 1991), which nevertheless contribute to a unique dimension (Chico and Ferrando, 1995; Dazzi, 2011).

Eysenck's instruments have been extensively employed for clinical, forensic, educational, and organizational purposes (e.g., Nyborg, 1997; Judge et al., 2000; Wood and Newton, 2003; Laidra et al., 2007; Smillie et al., 2009; Almiro et al., 2016), and all scales showed significant relations with a variety of psychologically and clinically relevant constructs and behaviors. Research, for instance, suggests that individuals with high levels of neuroticism may experience symptoms of anxiety and depression (e.g., Eysenck, 1991; Saklofske et al., 1995; del Barrio et al., 1997; Dazzi et al., 2004; Jylhä and Isometsä, 2006), and may also be more likely exposed to stress and health problems (e.g., Denney and Frisch, 1981; Huang et al., 2015; Bergomi et al., 2017). In contrast, extraversion appears to be mainly linked to adaptive social behavior, mental well-being, happiness, and life satisfaction (e.g., Lu, 1995; Mor, 2010; Gale et al., 2013). Moreover, this trait has been found to be negatively related to symptoms of anxiety and depression, to self-reported mental disorder and to health care use for psychiatric reasons (e.g., del Barrio et al., 1997; Jylhä and Isometsä, 2006). Finally, psychoticism has been often cited in relation to inappropriate social behaviors, such as unsafe sexual habits, heavy drinking, criminal behavior, dysfunctional impulsivity, gambling, and drug abuse (e.g., Barnes et al., 1984; Blaszczynski et al., 1985; Bogaert, 1993; Lodhi and Thakur, 1993; Francis, 1996; Conrad et al., 1997; Grau and Ortet, 1999; Hoyle et al., 2000; Chico et al., 2003; Heaven et al., 2004; Gudgeon et al., 2005; Colledani, 2018).

The short form of the Eysenck Personality Questionnaire-Revised (EPQ-R; Eysenck et al., 1985; Eysenck and Eysenck, 1991) includes 48 items (out of 100 of the EPQ-R), 12 per each of the four dimensions. This version of the instrument has been translated in several languages and is widely used, across different countries, for scientific and clinical purposes (Hosokawa and Ohyama, 1993; Aluja et al., 2003; Alexopoulos and Kalaitzidis, 2004; Dazzi et al., 2004; Francis et al., 2006; Tiwari et al., 2009; Sanavio et al., 2013). However, it suffers from the same drawbacks of the full-length version. In particular, P scale exhibited poor reliability with a restricted range of scores and a strong positive skewness (Bishop, 1977; Block, 1977; Claridge, 1981; Hosokawa and Ohyama, 1993; Katz and Francis, 2000; Alexopoulos and Kalaitzidis, 2004). In addition, several items showed differential item functioning (DIF) across gender (Eysenck et al., 1985; Eysenck and Eysenck, 1991; Lynn and Martin, 1997; Forrest et al., 2000; Karanci et al., 2006; Escorial and Navas, 2007), which makes the comparison between groups questionable.

A better selection of the items from the full-length version of the instrument could allow for reducing some of the aforementioned drawbacks. The present work aims at developing a new version of the short form of the EPQ-R with improved psychometric properties.

Item response theory (IRT; Bock, 1997; Thissen and Steinberg, 2009) is one of the most promising approaches to this aim. There are several successful applications of IRT for the development and validation of measurement scales (see, Da Dalt et al., 2013, 2015; Balsamo et al., 2014; Anselmi et al., 2015; Zanon et al., 2016; Sotgiu et al., 2018). Moreover, compared with classical test theory, IRT was found to provide more diagnostic information useful for the development of brief scales (Spence et al., 2012; Bortolotti et al., 2013; Petrillo et al., 2015). IRT allows for identifying the items that are best at discriminating different levels of the latent trait of interest, while ensuring that the entire trait continuum is covered. Selecting these items can result in a brief version of the scale that produces scores very similar to those obtained with the full-length scale and has the same external validity (i.e., the same correlations with other constructs; Reise and Henson, 2000; Spence et al., 2012). Moreover, IRT allows for detecting items that are unclear, ambiguous, or which exhibit DIF. These items should be not included in the brief scale. Despite advantages offered by IRT, only a few studies employed this approach for the refinement of Eysenck's instruments (e.g., Ferrando, 2001; Ferrando and Chico, 2001; Escorial and Navas, 2007; Maij-de Meij et al., 2008). Recently, Colledani et al. (2018) used IRT for developing a new version of the abbreviated form of the Junior EPQ-R (6 items per scale). The new version outperformed the original one on several aspects.

This work includes two main studies. In Study 1, a series of analyses were performed on the responses to the full-length version of the EPQ-R in order to select the 48 items (12 per each scale) with the best psychometric properties. In Study 2, the functioning of the new short form was tested in a new data sample. Reliability, validity and factor structure were examined. Relationships of the new scales with social desirability, the dimensions of the Five Factor Model (FFM), and clinically relevant constructs were verified.

Study 1

Participants

A total of 590 participants took part in the study (mean age = 36.69 years, SD = 14.16; from 18 to 75 years; 55.8% females). They were recruited from different Italian regions through convenience sampling. All participants were native Italian speakers and completed the questionnaire anonymously and voluntarily. All standards for research with human subjects were respected. Written informed consent was obtained from the participants. The project has been approved, now as later, by the Ethical Committee for the Psychological Research of the University of Padova since a prospective ethics approval was not required at the time when the research was conducted (Protocol n. 2622).

Instruments

The participants were presented with the Italian version of the EPQ-R (Dazzi et al., 2004; Dazzi, 2011). The instrument consists of 100 dichotomous items (yes/no), 32 for P scale (e.g., “Should people always respect the law?,” “Do you enjoy hurting people you love?”), 23 for E scale (e.g., “Do you enjoy meeting new people?,” “Can you get a party going?”), 24 for N scale (e.g., “Would you call yourself a nervous person,” “Are you often troubled about feelings of guilt?”), and 21 for L scale (e.g., “Are all your habits good and desirable ones?,” “Have you ever cheated at a game?”). Administration of the questionnaire was individual and paper-and-pencil.

The Italian version of the questionnaire has good reliability and the four-factor structure was confirmed (α = 0.67, 0.78, 0.85, and 0.75 for P, E, N, and L scales, respectively; Dazzi et al., 2004; Dazzi, 2011). The reliability found in the current sample (α = 0.60, 0.79, 0.85, and 0.77 for P, E, N, and L scales) is in line with literature.

Studies in the Italian context aimed also to test the factor structure and the psychometric characteristics of the short version of the instrument (Dazzi et al., 2004). Consistently with cross-cultural findings, results supported the four-factor structure of the instrument and showed reliability coefficients satisfactory for E, N, and L scales, while lower for P (α = 0.37, 0.77, 0.83, and 0.70 for P, E, N, and L, respectively; Dazzi et al., 2004). The reliability found in the current sample (α = 0.40, 0.73, 0.83, and 0.73 for P, E, N, and L scales) is in line with literature.

Analysis Strategy

The two-parameter logistic (2PL) model (see Thissen and Steinberg, 2009) was separately estimated on the responses to each of the four scales of the questionnaire. This model describes the probability that a subject endorses a certain item as a function of the latent trait level of the subject (parameter θ), the “endorsability” level of the item (i.e., the ease of providing a “yes” response to that item; parameter ε), and the capability of the item in differentiating subjects with different trait levels (parameter δ). In the case of the P scale, for instance, the greater the value of parameter θ, the greater the level of psychoticism of the subject; the greater the value of parameter ε, the greater the ease of responding “yes” to the item (i.e., of providing a response that is indicative of the presence of psychoticism); the greater the value of parameter δ, the greater the capability of the item in differentiating between subjects with different levels of psychoticism. All the analyses were run using the packages “difR” (Magis et al., 2016) and “ltm” (Rizopoulos, 2012) for the statistical environment R (R Core Team, 2016).

The 2PL assumes unidimensionality of the scales. Confirmatory factor analyses were run on the data of each of the four scales (for a reasonable fit, CFI ≥0.90, RMSEA < 0.08; see Hu and Bentler, 1999; Marsh et al., 2004; Brown, 2006). These analyses confirmed the unidimensionality of N [ $χ_{(252)}^{2}$ = 1046.791, p ≤ 0.001; RMSEA = 0.073; CFI = 0.919] and L [ $χ_{(189)}^{2}$ = 532.901, p ≤ 0.001; RMSEA = 0.056; CFI = 0.900]. Fit indices of E scale were close to acceptance [ $χ_{(230)}^{2}$ = 808.417, p ≤ 0.001; RMSEA = 0.065; CFI = 0.890]. The unidimensional model did not fit the data of P scale [ $χ_{(464)}^{2}$ = 1841.233, p ≤ 0.001; RMSEA = 0.071; CFI = 0.467]. An exploratory factor analysis on this scale suggests a four-factor solution with 7 items out of 32 exhibiting cross-loadings. In line with literature (e.g., Howarth, 1986; Roger and Morris, 1991; Chico and Ferrando, 1995; Dazzi, 2011), this result confirms that P scale defines a complex and multifaceted construct.

Item Selection for the New Short Scales

DIF and item fit statistics were used to identify the items with the poorest psychometric properties that were not included in the new short scales.

Three item fit statistics were used: infit, outfit (Wright and Masters, 1982), and the index suggested by Bock (1972). Infit and outfit are two χ²-based statistics, the former being effective in detecting unexpected responses to items close to a subject's trait level, the latter being effective in detecting unexpected responses to items far from the subject's trait level. In this work, items with infit and/or outfit higher than 1.4 (Wright and Linacre, 1994) were considered misfitting and not included in the new short scales. The index suggested by Bock involves grouping subjects into n categories on the basis of their latent trait level, and observed and expected proportions of subjects endorsing the item for each group are compared (Bock, 1972; Reise, 1990). In this work, subjects were grouped into four categories and the items which displayed a medium (0.3 ≤ Φ < 0.5) to large (Φ ≥ 0.5) effect size (Cohen, 1988) were not selected for inclusion in the new questionnaire.

Items exhibiting gender DIF were also excluded from the new questionnaire. Both uniform and non-uniform DIF were considered. The former is a systematic bias expressing a different probability of endorsing an item for the members of a specific group. The latter is a non-systematic bias which varies with the latent trait level. Females were used as reference group. Effect sizes of uniform and non-uniform DIF were evaluated by the R² difference test (Nagelkerke, 1991; Gómez-Benito et al., 2009), with values higher than 0.035 denoting moderate DIF and values higher than 0.07 denoting strong DIF (Jodoin and Gierl, 2001; Magis et al., 2016).

Parameters ε and δ were examined to select, among the remaining items, those that allow for covering the entire trait continuum and with the greatest discrimination level.

Assessment of the Psychometric Characteristics of the New Short Scales

Reliability and validity of the newly developed PEN-L scales were evaluated and compared with those of the original short scales. Reliability was evaluated through Cronbach's α and test information function (TIF). TIF tells us how well the test measures the latent trait levels over the entire range of interest (Baker, 2001; Petrillo et al., 2015). The larger the value of TIF, the greater the accuracy with which the latent trait levels are measured. TIF depends on the latent trait range under consideration and on the number of items in the test (Baker, 2001). In this work, the old and new short scales had the same length (12 items), and TIF was defined on the same range of latent trait levels (−5 to 5). Validity was evaluated using a bias index and the correlation between scores obtained with full-length and short scales. The bias index was computed as the average difference (in absolute terms) between the parameters θ estimated on the full-length scales and those estimated on the short scales. Low biases suggest that the latent trait estimates obtained with the short scales approximate those of the full-length versions. In addition, the correlations between scores obtained with the full-length and short scales were computed and corrected for common items using the Levy's (1967) method.

Results

Three of the 32 items of P scale exhibited uniform and non-uniform gender DIF of moderate (Items 68 and 91) or strong (Item 12) size. Fit statistics were adequate for all the items. From the remaining 29 items, 12 were selected taking into account their parameters ε and δ. This resulted in a new short scale, that differed from the original one for eight items (see Table 1). Specifically, Item 91 was changed because it showed uniform and non-uniform gender DIF of moderate size. These modifications allowed for obtaining a new scale with increased reliability (α increased from 0.40 to 0.62; TIF increased from 8.13 to 12.86) and with scores that better approximate those obtained with the full-length scale (bias decreased from 0.37 to 0.18, corrected correlation increased from 0.47 to 0.52). It is worth noting that Cronbach's α of the new 12-item scale (0.62) largely resembles that of the full 32-item scale (0.60).

TABLE 1

Table 1. Easiness (ε) and discrimination (δ) parameters for the 32 items of the Psychoticism scale.

Regarding the 23 items of E scale, only Item 55 exhibited uniform gender DIF of moderate size and no item showed misfit. Selecting 12 items upon the basis of their parameters ε and δ, we obtained a new E scale that differed from the original one for three items (see Table 2). The differences in reliability and validity of the new and original scales were small in size, nevertheless in favor of the new version (α increased from 0.73 to 0.75; TIF increased from 16.62 to 16.83; bias decreased from 0.21 to 0.19; corrected correlation increased from 0.74 to 0.77).

TABLE 2

Table 2. Easiness (ε) and discrimination (δ) parameters for the 23 items of the Extraversion scale.

Concerning N and L scales, no one item exhibited gender DIF or misfit. Therefore, items were selected considering their ε and δ parameters. For both scales, the new versions differed from the original ones for two items (see Tables 3, 4). Item 35 was present in the previous version of the N scale but it has not been included in the new one because of its redundant content. Reliability of the new scales largely resembles that of the original versions (α = 0.83, 0.82; TIF = 20.86, 20.80 for original and new N scale, respectively; α = 0.73, 0.74; TIF = 13.86, 14.15 for original and new L scale, respectively). Concerning N scale, a slight decrease of bias was observed (from 0.22 to 0.16). The other indexes remained substantially unchanged (bias = 0.20, 0.18 for original and new L scale, respectively; corrected correlation = 0.74, 0.75 for original and new L scale, respectively; 0.83, 0.84, for original and new N scale, respectively).

TABLE 3

Table 3. Easiness (ε) and discrimination (δ) parameters for the 24 items of the Neuroticism scale.

TABLE 4

Table 4. Easiness (ε) and discrimination (δ) parameters for the 21 items of the Lie scale.

Discussion

This study aimed at developing a new short version of the EPQ-R with improved psychometric characteristics. IRT based statistics allowed the identification of 48 items without gender DIF or misfit, well discriminating, and well distributed along the four latent traits continua. The new version of the P scale differs from the original one for eight items (out of 12), E scale for three, and N and L only for two. The largest improvement was reached for P scale, which in literature was found to perform less well than the other three scales (e.g., Bishop, 1977; Block, 1977; Claridge, 1981). In particular, the new version is not affected by gender DIF and outperforms the original one for reliability and approximation of the scores obtained with the full-length form. The new versions of the other three scales performed as well as, or slightly better than the original ones. Although small in size, these improvements are valuable taking into account that were obtained by substituting a small number of items and reducing content redundancy.

Study 2

This study aimed at investigating the functioning of the new version of the short EPQ-R on a new data set. Other to reliability and factor structure, construct validity was evaluated by taking into account relationships with social desirability, the dimensions of the FFM, and measures of anxiety and depression.

Participants

Participants were 300 native Italian speakers aged between 18 and 65 (mean age = 29.28, SD = 10.38; 60.2% females). They were recruited from different Italian regions using convenience sampling. All participants were presented with the new version of the short EPQ-R, whereas a subsample of 158 participants (mean age = 34.73, SD = 9.88; 68.7% females) also received the other measures. The participation to the study was anonymous and voluntary, and all standards for research with human subjects were respected. Written informed consent was obtained from the participants. The project has been approved, now as later, by the Ethical Committee for the Psychological Research of the University of Padova since a prospective ethics approval was not required at the time when the research was conducted (Protocol n. 2622).

Instruments

The new form of the short EPQ-R devised in Study 1 was administered to all participants.

The five traits of the FFM of personality (i.e., extraversion, agreeableness, conscientiousness, emotional stability, and openness) were measured through the Italian version (Ubbiali et al., 2013; Chiorri et al., 2016) of the Big Five Inventory (BFI; John et al., 2008). The questionnaire consists of 44 items answered on a five-point Likert scale (from 1 “Strongly disagree” to 5 “Strongly agree”; e.g., “I see myself as someone who is full of energy” for extraversion; “I see myself as someone who is helpful and unselfish with others” for agreeableness; “I see myself as someone who perseveres until the task is finished” for conscientiousness; “I see myself as someone who worries a lot” for emotional stability; “I see myself as someone who is ingenious, a deep thinker” for openness). Convincing evidence was found concerning construct validity, factor structure, gender invariance, and reliability (α from 0.75 to 0.86; Ubbiali et al., 2013; Chiorri et al., 2016; α from 0.73 to 0.83 in the current sample).

The Impression Management (IM) scale of the Italian brief version (Bobbio and Manganelli, 2011) of the Balanced Inventory of Desirable Responding (BIDR; Paulhus, 1991) was also administered. The scale comprises 8 items answered on a six-point Likert scale (from 1 “Strongly disagree” to 6 “Strongly agree”) and assesses the conscious tendency of individuals to provide positively inflated self-descriptions (e.g., “I have never dropped litter on the street”). Internal consistency of the scale ranges from 0.73 to 0.81 (Bobbio and Manganelli, 2011; in the current sample, α = 0.75).

The trait scale of the State-Trait Anxiety Inventory (STAI-Y; Spielberger et al., 1983; Pedrabissi and Santinello, 1989) was used to evaluate anxiety. The scale comprises 20 items answered on a four-point Likert scale (from 1 “Not at all” to 4 “Very much”). The instrument evaluates the tendency of people to experience general anxiety and the relatively stable predisposition to view stressful situations as threatening (e.g., “I am regretful”). The Italian version of the questionnaire showed adequate validity and reliability (α from 0.85 and 0.90; Pedrabissi and Santinello, 1989; in the current sample, α = 0.92).

Finally, the Italian version of the Patient Health Questionnaire-9 (PHQ-9; Spitzer et al., 1999; Kroenke et al., 2001) was used to evaluate depressive symptoms. The questionnaire is a self-administered instrument and assesses the nine DSM-IV (American Psychiatric Association, 2000) criteria for depression. Respondents are asked to evaluate the presence of depressive symptoms over the last 2 weeks through nine items scored on a four-point Likert scale (from 0 “Not at all” to 3 “Nearly every day”; e.g., “Feeling tired or having little energy”). This instrument showed adequate reliability (α from 0.86 to 0.89), and good sensitivity and specificity (see Kroenke et al., 2001). In the current sample, α equals 0.81.

Analysis Strategy

Reliability of the new version of the short EPQ-R was tested through Cronbach's α. Construct validity was evaluated by computing convergent validity coefficients and by analyzing the factor structure of the instrument.

Convergent validity was evaluated considering correlations between the four PEN-L traits, the five dimensions of FFM, social desirability, and indexes of depression and trait anxiety. According with literature, L scores are expected to positively correlate with the IM scale of the BIDR (e.g., Gillings and Joseph, 1996), while PEN traits are expected to correlate with BFI scales, depression and trait anxiety. In particular, positive correlations are expected between E scores of the EPQ-R and the extraversion measure of the BFI, while negative correlations are expected between P scale and agreeableness and conscientiousness. Positive correlations are also expected between N scale of the EPQ-R and the neuroticism measure of the BFI (e.g., McCrae and Costa, 1985; Draycott and Kline, 1995; Saggino, 2000; Barbaranelli et al., 2003; Scholte and De Bruyn, 2004; Heaven et al., 2013). Neuroticism, in addition, is expected to positively correlate with indexes of anxiety and depression (STAI-Y; Spielberger et al., 1983; PHQ-9; Spitzer et al., 1999; Kroenke et al., 2001). In contrast, extraversion is expected to negatively correlate with these two clinical indexes.

An Exploratory Structural Equation Model (ESEM; Asparouhov and Muthén, 2009) was run to evaluate the factor structure. The ESEM framework represents an integration of confirmatory factor analysis (CFA), structural equation modeling (SEM), and exploratory factor analysis (EFA). ESEMs give access to all the common statistics of SEM/CFA but, at the same time, overcome the restrictions associated with the confirmatory approach. CFA fixes non-target loadings to zero and, therefore, it may be inadequate to handle complex and multifaceted constructs where many cross-loadings may be expected (Marsh et al., 2009, 2010, 2011, 2014). When this is the case, fit problems and upward-biased estimates of correlations between factors can be observed (Cole et al., 2007; Marsh and Hau, 2007; Marsh et al., 2010). As in EFA, ESEMs allow for the free estimation of cross-loadings between items and non-target factors. In this work, ESEM was run using Mplus7 (Muthén and Muthén, 2012), and the WLSMV as estimator (weighted least squares mean and variance-adjusted). This method is recommended for binary or ordinal observed data (e.g., Flora and Curran, 2004; Brown, 2006) such as the dichotomous items of the EPQ-R. In the model, the 48 items were the indicators and four factors were modeled. The GEOMIN oblique rotation was used. To evaluate the goodness of fit of the model, several fit indexes were considered: χ², Comparative Fit Index (CFI; Bentler, 1990), Weighted Root Mean Square Residual (WRMR; Yu, 2002), and Root Mean Square Error of Approximation (RMSEA; Browne and Cudeck, 1993) with its 90% confidence interval (90% CI) and the test of close fit (CFit; Browne and Cudeck, 1993). A solution fits the data well when χ² is non-significant (p ≥ 0.05). Since this statistic is sensitive to sample size, the other fit measures were also considered. In particular, a solution fits the data well when CFI is close to 0.95 (0.90 to 0.95 for reasonable fit), WRMR is close to 1.0, and RMSEA is smaller than 0.06 (0.06 to 0.08 for reasonable fit) with CFit non-significant (see Hu and Bentler, 1999; Marsh et al., 2004; Brown, 2006).

Results

Cronbach's α coefficients were 0.55, 0.80, 0.81, and 0.70 for P, E, N, and L scales, respectively. These values were consistent with those of Study 1. Compared with the original version, the largest improvement was reached for P scale, as observed in Study 1.

Convergent validity coefficients are reported in Table 5. All the four PEN-L traits correlated in the expected direction with the considered constructs. E scale showed a strong positive relation with the extraversion measure of the BFI (0.727). P scale was negatively related to agreeableness (−0.323) and conscientiousness (−0.321). N scale was strongly correlated with neuroticism (0.709). Relations with anxiety and depression were also in the expected directions. N scale showed positive relations with scores of PHQ-9 (0.619) and STAI-Y (0.697), while moderate negative relations were found between these two indexes and E scale (r = −0.409, −0.405 for PHQ-9 and STAI-Y, respectively). Finally, L scale showed a strong positive correlation with the IM scale of the BIDR.

TABLE 5

Table 5. Cronbach's αs and correlations between the four PEN-L traits, STAI-Y, PHQ-9, BIDR-IM, and the five BFI dimensions.

Results of the ESEM supported the four-factor structure of the instrument { $χ_{(942)}^{2}$ = 1122.686, p < 0.001; RMSEA = 0.025 [0.019, 0.031]; CFit ≅ 1.000; CFI = 0.930; WRMR = 0.864}. The model is represented in Table 6. All items loaded on the intended factor and cross-loadings were, in general, lower than those observed on the target-factor.

TABLE 6

Table 6. Exploratory structural equation modeling.

Discussion

The analyses performed in this study provide further evidence concerning the adequate psychometric properties of the new short form of the EPQ-R. Concerning reliability, results are in line with those of Study 1 and confirm that, compared with the original version, the largest improvement was observed for P scale. Concerning validity, both the factor structure of the instrument and its convergent validity are supported.

Final Remarks

This work aimed at developing a new and improved version of the short form of the EPQ-R. This instrument is well-known and widely used in different settings. However, some weaknesses have been pointed out, especially for P scale (e.g., Bishop, 1977; Block, 1977; Claridge, 1981). IRT approach was used to develop the new instrument. This approach allowed for removing items with misfit or gender DIF, and for identifying items that were best at discriminating different levels of traits, while ensuring that the respective continua were covered. As suggested in literature, following these criteria for item selection should lead to a short scale with the same psychometric properties of the full-length instrument (Reise and Henson, 2000; Spence et al., 2012). In fact, results of this work show that the new short form of the EPQ-R approximated the scores obtained with the full-length form better than the original short version. In addition, convergent validity of the new scale was consistent with literature (e.g., Saklofske et al., 1995; Gillings and Joseph, 1996; del Barrio et al., 1997; Dazzi et al., 2004; Jylhä and Isometsä, 2006; Mor, 2010). The moderate to strong relationships between Eysenck's traits and clinical constructs provide further evidence toward the usefulness of assessing these traits in clinical settings.

A strength of the present work is that it provides a solution to some well-known drawbacks of the full-length EPQ-R and of its short form existing in the literature (Eysenck et al., 1985; Eysenck and Eysenck, 1991). The largest improvement was obtained for P scale. The new version is not affected by gender DIF and outperforms the original one for reliability and approximation of the full-length form. The new versions of the other three scales performed as well as the original ones, or slightly better. These improvements are small in size, yet notable considering that were obtained by substituting a small number of items and reducing content redundancy.

In the present work, separate analyses have been performed on each of the four scales by using a unidimensional IRT model. An alternative could have been examining the four scales at once through a multidimensional IRT (MIRT) model (see Haberman et al., 2008; Reckase, 2009). MIRT models offer some advantages over unidimensional IRT models. They could allow for better understanding the traits measured by an instrument and how well individual items measure each of them (Ackerman, 1994). Moreover, MIRT models could provide a more precise estimation of scale reliability (Cheng et al., 2009) and item parameters (Finch, 2010). In the present work, some of these advantages are not very relevant. On the one hand, the factor structure of the EPQ-R has been widely tested and validated in the literature (e.g., Hosokawa and Ohyama, 1993; Maltby and Talley, 1998; Forrest et al., 2000; Qian et al., 2000; Scholte and De Bruyn, 2001; Aluja et al., 2003; Alexopoulos and Kalaitzidis, 2004; Dazzi et al., 2004; Francis et al., 2006; Karanci et al., 2006; Tiwari et al., 2009; Picconi et al., 2018). On the other hand, for scales whose length is analogous to that of the four EPQ-R scales (i.e., from 21 to 32 items), the unidimensional IRT models have been found to provide item parameter estimates whose precision exceeds or equals that of the estimates produced by the MIRT models (Finch, 2010). Finch (2010) investigated the precision of MIRT estimates on tests measuring a number of traits as small as two. For larger numbers of traits (e.g., the four traits of the EPQ-R), the number of parameters of a MIRT model increases considerably. Thus, the sample size of Study 1 (590 individuals) could have not been appropriate for performing a multidimensional analysis.

Concerning P scale, despite notable improvements, reliability remains rather low. This result, however, was expected. P scale, in fact, maybe because of its complex and clinical nature, is the most problematic and controversial of the instrument (e.g., Eysenck et al., 1985). Future research, therefore, should try to develop a new pool of items effective in capturing the multifaced aspects of this trait.

In the present work, a new short version of the EPQ-R has been devised, which consists of 12 items per each of the four scales. An abbreviated form exists also in literature (Francis et al., 1992) that consists of only 6 items per scale. This abbreviated form suffers of the same weaknesses that have been pointed out for the other Eysenck's questionnaires. Future research should try to devise a new version of the abbreviated form by using the IRT approach.

Data Availability Statement

The raw data supporting the conclusions of this manuscript will be made available by the authors, without undue reservation, to any qualified researcher.

Author Contributions

DC contributed to the conception and design of the study, conducted the research, performed the statistical analyses, and wrote the first draft of the manuscript. DC and PA wrote sections of the manuscript. All authors contributed to manuscript revision, read and approved the submitted version.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Ackerman, T. A. (1994). Using multidimensional item response theory to understand what items and tests are measuring. Appl. Meas. Educ. 7, 255–278. doi: 10.1207/s15324818ame0704_1