Traditional Masculinity and Femininity: Validation of a New Scale Assessing Gender Roles

Kachel, Sven; Steffens, Melanie C.; Niedlich, Claudia

doi:10.3389/fpsyg.2016.00956

ORIGINAL RESEARCH article

Front. Psychol., 05 July 2016

Sec. Personality and Social Psychology

Volume 7 - 2016 | https://doi.org/10.3389/fpsyg.2016.00956

Traditional Masculinity and Femininity: Validation of a New Scale Assessing Gender Roles

Sven Kachel^*

Melanie C. Steffens^*

Claudia Niedlich

Department of Social and Economic Psychology, University of Koblenz and Landau, Landau, Germany

Gender stereotype theory suggests that men are generally perceived as more masculine than women, whereas women are generally perceived as more feminine than men. Several scales have been developed to measure fundamental aspects of gender stereotypes (e.g., agency and communion, competence and warmth, or instrumentality and expressivity). Although omitted in later version, Bem's original Sex Role Inventory included the items “masculine” and “feminine” in addition to more specific gender-stereotypical attributes. We argue that it is useful to be able to measure these two core concepts in a reliable, valid, and parsimonious way. We introduce a new and brief scale, the Traditional Masculinity-Femininity (TMF) scale, designed to assess central facets of self-ascribed masculinity-femininity. Studies 1–2 used known-groups approaches (participants differing in gender and sexual orientation) to validate the scale and provide evidence of its convergent validity. As expected the TMF reliably measured a one-dimensional masculinity-femininity construct. Moreover, the TMF correlated moderately with other gender-related measures. Demonstrating incremental validity, the TMF predicted gender and sexual orientation in a superior way than established adjective-based measures. Furthermore, the TMF was connected to criterion characteristics, such as judgments as straight by laypersons for the whole sample, voice pitch characteristics for the female subsample, and contact to gay men for the male subsample, and outperformed other gender-related scales. Taken together, as long as gender differences continue to exist, we suggest that the TMF provides a valuable methodological addition for research into gender stereotypes.

Introduction

Every time a group of people is addressed as “Ladies and Gentlemen!” the pervasiveness of gender over all other social categories is demonstrated. Gender is also one of the first social categories that children learn in today's societies, and thus knowledge of gender stereotypes is evident from early childhood on (for a recent review, see Steffens and Viladot, 2015) and into adulthood, with both adolescents and college students construing their self-concepts in line with the gender stereotypes they have internalized (e.g., Nosek et al., 2002; Steffens et al., 2010). Since the 1970s, following Bem's (1974) pioneering work, many scales have been designed, developed, and widely used for measuring traits traditionally considered as typically male vs. typically female (Constantinople, 1973). In recent years, such measures have often failed to find between-gender differences in self-ascriptions of gender stereotypical traits (e.g., Sczesny et al., 2004), which is presumably due to changes in gender roles across the decades (e.g., Diekman and Eagly, 2000; Wilde and Diekman, 2005; Ebert et al., 2014). Still, gender differences in self-ascriptions do continue to exist, and there are attempts to measure different aspects of masculinity and femininity, including, for example, everyday behavior such as housework (Athenstaedt, 2003). In the present paper, we argue that a scale that reliably and validly measures differences in an individual's underlying conceptualization of his or her own masculinity-femininity would be valuable for gender research. To date, these constructs can only be measured using two items, “masculine” and “feminine,” which is somewhat limited given that established standards of psychological assessment typically recommend using a larger number of items (e.g., Bühner, 2010). In the present article, we introduce a new, extended, but still parsimonious scale, the Traditional Masculinity-Femininity Scale, TMF, to fill this gap. Using a known-groups approach, we present two studies testing this measure's reliability as well as its incremental and criterion validity, and we provide evidence for its convergent validity.

We define “traditional masculinity” and “traditional femininity” as relatively enduring characteristics encompassing traits, appearances, interests, and behaviors that have traditionally been considered relatively more typical of women and men, respectively (adapting the definitions provided by Constantinople, 1973). It is important to note that the focus of the present paper is on gender-related self-assessment. Complementary research has investigated many different aspects of gender, for example, gender-role norms (e.g., Athenstaedt, 2000; Thompson and Bennet, 2015; Klocke and Lamberty, unpublished manuscript).

In a seminal study on masculinity and femininity, Deaux and Lewis (1984) investigated the perceived relationship between gender and gender-related components, such as role behaviors (e.g., head of household vs. takes care of children), traits, occupations, and physical characteristics (e.g., tall, broad-shouldered vs. soft voice, graceful). The researchers showed that these components were interdependent, impacting on one another, as well as on perceived gender and sexual orientation. In other words, participants readily generalized from one component to the others. In addition, physical appearance played a particularly large role. Such findings indicate that gender stereotypes may be based on some sort of “core” masculinity and femininity. Similarly, individuals may use such “core” masculinity and femininity in their self-construal.

The first attempts to gauge masculinity and femininity placed these constructs on a bipolar spectrum and involved measuring simple collections of personality traits on which women and men differed on average (for a review, see Constantinople, 1973). By contrast, Bem's pioneering Sex Role Inventory (BSRI; Bem, 1974) used gender-stereotypical traits to independently measure masculinity and femininity (e.g., masculine items such as competitive and dominant, and feminine items such as affectionate and gentle). She pointed out that women/men who score high on both scales were called androgynous. Importantly, “masculine” and “feminine” were included as items in these original scales, but were excluded from the revised version (Bem, 1979) because of problematic loadings on the factors on which the masculine and feminine traits loaded, respectively. Exploratory factor analyses showed an instable factor structure but often converged on three-factor solutions: Masculine traits on one factor, feminine traits on a second factor, and masculine-feminine along with participant gender on a third factor (e.g., Niedlich et al., 2015, see review by Choi and Fuqua, 2003). It has thus been suggested that the two independent masculinity and femininity trait dimensions are complemented by one bipolar masculinity-femininity dimension (see Constantinople, 1973; Spence et al., 1975; Bem, 1979) that reflects gender identity instead of gender-role related aspects (e.g., Bem, 1979; Spence and Buckner, 2000). As Choi and Fuqua (2003) suggest, inventories such as the BSRI “may not capture the complex and multidimensional nature of masculinity/femininity.” Instead, “masculinity and femininity could be two higher order constructs, with each having its own subconstructs” (p. 873). Similar to other scales (e.g., Personal Attributes Questionnaire, PAQ, by Spence et al., 1975), the BSRI appears to tap more specific constructs, often referred to as instrumentality/agency and expressivity/communion (e.g., Fiske et al., 2002; Abele and Wojciszke, 2007), rather than masculinity and femininity in general. For the present purposes it is important to note that if masculinity and femininity are directly measured they should load on one bipolar masculinity-femininity dimension.

Another limit to the practical use of these established scales pertains to the generally small magnitude of gender differences found on these two dimensions (e.g., Deaux, 1984). In other words, women and men appear rather similar on “masculinity” and “femininity.” More recently, gender differences have not emerged at all between graduates with the same major (see Abele, 2000). In short, scales that have been developed to assess aspects of masculinity and femininity have recently failed to find gender differences (see also Sczesny et al., 2004; Evers and Sieverding, 2014). This could indicate that gender differences in masculinity and femininity are a thing of the past (Alvesson, 1998). However, it could also mean that the scales do not tap the most relevant aspects of the constructs on which gender differences continue to exist. For example, gender roles have changed over the last decades, particularly women's roles, so that today's women possess more of the traits traditionally considered as masculine (e.g., Diekman and Eagly, 2000; Spence and Buckner, 2000; Wilde and Diekman, 2005; Ebert et al., 2014). According to these findings, instrumental traits have become more socially desirable for women and expressive traits have become more socially desirable for men (Swazina et al., 2004).

In order to overcome limitations of the discussed scales, there have been attempts to measure other aspects of masculinity and femininity to account for the multiple dimensions they are reflected in, such as physical appearance, behaviors, attitudes, and interests (e.g., Spence and Buckner, 2000; Blashill and Powlishta, 2009). For example, Athenstaedt (2003) observed considerable gender differences in everyday behavior such as “putting flowers on the desk” (feminine) and “putting the meat on the barbeque” (masculine), strongly suggesting the continued importance of gender differences. Complementing these existing approaches, we suggest directly assessing the presumed higher-order constructs, namely masculinity and femininity. However, instead of using only these two items, we constructed a scale that can be tested empirically with regard to its reliability and validity.

Scale Construction

We introduce the TMF scale, an instrument for measuring gender-role self-concept. Appendix A1 in Supplementary Material shows all items, both English translations and original German wordings. Each item initially included in scale construction was selected based on theoretical considerations, as outlined in the following. We argue that we can measure the “core” of masculinity/femininity by referring to three central aspects, identified by Constantinople (1973), that we summarize using the term gender-role self-concept: Namely, gender-role adoption, gender-role preference, and gender-role identity. Constantinople (1973) defines gender-role adoption as the actual manifestation (i.e., how masculine-feminine a person considers her- or himself) and gender-role preference as the desired degree of masculinity-femininity (i.e., how masculine-feminine a person ideally would like to be). According to Kagan (1964), gender-role identity refers to a comparison of gender-related social norms and the gender-related characteristics of the individual (e.g., how a person actually looks compared to expected gender-typical appearances according to societal norms). Hence, for gender-role identity social comparisons as well as references to different gender-related aspects are emphasized (e.g., looks, behaviors etc.), whereas gender-role adoption and preference are based on non-relative, absolute statements. Following the former approach, we use TMF as a reference point. Based on dimensions identified as important in previous research, the TMF encompasses gender-role identity with regard to physical appearance, behavior, interests, and attitudes and beliefs (e.g., Deaux and Lewis, 1984; Athenstaedt, 2003). As mentioned, physical appearance was shown to play a particularly large role in implicating other components of gender stereotypes (Deaux and Lewis, 1984). Athenstaedt (2003) advocated the inclusion of gender-stereotypical behaviors in addition to traits, so this domain was included in the TMF as well. Lippa (2008) found that gender-related interests were highly relevant in discriminating women and men as well as lesbians/gay men from straight people. Additionally, his study showed that instrumental and expressive traits were outperformed by these gender-related interests in predicting participants' gender. Consequently, we included gender-related interests in the TMF (instead of gender-related traits). Finally, regarding attitudes and beliefs, gender differences have often been found, for example, with regard to attitudes toward minority groups (e.g., Sidanius et al., 1994; Kite and Whitley, 1996). We therefore also included self-assessment of attitudes and beliefs in the TMF.

One advantage of the TMF is that each of the mentioned scale dimensions is measured on a global level and not by various specific indicator items. Different from the instruments described above, which infer masculinity-femininity from the degree of affirmation of specific traits and behaviors, the TMF aims to directly assess masculinity-femininity. For example, “Traditionally, my behavior would be considered as…” 1 (not at all masculine) to 7 (very masculine). We consider it an asset of the scale that it is thus independent of specific stereotype content regarding masculinity and femininity that depend on culture and time (e.g., intelligent and ambitious as masculine, childlike and shy as feminine, see BSRI; in the General Discussion we discuss how far this global conception can also be considered a limitation). The TMF consists of six items only: One for gender-role adoption (“I consider myself as…”), one for gender-role preference (“Ideally, I would like to be…”), and four for gender-role identity (“Traditionally, my 1. interests, 2. attitudes and beliefs, 3. behavior, and 4. outer appearance would be considered as…”) in order to measure an individual's gender-role self-concept in a parsimonious way. All of them have high face validity. Each item is to be independently rated in terms of femininity and masculinity. A 7-point-scale is used to gauge the extent to which the participant feels feminine or masculine, how feminine or masculine she or he ideally would like to be, and how feminine and masculine her or his appearance, interests, attitudes, and behavior would traditionally be seen. Construct validity is tested in the studies described below. The TMF was used with masculinity and femininity as two unipolar dimensions (Study 1: 1, not at all masculine, to 7, very masculine, and 1, not at all feminine, to 7, very feminine) vs. one bipolar dimension (pilot study, Study 2; 1, very masculine, to 7, very feminine) in order to check for dimensionality.

Overview of the Present Research

We validated the TMF in various ways. First, we conducted an item analysis and a factor analysis. As suggested by findings reported by Bem (1979), Constantinople (1973), and Spence et al. (1975; see Introductory Section), the TMF's items should load on one factor and tap a one-dimensional masculinity-femininity construct. Hence, we expected the TMF to measure a one-dimensional gender-role self-concept (Hypothesis 1).

Validation by Using the Known-Groups Approach

Based on the idea that gender differences are not a thing of the past, as indicated in the introduction, a valid masculinity and femininity scale should show these gender differences. Therefore, we expected men and women to differ considerably on self-ascriptions on the TMF, with men being more masculine and less feminine than women (Hypothesis 2).

Moreover, a valid masculinity and femininity scale should show differences between people differing in sexual orientation. The essence of gender stereotypes of straight women and men is that they conform to traditional gender roles (e.g., Kite and Deaux, 1987; Kite and Whitley, 1996; Madon, 1997; Blashill and Powlishta, 2009). Lay people expect straight women to be more feminine and less masculine than lesbians, and straight men to be more masculine and less feminine than gay men. Similarly, straight women's and men's self-ascriptions are, on average, more gender-typed than those of lesbians and gay men (see meta-analysis by Lippa, 2005). Bisexual women were found to score on masculinity-femininity in between lesbians and straight women (Lippa, 2005). Therefore, we used the known-groups approach as an established method for testing a scale's validity (e.g., Howitt and Cramer, 2008). We expected lesbians' self-ascriptions on the TMF to be less feminine and more masculine compared to straight women (Hypothesis 3a). Bisexual women should score in between (Hypothesis 3b). Additionally, we expected straight men's self-ascriptions to be more masculine and less feminine compared to gay men (Hypothesis 3c).

Because straight women and men conform to gender roles more than lesbians and gay men, comparing lesbians and gay men constituted a stricter test of the TMF. Consistent with Hypothesis 2 and gender self-stereotyping but contradictory to implicit gender inversion theory (Kite and Deaux, 1987; which we turn to in General Discussion), we hypothesized lesbians to be more feminine and less masculine than gay men (Hypothesis 4).

The idea that differences in “core” masculinity and femininity underlie differences in lesbians' and gay men's vs. straight women and men's self-ascriptions in gender typicality can formally be conceived as masculinity-femininity mediating the relationship between sexual orientation and responses on scales such as the BSRI (Hypothesis 5).

Validation by Implicit and Explicit Gender-Related Measures

A common critique of self-report measures is that they could reflect differences in social desirability more than “true” underlying differences in traits. Using implicit measures relying on response-time differences, such as an Implicit Association Test (IAT), may minimize this problem (Greenwald et al., 1998). Implicit measures are assumed to assess the impulsive system: Habitual, repeated, long-term associations between concepts (Strack and Deutsch, 2004), including self-related concepts (e.g., Steffens and Schulze-Koenig, 2006). We expected lesbians to describe themselves more masculine and less feminine than straight women (Hypothesis 6).

Adults' masculinity-femininity is related to (recalled) gender conformity during adolescence (e.g., Safir et al., 2003) and childhood (e.g., Lippa, 2008). Thus, gender-role instruments for assessing current traits and behaviors as well as recalled gender-typical behaviors, preferences, and interests during childhood were also suitable for testing convergent validity. We assumed all these characteristics to show moderate correlations with the TMF (Hypothesis 7).

Additionally, we expected the TMF to predict sexual orientation within one gender group better than other gender-related scales. We assumed the TMF to outperform other gender-related scales when predicting sexual orientation of women and men (Hypothesis 8).

Hypotheses Based on Criterion Validity

As indicated above, lay people use gender-typicality as an indicator for judging someone's sexual orientation (Rieger et al., 2010; Valentova et al., 2011). People self-reporting gender-typical characteristics are likely to be perceived as straight, whereas people who do not display such characteristics are more likely to be perceived as lesbian or gay on pictures, videos, and speech recordings. Hence, targets who are perceived as straight could be those who self-describe as gender-typical in masculinity-femininity ratings (Hypothesis 9).

Additionally, there is some evidence that voice pitch characteristics, also called fundamental frequency features, of lesbians and gay men are shifted toward what is typical for straight women and men. Generally, compared to straight women, straight men show voice pitches that are lower on average, in variability, and in range (e.g., Pierrehumbert et al., 2004; Munson and Babel, 2007). Average voice pitch has been found to be lower in straight compared to gay men (Baeck et al., 2011) and higher in straight women compared lesbians (Camp, 2009). Hence, we assumed gender-typical masculinity-femininity self-ratings to be reflected in gender-typical patterns of voice pitch characteristics (Hypothesis 10).

Furthermore, contact frequency of straight women and men with lesbians and gay men is linked to attitudes toward them (e.g., Swank and Raiz, 2010): A lower contact frequency is connected to more negative beliefs about lesbians and gay men. One belief about lesbians and gay men is that they transgress gender roles, on average (e.g., Kite and Whitley, 1996). It thus seems plausible that people who are more gender-typical themselves are those who have less contact to lesbians and gay men and hold more negative beliefs. Hence, we assumed gender-typical masculinity-femininity self-ratings to be connected to more current contact with straight women and men and less current contact with lesbians and gay men (Hypothesis 11).

Hypotheses Concerning Test-Retest Reliability and Predictive Validity

Finally, the TMF was expected to show at least moderate test-retest reliabilities given that people were re-invited after a 1-years period (Hypothesis 12). From a scale validation perspective, it is desirable to present analyses in which the predictor is truly assessed before the criterion. Therefore, we expected at least moderate predictive validity for other gender-related features at second measurement (Hypothesis 13).

Pilot Study

The pilot study had two aims. First, we tested the factor structure of the scale's version that contained six bipolar items. We assumed the TMF items to load on one factor (Hypothesis 1). Additionally, we wanted to determine the appropriateness of every single item by using an item analysis. Second, we assessed the scale's validity using a known-groups approach (Hypothesis 2).

Methods

At the end of an online survey that had a different purpose, participants filled in the 6-item version of the TMF (see Appendix in Supplementary Material) and indicated their gender (response options: male, female, both, none, no response). Overall 319 participants finished the study. Thirteen of them were excluded from further analysis because they described themselves as both male and female or neither or they did not disclose their gender. Data from 188 women and 118 men were used for analysis. Their age ranged from 18 to 41 (M = 23.6, SD = 3.1). They were students of different majors from different German universities (specifically, in Thuringia). Participants received no compensation for participation. Approval for all studies reported in this paper was obtained by the board of ethics (= human subjects committee) of the School of Humanities and Social Sciences at the Friedrich-Schiller-University of Jena. All studies were carried out in accordance with its recommendations, with written informed consent obtained from all participants in accordance with the Declaration of Helsinki.

Results

In order to check for one-dimensionality of the TMF, an exploratory principal axis factoring (PAF) was conducted. Sample adequacy was confirmed by a Kaiser-Meyer-Olkin (KMO) criterion of 0.87. All items were suitable for factor analysis as indicated by item-specific KMO values >0.79 and moderate to high commonalities (0.57–0.88). According to a graphical scree-plot analysis, a one-factor solution was confirmed. There was a steep decline of explained variance from factor one (77%) to factor two (10%). Each of the six items was represented well by the factor (factor loadings ranged from 0.75 to 0.94).

Reliability of the TMF was high (Cronbach's α = 0.94). As indicated by the coefficients in Table 1, no items needed to be deleted to improve reliability. Item-specific homogeneity was high and ranged from 0.66 to 0.72 (see Table 1). Corrected item-total correlations ranged from 0.72 to 0.91, suggesting that each item represented the scale well. Moreover, item means ranged from 0.51 to 0.59. Accordingly, every item received almost equal masculinity and femininity ratings, indicating that averaged across the sample containing women and men, items received “androgynous” responses, as one would expect. When computing item “difficulties” separately for each gender group, findings pointed in the expected directions: “Difficulties” ranged from 0.18 to 0.35 for the male sample, indicating “masculine” responses, and from 0.60 to 0.85 for the female sample, indicating “feminine” responses.

TABLE 1

Table 1. Item Characteristics of the TMF in the Pilot Study for the Whole Sample (left-hand values, n = 306) and Separately for Men (middle values, n = 118) and Women (right-hand values, n = 188).

We found the expected bimodal distribution of the TMF scores. Men and women differed significantly in terms of the scale mean, M_male = 2.56 (SD = 0.80), M_female = 5.28 (SD = 0.76), t₍₃₀₄₎ = −29.83, p < 0.001, and on every item, all ts₍₂₈₇₎ > −10.41, all ps < 0.001. With the exception of two outlier individuals, the overlap between men's and women's scores was very small (see Figure 1). According to Kolmogorov-Smirnov statistics, the TMF scores were normally distributed for men (Z = 0.99, p = 0.28) and women (Z = 0.78, p = 0.58). Predicting gender by the TMF scores in a logistic regression analysis was 97% accurate [B = 4.43, SE = 0.69, $χ_{(1)}^{2} = 41.38$ , p < 0.001; Nagelkerke's R² = 0.92; Model $χ_{(1)}^{2} =$ 347.87, p < 0.001].

FIGURE 1

Figure 1. Distribution of the TMF scores separately for men (n = 118) and women (n = 188) in the pilot study. The lines in the bars represent medians and bars indicate the range between 75th and 25th percentile. Error bars show the range of masculinity-femininity scores for non-outliers. Dots represent outlying values (1.5 SD above/below median).

Taken together, confirming Hypothesis 1, we found that the TMF tapped a one-dimensional construct which is in line with lay ascriptions and previous findings regarding the items masculine and feminine. All factor loadings were similar (Δ < 0.1), so that an unweighted additive overall score was justified (Bortz and Döring, 2006). Its single items represented the overall scale very well and were strongly connected to each other. Hence, no item had to be excluded due to low item-specific homogeneity (Bortz and Döring, 2006). Moreover, confirming Hypothesis 2, the TMF was shown to discriminate between women and men at the scale and at the item level. Therefore, we kept all items in the TMF.

Study 1

The aim of Study 1 was to test the one-dimensionality, reliability, and validity of the TMF. We used a known-groups approach, with lesbians, bisexual, and straight women, to assess which of several gender-related scales is best in differentiating between these groups. In addition to the TMF, we used the BSRI as the gold standard in gender-related assessment. However, we also used the Gender Role Behavior Scale (GRB, Athenstaedt, 2003) and a newly created measure of childhood gender conformity (see Appendix in Supplementary Material). Moreover, an Implicit Association Test (IAT, Greenwald et al., 1998) was used to measure implicit associations of self with masculine vs. feminine.

We assumed that the TMF would reflect a one-dimensional masculinity-femininity construct (Hypothesis 1). Furthermore, we expected that on each measure, straight women would score higher on femininity and/or lower on masculinity as compared to lesbians (Hypothesis 3a). Bisexual women should score in between (Hypothesis 3b). Additionally, on an IAT (see below for details), we assumed straight women to associate more with feminine and less with masculine than lesbians (Hypothesis 6). Gender-related measures should be correlated with each other (Hypothesis 7), and scores on each measure should predict sexual orientation. We also tested the incremental validity of the TMF over the other measures. The TMF should predict sexual orientation better than other gender-related scales (Hypothesis 8). Finally, the TMF should measure a higher-order factor “core” masculinity-femininity that mediates effects of sexual orientation on other gender-related scales (Hypothesis 5). If women differ in masculinity-femininity based on their sexual orientation, indirect effects of the more specific masculinity-femininity related measures via the TMF on sexual orientation should be observed.

Method

Participants

Participants were 126 women from Germany and Luxembourg who took part in the study, voluntarily without compensation. Their age ranged from 19 to 47 years (M = 31.13, SD = 8.52). Participants were recruited either at the University of Trier or by a snowball technique. Given their scores on a Kinsey-like scale, they were divided into three groups of 47 straight women (Kinsey scores: 6–7), 32 bisexual women (3–5), and 47 lesbians (1–2). Most of the women were well educated, with 50% possessing university entrance qualifications and 40% holding a university degree. With α = 0.05 and N = 126, based on Cohen's (1977) conventions, medium-size regression coefficients (f ² = 0.35) could be detected with a statistical power of 1 − β = 0.95 in a multiple linear regression with six predictors (Faul et al., 2007).

Materials

Implicit association test

In essence, IATs comprise two combined tasks in which stimuli that belong to four concepts are mapped onto two responses in different ways. IATs are based on the following idea: If someone is able to react relatively fast when two concepts share a response, these concepts appear to be associated for that person. In detail, stimuli were presented that represent the concepts self, others, feminine, and masculine. In one task, stimuli representing self or feminine required one response, and stimuli representing others or masculine required the other response (e.g., left vs. right key press). In the other task, stimuli representing self or masculine required one response, and stimuli representing others or feminine required the other response. A person considering herself feminine should be able to react faster in the self-feminine/others-masculine than in the self-masculine/others-feminine task.

We labeled one dimension for the IAT “typically feminine” vs. “typically masculine.” The associated attributes presented were feminine, female vs. masculine, male (in German: feminin, weiblich; maskulin, männlich, see Steffens et al., 2008). The other dimension was “self” vs. “others.” The stimuli on that dimension were synonyms of the superordinate concepts (me, self vs. you, others; in German: Ich, Selbst; Du, Andere). Participants were informed that concepts would be displayed throughout at the top left or right screen corner. Their task during the IAT would be to sort words belonging to these concepts by pressing the respective response key on the left or right as quickly as possible. A stimulus word would appear (e.g., feminine) after which participants would respond by pressing the appropriate key (e.g., left for typically feminine). The word would then be replaced by the next stimulus (e.g., me). Participants would again select the appropriate key (e.g., left for self). Each crucial, combined task consisted of four blocks of 62 trials. The order of the eight stimuli was randomized within each block, and the same eight stimuli were presented over and over. The reaction-stimulus interval was 200 ms. Missing reactions and errors led to an appropriate visual feedback (e.g., in case of errors, F! was shown for 200 ms). Participants received feedback on errors and reaction times after each block (e.g., given 10% errors or more: “You committed many errors. Please react more slowly and more correctly.”).

The IAT effect was computed similar to the IAT D effect (Nosek et al., 2005, except that no “error penalty” was used, see Steffens et al., 2008): Specifically, the reaction time difference between the self-feminine/others-masculine and the self-masculine/others-feminine task was computed and divided by each individual's standard deviation across both tasks. In order to avoid artificially high scores obtained with very long scales, internal consistency was estimated based on the average reaction time difference in reaction to each of the eight stimuli. In other words, the IAT was treated as an eight item scale (following Steffens and Buchner, 2003). All internal consistencies are presented in Table 2.

TABLE 2

Table 2. Internal Consistencies (Cronbach's α, with number of items) and Correlations between Measures in Study 1.

Bem sex-role inventory

We translated the English short version of the BSRI (Bem, 1979) into German. It consisted of 30 items, 10 for the Masculinity Scale (e.g., self-reliant, ambitious), 10 for the Femininity Scale (e.g., warm, tender), and 10 neutral items with a 7-point scale anchored 1 (never applies) to 7 (always applies). Participants were asked to rate the extent to which the given traits were adequate to describe them.

Traditional masculinity-femininity

The TMF was used as described in the Section Scale Construction with two unipolar dimensions, masculinity and femininity (12 items overall, see Appendix in Supplementary Material).

Childhood gender role behavior (CGRB)

Five items were used with a 7-point-scale in order to measure whether participants remembered to have been rather feminine during childhood, or rather typical girls, or not (see Appendix A2 in Supplementary Material). For example, we asked whether they had played with girls and girls' games, and whether they had liked wearing skirts and dresses.

Sexual orientation

As indicated in Section Participants, participants' sexual orientation was assessed using participants' responses on the item: “Regarding sexual orientation, I identify as …” (on a Kinsey-like scale, from 1 (exclusively lesbian) to 7 (exclusively straight). This was also the first item of a translated version of the Assessment of Sexual Orientation Scale (Coleman, 1987). Several additional items were originally used (sexual behavior: gender of partner and ideal partner; sexual fantasies, and emotional bindings). To be consistent with Study 2, we used only the first item to group participants as lesbians (scores 1–2), bisexual women (scores 3–5), and heterosexual women (scores: 6–7). The first item also correlated highly with the overall scale (r = 0.95), corroborating the decision to use only one item.

Gender role behavior scale

Participants rated themselves on a 7-point scale ranging from 1 (not at all typical) to 7 (very typical) on 52 everyday typically feminine or masculine behaviors (GRB, Athenstaedt, 2003; e.g., “watch soap operas,” “change light bulbs”).

Procedure

Participating students were tested at the University of Trier in a lab cubicle equipped with an iMac. The participants recruited via the snowball technique were tested individually in their homes or offices (as they wished) using an iBook. The instructions, the implicit tests, and the questionnaires were presented by a self-composed HyperCard computer program. Initially, participants were asked to report their age, educational background, and size of hometown. Then, they started with the IAT. IAT task order was held constant because of the correlational nature of the study (see e.g., Banse et al., 2001, for discussion). All participants did the self-masculine/others-feminine task first. After the IAT, the questionnaires were presented in the order described in the Materials Section—accordingly, data for the TMF was collected before all other scales. Finally, participants were debriefed and thanked.