Factor structure and invariance of the scale to measure teaching performance in the area of social sciences

Henríquez, Patricio Sebastián; Pérez-Morán, Juan Carlos; del Cid García, Carlos Javier; Zamora, Jesús Enrique

doi:10.3389/feduc.2023.1229129

ORIGINAL RESEARCH article

Front. Educ., 29 August 2023

Sec. Assessment, Testing and Applied Measurement

Volume 8 - 2023 | https://doi.org/10.3389/feduc.2023.1229129

This article is part of the Research TopicInsights in Assessment, Testing, and Applied Measurement: 2022View all 21 articles

Factor structure and invariance of the scale to measure teaching performance in the area of social sciences

Patricio Sebastián Henríquez¹

Juan Carlos Pérez-Morán²^*

Carlos Javier del Cid García¹

Jesús Enrique Zamora¹

¹Autonomous University of Baja California, Ensenada, Mexico
²Promoter Network of Diagnostic Evaluation Methods and Educational Innovation, Ensenada, Mexico

The use of scales to evaluate teaching from the students’ perspective is a method frequently used in educational systems around the world. The objective of this study is to analyze the factorial structure of the Teaching Performance Evaluation Scale (EEDDocente, by acronyms in Spanish) designed with the purpose of providing information that favors decision-making based on evidence for the improvement of teaching in the area of Social Sciences, as well as measuring the invariance by School stage and Educational Program. The sample consisted of 1,849 students of the Bachelor’s Degrees in Law, Psychology, Accounting, Administration, Education Sciences, Communication Sciences, Computer Science, and Sociology of the School of Social and Administrative Sciences (FCAyS) of the Autonomous University of Baja California, Mexico. Based on a three-factor model that meets the fit and quality criteria, a Multi-group Confirmatory Factor Analysis (MGCFA) was performed to measure the invariance of the EEDDocente by School stage and Educational program. It is concluded that the three-factor model can be used to measure, from the students’ perspective, the performance of teachers in the Area of Social Sciences. Likewise, it is concluded that the invariance of the simultaneous measurement is achieved, providing evidence to perform mean difference analysis between the different Educational programs.

1. Introduction

The evaluation of teaching in Higher Education Institutions (HEIs) is one of the most relevant components linked to the assumption of improving educational quality (Calatayud, 2021; Torquemada, 2022; Bleiberg, 2023; Silva, 2023). The measurement of the effectiveness of teaching practice occupies a central place in HEIs strategies, which allows the generation of information on the teaching and learning process that serves as an input to trace routes for improving the quality, relevance, effectiveness and accountability of education systems around the world (Chen and Hoshower, 2003; Liebowitz, 2021; Seivane and Brenlla, 2021; Camacho, 2022; Zhao et al., 2022). In addition, it is a transcendental input for the improvement and feedback of teacher performance in its multiple dimensions, thus attending to the formative function of this process (Marsh, 2007; Luna and Torquemada, 2008; Liebowitz, 2022; Silva et al., 2022).

Around the world, accountability and the growing demand to ensure the improvement of learning of future professionals graduating from universities has placed the evaluation of teaching performance as the dominant axis of educational policies (Vaillant, 2016; Liebowitz, 2022). However, there is a growing interest in the methodological aspects, techniques and instruments for collecting information (questionnaires, attitude scales, interviews, focus groups, classroom observation), and the subjects (students, teachers, managers, external experts) best suited to obtain reliable, valid, sufficient and relevant data on the evaluation of teaching (Marsh, 1984; Cruz Ávila, 2007; Romero and Martínez, 2017; Zamora, 2021; Bleiberg, 2023).

In particular, there is growing concern about the use of Students’ Evaluations of Teaching (SET) to make high-impact or consequential decisions in processes such as promotion, tenure, and awarding of stimuli and incentives (Boring et al., 2016; Hornstein, 2017; Wang and Guan, 2017; Benton, 2018; Ching, 2018; Mitchell and Martin, 2018; Bazán-Ramírez et al., 2021). Likewise, the purposes of the evaluation of teacher performance have been mainly oriented to condition the hiring or dismissal processes of academic staff, deciding who gets an economic incentive or job promotion based on the result of the teacher evaluation (Stroebe, 2016; Gómez and Valdés, 2019). However, more and more decision makers, education systems and HEIs see an opportunity and advantage in using relatively brief SETs, mid- or end-of-course, to provide formative feedback on teacher performance and competencies (Marsh, 2007; Silva et al., 2022).

The study of university teacher performance began to become widespread internationally in the 1980s, as part of the accountability processes derived from changes in government policies for financing higher education (Cisneros-Cohernour and Stake, 2010; Zamora, 2021; García-Olalla et al., 2022). The evaluation of teaching performance has its genesis in the first student learning assessment systems in the United States during the 1920s; by the second half of this decade, learning assessment served as a tool to evaluate teaching (Alcaraz-Salarirche, 2015; Zhao et al., 2022). For its part, the SET was an innovation in HEIs in the United States, a consequence of the consumer orientation of the capitalist system: students, as users of the educational service, are the ones who should evaluate it (García, 2000). During the 1980s and early 1990s, numerous studies were carried out on the subject, for example, those of Cohen (1981, 1983), Feldman (1988, 1989a,b, 1990, 1992, 1993), and Marsh (1984, 1986, 1987, 1993, 2007).

Teacher performance evaluation must maintain high and solid technical quality standards to fulfill its main purpose, which is linked to improving teaching practices and student learning. However, in the mid-1990s, studies began to emerge that questioned the reliability and usefulness of quantitative instruments to evaluate teaching (Theall et al., 2001; García, 2014; Boring et al., 2016; Benton, 2018; Ching, 2018). Among the most recurrent criticisms by researchers are those related to the idea that the SET presents problems of logic and structure in terms of components and characteristics to test the effectiveness of teaching, and random and biased responses, as well as a subjective judgment on the part of students about teaching (Stroebe, 2016, 2020; Wang and Guan, 2017; Ching, 2018; Zhou and Qin, 2018; Gu et al., 2021).

Despite criticisms to the contrary, it is undeniable that the use of scales and questionnaires has been the most widely used mechanism to evaluate university teachers (Wang and Guan, 2017; Zamora, 2021), and that questionnaires as evaluation instruments are viable tools to measure the effectiveness of teaching in HEIs (García, 2014; Mohammadi, 2021). However, it is necessary that evaluation questionnaires maintain validity, reliability, relevance, and pertinence for uses and consequences in the educational context (Messick, 1995; Kane, 2006; International Test Commission, 2013; Spooren et al., 2013; American Educational Research Association, 2018; Reyes et al., 2020; Lera et al., 2021).

The evaluation of the quality of teaching practice through experience, certifications, academic degrees, among other factors, shows little correlation with the effectiveness of teaching performance (Williams and Hebert, 2020). Thus, the evaluation of teaching, based on the perception of students, currently has a preponderant role in the processes of improving the quality of teaching in universities (Aleamoni, 1999; Salazar, 2008; Mohammadi, 2021; Zamora, 2021). The SET allows HEIs to have a reference for the improvement of teaching practice, as long as the performance measures maintain a high level of objectivity, methodological rigor and relationship with the implementation dimensions of the academic objectives of educational systems (House, 1998; Navarro and Ramírez, 2018; Seivane and Brenlla, 2021). At this point, it is important to mention that most of the criteria or dimensions of the SET are defined by committees of specialists, which are based on models of indicators of teaching quality and effectiveness, but with a strong influence of philosophical and pedagogical principles, and of the policy and regulations of the functions of the academic staff of each HEIs.

Among the first syntheses of criteria to design SET are those proposed by Feldman (1976) and Hildebrand et al. (1971). By analyzing students’ points of view, Feldman (1976) proposed three categories for effective teaching: (a) Presentation, which includes teachers’ enthusiasm for teaching, their knowledge of the subject, and clarity of presentation and organization of the course; (b) Facilitation, which refers to the availability of teachers for consultation, respect for students and their ability to encourage students to achieve learning in class; and (c) Regulation, which includes teachers’ ability to set clear objectives, classroom management skills, appropriateness of course materials and activities, and fairness in student assessment and feedback. For their part, Hildebrand et al. (1971) and Hildebrand (1973) propose five factors to measure teaching effectiveness: (a) Analysis and synthesis skills, which refers to the teacher’s mastery of class content; (b) Clarity and organization, which refer to the teacher’s ability to present course topics; (c) Interaction with the group, which refers to the teacher’s ability to interact with students and maintain the active participation of the group; (d) Interaction with each student, which refers to the teacher’s ability to establish trust and respect with each individual student; and (e) Dynamism and enthusiasm, which refers to the teacher’s enthusiasm and pleasure in teaching the subject. More recently, authors such as Marsh (1987), Marsh and Dunkin (1997), Richardson (2005), and Schellhase (2010) have proposed models of up to nine to 10 criteria (assignments and readings, breadth of coverage, examinations and grading, group interaction, individual rapport, instructor enthusiasm, learning and academic value, organization and clarity, workload and difficulty, and summative evaluation) to assess the quality of instruction.

In Ibero-America, several authors have focused on the design and validity of the measurement of teacher performance through scales considering various criteria models of the effectiveness and quality of teaching; in particular, on obtaining the psychometric properties of the measurement instruments, and on the evidence of their internal consistency and reliability. In this sense, the studies by García-Gómez-Heras et al. (2017), in Spain; Estrada et al. (2019), in Nicaragua; Bazán-Ramírez et al. (2021), in Peru; and Márquez and Madueño (2016), and Bazán-Ramírez and Velarde-Corrales (2021), in Mexico; they are noteworthy. For their part, García-Gómez-Heras et al. (2017) focused on revealing which professors’ behaviors were most appreciated by first-year students of studies taught at the Faculty of Health Sciences of the Rey Juan Carlos University of Madrid (Degrees in Medicine, Nursing, Physiotherapy, Dentistry Psychology and Occupational Therapy). The authors applied the questionnaire developed by Tuncel (2009) on the behaviors of university teachers that influence the academic performance of students. This questionnaire is made up of 48 items organized into six factors: (a) Emotional aptitude of university teachers, (b) Teacher-student interaction, (c) Achievement of educational objectives, (d) Theory-practice relationship, (e) Organization and planning, and (f) Feedback.

Likewise, Estrada et al. (2019), Gómez and Valdés (2019) conducted a study to establish the psychometric properties of the Opinion Questionnaire for the Evaluation of Teaching Performance (OQETP) composed of 38 items, focused on evaluating teaching performance from the students’ perception, at the National University of Trujillo, Nicaragua. The OQETP items are presented on a Likert-type scale with five response categories and are organized into eight questionnaire dimensions: (a) Formal Responsibility, (b) Methodology, (c) Communication, (d) Materials, (e) Attitude, (f) Evaluation, (g) Motivation, and (h) Satisfaction.

In Peru, Bazán-Ramírez et al. (2021) analyzed the factorial structure of the Teaching Performance Scale for Psychology Teachers (EDDPsic) and measured the invariance between groups (according to gender, age and academic stage). This instrument was designed based on the model of five didactic performance criteria (Carpio et al., 1998; Silva et al., 2014). In total, the EDDpsic is made up of 18 items (K = 18) organized into subscales: (a) Competence Exploration (k = 3), (b) Criteria Explanation (k = 5), (c) Illustration (k = 3), (d) Feedback (k = 4), and (e) Evaluation (k = 3). Their study involved 316 Psychology students, from basic cycles (fourth and sixth semesters) and disciplinary-professional cycles (eighth and tenth semesters), from two public universities in Peru. They also performed a Multigroup Confirmatory Factor Analysis (MGCFA) with the five-factor model that showed the best fit indices. Based on their results, they determined the invariance of the scale measure across the three study variables (age, sex and academic stage), for which the participants were divided into independent groups. The results revealed adequate fitness tests for the Configural model in each of the three variables (χ² p > 0.05, CFI < 0.01, RMSEA ≤0.06), so it was considered that the structure of the model was the same for each group. Similarly, evidence of factorial invariance was obtained for the Weak (M1), Strong (M2) and Strict (M3) models, in the variables of age (M1: CFI = −0.004; M2: CFI = -0.001; M3: CFI = −0.001), and gender (M1: CFI = −0.001; M2: CFI = −0.001; M3: CFI = −0.001). In the case of the academic stage variable, evidence of invariance was obtained for the Weak and Strong models (M1: CFI = −0.004; M2: CFI = −0.000) but not for the Strict model (M3: CFI = −0.018).

In Mexico, Márquez and Madueño (2016) analyzed the psychometric properties of an instrument made up of 16 items (K = 16) applied to students at a university in Sonora, to recover their opinion on the basic competencies of teachers in the teaching of undergraduate courses. From the 30,224 questionnaires answered, the construct validity of the instrument was determined using the principal components method with Varimax rotation, extracting two factors: (a) Pedagogical mediation (k = 11), and (b) Teaching attitudes (k = 5). For their part, Bazán-Ramírez and Velarde-Corrales (2021) evaluated the performance of teachers and students in their didactic interactions through the self-report of 124 psychology students in Mexico. The authors obtained the construct validity, convergent and divergent, of five didactic performance criteria, both of the teacher and the student, by means of EFA and CFA. The validation confirmed the theoretical structuring of five factors that correspond to the five dimensions: (a) Exploration of competencies, (b) Explicitness of criteria, (c) Illustration, (d) Feedback, and (e) Evaluation, derived from the models of didactic performance (Carpio et al., 1998; Silva et al., 2014). The authors also performed descriptive analyses of the students’ responses to the didactic performance criteria, according to their academic stage, age and sex.

In summary, it can be said that the models of criteria and instruments to evaluate teaching in the HEIs present a wide diversity of components and characteristics. Likewise, these instruments generally present acceptable psychometric properties of reliability and validity. However, most of them are made up of a large number of criteria and items, which results in instruments that can help in a diagnosis with greater depth and granulation, but make it difficult to apply in cases where students have to answer repeatedly an instrument for each of their teachers at the end of each school year and throughout their university studies. Although, it is important to highlight that most measurement models based on more than five criteria do not satisfactorily meet all the necessary technical quality criteria. In this regard, several authors mention that one of the problems of SET is that multidimensional models that try to cover a large number of criteria based on theory present internal structure problems (Stroebe, 2016, 2020; Ching, 2018). This is explained to some extent when universities include in their teacher evaluations criteria that refer to affective components such as student satisfaction with the class, interest in the subject content, teacher’s capacity for empathy, among others. So far it can be concluded that the instruments for measuring the effectiveness and quality of teaching that seek to include a wide variety of criteria present problems of logic and internal structure, as well as difficulties for their application in evaluation strategies where it is required that students respond repeatedly at a specific time in the school year. Another important problem to mention is that most of the SETs evaluate different criteria, making it impossible to make comparative studies that help to evaluate the policies to improve the quality of teaching among different educational programs, schools, and universities.

This paper analyzes the psychometric properties and evidence of construct validity of internal structure and invariance of the Teaching Performance Evaluation Scale (EEDDocente, by acronyms in Spanish) applied at the middle of each school stage to assess the performance of each one of the teachers in the different educational programs of the School of Administrative and Social Sciences (FCAyS, for its acronym in Spanish) of the Autonomous University of Baja California (UABC, for its acronym in Spanish). The EEDDocente is applied biannually with the purpose of identifying strengths and weaknesses of teaching performance from the students’ perspective and thus provide feedback on teaching and design teacher training and updating courses (Cashin and Downey, 1992; Liebowitz, 2021; Zamora, 2021; Silva et al., 2022).

Despite the variety of instruments for the evaluation of teaching practice, the relevance of the EEDDocente lies in its purpose, design and objective that seek to maintain coherence between the instrument and the use of results (Stroebe, 2016; Estrada et al., 2019; Gómez and Valdés, 2019; Aravena-Gaete and Gairín, 2021). The EEDDocente was designed to provide information to identify teachers’ needs for updating and continuous training, and to influence the improvement of performance and teaching practices at the classroom level. Among its specific characteristics, the EEDDocente focuses on student-centered teaching and, based on this, the information provided by the scale seeks to generate processes of reflection on teaching practice and a change in the conception and vision of how they develop university teaching (Tomás-Folchy and Durán-Bellonch, 2017).

However, there is no evidence related to the internal structure and invariance of this instrument. This paper aims to address this problem and contribute to the existing literature by analyzing the internal structure of the three-factor model of a reduced version (K = 15) of the EEDDocente that is based on categories of solid theoretical models: (a) Classroom organization, (b) Teaching quality, and (c) Learning assessment/feedback (Hildebrand et al., 1971; Feldman, 1976; König et al., 2017; Nasser-Abu, 2017; Chan, 2018; Bazán-Ramírez et al., 2021; Henríquez et al., 2023). Likewise, with invariance analysis it is possible to reduce student bias when evaluating teaching among the different educational programs in which they are enrolled.

2. Method

2.1. Participants

We analyzed the responses of a focused sample of 1849 students out of a total of 4,226 enrolled in the FCAyS of the UABC who participated in the internal strategy of teaching performance evaluation 2022–1 (conducted in the first semester of the year). For the selection of the sample of participants, the FCAyS Teaching Evaluation Coordination randomly selects, during the second semester of each school stage, two groups from each semester of the eight current educational programs (Law, Psychology, Accounting, Business Administration, Educational Sciences, Communication Sciences, Computer Science, and Sociology), one from the morning shift and one from the afternoon shift. In addition, it randomly chooses four groups of the Common core of the Areas of knowledge of Legal Sciences, Accounting and Administrative Sciences and Social Sciences, two from the morning shift and two from the afternoon shift. Table 1 shows the distribution of the study sample by Educational program, Common core and Area of knowledge. Note that the number of participants by subject area shows a wide difference. In particular, between the Area of Legal Sciences, with 366 participating students, and the other two Areas of knowledge, where almost twice as many students participated in each of them. Likewise, there is a considerable difference between the sample of participating students per School stage [Basic stage (1st and 2nd semester) N = 632, Disciplinary stage (3rd, 4th, 5th and 6th semester) N = 816, Terminal stage (7th and 8th semester) N = 392].

TABLE 1

Table 1. Distribution of the sample of participating students by Educational program, Common core and Area of knowledge of the FCAyS.

2.2. Measurement

Scale for the Evaluation of Teaching Performance (EEDDocente) was designed by the coordinators of teacher evaluation of the FCAyS (Henríquez et al., 2017, 2018; Henríquez and Arámburo, 2021) with the purpose of providing at the middle of each school stage relevant information, based on the opinion of the students, on the performance of each teacher who teaches classes in the current educational programs, favoring continuous training and decision-making to improve teaching. In total, a student can answer the EEDDocente up to seven times, depending on the number of teachers who teach the different classes in the semester in progress. The EEDDocente is a typical performance test made up of 25 ordered response items (K = 25) with four categories: 1 = Strongly disagree, 2 = Disagree, 3 = Agree, 4 = Strongly agree. During the design of the EEDDocente, a committee made up of teachers and graduates of the Social Sciences area of the University where the scale is applied was formed, who participated together with specialists in the writing of the items. With this, it was sought to ensure that the scale items were designed from a student-centered teaching approach. The items are organized into three subscales in which the dimensions underlie: (a) Course organization, refers to the teacher’s ability to explain in a clear and organized manner the contents of the subject matter and the objectives and activities in class, as well as to use didactic strategies in an adequate manner to awaken the students’ interest in the learning objectives; (b) Quality of teaching, refers to the teacher’s ability to relate the contents of the subject matter with those of other classes, encourage group participation in class activities, establish norms of coexistence in the classroom and make adjustments to favor the achievement of the group’s learning objectives; and (c) Evaluation and feedback of learning, refers to the teacher’s ability to apply strategies for evaluation and feedback of learning with a formative approach, differentiate between students who learn more and less easily, adapt their teaching strategies and forms of evaluation, establish forms of evaluation related to real-life problems, and show openness to corrections and adjustments regarding non-conforming grades or errors. As a foundation, the EEDDocente is based on multidimensional conceptual models consolidated and commonly reported in the literature related to the evaluation of teaching by students (Marsh, 1984, 1993, 2007; Feldman, 1988, 1993; Centra, 1993; Braskamp and Ory, 1994; Arreola, 2007; Fink, 2008; Bazán-Ramírez et al., 2021). Table 2 shows the items that make up the three subscales of the EEDDocente.

TABLE 2

Table 2. Items of the Scale for the Evaluation of Teaching Performance (EEDDocente).

2.3. Procedure

The protocol and procedure for applying the instrument was approved by the FCAyS-UABC Management and supervised by the FCAyS Teacher Evaluation Coordination, in accordance with current institutional research ethical standards. It should be noted that the application of the EEDDocente is part of the internal strategy of evaluation of the teaching performance of the FCAyS that is applied at the middle of each school cycle by the Coordination of Teaching Evaluation of said faculty. In particular, the application is carried out with the support of students who provide their professional social service and who are previously trained to apply the evaluation instrument in the classroom. The students to whom the instruments are applied are previously informed of the objectives and procedures of the evaluation strategy, and of the confidentiality, safeguarding, and use of their answers in order to promote the continuous training of teachers, research, and decision-making to improve the performance of teachers who teach classes at the FCAyS. On this occasion, the EEDDocente was administered during school hours in each of the classrooms of the 80 randomly selected groups that make up the sample. On average, the explanation of the purpose of the teacher evaluation, the instructions and the application of the EEDDocente lasted 25 min. In addition, at the end of the application an effort was made for the students to answer all the items on the scale.

2.4. Data analysis

The data analysis is organized in four stages: (1) purification of the database, descriptive statistics, elimination of atypical cases; (2) verification of the preliminary assumptions of normality, reliability and linearity; (3) explained variance, measure of sample adequacy and analysis of the internal structure through the application of the Confirmatory Factor Analysis (CFA); and (4) measurement of invariance using Multi-Group Confirmatory Factor Analysis (MGCFA). Following the recommendations of Hu and Bentler (1999) and Hirschfeld and Von-Brachel (2014), statistical analyses were performed with the support of the dplyr (Wickham et al., 2019), psych (Revelle and Revelle, 2015), lavaan (Rosseel, 2012) and semTools (Jorgensen et al., 2022) from the open source software RStudio version 1.4 (R Core Team, 2022).

In the first stage, the database was cleaned, eliminating missing and atypical cases based on the Tukey Fences test. As a result of said procedure, 1,679 of the 1,849 original cases remained, of which 549 are from the Basic stage, 748 from the Disciplinary stage, and 374 from the Terminal stage. Consecutively, the mean, standard deviation, standard error and item-total correlation (rpbis) of each one of the items, and the general index and by subscales of the EEDDocente were estimated.

In the second stage, the assumption of normality was tested by applying the Multivariate normality test for asymmetry and kurtosis by Mardia (1970) with an acceptance criterion ≥0.05. The Kolmogorov–Smirnov test with Lilliefors correction was performed consecutively. With the kurtosis coefficient it is possible to identify the tendency of the participants to respond in a biased way toward one of the response categories (Vance et al., 1983), while with the symmetry coefficient the degree of concentration of responses to a central area of the distribution. In the Kolmogorov–Smirnov test, if the value of p is less than α (0.05, default value), the null hypothesis is rejected (the distribution is normal) (Dallal and Wilkinson, 1986). As a result of said procedure, items Q4.3, Q4.7, Q4.10, Q4.12, Q4.13, Q4.14, Q4.15, and Q4.16 were eliminated, which presented values well outside the boundaries of the kurtosis and skewness coefficients between −1 and + 1 recommended by Hair et al. (2019). Likewise, items Q10.7 and Q10.8, which did not meet the cutoff criterion of rpbis ≥ 0.2, were eliminated (Brown, 2012).

For its part, global and subscale reliability was verified with the estimation of the standardized Rho Alpha coefficient (ρ) and the McDonald Omega coefficient (ω) together with Cronbach’s Alpha (α) (Cronbach, 1951, 1988; McNeish, 2018; Raykov and Marcoulides, 2019). The quality criteria for the reliability coefficients were ρ ≥ 0.70, ω ≥ 0.80, and α ≥ 0.70 (Cronbach, 1951, 1988; Katz, 2006; Zhang and Yuan, 2016; Nájera-Catalán, 2019). Once the preliminary analysis and the quality criteria were taken into account, we proceeded to analyze the model of three factors [Course Organization (F1), Quality of Teaching (F2), and Evaluation and Feedback of Learning (F3)] that underlie in the internal structure of the EEDDocente through the CFA application. For this, the Weighted Least Squares (WLS) and Robust Weighted Least Squares (WLSMV) estimation methods were applied (Jöreskog and Sörbom, 1979; Brown, 2015; Kline, 2015; Gazeloglu and Greenacre, 2020). On the other hand, in the evaluation of the adjustment indexes, the recommendations of Hu and Bentler (1999) and Hair et al. (2019). In particular, the adjustment indices and criteria were the Comparative Fit Index (CFI) ≥ 0.95, the Tucker-Lewis Index (TLI) ≥ 0.95, the Normalized Mean Square Residual (SRMR) ≤ 0.08 and the Mean Square Error of Approximation (RMSEA) ≤ 0.06 (Browne and Cudeck, 1993; Schreiber et al., 2006). For the subsequent analysis, only items with factor loadings ≥0.43 were considered.

Finally, an MGCFA was carried out to measure the invariance by School stage and Educational program based on the adjusted model of three factors of the EEDDocente. To verify the assumption of invariance depending on the School stage, three groups were considered: (a) students of the Basic stage (1st and 2nd semester), (b) students of the Disciplinary stage (3rd, 4th, 5th and 6th semester), and students of the Terminal stage (7th and 8th semester). On the other hand, to verify the assumption of invariance based on the Knowledge Area, three groups were considered: (a) students enrolled in the Accounting and Administrative Sciences Educational programs, (b) students enrolled in the Legal Sciences Educational programs, and (c) students enrolled in the Educational programs of Social Sciences. The Configural, Weak, Strong and Strict invariance models were contrasted (Dimitrov, 2010; Milfont and Fischer, 2010). For this, the recommendations of Byrne et al. (1989) and Vandenberg and Lance (2000) focus the analysis on the increasingly restrictive comparison of the model parameters. To consider the factorial invariance between models adequate, it was established as a criterion that the Chi-square difference (Δχ²) was not significant (p > 0.05). However, since the Δχ² is affected by the sample size, the recommendations of Vandenberg and Lance (2000), Cheung and Rensvold (2002) and Dimitrov (2010) were followed, establishing RMSEA parameters close to the cutoff criterion of 0.08, a difference in RMSEA parameters between models less than 0.015 (ΔRMSEA ≤0.015), and a difference in CFI and TLI parameters between models less than 0.010 (ΔCFI and ΔTLI <0.010) (Chen, 2007; Dimitrov, 2010; Putnick and Bornstein, 2016).

3. Results

3.1. Descriptive results and preliminary analyses

The average of the general index of the EEDDocente was 86.61, with a standard deviation of 11.05. Likewise, the average score of the scale items was 3.41 (4 = Strongly agree), with item Q4.4 being the one with the lowest average score (3.07) and Q4.11 the one with the highest average score. (3.62). For its part, the average item-total correlation of the scale was 0.64, meeting the cut-off criterion (rpbis ≥ 0.2). Likewise, the items presented, on average, moderate correlations among themselves (0.42) with correlation coefficients that oscillated between 0.26 and 0.74. Table 3 shows the descriptive results of the items and the general index of the EEDDocente.

TABLE 3

Table 3. Descriptive statistics (n = 1,679, K = 15).

Regarding the assumption of normality, significant results (p < 0.001) were obtained with the multivariate normality test of asymmetry and kurtosis by Mardia (1980), rejecting the assumption of multivariate normality in the study sample. Likewise, the results of the Kolmogorov–Smirnov test with Lilliefors correction yielded values that reject the normal distribution of the general index (D = 0.12, p < 0.001). The global reliability coefficients of the EEDocente (α = 0.92, ρ = 0.92 and ω = 0.93) meet the quality criteria established a priori. Likewise, the three subscales meet the quality criteria α ≥ 0.70, ρ ≥ 0.70. However, regarding the McDonald Omega coefficient (ω), subscales 1 and 2 [Course Organization (F1) and Quality of Teaching (F2)] present values below the ω ≥ 0.80 criterion. Table 4 shows the values obtained from the general reliability coefficient and by subscale of the EEDocente.

TABLE 4

Table 4. Overall and subscale internal consistency values of EEDDocente.

3.2. Confirmatory factor analysis

The fit indices estimated using the WLS (χ² = 251.21; df = 87, p = 0.000; CFI = 0.868; TLI = 0.841; GFI = 0.936; NNFI = 0.814; RMSEA = 0.034; SRMR = 0.057) and WLSMV (χ² = 52.80, df = 87, p = 0.999, CFI = 1.0, TLI = 1.0, GFI = 0.999, RMSEA = 0.000, SRMR = 0.21) were adequate for the three-factor model of the EEDDocente. In turn, the factors presented on average moderate correlations among themselves (0.64) with correlation coefficients ranging between 0.58 and 0.76. In addition, the standardized factor loadings of the three-factor model showed significant and adequate values (see Figure 1).

FIGURE 1

Figure 1. Three-factor first-order CFA model of EEDDocente.

3.3. Factorial invariance

Table 5 shows the results of the adjustment of the factorial invariance parameters of the three-factor model of the EEDDocente based on the School stage and by Knowledge Area. It is shown that the three-factor model of teacher performance from the perception of the students was adequate for the groups according to the School stage (Basic Stage, Disciplinary, Stage and Terminal Stage) and by Knowledge Area (Accounting and Administrative Sciences, Legal Sciences, and Social Sciences). The Configural invariance model presented a good fit for all study groups. In particular, the differences between the Weak, Strong and Strict models, both for the groups based on School stage and Knowledge Area, meet the cut-off criteria (ΔCFI <0.010, ΔRMSEA ≤0.015) (Chen, 2007; Dimitrov, 2010; Putnick and Bornstein, 2016). With the differences obtained between the Weak (ΔRMSEA = −0.002 and ΔCFI = −0.001), Strong (ΔRMSEA = −0.001 and ΔCFI = −0.002) and Strict (ΔRMSEA = 0.000 and ΔCFI = −0.005) models for the groups depending on the School stage and the Weak (ΔRMSEA = −0.002) models and ΔCFI = −0.001, Strong (ΔRMSEA = −0.002 and ΔCFI = −0.002) and Stric (ΔRMSEA = 0.002 and ΔCFI = −0.008) for the groups depending on the educational programs by Knowledge Area, factorial invariance is verified.

TABLE 5

Table 5. Fit indices to evaluate the factorial invariance by school stage and area of knowledge of the three-factor model of the EEDDocente.

4. Discussion

The development and validation of the EEDocente represents an important contribution to the study and measurement of teacher performance from the perspective of students (Shevlin et al., 2000; Whittington, 2001; Campbell et al., 2005; Richardson, 2005). In particular, this study provides evidence of reliability, internal structure, and factorial invariance that allow for further comparative studies and thus evidence-based decision-making. Contrary to high-stakes assessments, the use of this type of assessment for the purpose of performance improvement and continuous teacher training is a rare practice, but vital for the improvement of classroom education in all education systems around the world. By way of discussion, the most relevant findings of the study are presented and contrasted with the results of other similar studies.

In particular, the reduced version (K = 15) of the EEDDocente complies with the psychometric quality criteria of reliability and internal structure. The global reliability coefficients of the EEDDocente meet the cut-off criteria (α = 0.92, ρ = 0.92 and ω = 0.93), and the reliability coefficients per subscale are very close to what was expected. Likewise, with the results of the CFA obtained, the three-factor structure proposed by the Coordination of teacher evaluation of the FCAyS is corroborated (Henríquez et al., 2017, 2018; Henríquez and Arámburo, 2021). The multidimensional model of three factors with 15 items presents adequate factor loadings (between 0.50 and 0.84) and an acceptable. With this, the structure of the EEDDocente, which addresses some of the most relevant teaching competencies throughout the educational levels, endorses and consolidates its underlying theoretical model. This is consistent with other studies of similar instruments that present similar theoretical dimensions in their structure (Marsh, 1984, 1993, 2007; Feldman, 1988; Centra, 1993; Feldman, 1993; Braskamp and Ory, 1994; Fink, 2008; Silva et al., 2014; Irigoyen et al., 2016; Bazán-Ramírez and Velarde-Corrales, 2021; Bazán-Ramírez et al., 2021). It is important to mention that the items eliminated do not affect the interpretation of the construct, maintaining the three basic dimensions of the EEDDocente defined at the beginning by the design committee. Likewise, with a smaller scale, the time and possible problems related to the average number of times a regular student of the FCAyS must answer the EEDDocente per semester are reduced.

Added to this, the study provides new findings on factorial invariance depending on the School stage and Educational programs in the Knowledge Areas of Accounting and Administrative Sciences, Legal Sciences, and Social Sciences in samples of university students. The Configural invariance model presented a good fit for all study groups, and the differences in the parameters of the Weak and Strong models are adequate. This guarantees that the EEDDocente can be considered on the same scale for the different groups under study and confirms that the three-factor model measures in the same way in all of them (Vandenberg and Lance, 2000; Wang and Wang, 2012). In addition, the Bayesian Information Criterion (BIC) presented a sequential reduction, which can be interpreted as a sign of equivalence between the samples (Cheung and Rensvold, 2002). In this regard, Chen (2007) mentions that the RMSEA and SRMR tend to reject invariant models when the sample size is not equal between the groups, so it is advisable to use the CFI as the main criterion to establish invariance.

It must be recognized that the main limitation of the study has to do with the fact that, although the student samples are large, they are not equitable between the study groups. In particular, it is important to remember that there is a difference greater than 100 individuals between the groups of the School stage of the Basic stage (N = 632) and the Disciplinary stage (N = 816), and that this difference increases when compared with the number of students participating in the Terminal stage (N = 392). The same happens with the number of participants in the educational programs by Area of Knowledge, where 366 students from the Knowledge Area of Legal Sciences participated, and in Accounting and Administrative Sciences and Social Sciences almost twice as many participated (N = 807 and N = 676 respectively).

5. Conclusion

By way of conclusion, it can be said that the findings derived from the reliability analysis and the CFA provide evidence that supports an adequate adjustment of the three-factor structural model [Class Organization (F1), Teaching quality (F2), and Assessment and Feedback on learning (F3)] of the reduced version (K = 15) of the EEDDocente to evaluate teaching performance throughout the School stage (Basic stage, Disciplinary stage, and Terminal stage) and the Areas of knowledge (Accounting and Administrative Sciences, Legal Sciences, and Social Sciences). In addition, factorial invariance analysis based on the School stage and the Educational programs by Areas of Knowledge in samples of university students show an adequate adjustment of the Configural model, and the differences in the parameters of the Weak, Strong, and Strict models. These results indicate that none of the study groups presents a systematic tendency to answer the items higher or lower than the rest of the groups (Vandenberg and Lance, 2000; Meredith and Teresi, 2006; Wang and Wang, 2012), providing evidence to carry out comparative studies. With all this, it is guaranteed that the EEDDocente complies with the standards of reliability, internal structure validity and invariance, and its use as a brief and easy-to-administer instrument is supported, presenting an important contribution to the study and measurement of teaching from the students’ perspective. It is recommended for future research ensure the equivalence of the samples of the study groups to favor the analysis of the metric invariance and factorial invariance of the EEDDocente and to carry out comparative and predictive studies. It is also important to consider the application of the EEDDocente in other schools and universities in order to have a tool for brief application with the purpose of providing relevant information at the end of each school stage, based on the opinion of the students, on teaching performance, favoring continuous training and decision making to improve the effectiveness and quality of teaching.

Data availability statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Ethics statement

The protocol and procedure of the evaluation strategy on which the present study was based were approved by the FCAyS-UABC administration and supervised by the FCAyS-UABC Teaching Evaluation Coordination, in accordance with the institutional norms in force regarding research ethics. For the application of the EEDDocente, a group of students was trained as applicators of the instrument and the voluntary participation of the students of each educational program of the FCAyS-UABC was requested, who were previously informed about the objectives and procedures of the study and about the confidentiality, safeguarding and use of the information obtained based on their answers.

Author contributions

PH and JP-M contributed to the idea of research, its conceptualization, implementation, and methodology, were in charge of writing the manuscript, and also contributed to the analysis and interpretation of data, revising the English version, and writing in Frontiers format. CC provided support in the search for additional bibliographic information, was responsible for drafting the introduction to the manuscript, and reviewed the writing style of the article. JZ contributed to the design and implementation of the EEDDocente, and also provided elements for the delimitation of the participants. All authors contributed to the article and approved the submitted version.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Alcaraz-Salarirche, N. (2015). Aproximación Histórica a la Evaluación Educativa: De la Generación de la Medición a la Generación Ecléctica [Historical Approach to Educational Evaluation: From the Measurement Generation to the Eclectic Generation]. Revista Iberoamericana de Evaluación Educativa. 8:1. Available at: https://dialnet.unirioja.es/servlet/articulo?codigo=5134142 (Accessed January 21, 2023).

Google Scholar

Aleamoni, L. (1999). Student rating myths versus research facts from 1924 to 1998. J. Pers. Eval. Educ. 13, 153–166. doi: 10.1023/a:1008168421283