Validation of a Short Scale for Student Evaluation of Teaching Ratings in a Polytechnic Higher Education Institution

Sánchez, Tarquino; León, Jaime; Gilar-Corbi, Raquel; Castejón, Juan-Luis

doi:10.3389/fpsyg.2021.635543

ORIGINAL RESEARCH article

Front. Psychol. , 05 July 2021

Sec. Educational Psychology

Volume 12 - 2021 | https://doi.org/10.3389/fpsyg.2021.635543

This article is part of the Research Topic The Role of Teacher Interpersonal Variables in Students’ Academic Engagement, Success, and Motivation View all 95 articles

Validation of a Short Scale for Student Evaluation of Teaching Ratings in a Polytechnic Higher Education Institution

$\nTarquino Snchez$ Tarquino Sánchez¹

Jaime León²^*

Raquel Gilar-Corbi³

Juan-Luis Castejón³

¹College of Electrical and Engineering, National Polytechnic School of Quito, Quito, Ecuador
²Facultad de Ciencias de la Educación, Universdad de Las Palmas de Gran Canaria, Las Palmas de Gran Canaria, Spain
³Department of Developmental Psychology and Didactics, University of Alicante, Alicante, Spain

The general purpose of this work is 2-fold, to validate scales and to present the methodological procedure to reduce these scales to validate a rating scale for the student evaluation of teaching in the context of a Polytechnic Higher Education Institution. We explored the relationship between the long and short versions of the scale; examine their invariance in relation to relevant variables such as gender. Data were obtained from a sample of 6,110 students enrolled in a polytechnic higher education institution, most of whom were male. Data analysis included descriptive analysis, intraclass correlation, exploratory structural equation modeling (ESEM), confirmatory factorial analysis, correlations between the short and long form corrected for the shared error variance, gender measurement invariance, reliability using congeneric correlated factors, and correlations with academic achievement for the class as unit with an analysis following a multisection design. Results showed four highly correlated factors that do not exclude a general factor, with an excellent fit to data; configural, metric, and scalar gender measurement invariance; high reliability for both the long and short scale and subscales; high short and long-form scale correlations; and moderate but significant correlations between the long and short versions of the scales with academic performance, with individual and aggregate data collected from classes or sections. To conclude, this work shows the possibility of developing student evaluation of teaching scales with a short form scale, which maintains the same high reliability and validity indexes as the longer scale.

Introduction

The academic failure and dropout rates in higher education in Ecuador, especially in Engineering studies, are very high. Sandoval-Palis et al. (2020) find a dropout rate in the 1st year of university studies at the National Polytechnic School of around 70%. Braxton et al. (2000) and Kuh (2002) point out the quality of teaching as one of the determining aspects of academic failure and dropout. Likewise, instructional factors are one of the key factors in explaining academic success and dropout. Schneider and Preckel (2017) highlights the effect on academic readiness of the teacher-student interaction, the type of communication, the preparation, organization, and presentation of content by the teacher, the teacher's planning, and the feedback provided to the student, are some of the aspects.

Student evaluation of teaching (SET) ratings is a generalized procedure in the institutions of higher education (Richardson, 2005; Zabaleta, 2007; Huybers, 2014). SET is a useful tool for formative aims, such as feedback for the improvement of instruction, and for administrative decision-making about recruitment, career progress or economic incentives (Linse, 2017). A systematic review on the subject shows that there are very few publications on the validation of student evaluation of university teaching scales -SET- in South America, collected in the most important databases such as Scopus and WoS -Web of Science- (Pimienta, 2014; Andrade-Abarca et al., 2018), and some more when the scope of the search is expanded (Fernández and Coppola, 2008; Montoya et al., 2014).

In the Ecuadorian context, there are the works of Aguilar and Bautista (2015) and Andrade-Abarca et al. (2018), who validate questionnaires in the field of an Ecuadorian polytechnic university. While in the review by Loor et al. (2018) on the evaluation of university teaching staff, the need to improve the quality of the evaluation process is concluded.

Student Evaluation of Teaching Ratings Scales

The instruments normally used to measure students' evaluation of their teachers, programs, and students' satisfaction with their instruction are known as standard rating scales. However, research on student evaluation of teaching ratings has not yet provided clear answers to some questions about their validity (Marsh, 2007a,b; Spooren et al., 2013; Hornstein, 2017; Uttl et al., 2017).

Many evaluation instruments have been constructed and validated within the home institution itself, and the results of such validation have not always been published, and in some instances they have not even been tested for psychometric quality (Richardson, 2005). In addition, there is a lack of consensus on the number and type of dimensions (Spooren et al., 2013), due to conceptual problems related to the lack a theoretical framework about what effective teaching is, and methodological problems concerning the measurement of these dimensions as a data-driven process (in which different post-hoc analytic techniques are used). It seems necessary to use the most common dimensions, which are associated with greater teaching effectiveness.

A question concerning construct validity that arises in relation to student evaluation of teaching rating scales is whether it has a one-dimensional (Abrami et al., 1997; Cheung, 2000) or multidimensional structure. Marsh et al. (2009) defended the application of exploratory structural equation modeling (ESEM) methods integrating confirmatory (CFA) and exploratory factor analyses (EFA) to analyse issues related to multidimensional student evaluations of university teaching (SETs), on the basis of the measures that can be obtained both of the specific dimensions and a general factor of the quality of teaching.

An open and controversial question related to the criterion validity is the relationship of SET scores to student academic achievement. To answer this question, a series of revision and meta-analytical studies have been carried out (Cohen, 1981; Feldman, 1989; Clayson, 2009; Uttl et al., 2017). Taken together, the results regarding the relation between SET and academic performance, when multiple sections are included and the previous academic achievement is controlled, show that SET is moderately related to academic achievement; however, the effect of SET on academic performance is smaller than that found in some previous meta-analytic studies (Cohen, 1981; Feldman, 1989), at around only 10%.

Another methodological question concerns evaluation systematic-bias. This problem is present when a confirmed characteristic of students habitually influences their evaluations of teachers (e.g., gender; Centra and Gaubatz, 2000; Badri et al., 2006; Basow et al., 2006; Darby, 2006; Boring, 2015). A possible source of bias is the discipline. If the evaluation of teaching is situational and is affected by academic disciplines, being higher in studies in the field of education and the liberal arts and less in other areas such as business and engineering (Clayson, 2009), it seems necessary to carry out new studies in areas different from the previous ones, such as the technical areas where there are fewer studies on the subject.

The present study was carried out in a different context to most previous studies (Clayson, 2009), the student evaluations of teaching in a higher education institution, the National Polytechnic School of a South American country, Ecuador, where students study technical subjects, such as engineering, architecture, and biotechnology. Unfortunately, in South America there is a shortage of reliable and valid SET scales in polutechnic higher education institution, although it is a widespread procedure in these institutions since the early 1980's (Pareja, 1986).

The Council of Ecuadorian Higher Education establishes the obligatory nature of the evaluation of the teaching staff of higher education institutions, both for their entry and for their promotion, in the Career and Ladder Regulations of the Professor and Researcher of the Higher Education System, and they may even be dismissed from teaching in case of performance evaluations of <60% twice consecutively, or four comprehensive evaluations of performance <60% during their career (Consejo de Educación Superior, 2017).

The evaluation of the quality of teaching in the National Polytechnic School of Ecuador uses different procedures, including self-assessment, evaluation by peers and managers, and evaluation by students through evaluation questionnaires. The elaboration of this questionnaire is based on the criteria proposed by the institution itself and the guidelines suggested by the Higher Education Council (Consejo de Educación Superior, 2017).

The instrument of student evaluation of teaching used in the National Polytechnic School is the “Cuestionario de Evaluación de la Enseñanza del Profesor de la Escuela Politécnica Nacional del Ecuador” (Teacher Evaluation Questionnaire of the National Polytechnic School). The elaboration of the questionnaire was based on previous SET literature (Toland and De Ayala, 2005; Marsh, 2007a; Mortelmans and Spooren, 2009) and consists in the proposal of several effective teaching criteria. Next, a teaching committee, part of the management team of the National Polytechnic School, developed a set of items. This committee consisted of 5 main tenured professors with extensive experience in teaching quality, and a representative from the administrative sector and a student. The aspects to be evaluated and the specific items that make up the questionnaire are approved each academic year by the management team of the National Polytechnic School. The items are grouped theoretically into the following four factors. 1. Planning, mastery, and clarity in the explanation of the subject matter (i.e., The teacher conveniently expresses the class objectives and contents, indicating their relationship with the student's training). 2. Methodology and resources (i.e., The teacher prepared teaching material apart from the textbook and made it known). 3. Teacher-student relationship (i.e., The teacher created a climate of trust and productivity in class). 4. Evaluation (i.e., The evaluation events are related to the teaching given). Although the number and dimensions of effective teaching remains an open question (Spooren et al., 2013), these four dimensions are present in the most of SET literature (Feldman, 1989; Richardson, 2005; Huybers, 2014).

Thus, face and content validity are taken into account during the process of developing an instrument. Face validity indicates whether an instrument seems appropriate, that is, face validity does not analyze what the instrument measures but what it appears to measure; i.e., the extent to which the items of a SET instrument appear relevant to a respondent (Spooren et al., 2013; Rispin et al., 2019). Content validity refers to whether the content of an instrument has been included in an exhaustive and representative way, that is, if the content has been included in an appropriate way. Content validity is obtained from the consensus based on informed opinion of experts; it is recommended to include at least five experts for the evaluation of content validity (Yaghmale, 2009). However, the empirical validation is minimal and is limited to a descriptive analysis of the items individually considered. It lacks a complete process of construct and criterion validity, as well as an estimation of the reliability of the scale and/or the subscales that make up these questionnaires.

Although many studies have been developed on the subject of the validation of student evaluation of teaching scales in higher education, few have done so in the specific scope of polytechnic institutions and SEM studies; there are also very few examples of rigorous development of short teacher assessment scales. For this reason, our work tries to contribute to filling this gap.

Scale Reduction

Currently, a line of work has been developed to reduce the length of scales already used or elaborate scales with a reduced number of items. The lack of time for the application of scales, fatigue, and possible stereotyped responses in scales that are too long or that are part of a set of scales that are applied within the same study, etc., has led to proposals of short scales (Gogol et al., 2014; Lafontaine et al., 2016). These scales have to be small enough to allow for a rapid assessment of purposed constructs, but large enough to ensure appropriate reliability, validity, and accurate parameter estimation.

Short scales are considered to present psychometric inconveniences in comparison to long scales with regard to both reliability and validity, as they can be more affected by random measurement errors (Lord and Novick, 1968; Credé et al., 2012).

In the short-form scales, the number of items per factor proposed varies from one to four items. Thus, several authors propose scales and subscales in which each factor should include four items (Marsh et al., 1998, 2009, 2010; Poitras et al., 2012). Moreover, other authors, such as Credé et al. (2012), point out the loss of psychometric qualities when the scales have between one and three items. On the other hand, Kline (2016) points out that construct validation procedures, such as confirmatory factor analysis and other modeling methods, require at least three indicators per factor for a model to be identified. From a point of view that combines theoretical demands with practical interest, the PISA study of 2000 and the German PISA study of 2003 use short scales with three items (Brunner et al., 2010).

Another group of studies propose the use of short scales based on the finding that reliability and validity of short measures is similar to those of the corresponding longer scales measures, and have high correlation with long scales (Nagy, 2002; Christophersen and Konradt, 2011; Gogol et al., 2014). Gogol et al. (2014) compared the reliability and validity of three-item and single-item measures to those of the corresponding longer scales, finding satisfactory reliability and validity indices in all short forms and a high correlation with long scales; however, single-item measures showed the lowest reliability indices and correlations with the longer scales. Based on these results, the authors defended the use of short scales.

In sum, there are empirically founded reasons to propose short scales of three or four items. Although three items seem sufficient to guarantee the reliability and validity of the measure, in some cases, such as when additional assumptions are made about the psychometric properties of the items and factors (variables error variances, factor variances, etc.) or the hierarchical nature of the data is taken into account in multilevel analysis, four items per factor are recommended for accurate parameter estimation (Marsh et al., 1998).

Research Objectives

Hence, in this work, the following objectives are established:

1. Validate a Student Evaluation of Teaching Rating Scale and a short version of the corresponding long scale, including four items for each measured dimension, in a large sample of higher education students enrolled in a polytechnic higher education institution.

2. Test alternative structures of the dimensions of the Student Evaluation of Teaching Rating Scale.

3. Find the relationship between the long and short forms of the scale and academic achievement.

4. Examine whether the scores are invariant with respect to relevant variables such as the gender of the students in the context of scientific-technological studies.

5. Considering the hierarchical nature of the data, determine the ratings of the teaching of individual students located in different groups, classes, or sections, as well as where each group evaluates a different teacher.

Materials and Methods

Participants

The sample comprised 6,110 students of the National Polytechnic School of Ecuador who rated the teaching of their teachers. These students were enrolled in eight different faculties in 28 different degree programs and attended 358 different classes. 68.3% of the students were male and 31.7% female. The higher percentage of male students is representative of the population of students of polytechnic studies. The average age was 22.6 years old (SD = 3.2). These students rated the teaching of their teachers during the 2016–17 academic year.

The sample of teachers was composed of 310 teachers, most of which were males (62.8%), aged between 26 and 57 years (mean = 43.7), belonging to all professional categories, from assistant professor to principal, with a majority (42%) of full professors, and extensive teaching experience (mean = 18,6 years).

This sample of participants corresponds to the students enrolled in the aforementioned studies, who took part in the evaluation process of the teaching staff of their institution, the EPN, at the end of a semester.

Measures

Students' evaluations of teaching ratings were obtained from the “Cuestionario de Evaluación de la Enseñanza del Profesor de la Escuela Politécnica Nacional del Ecuador” [Teacher Evaluation Questionnaire of the National Polytechnic School], approved by the teaching staff for the 2016–17 academic year. This scale comprises 32 items grouped theoretically into the following four factors. 1. Planning, mastery, and clarity in the explanation of the subject matter (items 1–9). 2. Methodology and resources (items 10–15). 3. Evaluation (items 16–23) 4. Teacher–student relationship (items 24–32). Response scale ranges from 1 to 5; 1: do not agree at all; 2: little agreement, 3: moderately agree; 4: strongly agree; and 5: totally agree. The full and reduced scales with the items grouped into the four theoretical dimensions are included in the Appendix A.

The measures of student academic performance were obtained for a subsample of 1538 students. This subsample consisted of those students for whom data on their academic performance were available in the university's administrative computerized records. There is no known evidence that this subsample is biased with respect to the total sample used in this study. This measure of academic performance at the end of the semester was operationalized by the grade awarded by the teacher, based on a final exam: a written examination, both theoretical and practical. These final exams were the same across sections in some cases and were different for different sections in others. Different sections follow the same program and have the same assessment criteria that are specified in the study program of each course. Therefore, the exams, although different, can be considered quite equivalent. There are also common general rules for all exams in the National Polytechnic School of Ecuador. The scores of final grades ranged from 0 to 40 for all courses.

Students' age and gender, as well as teachers' age, gender, and experience, were collected from administrative records.

Procedure

The data were collected from the existing computer records in the administration of the Polytechnic School, and permission for access to them was granted to the academic staff of the Institution. The data provided by the institution were anonymous, with only one identification code for each student.

The application of the evaluation of teaching scale by the students was carried out toward the end of the semester, before they knew their final grades. All the teachers were evaluated by the students in a similar period of time. All the students had to evaluate the teachers to be able to access their final grades. The student evaluation of teaching was conducted through an electronic platform on which the data were recorded.

The impact that faculty procedures of student evaluations of teaching have on response rates has been analyzed by several authors in special electronic evaluations. Thus, Young et al. (2019) found that evaluations made by students were considerably higher when faculty gave in-class time to students to complete student evaluation of teaching, compared to an electronic form issued by the administration. However, other studies of this issue did not find differences between the evaluations made with electronic questionnaires and paper and pencil questionnaires, or when a more representative sample responded instead of a smaller, more biased sample (Nowell et al., 2014).

As response rates to electronic administration are lower than to paper-and-pencil questionnaires, the procedure followed in this work consisted in requiring all the students to answer the evaluation survey in order to access their final grades. This procedure has proved useful and valid in some higher education institutions (Leung and Kember, 2005; Nair and Adams, 2009).

Data Analysis

Preliminary Analyses

We explored means, standard deviations, skewness, and intraclass correlations (ICCs) for all items. Skewness indicates the asymmetry of the distribution, while ICC gives information about the non-independence of data, that is, the similarity of students' responses in the same class.

Construct Validity

To gather evidence of the scale's construct validity, we followed the recommendations of Schmitt et al. (2018). There are different methods to retain the “best” factor structure; for instance, exploratory factor analysis (EFA), confirmatory factor analysis (CFA), or exploratory structural equation model (ESEM). EFA has the disadvantage of the difficulty to replicate results with different samples, while CFA leads to biased loadings and correlations between factors because it requires that cross-loadings be 0 in the non-target factors (Garn et al., 2018). ESEM combines EFA and CFA, provides goodness of fit indices, and allows testing for multiple-group measurement invariance (Xiao et al., 2019). Schmitt et al. (2018) recommend using EFA when there is no a priori theory, using CFA when there is a strong theory and evidence of the scale structure, and using ESEM when the a priori theory is sparse. Howard et al. (2018) add that ESEM should be retained over CFA when correlations are different between factors are different in these two methods.

Another interesting issue in factor analysis, specifically in multidimensional structures, is bi-factor models (Morin et al., 2016). Bi-factor models are used to divide covariance between a global factor (i.e., teachers' style) and specific factors (i.e., Methodology and resources or Teacher-student relationship).

Therefore, in view of the above information and our data, we can test the following models: one-factor via CFA, four-factor via CFA, four-factor via ESEM, and four- and bi-factor via ESEM (see Figure 1). To select the factor structure, we relied on the adjusted χ²-difference tests and changes in CFI and RMSEA. The estimation method was Robust Maximum Likelihood because the data were non-normal; moreover, as responses were not independent, we corrected χ² and standard errors using a sandwich estimator (Muthen and Satorra, 1995; Muthén and Muthén, 2020). All analyses were conducted with Mplus 8.4 (Muthén and Muthén, 2020).

FIGURE 1

Figure 1. Proposed structural factors of the model tested.

Short Version

To choose items for a short version, we account for factor loadings, corrected for item-test correlations, reliability, and the item theoretical significance (Marsh et al., 2010). To test the agreement of both versions, we relied on the Levy correction of the short vs. long form correlation. This correction accounts for the shared error variance between both forms due to the subset of items (Levy, 1967; Barrett, 2015). Moreover, because correlation only considers the monotonicity between both forms, we also relied on the Gower index (Gower, 1971; Barrett, 2012), whose values range between 0 and 1, where values close to 1 indicate agreement.

Gender Measurement Invariance

To test whether male and female students interpret the scale similarly, we performed a measurement invariance test (Vandenberg and Lance, 2000). Specifically, we compared three models: configural, metric, and scalar (Muthén and Muthén, 2020). The configural model has factor loadings, intercepts, and residual variances free across groups and factor means fixed at zero in all groups. In the metric model, factor loadings are held equal across groups, while intercepts and residual variances are free across groups, and factor means are fixed at zero in all groups. Finally, in the scalar model, factor loadings and intercepts are equal across groups, while residual variances are free across groups, and factor means are constrained to zero in one group and free in the other group. For model comparisons, we used the adjusted χ²-difference tests and changes in CFI and RMSEA.

Reliability

Finally, to test the reliability of the short and long form, we did not use Cronbach's alpha because there is increasing evidence of its lack of accuracy and the difficulty of meeting its assumptions: the parallelism and tau-equivalence of the items (Zhang and Yuan, 2016; McNeish, 2018). Cho (2016) proposes different formulas to estimate reliability whenever items lack parallelism, tau-equivalence, or both, not only for unidimensional structures but also for multidimensional structures.

Criterion Validity: Relation With Academic Achievement

To analyse the relationships between student ratings of teaching and academic performance, the data were taken individually and grouped into sections. Initially, the validity of students' ratings might be evidenced by the correlation between SET and academic achievement. Nevertheless, students' grades cannot be supposed to constitute a simple measure of teaching effectiveness because each group could have different evaluations (Richardson, 2005). The key evidence cited in support of student evaluations of teaching as a measure of a teacher's instructional effectiveness is multisection studies, in which different professors teach the same subject following the same outline, and at the end of the semester, all the sections have the same exam or equivalent ones (Cohen, 1981; Uttl et al., 2017). To find the correlation between scale scores and academic performance, the data were taken individually and treated as a typical multisection study in which the average class was used as the unit of analysis.