Developing and Validating Tests of Reading and Listening Comprehension for Fifth and Sixth Grade Students in Portugal

Rodrigues, Bruna; Cadime, Irene; Viana, Fernanda Leopoldina; Ribeiro, Iolanda

doi:10.3389/fpsyg.2020.610876

ORIGINAL RESEARCH article

Front. Psychol., 09 December 2020

Sec. Quantitative Psychology and Measurement

Volume 11 - 2020 | https://doi.org/10.3389/fpsyg.2020.610876

Developing and Validating Tests of Reading and Listening Comprehension for Fifth and Sixth Grade Students in Portugal

Bruna Rodrigues^1*

Irene Cadime¹

Fernanda Leopoldina Viana²

Iolanda Ribeiro³

¹Psychology Research Centre, School of Psychology, University of Minho, Braga, Portugal
²Research Centre on Child Studies, Institute of Education, University of Minho, Braga, Portugal
³School of Psychology, University of Minho, Braga, Portugal

An efficient assessment of reading and linguistic abilities in school children requires reliable and valid measures. Moreover, measures which include specific test forms for different academic grade levels, that are vertically equated, allow the direct comparison of results across multiple time points and avoid floor and ceiling effects. Two studies were conducted to achieve these goals. The purpose of the first study was to develop tests of reading and listening comprehension in European Portuguese, with vertically scaled test forms for students in the fifth and sixth grades, using Rasch model analyses. The purpose of the second study was to collect evidence for the validity of these tests based on the relationships of test scores with other variables. The samples included 454 and 179 students for the first and second study, respectively. The data from both studies provided evidence for good psychometric characteristics for the test forms: unidimensionality and local independence, as well as adequate reliability and evidence of validity. The developed test forms are an important contribution in the Portuguese educational context as they allow for the assessment of students’ performance in these skills across multiple time points and can be used both in research and practice.

Introduction

The product of listening and reading comprehension is an integrated mental representation of the meaning of a text (Oakhill et al., 2019). The processes necessary to extract meaning from written or oral language are generally similar: integration of information, making inferences, association of what one reads/hears with one’s previous knowledge, and construction of the meaning of the material (Perfetti et al., 2005; Cain and Oakhill, 2008).

The assessment of these skills allows us to identify at-risk readers, to support the development of intervention and teaching programs, and to monitor the students’ progress in these areas over time (Santos et al., 2016b). For this purpose, the use of standardized measures with robust psychometric qualities is essential (Salvia et al., 2010).

The overall aim of this paper is to describe the development and validation of two vertically scaled forms of a reading comprehension and a listening comprehension test for Portuguese students in the fifth and sixth grades.

Reading and listening comprehension tests developed to assess specific age or grade level groups are useful tools to compare inter-individual differences (i.e., to compare a student’s performance with a normative group). However, when the goal is to compare the achievement of the same student across different time points (intra-individual differences), the administration of the same test at different educational levels has several disadvantages. The use of the same test across a wide range of academic grades is problematic due to the learning effects and reactivity effects of the measures. Moreover, the results can be influenced by extreme floor or ceiling effects in lower and upper grades, respectively (Santos et al., 2016b). The solution for these problems is to use different and specific test forms for each academic grade with equated scores. Equating is a statistical process that allows the conversion of the scores obtained in different test forms into a single metric, so that these test forms can be used at different points of time and the scores can be directly compared to assess the development of these skills in the same individual over time (Kolen and Brennan, 2014).

Equating models based on item response theory analyses are widely used (Wilson and Moore, 2011). Item response theory analyses, including Rasch model analyses, allow researchers to assess “not only the difficulty level of a specific item, but also permit interval scaling for the assessment of change, assessment of the dimensionality of a set of items, and specification of the range of items (in terms of ‘ability scores’) that characterize a particular measurement device” (Francis et al., 2005, p. 375). Item response theory analyses also allow researchers to perform differential item functioning (DIF) analyses. Differential item functioning ensures equity in testing because identifying items that favor one group over another on a test prevents bias in the comparison of test scores between different groups (Walker and Beretvas, 2001). The development of standardized tests following item response theory, specifically based on Rasch model, has several advantages, such as allowing the selection of the most appropriate items to the level of competence of the group that is intended to evaluate and performing the vertical equating of different versions of the same test (Prieto and Delgado, 2003; Kolen and Brennan, 2014).

Collecting empirical evidence of validity is also crucial for the development of tests. According to the American Educational Research Association et al. (2014), “validity refers to the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests” (p. 11). Evidence based on relationships with other variables is one of the sources of validity evidence and refers to “the degree to which these relationships are consistent with the construct underlying the proposed test score interpretations” (American Educational Research Association et al., 2014, p. 16). It implies the identification of relevant variables for the construct to be measured and the analysis of the relationships between them. Reading comprehension requires the development of basic reading skills, such as oral reading fluency. A fluent reading ability is mandatory so that higher-level processes of reading comprehension can take place. Therefore, medium-to-large correlation coefficients between these skills have been found across a wide range of orthographies with varying depths, in students up to the sixth grade (Yovanoff et al., 2005; Padeliadu and Antoniou, 2014; Fernandes et al., 2017). However, as the automaticity of the basic reading processes increases across schooling years, successful text comprehension becomes more dependent on higher order skills, such as vocabulary, memory, reasoning, and comprehension monitoring (Yovanoff et al., 2005; Sesma et al., 2009; Ouellette and Beers, 2010; Ribeiro et al., 2015a; Nouwens et al., 2016; Fernandes et al., 2017).

With regard to listening comprehension, given that it involves linguistic processes similar to the ones used in reading comprehension, similar results have been observed for the relationship between listening comprehension, vocabulary, and working memory (Ouellette and Beers, 2010; Florit et al., 2014; Tighe et al., 2015; Kim, 2016; Jiang and Farquharson, 2018).

Analogical reasoning also seems to play an important role in solving comprehension tasks, since it enables processes for making inference (Tzuriel and George, 2009). In this sense, previous studies have shown that verbal and non-verbal reasoning had medium-to-large sized correlations with reading and listening comprehension in several orthographies (Ribeiro et al., 2015a; Tighe et al., 2015; Potocki et al., 2017).

Moreover, readers who can successfully comprehend the text employ planning strategies (e.g., evaluate the text’s difficulty before reading) to begin reading metacognitively, and monitoring strategies (e.g., summarize information in the text) to make sense of what they read (Botsas, 2017). However, empirical studies seem to yield mixed results when the use of reading strategies is assessed by self-report measures. For example, in a sample of Croatian students from the fifth to eighth grades, perceived use of reading strategies was significantly associated with reading comprehension only in eighth-grade students (Kolić-Vehovec and Bajšanski, 2006). However, in another study conducted with Chinese students, perceived use of reading strategies was also moderately correlated with reading comprehension among fifth graders (Law, 2009).

Finally, previous studies have also found medium-to-large correlation coefficients between teachers’ ratings of students’ reading skills and students’ performance on standardized tests that assess reading and listening comprehension from kindergarten to the fifth grade (Feinberg and Shapiro, 2003, 2009; Gilmore and Vance, 2007; Viana et al., 2015; Santos et al., 2016a).

The Present Study

Various measures of reading assessment for elementary school students have been developed in Portugal. One of these measures was the Battery of Reading Assessment (Santos et al., 2015, 2016b), which is composed of vertically scaled forms to assess word reading, listening, and reading comprehension from the first to the fourth grade. The special attention paid to the lower grades of elementary schools for assessing reading and listening comprehension can be explained by the importance and the impact of learning across the primary school years on the subsequent years. However, results from national level reports in Portugal have shown that the number of children who have reading difficulties past lower grades of elementary school is still high (Monteiro et al., 2014). These data raise growing concerns about reading difficulties emerging in the later years of schooling: students who succeed in learning to read in the primary grades, but then fall behind in the upper elementary or middle school grades (Leach et al., 2003; Catts et al., 2005; Lipka et al., 2006). This phenomenon imposes the need for the development of robust measures that not only allow further development of research in this field, but also help to assess and monitor comprehension performance beyond lower elementary school grades. Therefore, the present study intended to expand the Battery of Reading Assessment for fifth and sixth graders in Portugal.

This paper reports the procedure and results of two studies. The purpose of the first study was to develop listening and reading comprehension tests, with two vertically scaled test forms for European Portuguese students in the fifth and sixth grades, using Rasch model analyses. The second study aimed to collect validity evidence for the two vertically scaled forms of each test based on the relationship of test scores to other variables by analyzing the relationships between the developed test forms and measures used as external criteria for oral reading fluency, vocabulary, working memory, comprehension monitoring, verbal and abstract reasoning, teachers’ ratings, and academic achievement. Based on the research literature, it was expected that the scores on the test of reading comprehension will be positively correlated with all the other variables and that the scores on the test of listening comprehension will be correlated with measures of vocabulary, working memory, verbal and abstract reasoning, teachers’ ratings, and academic achievement.

Study 1

Materials and Methods

Participants

All participants were native speakers of European Portuguese, attending schools located in northern Portugal. The sample included 222 fifth graders (M_age = 10.95 years, SD = 0.58; 52.3% were boys; 77% were attending public schools) and 232 sixth graders (M_age = 11.98 years, SD = 0.42; 52.6% were boys; 89.2% were attending public schools). Students who qualified for educational intervention at the selective and/or additional levels were not included in the sample. With regard to socioeconomic status, 43.7% of the fifth graders and 26.7% of the sixth graders benefited from scholar social support (i.e., reduced-price meals at school, access to a loan service for books, and support for the acquisition of school supplies). Regarding maternal education of the fifth graders, 16.2% of the mothers had completed a university degree, 28.8% had completed high school, and 55% had a lower educational degree. In relation to the sample of sixth graders, 30.2% of their mothers had completed a university degree, 27.6% had completed high school, and 37.9% had a lower educational degree (4.3% missing information).

Study Design and Measures

Non-equivalent groups with anchor test design is the most appropriate equating procedure in the construction of measures involving at least two groups that differ in ability level, responding to different test forms (de Ayala, 2009; Kolen and Brennan, 2014). For this purpose, test forms should include a set of common items between adjacent grades, and a set of unique items for each test form, that allows researchers to calibrate each test form separately as well as sequentially using vertical equating (Kolen and Brennan, 2014). This kind of equating is used when groups of subjects differ in ability level and tests differ in level of difficulty (Baker, 1984). This technique is used when the goal is to compare performances in skills that are expected to develop over time, such as listening and reading comprehension (de Ayala, 2009). Specific test forms for each grade were developed to assess reading and listening comprehension at the end of fifth and sixth grades, namely, the Test of Reading Comprehension of Narrative Texts (TRC-n_5/6) and the Test of Listening Comprehension of Narrative Texts (TLC-n_5/6). Each test form included a booklet with three original texts (two unique texts for each grade, and one common text between adjacent grades) authored by Portuguese writers of literature for children and a worksheet containing the test items. The length of the texts for TRC-n₅ and TLC-n₅ ranged from 551 to 882 words, and for TRC-n₆ and TLC-n₆ from 574 to 700 words. Each test form comprised unique and common items between the test forms for the adjacent grades (see Table 1). Regarding TRC-n₅ and TLC-n₅, anchor items (and the respective text) were derived from the test forms previously validated for fourth graders (Santos et al., 2015, 2016b). To select the anchor items to be included in the test forms for the fifth grade, the mean difficulty level of the items of the texts that composed the fourth-grade test forms was computed. The text whose items had the highest mean difficulty was selected and the respective items were used as anchor items. Anchor items (and the respective text) of the TRC-n₆ and the TLC-n₆ were derived from the TRC-n₅ and the TLC-n₅, respectively. Test items were multiple-choice questions with three options. Prior studies have shown that three options are optimal for multiple-choice items, being as psychometrically efficient as four or five options (Delgado and Prieto, 1998; Rodriguez, 2005). Items were developed to assess four levels of comprehension (literal, inferential, critical, and reorganization) described in the taxonomy by Català et al. (2001), that was used in the development of the test forms for primary school students (Santos et al., 2015, 2016b). The items and the options were developed by the researchers and were later revised by text comprehension experts, who have extensive experience in teachers’ training. In the TRC-n, the student silently reads the text passages that are followed by multiple-choice questions and marks the chosen option on the answer sheet (pencil-and-paper format). In the TLC-n, the student listens to the texts divided in short passages and the respective items that are only presented orally though an audiotaped recording and marks the chosen option on the answer sheet. The testing procedure included two example items for all test forms. There was no time limit to complete each test. In the TLC-n_5/6, the students listened to the texts divided in short passages and the respective items that are only presented orally though an audiotaped recording. The audiotaped recording was stopped after each item so that all students had time to mark their response. The presentation of the next item proceeded only after all students have marked the chosen option on their answer sheet or decided not to respond.

TABLE 1

Table 1. Items in each test form of the TRC-n and TLC-n.

Procedure

Legal authorizations for data collection were obtained from the ethics committee of the University of Minho and the Portuguese Ministry of Education, and from the respective school boards. Informed consent for the test administration was previously collected from parents or legal tutors. The anonymity and confidentiality of the data were guaranteed. Children were informed of the goals and characteristics of the study and were told that they could drop the study at any time. All tests were administered collectively in the classroom by trained psychologists.

Data Analyses

Ten cases had missing data in the TRL-n₆, but the number of missing values was only 0.14% of the total data. Five outliers for TRC-n₆ were found in the exploratory data analyses and were, therefore, removed. Unidimensionality of the test forms was tested using principal component analyses (PCA) of the linearized Rasch residuals. Eigenvalues less than 2.0 and/or explained variance less than 5% for secondary dimensions support this requisite (Linacre, 2018). Correlations between the items’ linearized Rasch residuals were calculated to examine the assumption of local independence of the items. Correlations higher than 0.70 may indicate that response to an item does not exclusively depend on the persons’ ability and is influenced by the performance on another item (Linacre, 2018). The reliability was analyzed by calculating Rasch coefficients for the person- and item-separation reliability (PSR and ISR), as well as Kuder-Richardson formula 20 (KR20). All coefficients values should be higher than 0.70 (Nunnally and Bernstein, 1994). Infit and outfit mean square statistics were analyzed to assess person and item fit to the model. These values should be smaller than 1.5 (Linacre, 2002). Mean ability for the students who selected each answer option was also computed for each item of the test forms. It was expected that the students with highest ability levels would choose the correct options (Linacre, 2018). Differential item functioning analysis was conducted to test the invariance of measurement as a function of sex and socioeconomic status for all items of each test form, using the Rasch-Welch procedure and considering a significance level of 5% (Linacre, 2018). Besides statistical significance, DIF size was also considered for practical significance: it was considered notable if the DIF contrast was ≥| 1.0| logit (Wright and Douglas, 1975, 1976). The displacement of the anchor items was also analyzed in order to evaluate the stability of the common items’ difficulty between adjacent grades. Values of the anchor items’ displacement can assume values as large as 1.0 logits without causing much impact on measurement (Linacre, 2018). The literature also suggests a minimum of 20% of anchor items in tests with 40 or more items for equating purposes (Kolen and Brennan, 2014). Both criteria were taken into account in the decision of deleting anchor items.

The TRC-n and TLC-n forms were linked according to three steps. In a first step, the calibration of the versions for the fifth grade was performed fixing the parameters of the common items with the values obtained in the versions for the fourth grade. In a second step, the items with inappropriate psychometric characteristics were removed. The psychometric characteristics considered were: item misfit, point-measure correlation (correlation between the response to the item and the construct that is being measured by the set of items) lower than 0.15, correct option not chosen by participants with higher levels of the latent trait, presence of DIF as a function of sex and of socioeconomic status, and, in the case of anchor items, displacement higher than 1.0. In a third step, the number of unique items were reduced taken into account the same criteria adopted in the development of the test forms for the primary school students (Santos et al., 2015, 2016b): the spread of difficulty (the items distributed along the continuum of ability of each grade sample were chosen), the redundancy (the number of items of similar difficulty levels was reduced by discarding some redundant items), and the comprehension level (the proportion of items of each comprehension level in the initial pool of items was maintained in the final pool of items, and when two or more items were of similar difficulty levels and they measured the same comprehension level, the item with a higher point-measure correlation was selected). The same steps were followed for the versions of the sixth grade, with the scores of the anchor items obtained in the fifth grade being used in the first step. The test forms were again linked through a final set of calibrations using the unique and anchor items selected in the previous steps. Finally, the reliability coefficients were recalculated. All these analyses were conducted using the Rasch measurement software Winsteps 3.92.1 (Linacre, 2018). Descriptive statistics and one-way analysis of variance to test differences between the grades in the scaled scores obtained on each test form were performed through the statistical program IBM^® SPSS Statistics 26. Statistically significant differences were expected in the means between the three grade levels, with successively higher values in upper grade levels.

Results

Dimensionality and Local Independence of the Items

Results of the PCA of the residuals revealed that all the secondary dimensions had eigenvalues close to 2.0 for the initial forms and these secondary dimensions explained less than 5% of the variance (see Table 2). The explained variance by measures was about four times higher than the variance explained by the first secondary dimension. The residuals’ correlations were much lower than 0.70. Therefore, these results support the assumptions of unidimensionality and local independence of the items for implementing Rasch model analyses.

TABLE 2

Table 2. PCA of the residuals and reliability coefficients by test form.

Item Analyses

Table 3 presents descriptive statistics for the Rasch estimated parameters for each test form.

TABLE 3

Table 3. Descriptive statistics of the estimated parameters by test form.

Tests of Reading Comprehension

In the TRC-n₅ none of the items exceeded the reference value of 1.5 for infit and outfit statistics (see Table 3) and the highest mean ability value was obtained by students who chose the correct answer option for all 39 items. However, one item exhibited a difficulty value higher than the maximum value for person ability, meaning that it was too difficult for fifth graders. The same item presented a point-measure correlation lower than 0.15. Moreover, four items were flagged with DIF as a function of sex and two items were flagged with DIF as a function of socioeconomic status. Two of these six items were anchor items. Consequently, only one out of these two (the one with the highest DIF contrast) was eliminated in order to maintain the percentage of anchor items close to the minimum value of 20%. The item that was maintained in the measure obtained a DIF contrast of 0.65 and, therefore, its impact was considered not notable. In addition to the six items with inappropriate psychometric characteristics mentioned above, one more item was deleted according to the criteria for selection of unique items. Therefore, seven items were removed from the initial version of TRC-n₅. Thus, the final version of TRC-n₅ was composed of 32 items with six anchor items (18.8% of the total number of items).

In the TRC-n₆ initial pool of 46 items, four items presented difficulty levels lower than the minimum person ability value, meaning that they were too easy for fifth graders. Further, one item had infit and outfit values higher than 1.5 and a negative point-measure correlation. In the same item, students who chose the correct answer option were not the ones with the highest average ability levels. A second item had a point-measure correlation lower than 0.15. This one and more two items were also flagged with DIF as a function of sex. Additionally, four items were flagged with DIF as a function of socioeconomic status. Therefore, eight items were removed from TRC-n₆. According to the criteria for selection of unique items, eight other items were also removed. Therefore, the final version of TRC-n-6 was composed of 30 items with eight anchor items (26.7% of the total number of items). Figure 1 presents the item and person parameter locations in the vertical scale resulting from the final recalibration of the TRC-n₅ (left) and the TRC-n₆ (right). Mean values of the person ability standardized scores for the TRC-n were 111 (SD = 10) for the TRC-n₅, and 120 (SD = 10) for the TRC-n₆. In the validation study of the version for the fourth grade (TRC-n_4; Santos et al., 2016b), the mean was 108 (SD = 10). With the progress in grade levels (lower to higher), person ability values were significantly greater, F(2, 670) = 90.874, p < 0.001. Post-hoc tests revealed significant differences (p < 0.001) between the scaled scores obtained on the three TRC-n test forms.

FIGURE 1

Figure 1. Person-item variable map for TRC-n₅ (left) and TRC-n₆ (right). Items are identified by the text to which they are related (WP, The history of white pencil; ST, A very special trip; LB, The lost bread; MC, A mysterious chest; LT, Loose thoughts), followed by the item’s number. The comprehension level assessed by each item is presented in superscript (LC, Literal Comprehension; IC, Inferential Comprehension; R, Reorganization; CC, Critical Comprehension). Anchor items appear in bold.

Tests of Listening Comprehension

Of the TLC-n₅ initial pool of 48 items, 6 items exhibited difficulty values lower than the minimum value for person ability (see Table 3), meaning that they were very easy for fifth graders. Additionally, one item exhibited difficulty value higher than the maximum value for person ability, indicating that it was very difficult for fifth graders. Regarding the infit and outfit statistics, two items exceeded the reference value of 1.5. Three items had point-measure correlations lower than 0.15. In one of these three items, the highest mean ability value was not obtained by students who chose the correct answer option, suggesting that the students with greater reading comprehension abilities chose an incorrect alternative. Additionally, four items were removed because they were flagged as having DIF regarding sex or socioeconomic status. According to the criteria for the selection of unique items, four additional items were also removed. Therefore, a total of 13 items were removed. Thus, the final version of the TLC-n₅ was composed of 35 items with eight anchor items (22.9% of the total number of items).

In the TLC-n₆ initial pool of 39 items, the minimum person ability value exceeded the minimum value of items difficulty for four items, meaning that these items were easy for sixth graders. Regarding the fit statistics for the items, none of the items exceed the reference value of 1.5. Only one item had point-measure correlation lower than 0.15. Two items were flagged as having DIF both as a function of sex and socioeconomic status and, therefore, were eliminated. Additionally, four items were flagged as having DIF only as a function of sex and six as having DIF only as a function of socioeconomic status. Only seven out of these 10 items were removed in order to maintain acceptable reliability coefficients (PSR, KR20 and ISR). The three with the lowest DIF contrast, ranging between 0.61 and 0.78, were maintained in the test form. As a summary, 10 items were removed and the final version of the TLC-n₆ was composed of 29 items, among which six were anchor items (20.69% of the total number of items). Figure 2 presents the item and person parameter locations in the vertical scale resulting from the final recalibration of the TLC-n₅ (left) and the TLC-n₆ (right). Mean values of the person ability standardized scores for the TLC-n were 124 (SD = 10) for the TLC-n₅, and 128 (SD = 10) for the TLC-n₆. In the validation study of the version for the fourth grade (TLC-n_4; Santos et al., 2015) the mean was 122 (SD = 10). With the progress in grade levels (lower to higher), person ability values were significantly greater, F(2, 711) = 24.721, p < 0.001. Post-hoc tests revealed significant differences (p < 0.05) between the scaled scores obtained on the three TLC-n test forms.

FIGURE 2

Figure 2. Person-item variable map for TLC-n₅ (left) and TLC-n₆ (right). Items are identified by the text to which they are related (DB, The Dentuça Bigodaça; CO, A composition; HV, Holidays in the Village; CA, The camping; BP, The bird and the pine), followed by the item’s number. The comprehension level assessed by each item is presented in superscript (LC, Literal Comprehension; IC, Inferential Comprehension; R, Reorganization; CC, Critical Comprehension). Anchor items appear in bold.

Reliability

The PSR and KR20 values were moderate, and the ISR coefficients were very high for the initial and final versions of all test forms. The elimination of items from the initial test forms to the final test forms did not cause a sharp decrease in reliability (see Table 2).