AUTHOR=Hackl Veronika , Müller Alexandra Elena , Granitzer Michael , Sailer Maximilian TITLE=Is GPT-4 a reliable rater? Evaluating consistency in GPT-4's text ratings JOURNAL=Frontiers in Education VOLUME=8 YEAR=2023 URL=https://www.frontiersin.org/journals/education/articles/10.3389/feduc.2023.1272229 DOI=10.3389/feduc.2023.1272229 ISSN=2504-284X ABSTRACT=
This study reports the Intraclass Correlation Coefficients of feedback ratings produced by OpenAI's GPT-4, a large language model (LLM), across various iterations, time frames, and stylistic variations. The model was used to rate responses to tasks related to macroeconomics in higher education (HE), based on their content and style. Statistical analysis was performed to determine the absolute agreement and consistency of ratings in all iterations, and the correlation between the ratings in terms of content and style. The findings revealed high interrater reliability, with ICC scores ranging from 0.94 to 0.99 for different time periods, indicating that GPT-4 is capable of producing consistent ratings. The prompt used in this study is also presented and explained.