Generative AI in education: ChatGPT-4 in evaluating students’ written responses
Reliably grading short writing assignments with an older LLM was hard
Jauhiainen, J. S., & Garagorry Guerra, A. (n.d.). Generative AI in education: ChatGPT-4 in evaluating students’ written responses. Innovations in Education and Teaching International, 0(0), 1–15. https://doi.org/10.1080/14703297.2024.2422337
One of the reasons reading AI-related research in pre-print is so useful is that the results are more recent and more relevant. This newly published research, which must have been conducted a year ago, reports that grading student writing using GPT-4 was difficult. While this is good to know,iIt’s unlikely that anyone would try to grade student writing using an older model like GPT-4 today.
“In our findings, 31.2% of the initial grades changed upon re-evaluation, with 25.7% changing by more than one grade level. Repeated evaluations help identify inaccuracies and inconsistencies in LLM’s grading. In a specific test where 54 responses were evaluated ten times each, ChatGPT-4’s final grade to the answer was consistent in 68.7% of cases, varied by one grade in 27.4%, and by two grades in only 3.9%. For a five-criteria evaluation involving 2,700 gradings, the consistency was 71.6%, with variations of one and two grades at 20.6% and 6.8%, respectively.”
“ChatGPT-4 successfully assessed the factual accuracy of responses against reference learning materials, performing well at least at lower levels of the knowledge taxonomy. However, higher-level evaluations require precise instructional prompts, detailed student responses and explicit instructions for students to demonstrate their higher-level knowledge, which were not requires in this study.”