Using Large Language Models for Automated Grading of Student Writing about Science
GPT-4 requires both a rubric and sample student work to grade reliably.
Impey, C., Wenger, M., Garuda, N., Golchin, S., & Stamer, S. (2024). Using Large Language Models for Automated Grading of Student Writing about Science. arXiv. https://doi.org/10.48550/arXiv.2412.18719
Automated grading of student essays has been an research area of interest for decades and this team from the University of Arizona makes some interesting contributions, including documenting the comparability of instructor-created and LLM-generated rubrics, and the fact that getting an acceptable performance from the LLM required both an example answer and a rubric be included.
The rapid development of AI has introduced the possibility of using large language models (LLMs) to evaluate student writing. An experiment was conducted using GPT-4 to determine if machine learning methods based on LLMs can match or exceed the reliability of instructor grading in evaluating short writing assignments on topics in astronomy. The audience consisted of adult learners in three massive open online courses (MOOCs) offered through Coursera. One course was on astronomy, the second was on astrobiology, and the third was on the history and philosophy of astronomy.
…
The promise and peril of AI for education cannot be fully elucidated in a simple pilot study like this [evaluating answers from 120 students to 12 questions across the three courses]. However, the results of using an LLM to grade student writing assignments in these MOOCs are promising.
For research question 1: Can the LLM generate grades comparable to those of an instructor? The answer is yes. GPT-4 was able to produce comparable grades to an instructor when prompted with appropriate information which, in this case, included an example answer and a rubric.
For research question 2: Can the LLM match or exceed the reliability of peer/instructor grading? The answer is yes, GPT-4’s grading is consistent and reliable, matching or exceeding the reliability of peer grading.
For research question 3: Can the LLM create a grading rubric that will produce LLM grades comparable to that of an instructor? The answer is yes, GPT-4 was able to produce comparable grades to an instructor with LLM generated rubrics, indicating the LLM-generated rubrics were of similar utility to the instructor-provided rubrics in terms of using them for this automated grading procedure.
Unsurprisingly, the authors note that the LLM had a much harder time matching instructors’ grading when the instructors themselves weren’t exactly clear about what they were looking for:
The performance of the LLM in matching the instructor grades is better for the astronomy and astrobiology courses than it is for the history and philosophy class, where questions are more open-ended and it is challenging even for an instructor to create a concrete rubric.
To Janet's point below, if GenAI were to replace all grading and feedback, at least at the moment, I'd be concerned and see that as swinging the "automation" pendulum too far too soon with this tech. On the other hand, if GenAI is used to augment human grading and feedback, allowing instructors to reallocate time to those students that need more personalized support, I think that could be a big win. For example, I'm not sure it's the best use of a professors time to be providing the same basic grammar and writing feedback on essays in a college writing 101 type of course as that seems like something GenAI could do well (potentially better then a human?). Instead of taking 20 hours to read and provide basic feedback on an essay, the instructor could use that time to work with smaller groups of students or individuals who are struggling most and need personalized attention to succeed.
So students are expected to pay tuition for educators who don't review their work? Hmmm... why pay?
Sure, reading papers and dissertation drafts was a lot of work. But maybe because in those pre AI days I did the reading and made an effort to support and encourage students as well as to grade them, i still hear from them a decade later. If they contact me we can reflect on the research.