Automated Feedback in Math Education: A Comparative Analysis of LLMs for Open-Ended Responses
A small, fine-tuned model outperforms a frontier model at grading open-ended questions
Baral, S., Worden, E., Lim, W.-C., Luo, Z., Santorelli, C., Gurung, A., & Heffernan, N. (2024). Automated Feedback in Math Education: A Comparative Analysis of LLMs for Open-Ended Responses (No. arXiv:2411.08910). arXiv. https://doi.org/10.48550/arXiv.2411.08910
This paper is a great example of the power of fine-tuning smaller models. The authors’s much smaller GOAT model outperformed GPT-4 (and an older model called SBERT-Canberra) at assigning scores to student answers to open-ended math questions.
The GOAT model is our fine-tuned LLM catered to the dataset of student open responses and teacher-provided scores to these responses. To develop the GOAT model we finetune Mistral 7B. We fine-tune based on Mistral since it has shown to beat Llama 13B on math, reading comprehension and reasoning. We fine-tuned using LoRA since it uses less GPU memory and time and avoids catastrophic forgetting.
To acquire input-output pairs for fine-tuning, we utilize the illustrative grading rubric to design an instructional prompt for each pair, as shown in Figure 3, amalgamating a math problem and a student’s answer into the input, while treating a real teacher’s score as the desired output. We utilized 4000 entries data in the training split for fine-tuning and 1000 entries for testing.
Fine-tuning spans 4 epochs with 10 warm-up steps. We initialize the learning rate to 0.0002 and apply a cosine annealing schedule. To address memory constraints, we adopt the gradient accumulation technique, setting gradient accumulation steps to 2, partitioned into micro-batches of 2. The training process, conducted on a single A100 GPU, lasts approximately 2 hours and yields a near-zero loss function when complete.
GOAT was not as good at writing feedback to students as GPT-4 was. However, because of its diminutive size, GOAT will be significantly less expensive than GPT-4. This means that a multi-model approach to grading student work (using GOAT for scoring and GPT-4 for writing feedback) will both give higher quality scores and cost less than using GPT-4 alone would.