Improving the Validity of Automatically Generated Feedback via Reinforcement Learning
Yes, you can fine-tune open weights LLMs to behave more pedagogically.
Scarlatos, A., Smith, D., Woodhead, S., & Lan, A. (2024). Improving the Validity of Automatically Generated Feedback via Reinforcement Learning (Vol. 14829, pp. 280–294). https://doi.org/10.1007/978-3-031-64302-6_20
This work is directly relevant to my recent writing about Open Educational Language Models (OELMs), which describes using openly licensed data to fine-tune open weights LLMs to behave more pedagogically. In this case, Llama 2 is fine-tuned to give higher quality feedback.
In this work, we address both problems of automatically generating and evaluating feedback while considering both correctness and alignment. First, we propose a rubric for evaluating math feedback and show that GPT-4 is able to effectively use it to annotate human-written and LLM-generated feedback. Second, we propose a framework for feedback generation that optimizes both correctness and alignment using reinforcement learning (RL). Specifically, we use GPT-4’s annotations to create preferences over feedback pairs in an augmented dataset for training via direct preference optimization (DPO). We show that our methods significantly increase the correctness and alignment of generated feedback with Llama 2, an open-source LLM.