Team, L., Modi, A., Veerubhotla, A. S., Rysbek, A., Huber, A., Wiltshire, B., Veprek, B., Gillick, D., Kasenberg, D., Ahmed, D., Jurenka, I., Cohan, J., She, J., Wilkowski, J., Alarakyia, K., McKee, K. R., Wang, L., Kunesch, M., Schaekermann, M., … Assael, Y. (2024). LearnLM: Improving Gemini for Learning. arXiv. https://doi.org/10.48550/arXiv.2412.16429
It’s great to see progress being made on this critically important problem. And, as is often the case, progress on hard problems often comes from changing your mental framing of the problem.
Today’s generative AI systems are tuned to present information by default rather than engage users in service of learning as a human tutor would. To address the wide range of potential education use cases for these systems, we reframe the challenge of injecting pedagogical behavior as one of pedagogical instruction following, where training and evaluation examples include system-level instructions describing the specific pedagogy attributes present or desired in subsequent model turns. This framing avoids committing our models to any particular definition of pedagogy, and instead allows teachers or developers to specify desired model behavior. It also clears a path to improving Gemini models for learning—by enabling the addition of our pedagogical data to post-training mixtures—alongside their rapidly expanding set of capabilities.
The authors also made good progress on the design of their evals, which is an under-appreciated problem working with LLMs generally, but especially with LLMs in educational contexts. This is already proving helpful in my own work.
In our initial tech report we discussed a taxonomy of pedagogy evaluation designs, and reported results of four human evaluations with different methodologies (Sections 4 and 5 in Jurenka et al. [1]). Here, we focus on scenario-guided, conversation-level pedagogy evaluations and side-by-side comparisons. We improved the clarity and coverage of our learning scenarios, added System Instructions specific to each scenario, and updated the pedagogy rubric and questions. Guiding the conversations with scenarios is especially important in multi-turn settings [10]: without scenarios, the unconstrained nature of human-AI interactions frequently leads to meandering conversations, offering a poor basis for comparison. In contrast, scenario-based approaches support relatively repeatable, controlled comparisons of the capabilities of different conversational AI systems. Scenario frameworks also help with evaluation coverage, ensuring that we test a diverse range of use cases.