Applying IRT to Distinguish Between Human and Generative AI Responses to Multiple-Choice Assessments
Maybe it's easier to identify cheating on multiple choice questions than on essays
Strugatski, A., & Alexandron, G. (2024). Applying IRT to Distinguish Between Human and Generative AI Responses to Multiple-Choice Assessments (No. arXiv:2412.02713). arXiv. https://doi.org/10.48550/arXiv.2412.02713
You’ve probably read reports that generative AI can ace a wide range of standardized exams - far exceeding average human performance. What if that outstanding performance left a psychometric trace that you could follow to detect cheating?
In this paper, we propose a method based on the application of Item Response Theory to address this gap. Our approach operates on the assumption that artificial and human intelligence exhibit different response patterns, with AI cheating manifesting as deviations from the expected patterns of human responses. These deviations are modeled using Person-Fit Statistics. We demonstrate that this method effectively highlights the differences between human responses and those generated by premium versions of leading chatbots (ChatGPT, Claude, and Gemini), but that it is also sensitive to the amount of AI cheating in the data. Furthermore, we show that the chatbots differ in their reasoning profiles. Our work provides both a theoretical foundation and empirical evidence for the application of IRT to identify AI cheating in MCQ-based assessments.
Lots to unpack here, but this is really promising work.