AI fails to match expert judgments in education assessment; needs continual learning to improve
The study supports broader concerns about model staleness in artificial intelligence systems used in education. As pedagogy evolves and learner expectations shift, static models may fail to capture current best practices. The authors recommended developing AI systems with continual learning capabilities or regular retraining protocols to keep models aligned with changing professional standards.

An artificial intelligence system developed to evaluate teacher engagement in online video lectures has failed to deliver results comparable to expert human judgment, according to a new peer-reviewed study published in Education Sciences. The model, trained on recorded university lectures using deep learning techniques, showed only limited agreement with professional evaluations when tested in a controlled setting, raising concerns about the reliability and real-world applicability of automated educational tools.
The system, developed by researchers at the University of Southern Queensland, was part of a multi-phase project aimed at automating the assessment of online teaching effectiveness. It was designed to identify visual and behavioral indicators of teacher engagement, such as facial expressions, eye contact, tone of voice, and interaction patterns, using convolutional neural networks. Training data included 25 annotated videos, and the model was evaluated on two additional lecture recordings.
Statistical validation showed significant discrepancies between the AI’s assessments and those made by two experienced human raters. Agreement levels measured using Cohen’s Kappa were 0.09 and 0.07, indicating negligible alignment. The Intraclass Correlation Coefficient stood at 0.45, categorized as moderate at best. Bland–Altman analysis showed wide variability, and both Pearson and Spearman correlation coefficients revealed weak or near-zero correlation between the model and human ratings.
The researchers attributed part of the gap to the static nature of the AI model, which had not been updated in two years. While human evaluators had adapted their understanding of engagement criteria over time, the AI system remained fixed in its original state. The study described this as a central limitation and warned that without continual updates, such tools may quickly become obsolete in dynamic educational environments.
The AI model was trained on 15 specific engagement indicators identified through a systematic review. These included behaviors such as encouraging student questions, maintaining eye contact, and using varied tone and pitch. Each video frame was converted into an image and classified based on the presence of these behaviors. Internal metrics from the training phase showed 68% precision, 75% recall, and 79% balanced accuracy. However, these figures did not hold under third-party validation.
Researchers emphasized that while AI models can assist in educational evaluation, they should not be relied upon in isolation. The model’s failure to align with expert assessments highlights the difficulty of automating complex pedagogical judgments, especially those involving affective or interpersonal dimensions. The authors also cautioned against using such tools to make high-stakes decisions about instructor performance without additional oversight.
The study supports broader concerns about model staleness in artificial intelligence systems used in education. As pedagogy evolves and learner expectations shift, static models may fail to capture current best practices. The authors recommended developing AI systems with continual learning capabilities or regular retraining protocols to keep models aligned with changing professional standards.
In addition to updating the model, the researchers proposed expanding the training dataset and refining annotation practices to improve future versions. They also suggested incorporating educator feedback during development and deploying more transparent model outputs to allow instructors to interpret and contextualize AI-generated feedback.
Despite its shortcomings, the model showed potential for use in formative feedback and professional development contexts, provided it is regularly maintained and clearly presented as a support tool rather than a decision-maker. The authors concluded that AI models in education must remain subordinate to human expertise and that their role should be clearly communicated to educators and institutions alike.
The findings underscore a broader tension in education technology between the promise of scalable automation and the irreplaceable value of human judgment in teaching. The study serves as a cautionary example for institutions adopting AI-driven evaluation systems and stresses the importance of continuous model validation, stakeholder engagement, and responsible use.
The early access version of the study titled "Evaluating an Artificial Intelligence (AI) Model Designed for Education to Identify Its Accuracy: Establishing the Need for Continuous AI Model Updates" is available here.
- FIRST PUBLISHED IN:
- Devdiscourse