Do Chatbots Really Teach? A Deep Dive into Their Use, Design, and Educational Impact

A comprehensive review by European researchers reveals that while educational chatbots are widely used and often effective, most lack theoretical grounding and standardized evaluation. The study urges integrating pedagogical frameworks and holistic assessment to maximize their educational value.


CoE-EDP, VisionRICoE-EDP, VisionRI | Updated: 17-08-2025 09:40 IST | Created: 17-08-2025 09:40 IST
Do Chatbots Really Teach? A Deep Dive into Their Use, Design, and Educational Impact
Representative Image.

In a timely and methodical effort to decode the fast-evolving role of chatbots in education, researchers from the Open Universiteit (Netherlands), Maastricht University, and the Universidad Politécnica de Valencia have delivered one of the most comprehensive reviews to date. Drawing from 71 peer-reviewed studies published between 2019 and 2023, the team undertook a systematic analysis using the PRISMA framework to understand the objectives, technologies, theories, evaluation strategies, and effectiveness of educational chatbots. Their findings reveal a field teeming with potential but fraught with inconsistencies, especially in terms of theoretical grounding and evaluation quality.

The review highlights that a large majority, 65%, of chatbots are teaching-oriented, designed to enhance learning experiences by delivering content, answering questions, or offering tailored feedback. Around 31% serve administrative functions, such as managing schedules or facilitating communication, while only 4% attempt to bridge both learning and service functions. These chatbots are largely deployed in higher education settings, with heavy use in STEM fields, especially computer science and health-related disciplines. However, their distribution across 31 academic domains shows that chatbot integration is a widespread trend, not confined to technical subjects.

Missing the Pedagogical Mark: The Theory Gap

One of the most revealing insights from the study is the widespread absence of theoretical grounding in chatbot design. Only 18% of studies incorporated learning, motivational, or social theories in meaningful ways. Theories like constructivism, scaffolding, or self-regulated learning, which can play a vital role in shaping learner engagement and cognitive processes, are rarely used. Even when mentioned, theoretical frameworks often appear superficially, without influencing design choices or evaluation methods.

The consequence of this gap is significant. The review notes that all chatbots categorized as ineffective lacked any theoretical basis. Meanwhile, those few that incorporated theory tended to show more consistent success in achieving learning outcomes. The disconnect likely stems from the interdisciplinary nature of chatbot development, where computer scientists, rather than educators, often lead the process. As a result, many chatbots risk being technologically innovative but pedagogically shallow, limiting their long-term educational value.

Platforms Dominate, but Transparency Lags

The technological backbone of chatbot development is dominated by builder platforms such as Dialogflow, IBM Watson, and Rasa, used in nearly 60% of the cases. These tools offer accessible, low-code environments that are ideal for developers with limited programming backgrounds. However, the widespread reliance on platforms often comes with a lack of transparency; many studies failed to specify what kind of underlying natural language processing, AI models, or rule-based systems powered the bots.

Beyond platforms, about 17% of chatbots used AI-driven models, while 13% combined AI with rule-based logic in hybrid systems. Pure rule-based bots, once common, now represent a small minority. Surprisingly, advanced foundational models like GPT-3 or ChatGPT were rarely used, despite their rising prominence in tech spheres. This could be due to the timing of the studies or concerns around cost, accessibility, and ethical considerations. The authors see vast potential in integrating large language models into education but emphasize the need for transparent, theory-driven design and careful evaluation.

Evaluation Practices: Varied and Often Superficial

One of the major challenges the authors identify is the lack of consistent and comprehensive evaluation frameworks. Across the 71 studies, they documented 37 distinct evaluation criteria, with most studies focusing on perceptual measures like satisfaction, usability, and trust. These are typically gauged using surveys and standardized tools such as the System Usability Scale (SUS). While important, these metrics offer only a partial view of educational effectiveness.

Behavioral outcomes such as learning gains or engagement are somewhat more informative, often measured through pre- and post-tests or quiz performance. Fewer studies focused on technical evaluations like response accuracy or BLEU scores, and even fewer explored psychological outcomes such as motivation, self-efficacy, or anxiety. The researchers caution that relying solely on user impressions or technical precision can miss the deeper educational impact. Encouragingly, more recent studies show a shift toward integrating behavioral and psychological measures, signaling a maturing approach to chatbot evaluation.

What Works, and What Still Needs Work

Effectiveness, as defined in the study, hinges on whether a chatbot achieves its stated educational or support objectives. Based on this, 76% of chatbots were deemed effective, while 12% were partially effective, 6% ineffective, and another 6% lacked sufficient data. Interestingly, the same platform and technology sometimes yielded different results, depending on how well the chatbot’s design aligned with educational goals and theoretical principles.

To better understand these dynamics, the authors used social network analysis to map the relationships between objectives, technology types, evaluation criteria, theoretical frameworks, and effectiveness. The strongest pattern emerged around teaching-oriented chatbots, built on platforms, evaluated with perceptual and behavioral measures, and occasionally supported by educational theory. In contrast, rule-based systems, theory-light designs, and purely technical evaluations were underrepresented among successful cases.

The review offers a valuable blueprint for advancing chatbot use in education. The authors call on developers, educators, and researchers to co-create chatbots that are not just technically sound but theoretically informed, pedagogically aligned, and rigorously evaluated. With foundational AI models gaining traction, the timing is ripe to steer chatbot innovation toward more meaningful, equitable, and impactful learning experiences. The message is clear: educational chatbots can indeed be transformative, but only when built on a foundation of smart pedagogy, not just smart code.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback