Assessing Multimodal AI with KoNET: A New Standard for Korean Educational AI
KoNET, developed by NAVER Cloud AI and KAIST AI, is a pioneering benchmark that evaluates Multimodal Generative AI (MLLMs) using Korea’s national educational tests, enabling direct AI-human performance comparisons. It highlights AI’s strengths and weaknesses in multilingual, multimodal reasoning, emphasizing the need for localized, culturally-aware AI training beyond English-centric benchmarks.
KoNET, a pioneering benchmark developed by researchers at NAVER Cloud AI and KAIST AI, is designed to evaluate Multimodal Generative AI (MLLMs) using Korea’s national educational tests. Unlike conventional AI evaluation methods that heavily focus on English, KoNET leverages rigorous Korean national exams to assess AI models across various educational levels and subjects. The benchmark consists of four standardized tests: the Korean Elementary General Educational Development Test (KoEGED), Middle (KoMGED), High (KoHGED), and College Scholastic Ability Test (KoCSAT). These tests, widely recognized for their high difficulty and cognitive demand, offer a real-world challenge for AI systems, making them a perfect tool for assessing the reasoning, comprehension, and problem-solving capabilities of modern AI models. Given the rapid integration of AI in education, tutoring, and automated assessment technologies, KoNET provides a unique opportunity to measure AI performance against human benchmarks, especially in non-English languages where AI evaluation remains limited.
Why KoNET Matters in AI Research
The development of KoNET stems from a fundamental problem in existing AI benchmarking—a lack of multilingual and multimodal assessment tools. Current AI benchmarks are overwhelmingly English-centric, failing to evaluate how AI models perform in low-resource languages such as Korean. While Large Language Models (LLMs) have made significant progress in text generation, reasoning, and comprehension, their ability to understand and process complex Korean-language questions remains largely untested. Moreover, most AI benchmarks do not compare AI proficiency directly with human performance, making it difficult to determine their practical use in real-world educational settings. Another major limitation in conventional benchmarks is the absence of multimodal components, while real-world exams require students to interpret both text and images to arrive at the correct answer. KoNET bridges these gaps by integrating visual question-answering (VQA) tasks, text comprehension, and multimodal problem-solving, making it one of the most advanced non-English AI assessment frameworks available.
How KoNET Works: The AI Testing Process
KoNET was constructed using publicly available national exam PDFs from the Korea Institute of Curriculum and Evaluation (KICE). The dataset comprises 2,377 images and corresponding questions, covering subjects like mathematics, science, language arts, and social studies. Unlike conventional text-based AI benchmarks, KoNET employs a multimodal format, requiring AI models to interpret exam questions embedded within images rather than processing pre-formatted textual inputs. This aligns with recent advancements in AI, where multimodal models must understand both visual and textual content simultaneously. One of the most distinctive features of KoNET is the inclusion of human error rate data from the KoCSAT, enabling a direct AI-human comparison. This allows researchers to analyze where AI models excel and where they fail compared to human test-takers, providing deeper insights into AI reasoning and comprehension abilities.
To evaluate AI performance on KoNET, researchers tested 18 open-source LLMs, 20 open-source MLLMs, 4 closed-source LLMs, and 4 closed-source MLLMs, including models from OpenAI, Google, Meta, and NAVER Cloud AI. The experiment employed Chain-of-Thought (CoT) prompting, a reasoning approach that guides AI models step by step through complex problem-solving tasks. Additionally, Optical Character Recognition (OCR) APIs were used to extract textual content from images for non-visual AI models, ensuring a fair evaluation between text-based and multimodal models.
Key Findings: AI vs. Human Performance on KoNET
The results of the experiment uncovered several key insights. First, closed-source LLMs like GPT-4o, Claude 3.5, and HyperCLOVA-X significantly outperformed open-source models, particularly in the most difficult exam sections. This suggests that open-source AI models lack sufficient tuning for Korean-language tasks, while commercial AI models benefit from proprietary data and optimization. Interestingly, MLLMs did not always outperform LLMs, challenging the assumption that multimodal models are inherently superior. In several cases, text-based LLMs with OCR support achieved higher accuracy than MLLMs trained for image-text interpretation, highlighting the difficulties AI faces in processing Korean characters embedded in images.
Another notable finding was the dramatic decline in AI performance as question difficulty increased, particularly in the KoCSAT subset, which is also challenging for human test-takers. This demonstrates that AI still struggles with higher-order problem-solving and contextual reasoning, even when trained on large-scale datasets. Models with a stronger Korean-language focus performed better in culturally specific contexts. For example, EXAONE-3.0-7.8B-Instruct, which was specifically trained for bilingual tasks in Korean and English, outperformed similarly sized English-centric AI models. A striking case was a high school literature question based on Yongbieocheonga (Songs of the Dragons Flying to Heaven), a 1445 Korean historical text. While most AI models failed to interpret the historical and literary context correctly, EXAONE-3.0-7.8B provided the right answer, demonstrating that cultural and linguistic familiarity significantly impacts AI performance.
Future Implications and the Road Ahead
One of KoNET’s most groundbreaking contributions is its direct comparison of AI and human error rates. By leveraging KoCSAT performance data from over 505,000 students, researchers discovered that AI models and humans make errors in different ways. AI models excel at comprehension-based tasks, where human test-takers often make mistakes due to fatigue or attention lapses. However, humans outperform AI in knowledge-based recall and historical/mathematical problem-solving, particularly in long-tail factual recall questions. This suggests that while AI can analyze vast amounts of information rapidly, it struggles with complex, knowledge-driven reasoning.
KoNET’s broader applications extend into AI-powered tutoring, automated grading, and education technology development. By providing a realistic, high-quality benchmark, KoNET can help fine-tune AI models for multilingual and multimodal learning environments. However, the study highlights some limitations. Since KoNET primarily consists of multiple-choice questions, it does not fully capture an AI model’s ability to express complex reasoning in written responses. Additionally, as national exams evolve annually, regular updates to KoNET’s dataset are required to maintain its relevance and prevent data leakage.
KoNET sets a new standard for AI evaluation in the Korean educational landscape, addressing critical gaps in multilingual and multimodal AI assessment. The research underscores the importance of localized AI training, demonstrating that AI models trained on English-based benchmarks alone cannot accurately predict performance in non-English environments. By making the KoNET dataset builder open-source, the authors encourage global AI research communities to develop more inclusive and culturally adaptive AI systems. As AI becomes more deeply integrated into education and learning technologies, KoNET will serve as a vital tool for ensuring AI’s effectiveness in diverse linguistic and academic contexts.
- READ MORE ON:
- KoNET
- Generative AI
- Large Language Models
- LLMs
- FIRST PUBLISHED IN:
- Devdiscourse

