The AI oversight trap: When smarter models make the same mistakes

The study highlights a growing trend in AI governance - relying on LMs to evaluate other LMs. This practice, known as "AI Oversight," is becoming more common due to the increasing cost and complexity of human supervision. AI is now used to score model outputs, provide feedback, and even fine-tune other AI models. While this offers scalability and efficiency, the researchers identify a critical flaw: models that are too similar tend to reinforce each other’s biases and mistakes.


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 12-02-2025 17:10 IST | Created: 12-02-2025 17:10 IST
The AI oversight trap: When smarter models make the same mistakes
Representative Image. Credit: ChatGPT

Artificial intelligence is evolving at an astonishing pace, with language models (LMs) becoming more capable and autonomous. However, as these models become integral to decision-making and evaluation, a fundamental question arises: Can AI truly oversee AI? Recent research suggests that AI models tend to think alike, which could lead to systemic risks in AI oversight.

A study titled "Great Models Think Alike and this Undermines AI Oversight", authored by Shashwat Goel, Joschka Strüber, Ilze Amanda Auzina, Karuna K Chandra, Ponnurangam Kumaraguru, Douwe Kiela, Ameya Prabhu, Matthias Bethge, and Jonas Geiping, investigates how model similarity affects AI oversight. Submitted in arXiv (2025), the study introduces a novel metric, Chance Adjusted Probabilistic Agreement (CAPA), to measure functional similarity between LMs. The research reveals that as AI models become more powerful, their errors start to align, making oversight less reliable. This has profound implications for the safety, reliability, and fairness of AI systems.

AI oversight and the challenge of model similarity

The study highlights a growing trend in AI governance - relying on LMs to evaluate other LMs. This practice, known as "AI Oversight," is becoming more common due to the increasing cost and complexity of human supervision. AI is now used to score model outputs, provide feedback, and even fine-tune other AI models. While this offers scalability and efficiency, the researchers identify a critical flaw: models that are too similar tend to reinforce each other’s biases and mistakes.

To quantify this, the study introduces CAPA, a metric that measures how often two models make the same mistakes beyond what would be expected by chance. Using CAPA, the researchers show that as AI models become more advanced, their mistakes become more correlated. This means that errors are no longer random but systematic, posing a significant challenge for oversight mechanisms that rely on AI judges or automated evaluations.

The affinity bias of AI judges

One of the most striking findings of the study is that AI models used as judges tend to favor models that are similar to themselves. This is analogous to human affinity bias, where people tend to prefer individuals with similar traits. The research shows that when a language model evaluates another AI model's output, it assigns higher scores to models that share its own error patterns.

For example, an AI judge trained on a particular dataset may be biased toward similar linguistic structures and reasoning patterns. Even when a more capable model provides a superior answer, an AI judge may still score it lower simply because it deviates from its own way of thinking. This presents a major obstacle in AI oversight, as it means AI evaluations could be inherently biased and unreliable.

Weak-to-strong generalization: When AI learns from AI

Another key area explored in the study is how models improve by learning from weaker models, a process called weak-to-strong generalization. This occurs when a less capable model provides annotations or feedback that a stronger model then uses to fine-tune its own learning. The study finds that performance gains are higher when the weak supervisor and strong student model are more different.

However, as AI models become increasingly homogeneous, the opportunity for complementary learning diminishes. If all models are trained using similar data and architectures, their learning gaps narrow, making it harder to transfer diverse knowledge. This could slow down the progress of AI development and make models more susceptible to shared blind spots.

Perhaps the most concerning implication of the study is the observation that as AI models improve, their errors become more correlated. This means that instead of improving oversight, AI may simply amplify common mistakes across different systems. If AI models continue to be trained on similar datasets, use the same architectures, and rely on AI oversight, they risk developing systemic blind spots.

For example, if multiple AI systems are used to detect fraudulent transactions but they all share similar training biases, they may collectively fail to identify new types of fraud. Similarly, in medical AI applications, correlated errors could result in consistent misdiagnoses, leading to dangerous outcomes. The study suggests that AI oversight strategies must account for diversity in model training to mitigate these risks.

The future of AI oversight requires diversity

The findings of this study highlight a fundamental paradox in AI governance: as AI models become more similar, their ability to oversee each other weakens. This raises critical questions about the reliability of AI judges, the effectiveness of weak-to-strong learning, and the risks of correlated failures.

To address these challenges, researchers propose that AI oversight should incorporate greater model diversity. This could involve training models on more varied datasets, using different architectures, and developing hybrid AI-human oversight systems. Additionally, transparency in AI evaluation processes is crucial - benchmarking methodologies should account for model similarity to avoid reinforcing biases.

As AI continues to shape decision-making across industries, ensuring diverse oversight mechanisms will be key to building more robust, fair, and trustworthy AI systems. The future of AI governance depends on breaking the cycle of similarity and fostering greater model independence, interpretability, and accountability.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback