Legal AI is being misjudged: Benchmarks don’t match real-world law

The review found that legal use cases are routinely broken down into tasks that do not reflect the complexity of legal reasoning, procedure, or decision-making. Many studies transform rich legal problems into classification tasks, especially binary classification. This includes predicting guilty or not-guilty outcomes, identifying compliance or violation in a document, or tagging components within contracts.


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 29-11-2025 10:35 IST | Created: 29-11-2025 10:35 IST
Legal AI is being misjudged: Benchmarks don’t match real-world law
Representative Image. Credit: ChatGPT

A new evidence review has raised urgent concerns about the way large language models are being tested for use in the legal sector, warning that most current evaluation methods fail to reflect how these systems will function inside courts, law firms, and public advisory settings. The analysis finds that widely used benchmarks and metrics offer an incomplete and often misleading view of model performance, leaving major risks unaddressed as legal institutions consider adopting AI.

The study, titled “A rapid evidence review of evaluation techniques for large language models in legal use cases: trends, gaps, and recommendations for future research, published in AI & Society, reviewed 140 peer-reviewed papers published since the launch of ChatGPT-4, covering a wide range of legal applications including legal judgement prediction, legal question-answering, contract drafting, compliance detection, and legal analysis.

The review warns that unless evaluation methods change, legal AI research will continue producing misleading results that do not prepare institutions for the risks of real-world deployment.

Legal tasks are being oversimplified into narrow technical exercises

The review found that legal use cases are routinely broken down into tasks that do not reflect the complexity of legal reasoning, procedure, or decision-making. Many studies transform rich legal problems into classification tasks, especially binary classification. This includes predicting guilty or not-guilty outcomes, identifying compliance or violation in a document, or tagging components within contracts.

The review notes that this approach ignores essential elements of legal work. Lawyers, judges, and advisors must evaluate context, interpret ambiguous information, justify conclusions, and consider the social and professional consequences of their decisions. A binary outcome cannot capture this complexity. It also forces LLMs to deliver definitive answers even when the underlying case would reasonably be considered uncertain or requires deeper reasoning.

A minority of studies attempted to account for this by adding reasoning or explanation tasks, allowing models to write short justifications or generate legal judgements. These additions reflect a more realistic set of requirements but remain rare. In most studies, tasks are framed narrowly so they align with available quantitative metrics rather than aligning with legal realities.

This reductionism not only distorts the nature of legal work but also shapes the direction of research. When benchmarks reward simple classification accuracy, researchers tend to design tasks that fit those metrics rather than tasks that fit actual legal needs. As a result, the authors argue, legal AI research risks drifting further away from its intended domain.

Benchmarking culture drives a misalignment between evaluation and legal reality

The review shows that most studies rely on general-purpose NLP metrics such as accuracy, precision, recall, F1 scores, BLEU, ROUGE, and Exact Match. These metrics are cheap, quantitative, and easy to compute, but they were designed for tasks like translation, summarization, and information retrieval, not for legal reasoning.

The authors highlight several fundamental problems with this approach. Metrics like F1, precision, and recall assume that an LLM’s output can be mapped to a fixed ground-truth label. But LLMs generate free-form text, and mapping that text to a label often requires forced rules that do not capture nuance. Meanwhile, BLEU and ROUGE focus on word overlap rather than meaning, which makes them unreliable for tasks where the semantic correctness of the answer matters more than its wording.

This leads to situations where a model produces legally sound reasoning expressed differently from a reference answer, yet receives a low score. Conversely, a model might match the wording of the ground truth while misunderstanding key concepts, yet receive a high score. The review points to several studies showing these mismatches directly.

The authors argue that this benchmarking culture has become self-reinforcing. Because these metrics are easy to implement, researchers continue to use them even though they poorly represent legal reasoning. This has created a cycle in which evaluation practices shape task design, and task design shapes the types of questions researchers ask about LLM performance.

The result is an expanding body of research that appears rigorous but does not provide the information legal institutions need. The review calls for a shift away from benchmark-driven evaluation toward context-driven evaluation grounded in real workflows.Ethical, Social, and Contextual Factors Are Missing From Most Experiments

The third major theme is the lack of attention to ethical and social dimensions. The authors find that only a small number of studies integrate fairness, bias, privacy, or user experience into their evaluation design. Most papers mention ethical risks only in passing but do not embed them in the tasks or metrics used.

This is particularly concerning in legal contexts where biased or inaccurate outputs can produce serious harm, such as discriminatory judgement predictions, incorrect legal advice, or flawed document analysis. The review highlights examples where LLMs were tested without considering how their performance might differ across demographic groups or legal systems.

A few studies stand out as exceptions. One tested racial bias in legal judgement prediction by masking identity information and measuring whether the model showed skewed predictions. Another developed a Legal Safety Score that combines accuracy and fairness when evaluating statutory reasoning in the Indian legal system.

These exceptions show what more context-sensitive evaluation could look like. However, they remain rare, and the authors argue that ethical concerns should be integrated far more deeply into the design of tasks and metrics.

In addition to ethics, the review finds that most research does not meaningfully engage legal practitioners or real users. In many studies, lawyers only appear at the final stage to rate model outputs. While helpful, the authors say this is insufficient. Evaluations need to account not just for output quality but also for usability, workflow compatibility, and the realistic expectations of both experts and laypeople seeking legal help.

Some studies conducted workshops and user studies showing that laypeople often treat LLMs like search engines, entering vague prompts that lead to poor outputs. These findings highlight the gap between idealized model performance and real-world behavior.

A roadmap for more reliable legal AI evaluation

The authors provide a series of recommendations to align evaluation methods with real-world needs. These include embedding legal context directly into the design of tasks, ensuring that evaluation metrics reflect professional standards, and adopting mixed-methods approaches combining quantitative metrics with qualitative human judgment.

The review highlights the importance of socio-technical analysis, studying AI systems not just as technical artifacts but as tools that operate within institutions, workflows, norms, and legal cultures. This means collaborating with lawyers, regulators, technologists, and end users early in the research process instead of relying exclusively on technical benchmarks.

The authors also call for more attention to jurisdictional differences. Legal standards vary sharply between countries, and evaluation methods that work in one setting may not apply elsewhere. Models must be tested in ways that reflect the cultural and procedural norms of each legal system.

Lastly, the study calls for metrics that capture semantics, reasoning quality, and fairness rather than surface-level text patterns. Early attempts like the Legal Text Score demonstrate that domain-specific metrics can be created when researchers prioritize contextual accuracy and meaning.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback