AI credit card fraud detection models overstate accuracy
The paper concludes that Precision and Recall must be the minimum reporting standard, and confusion matrices should always accompany headline metrics. Only a handful of studies included these details.
A new international study published in Computers has uncovered deep flaws in how artificial intelligence models are evaluated for credit card fraud detection.
Titled “A Systematic Review of Machine Learning in Credit Card Fraud Detection Under Original Class Imbalance,” the research presents the first systematic review focused exclusively on fraud detection studies that retain real-world data imbalance, exposing methodological shortcuts that may be inflating model accuracy.
Why the study matters: Original data imbalance changes everything
In real financial systems, fraudulent transactions make up a tiny fraction of all credit card operations, often less than 0.2 percent. Yet most AI studies artificially balance datasets before training models, creating conditions that rarely exist in production environments.
The authors argue that this practice distorts results, leading to over-optimistic performance claims that can mislead financial institutions. Their review analyzes 1,425 academic papers published between 2019 and mid-2025, narrowing them to 44 studies that explicitly maintained the original class imbalance when building and testing machine learning models.
The team used the PRISMA 2020 systematic review methodology and a Kitchenham-style software-engineering framework to ensure reproducibility and transparency. They then examined four critical research questions:
- What datasets are being used?
- Which machine learning algorithms perform best without resampling?
- What evaluation metrics are reported?
- How do explainability and transparency factor into model design?
Their conclusions reveal a field still dominated by convenience choices rather than operational realism.
Tree ensembles dominate, but explainability lags
Across the 44 retained papers, the study found that tree-based ensemble methods such as Random Forest, XGBoost, and LightGBM overwhelmingly lead the field when data imbalance is preserved. These models outperform deep neural networks and support vector machines under extreme skew because they can handle non-linear patterns while maintaining interpretability.
Despite this, only two studies implemented any practical explainable-AI (XAI) framework. Both used SHAP (Shapley Additive Explanations) to identify which transaction features most influenced fraud predictions. The near absence of interpretability tools, the authors warn, threatens trust and adoption in live banking environments.
Another striking result is the over-reliance on one benchmark dataset, the widely used European Credit Card Fraud Dataset, employed in 73 percent of the reviewed papers. While it offers standardized data, its dominance limits generalizability, preventing researchers from capturing regional fraud behavior or evolving attack vectors. Other datasets, such as IEEE-CIS Fraud Detection and synthetic generators like BankSim and PaySim, appeared far less frequently.
The analysis also highlighted that most studies measure success using AUC-ROC, even though this metric performs poorly on rare-event problems. The authors advocate shifting to AUC-PR (Area Under the Precision–Recall Curve) because it better reflects a model’s ability to identify minority-class frauds without over-counting easy true negatives.
Performance metrics reveal an accuracy illusion
The review challenges the assumption that high accuracy equals good fraud detection. When datasets are 99.8 percent legitimate, a naïve model can achieve 99 percent accuracy simply by predicting every transaction as non-fraudulent.
To expose this illusion, the researchers compared how Precision, Recall, F1-score, and AUC-PR vary under original imbalance. They found that models achieving high accuracy often suffered from low Recall - missing up to half of all fraudulent cases. Conversely, models tuned for higher Recall risked more false alarms, raising customer-experience concerns.
The paper concludes that Precision and Recall must be the minimum reporting standard, and confusion matrices should always accompany headline metrics. Only a handful of studies included these details.
Crucially, the team calls for cost-sensitive evaluation, linking misclassification penalties to real financial impact. The absence of cost-aware metrics, they argue, keeps much of the research disconnected from how fraud teams actually weigh false positives versus false negatives.
Where the field goes next: From metrics to meaning
While the literature shows steady progress in applying machine learning to fraud detection, Baisholan and her co-authors identify four key areas demanding immediate reform.
-
Data Diversity The dominance of a single European dataset creates a narrow research view. The authors urge the creation of open, multi-institutional datasets that preserve transaction semantics without compromising privacy, enabling more robust global benchmarking.
-
Metric Reform They recommend a field-wide shift toward Precision–Recall-based evaluation and away from misleading overall accuracy metrics. Integrating financial-cost metrics could bring academic modeling closer to real-world deployment standards.
-
Explainability Integration Few models provide actionable reasoning behind decisions. The authors call for built-in explainability using methods like SHAP or LIME, supported by transparent feature dictionaries that can be safely disclosed to fraud analysts.
-
Practical Collaboration The review points out that small and medium-sized enterprises struggle with the computational demands of deep models and the privacy barriers around sharing data. The team proposes federated learning frameworks, allowing banks to collaborate on model training without exposing sensitive customer information.
Their conclusions also highlight the need to evaluate data drift, the gradual change in spending behavior that weakens models over time. By incorporating adaptive learning pipelines, financial institutions could maintain model accuracy across changing patterns of fraud.
- FIRST PUBLISHED IN:
- Devdiscourse

