Banks could strengthen credit card fraud screening with ensemble machine learning model
A new machine learning framework could help financial institutions identify suspicious credit card transactions with greater accuracy and resilience, according to a new study published in the journal Frontiers in Artificial Intelligence.
The study, titled "Applying supervised machine learning algorithms and ensemble models to enhance credit card fraud detection," tested supervised ML models, resampling methods, behavioral features and ensemble learning techniques to improve fraud detection across highly imbalanced credit card transaction datasets.
Fraud detection faces rising pressure from digital payments and imbalanced data
The growing use of credit cards has expanded the attack surface for fraud, with the study noting that global payment card fraud losses reached USD 27.85 billion in 2018 and were projected to rise to USD 35.67 billion by 2023. In the United States, reported fraud losses exceeded USD 10 billion in 2023, marking a 14% increase from the previous year.
Credit card fraud is a threat not only to individual cardholders but also to banks, merchants, governments and payment networks. Fraud can involve stolen cards, skimming, phishing and unauthorized use of card details. As mobile banking, e-commerce and digital wallets grow, fraud patterns are becoming faster, more complex and harder to catch through traditional rule-based systems.
Machine learning is increasingly used to address this problem because it can learn patterns from historical transaction data and classify new transactions as legitimate or suspicious. But the study highlights a major obstacle: fraud datasets are highly imbalanced. In normal payment systems, legitimate transactions vastly outnumber fraudulent ones. In the primary datasets used by the researchers, fraudulent transactions made up only 0.5% of records. This imbalance can distort model performance.
A system may appear highly accurate simply by classifying most transactions as legitimate, while still missing the rare fraud cases that matter most. For banks and payment processors, missing fraud can lead to financial loss, while excessive false alarms can block genuine transactions and frustrate customers. The study therefore emphasizes the need to balance recall, which captures fraud cases, with precision, which reduces unnecessary alerts.
To address this challenge, the researchers tested several supervised machine learning models: Decision Tree, Logistic Regression, Naïve Bayes, Random Forest, Artificial Neural Network and XGBoost. They also applied three resampling techniques: Random Under-Sampling, Random Over-Sampling and Synthetic Minority Over-Sampling Technique. These methods were used to reduce bias toward legitimate transactions and improve the ability to detect rare fraud events.
The study used six datasets, combining synthetic and real-world credit card and payment fraud data. The primary training dataset contained 1.3 million synthetic transactions, while five unseen datasets were used to test whether the best model could generalize beyond the data on which it was trained. This multi-dataset validation was central to the research because many fraud detection studies rely on a single dataset and may not show how models perform under different data conditions.
Behavioral features and ensemble models improve detection performance
The researchers built their framework around the Cross Industry Standard Process for Data Mining, a structured lifecycle for machine learning projects. The process included business understanding, data understanding, data preparation, modeling, optimization, evaluation and testing on unseen data.
Data preparation played a major role. The study applied feature transformation, encoding, scaling, data splitting and feature selection. The researchers used both filter and wrapper methods to identify the most relevant variables, including correlation-based selection, variance thresholding, ANOVA, Gini index, recursive feature elimination and forward feature selection. This hybrid approach aimed to remove irrelevant or redundant features while keeping variables that improved prediction.
The researchers created behavioral features designed to capture unusual customer activity. These included transaction frequency, transaction timing and anomaly scores. The goal was to move beyond basic transaction attributes and detect deviations from a cardholder's normal behavior. For example, transactions occurring outside a user's usual time window or sudden bursts in transaction activity could provide signs of possible fraud.
The results showed that standalone models varied sharply in performance. Logistic Regression and Naïve Bayes were weaker overall, while Decision Tree and Artificial Neural Network produced more acceptable results. Random Forest and XGBoost were among the strongest standalone models, particularly after resampling and threshold tuning.
Threshold optimization was used to improve the trade-off between precision and recall. The default classification threshold of 0.5 may not be best for fraud detection, where identifying more fraud cases can be more important than maximizing standard accuracy. The researchers tested threshold changes for Random Forest and XGBoost, finding that Random Forest with Random Over-Sampling at a 0.2 threshold and XGBoost with SMOTE at a 0.7 threshold produced stronger fraud detection balances.
The study tested bagging, boosting and stacking models. Ensemble learning combines multiple models to improve predictive stability and reduce reliance on any single algorithm. In fraud detection, this is valuable because different models may capture different transaction patterns.
Among the ensemble approaches, the bagging model delivered the best overall performance. The selected Bagging 1 model combined Decision Tree, Random Forest and Artificial Neural Network learners under different resampling conditions. It achieved 0.99 accuracy, 0.90 recall and 0.77 precision in the study's final summary, showing that it could identify most fraudulent transactions while keeping false positives at a manageable level.
The researchers found that bagging was more stable than boosting and stacking across key performance measures. Boosting with Decision Tree and AdaBoost performed strongly, but other boosting configurations were weak. Stacking models also performed well, with one stacking model showing balanced precision and recall and another emphasizing recall. Still, the bagging model was selected because it showed the most consistent performance and better generalization potential.
The addition of behavioral features improved results further. When novel features were added to the training dataset, the Bagging 1 model's F1-score rose from 0.79 to 0.83, precision rose from 0.73 to 0.77, and recall rose from 0.86 to 0.89. On the first unseen dataset, the same feature additions improved F1-score and precision while keeping recall stable at 0.88. These gains suggest that behavioral patterns can strengthen fraud detection beyond standard transaction variables.
Real-world deployment needs monitoring, privacy safeguards and lower false positives
The study claims that an integrated fraud detection framework can outperform isolated model improvements. Instead of focusing only on algorithm selection, the researchers combined feature engineering, class imbalance handling, threshold tuning, ensemble diversity and unseen-data validation. This broader pipeline is positioned as a scalable approach for financial fraud detection.
The model also showed strong performance on unseen datasets. On several unseen datasets, the selected bagging model reached high accuracy and strong recall and precision, with some datasets recording scores above 0.95 across major metrics. This matters because real financial systems encounter transaction patterns that differ from training data. A model that performs well only on one dataset may fail when deployed in a live banking environment.
The researchers also acknowledge practical constraints. Ensemble models can increase computational complexity, and real-time fraud detection requires low-latency systems that can process large transaction volumes quickly. Bagging models are relatively scalable because they can be parallelized, but deployment may still require model pruning, distributed computing or incremental learning.
False positives remain a major operational concern. A model with strong recall may catch more fraud, but if it flags too many legitimate transactions, banks could face customer dissatisfaction, transaction delays and additional review costs. The study therefore recommends further work on thresholds and decision boundaries to balance fraud detection with operational costs.
The researchers also call for continuous model monitoring because fraud tactics evolve. Static models can lose effectiveness as criminals adapt to detection systems. Live financial systems must be updated to manage concept drift, where the statistical patterns of transactions change over time. This is especially important in digital banking, where fraud methods can shift quickly across channels and regions.
The study recommends more advanced feature engineering, stronger use of user behavior analytics, deeper investigation of bagging techniques, and scalable pipelines capable of processing large volumes of transactions in real time. It also points to the need for collaborative research and responsible data sharing among researchers, banks, industry stakeholders and regulators, while protecting privacy and security.
Future research could also explore model interpretability. Financial institutions must often explain why transactions are blocked or flagged, and regulators may require transparent decision-making in automated systems. Interpretability becomes especially important when machine learning models affect customer access to payments.
The researchers also suggest expanding fraud detection into network-based analysis to identify coordinated fraud groups. Social network analysis, already studied in anti-money laundering contexts, could help detect connected suspicious actors rather than treating each transaction as an isolated event.
- FIRST PUBLISHED IN:
- Devdiscourse
Google News