Stacked AI model improves credit default forecasting


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 12-03-2026 19:52 IST | Created: 12-03-2026 19:52 IST
Stacked AI model improves credit default forecasting
Representative Image. Credit: ChatGPT

A new study suggests that lenders may get their strongest overall read on credit default risk by combining several machine learning models rather than relying on a single algorithm. The researchers found that a stacked ensemble model delivered the best overall balance of discrimination, recall and precision in tests built around a benchmark credit dataset, while also offering a clearer explanation of why borrowers were flagged as high or low risk.

Published in Mathematical and Computational Applications, the study Predictive Modelling of Credit Default Risk Using Machine Learning and Ensemble Techniques examines one of the main tensions in modern lending: how to improve prediction without sacrificing transparency. That question has become more urgent as banks and other lenders face pressure to automate more decisions while still meeting regulatory demands for explainability and wider expectations of fairness and accountability.

Traditional methods such as logistic regression still matter because they are easier to explain and easier to defend before regulators, but they often struggle to capture the messy, non-linear relationships that shape borrower behaviour. More advanced machine learning systems can spot those patterns, but they are often treated as black boxes. Mathibela and Maposa set out to test whether a hybrid framework could narrow that gap by combining several model types with a formal interpretability layer.

Using the German Credit Dataset, a long-running benchmark in credit risk research, the authors built a full comparison pipeline around 1,000 credit applications and 20 predictor variables. The data included demographic, financial and credit history inputs and carried a 30 percent default rate. The researchers used a stratified 70-30 train-test split, cleaned missing values, encoded categorical features, scaled numerical variables and applied synthetic oversampling to address class imbalance. They then trained logistic regression, Random Forest, XGBoost and multilayer perceptron models, and merged them into a stacked ensemble with logistic regression as the meta-learner. To strengthen the analysis, they paired standard performance scores with bootstrapped confidence intervals and formal significance testing, including McNemar’s test and Friedman’s test with Nemenyi post-hoc analysis.

Stacked ensemble sets the pace, but Random Forest stays close

Under baseline conditions, the stacked ensemble turned in the strongest overall result. It posted an AUC of 0.761, the highest among all models tested, along with precision of 0.783, recall of 0.806 and an F1 score of 0.794. Those numbers marked it as the most balanced performer in the study, especially on the task that matters most to many lenders: identifying likely defaulters without letting error rates spiral out of control. By contrast, Random Forest delivered the highest raw accuracy at 0.736 and an AUC of 0.749, while logistic regression and XGBoost each landed at 0.733 on AUC and the multilayer perceptron finished at 0.720.

That ranking matters because it cuts against one of the common assumptions in machine learning for finance. XGBoost is often treated as the stronger tabular-data model, yet in this study Random Forest beat it on both AUC and accuracy. The result suggests that credit default risk modelling remains heavily dependent on the shape and size of the data, the level of noise in the features and the way imbalance is handled. For lenders and model validators, that is a reminder that algorithm choice still has to be empirical rather than driven by reputation alone.

The paper also adds a caution that is easy to lose in performance headlines. The authors ranked the five models across multiple metrics and found statistically meaningful differences overall, with the stacked ensemble taking the best average rank at 1.4, followed by Random Forest at 2.8. But the ensemble’s edge over Random Forest did not reach conventional statistical significance, even though it did significantly outperform the multilayer perceptron. In practical terms, that means the most complex system was the best overall performer, but not by a margin large enough to settle the matter for every lender. A bank that values simplicity, easier governance and faster deployment could still justify choosing Random Forest over the stacked ensemble.

That trade-off became clearer when the researchers moved beyond standard scoring and tried to mirror the real economics of lending. In credit decisions, the cost of missing a true defaulter is usually higher than the cost of wrongly rejecting a safe applicant. Once the models were adjusted to reflect that imbalance, Random Forest became the most sensitive detector of defaults, lifting recall to 0.823. The stacked ensemble, however, held onto the top AUC at 0.761 and kept a strong balance between precision and recall. Logistic regression shifted into a more conservative role in this setting, producing the highest precision at 0.825, a result that could appeal to institutions that want high confidence before labelling a borrower high risk.

Threshold optimisation pushed the story further. Instead of keeping the default cut-off at 0.50, the authors tuned the decision threshold to better match risk priorities. In that stricter setting, Random Forest produced the highest precision at 0.877, with the stacked ensemble close behind at 0.872. Yet the ensemble still preserved the strongest overall discrimination on AUC. The move also underscored the paper’s operational point: the best credit default risk model depends on what a lender is trying to protect. If the priority is catching as many likely defaulters as possible, one model may stand out. If the priority is avoiding false alarms and limiting unnecessary loan rejections, another may look stronger.

The error analysis made those trade-offs concrete. At baseline, logistic regression generated the fewest false positives, with 24, showing its conservative tilt. Random Forest captured the highest number of true positives, at 144, which made it the strongest single detector of actual defaults in the initial round. After class-imbalance handling and threshold tuning, the stacked ensemble cut false positives sharply, from 39 to 16, and Random Forest lowered them from 35 to 15. But both gains came with a cost: false negatives rose, meaning more actual defaulters slipped through. That shift reflects the core business tension in credit risk systems. Lowering one type of mistake often raises the other, and institutions need to decide which risk is more expensive.

Interpretability moves from side issue to central test

The researchers built interpretability into the evaluation itself through SHAP, a method that estimates how much each feature contributes to a prediction. That matters in credit default risk because lenders increasingly need to explain not only which model performed best, but why it made a given decision.

On the global level, the SHAP analysis pointed to current account status as the dominant predictor by a wide margin, with a mean absolute SHAP value of 0.153. Loan duration followed at 0.064 and savings account status at 0.063. Credit history and credit amount showed more moderate influence at 0.046 and 0.038 respectively. The pattern suggests that immediate liquidity and near-term financial condition carried more weight in the model than some lenders might expect from loan size alone. That is an important result because it shifts the focus from simple exposure measures to the borrower’s current financial resilience.

The feature rankings also add nuance to older credit scoring assumptions. Credit amount has often been treated as a core warning sign, but in this dataset it was not among the dominant drivers of risk. Instead, the model leaned more heavily on signals tied to accessible cash, repayment duration and prior credit behaviour. For lenders, that may support more refined underwriting strategies that focus on a borrower’s ability to absorb short-term stress rather than simply the nominal size of the loan.

At the local level, the SHAP results showed how those same features combined in individual cases. In a high-risk borrower example, negative current account status, limited savings, a high instalment burden, a larger credit amount and weak credit history all pushed the prediction toward default. That kind of case-level explanation is important for regulated lending because it moves model output closer to an auditable decision trail. Instead of stopping at a numerical score, the framework can show which borrower characteristics drove the outcome.

This part of the study also intersects with a larger policy debate around fairness and accountability in machine learning. The authors argue that lenders need systems that are not only accurate, but also transparent enough to justify decisions to regulators and applicants. Their framework is aimed at that balance. Even so, the paper is careful not to overstate the result. SHAP improves interpretability, but it is not the same as a full fairness audit, and the authors did not test demographic parity, equalised odds or other formal fairness measures. That leaves an important part of responsible AI unresolved, even as the paper advances the explainability side of the problem.

Limits of the data

The German Credit Dataset is a standard benchmark, but it is also historical and relatively small, with only 1,000 observations. That makes it useful for controlled comparison, but not enough on its own to prove that the same rankings will hold in live, contemporary lending portfolios. The authors also relied on a single stratified train-test split rather than repeated cross-validation or time-based validation, which means the performance estimates may carry more variance than they would under a more exhaustive testing design.

The authors acknowledge other gaps as well. Their cost-sensitive approach uses class weighting rather than institution-specific loss matrices, so it captures the direction of business trade-offs without measuring the full economic impact of lending errors. The confidence intervals came from bootstrap resampling of a single test set, which improves robustness but does not replace repeated experiments across new samples. The study may also have lacked enough statistical power to detect very small performance differences between the top two models, especially between the stacked ensemble and Random Forest.

Even with those limits, the paper makes a practical contribution that reaches beyond one benchmark dataset. It offers a reproducible Python-based workflow for comparing credit default risk models under several conditions at once: standard performance, class imbalance, threshold tuning, interpretability and statistical significance. That full pipeline is one of the study’s strongest selling points because real-world lenders rarely choose models on a single metric. They choose them based on a mix of predictive strength, governance demands, operational cost and the ability to defend outcomes internally and externally.

The authors position the framework as a methodological benchmark rather than a one-size-fits-all production system. That is an important distinction. The study does not claim that one stacked ensemble can be dropped into every lending context and solve the problem. Instead, it argues that institutions can adapt the method to their own data, cost structures and compliance standards. In that sense, the paper’s most important message may be that credit scoring model selection is strategic rather than purely technical. A lender focused on maximum sensitivity to default risk may favour one setting. A lender focused on reducing false alarms and keeping explanations simple may favour another.

Future research will need to test whether the same balance between performance and interpretability holds under more realistic conditions. The authors point to several next steps

  • embedding fairness constraints directly into ensemble training
  • modelling borrower risk over time rather than from a single snapshot
  • using institution-specific cost inputs, and
  • validating the framework across larger datasets from different regions and economic environments.

They also call for stronger validation designs, including nested cross-validation and walk-forward testing, to better match real deployment conditions.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback