When Ratings Mislead: How Machine Learning Redefines Aid Project Effectiveness
A study by the World Bank Group and MIT GOV/LAB finds that donor project ratings rarely reflect real development impact, while machine learning reveals that projects tailored to local context are far more effective. The research calls for a shift from superficial ratings to deeper, context-driven evaluation methods.
A groundbreaking study by researchers from the World Bank Group and the MIT Governance Lab is reshaping how the global development community thinks about foreign aid. The study, titled "Minding the Gap: Aid Effectiveness, Project Ratings and Contextualization," is authored by Diana Goldemberg, Luke Jordan, and Thomas Kenyon. Using a mix of econometric modeling and state-of-the-art machine learning techniques, the research exposes critical flaws in the way development projects are evaluated and offers a bold new approach to measuring what truly makes aid effective. Drawing on thousands of project documents and aid data from 183 countries over three decades, the study reveals that donor-assigned project ratings long used to gauge success often fail to reflect actual improvements in health, education, sanitation, and other key development indicators.
Ratings Don’t Reflect Reality
One of the most startling findings is that project outcome ratings assigned by major donor agencies such as the World Bank, Asian Development Bank, and the UK’s former DFID bear little relation to actual improvements in people’s lives. In most sectors examined, including education, health, water and sanitation, and energy, the ratings were statistically insignificant as predictors of sectoral outcomes. The only exception was in fiscal policy, where ratings showed a modest correlation with outcomes like improved tax collection. For the rest, the ratings, often based on whether a project met its predefined objective, offered no meaningful signal of whether a project genuinely contributed to national development goals. This undermines the value of ratings as a primary accountability tool and raises concerns that they may incentivize risk-averse project design over truly transformative work.
Machine Learning Brings Fresh Insight
To move beyond the limitations of traditional evaluation methods, the researchers turned to machine learning. They employed large language models to create text embeddings, mathematical representations of the content of project documents like development objectives, results frameworks, and completion reports. These embeddings allowed them to quantify similarities and differences between projects in ways no manual coding exercise could. What emerged was a powerful new metric called “contextualization,” which measures how much a project deviates from sectoral norms and aligns with country-specific realities. Projects with higher contextualization, those that appear more tailored to the local political, institutional, and social context, showed stronger associations with improved development outcomes. The further a project’s design diverged from a generic sector blueprint, the more likely it was to succeed.
Predictive Power Far Beyond Ratings
The authors then tested the predictive power of various features using advanced modeling techniques like gradient-boosted trees and random forest classifiers. These ensemble models significantly outperformed both traditional regression methods and simpler linear models. In classification tasks predicting whether a project would lead to improved sector outcomes five years after completion, the best models reached an accuracy score (AUC) of 0.7. In regression tasks, they explained up to 86 percent of the variance in residual outcomes, a dramatic improvement over previous studies, which typically capped at 30 percent. Text embeddings, particularly those reflecting contextualization, emerged as the most important predictors, far outweighing conventional factors like project size, manager experience, or even institutional quality. Notably, the inclusion of project ratings in these models added little to no explanatory value.
Reviews Matter More Than Scores
While ratings fall short, one overlooked source of insight proved unexpectedly rich: the implementation completion reports (ICRs) that teams prepare at the end of a project. When analyzed using the same text embedding techniques, these reviews provided valuable signals about project effectiveness. Though they require time and expertise to digest, and are rarely used in quantitative analysis, the machine learning models showed that ICRs can help predict whether a project truly contributed to long-term sectoral improvements. Unlike ratings, which reduce a project’s story to a six-point scale, ICRs offer detailed narratives about what worked, what didn’t, and why. In several model specifications, embeddings derived from these reports even outperformed all other variables in predicting success.
Rethinking What Success Looks Like
The study has profound implications for the design and evaluation of development aid. It argues that success should be defined not merely by internal objectives met, but by the project’s actual contribution to sector-wide progress. Contextualization, the degree to which a project design is responsive to the realities of the country it operates, emerges as a vital, measurable ingredient in making aid effective. Surprisingly, the study finds that contextualization does not correlate strongly with common markers of quality like project manager education, team composition, or whether staff were based in headquarters or in-country. This suggests that genuine adaptation to context is not just about technical skill or location; it may require a more fundamental shift in mindset, incentives, and institutional culture.
Ultimately, the researchers offer not just a critique but a pathway to reform. Their approach is scalable, replicable, and grounded in the language of project documents that donors already produce. By embracing machine learning and focusing on context, development institutions can move beyond superficial ratings to truly understand what drives impact. As the global development sector grapples with complex challenges and increasing demands for accountability, this study offers a timely and transformative contribution to the way we evaluate what works and what doesn’t in the fight against poverty.
- FIRST PUBLISHED IN:
- Devdiscourse
ALSO READ
Indonesia Sustains 5% Growth as World Bank Urges Digital Reforms for Quality Jobs
World Bank Approves $376M Plan to Boost Fiscal Sustainability and Growth in Piauí
World Bank Invests $120M to Strengthen Social Services in Salvador, Brazil
Nepal Expands Drone Use as World Bank Program Accelerates a Safer, Skilled Ecosystem
World Bank Approves $150 Million to Support Reforms and Resilience in PNG

