Can AI optimize building retrofits? Research shows promise in CO₂ reduction but gaps in economic reasoning
Researchers from Michigan State University have conducted one of the first systematic evaluations of large language models (LLMs) in the domain of building energy retrofits, where decisions on upgrades such as insulation, heat pumps, and electrification can directly impact energy savings and carbon reduction.
The study, titled “Can AI Make Energy Retrofit Decisions? An Evaluation of Large Language Models,” published on arXiv, examines whether LLMs can reliably guide retrofit decision-making across diverse U.S. housing stock. It addresses the limitations of conventional methods, which are often too technical, data-heavy, or opaque for practical adoption, particularly at large scale.
How accurate are AI models in selecting retrofit measures?
The researchers tested seven widely used LLMs, ChatGPT o1, ChatGPT o3, DeepSeek R1, Grok 3, Gemini 2.0, Llama 3.2, and Claude 3.7, on a dataset of 400 homes drawn from 49 states. Each home profile included details such as construction vintage, floor area, insulation levels, heating and cooling systems, and occupant patterns. The models were asked to recommend retrofit measures under two separate objectives: maximizing carbon dioxide reduction (technical context) and minimizing payback period (sociotechnical context).
The analysis found that LLMs were able to deliver effective results in technical optimization tasks. Accuracy reached 54.5 percent when looking at the single best solution and as high as 92.8 percent when top five matches were considered, even without fine-tuning. This reflects the models’ ability to align with physics-based benchmarks in scenarios where clear engineering goals, such as cutting carbon emissions, are prioritized.
On the other hand, when the focus shifted to minimizing payback period, results weakened substantially. Top-1 accuracy fell as low as 6.5 percent in some models, with only Gemini 2.0 surpassing 50 percent at the broader Top-5 threshold. The study concludes that economic trade-offs, which require balancing upfront investment against long-term savings, remain difficult for LLMs to interpret accurately.
How consistent and reliable are AI-generated decisions?
The study also examined whether different LLMs converged on the same recommendations. Here, performance was less encouraging. Consistency between models was low, and in some cases their agreement was worse than chance. Interestingly, the models that performed best in terms of accuracy, such as ChatGPT o3 and Gemini 2.0, were also the ones most likely to diverge from other systems. This indicates that while some models may excel, they do not necessarily produce results that align with peers, creating challenges for standardization in real-world applications.
The findings underscore the difficulty of relying on AI for high-stakes energy decisions when consensus is lacking. In practice, building owners, policymakers, and utility companies require not just accurate but also consistent recommendations. Low inter-model reliability highlights the importance of developing frameworks that validate and harmonize AI outputs before they can be integrated into large-scale retrofit programs.
What shapes AI reasoning in retrofit decisions?
The researchers also explored how LLMs arrive at their decisions. Sensitivity analysis showed that most models, like physics-based baselines, prioritized location and building geometry. Variables such as county, state, and floor space were consistently weighted as the most influential factors. However, the models paid less attention to occupant behaviors and technology choices, even though these can be critical in shaping real-world outcomes.
The reasoning patterns offered further insight. Among the tested systems, ChatGPT o3 and DeepSeek R1 provided the most structured, step-by-step explanations. Their workflows followed an engineering-like logic, beginning with baseline energy assumptions, adjusting for envelope improvements, calculating system efficiency, incorporating appliance impacts, and finally comparing outcomes. Yet, while the logic mirrored engineering principles, it was often simplified, overlooking nuanced contextual dependencies such as occupant usage levels or detailed climate variations.
The authors also noted that prompt design played a key role in outcomes. Slight adjustments in how questions were phrased could significantly shift model reasoning. For example, if not explicitly instructed to consider both upfront cost and energy savings, some models defaulted to choosing the lowest-cost option when evaluating payback. This sensitivity suggests that successful deployment of AI in retrofit contexts will depend heavily on careful prompt engineering and domain-specific adaptation.
A cautious but forward-looking conclusion
The evaluation highlights both the promise and the limitations of current LLMs in building energy retrofits. On one hand, the ability to achieve near 93 percent alignment with top retrofit measures in technical contexts shows significant potential for AI to streamline decision-making and improve energy efficiency strategies. On the other, weak performance in sociotechnical trade-offs, low inter-model consistency, and simplified reasoning demonstrate that these tools are not yet ready to replace domain expertise.
To sum up, LLMs can complement but not substitute traditional methods and expert judgment in retrofit planning. They recommend further development of domain-specific models, fine-tuning with validated datasets, and hybrid approaches that integrate AI with physics-based simulations to ensure accuracy and traceability.
For policymakers and practitioners, the study provides an important benchmark: AI can indeed assist in advancing retrofit strategies, especially for carbon reduction, but its current shortcomings demand careful oversight. As cities and communities push toward energy transition goals, ensuring that AI systems are transparent, consistent, and context-aware will be essential before they can be deployed at scale.
- FIRST PUBLISHED IN:
- Devdiscourse

