LLMs show strengths in climate literacy but struggle with weather data precision
The ability to recognize cloud types from photographs is a longstanding challenge, even for trained meteorologists. Yet in this study, the latest iteration of ChatGPT (o3-mini) demonstrated outstanding performance, achieving an average F1 score of 0.83—the highest among all tools evaluated.

Artificial intelligence is rapidly redefining the landscape of meteorology and climate science, not through specialized physics-based models alone, but now through versatile large language models (LLMs) capable of visual interpretation, map creation, and scientific synthesis. A recent study, “The Applications of AI Tools in the Fields of Weather and Climate—Selected Examples,” published in Atmosphere (2025), presents a comprehensive evaluation of leading LLMs, including ChatGPT, Gemini, Claude, Perplexity, and others, on tasks ranging from cloud classification and map comparison to numerical data extraction and literature review support.
Using a standardized testing methodology across diverse platforms, the study analyzed each model’s performance in meteorological workflows through real-world scenarios. These included recognizing clouds from photos, creating maps from datasets, comparing temperature visuals, retrieving data from charts, and identifying key research in climate literature. The results unveil a nuanced picture: while modern LLMs display remarkable capabilities in certain domains, they also suffer from critical limitations that prevent their autonomous use in operational forecasting and scientific decision-making.
Can AI tools rival human experts in meteorological image classification?
The ability to recognize cloud types from photographs is a longstanding challenge, even for trained meteorologists. Yet in this study, the latest iteration of ChatGPT (o3-mini) demonstrated outstanding performance, achieving an average F1 score of 0.83—the highest among all tools evaluated. It also showed superior precision and recall in identifying complex cloud types based on the World Meteorological Organization’s classification system. Other models such as Gemini 2.0 and Claude 3.5 Sonnet also performed well, although they exhibited occasional confusion with visually similar cloud types like stratocumulus and altocumulus.
Interestingly, the tools consistently fared better when analyzing standard cloud atlas images compared to privately sourced photographs. This suggests that some models may have been indirectly trained on publicly available datasets, reinforcing the importance of evaluating AI in less curated, real-world contexts. Moreover, while most models correctly identified cloud genera, misclassifications in species and varieties were frequent, highlighting gaps in nuanced meteorological knowledge across platforms. Despite this, the improvements in model performance over a one-year testing period point to rapid evolutionary gains that could make automated cloud classification a viable support tool in early warning systems.
How well do LLMs support geospatial tasks like map creation and interpretation?
In climate science, maps are fundamental for interpreting spatial temperature trends, storm impacts, and hydrological shifts. The study evaluated whether LLMs could generate or interpret maps from either image inputs or raw data files. When it came to interpretation, models like Claude 3.5 Sonnet and SciSpace excelled, producing detailed geographic assessments and accurate temperature extrapolations from comparative maps. ChatGPT 4.o and Bard also offered region-specific insights with numerical accuracy, although others like Copilot and Gemini provided only general summaries with limited technical value.
However, the ability to generate maps proved to be more uneven. Only ChatGPT 4.o and ScholarGPT succeeded in producing readable and informative weather maps from structured Excel datasets. Gemini, Copilot, and Claude either failed to visualize the data or generated distorted, unlabelled, or incomplete graphics. This inconsistency reflects a broader issue in AI tool design: many LLMs can produce code (e.g., in Python or R), but few can consistently execute geospatial rendering unless paired with specific plug-ins or human interventions. While AI-assisted mapping can accelerate research workflows or aid in teaching, these tools remain suboptimal for publication-grade cartography.
Are LLMs reliable in extracting data or reviewing scientific literature?
Despite notable successes in image recognition and mapping, LLMs struggled significantly with numerical precision tasks. When asked to extract hourly temperature data from a line graph image, none of the models achieved results within 0.2°C of the actual values. Even the best performer, Claude 3.5 Sonnet, showed an average deviation of 0.4°C, while others like DataAnalyst and Gemini either produced flatline trends or incorrect estimations. This finding highlights a critical blind spot in current LLM capabilities: while visually adept, they lack the spatial and mathematical alignment necessary for precise data extraction from visual graphs.
In contrast, literature review support showcased the strongest area of LLM utility. Tools like Consensus and ChatGPT o1-preview accurately identified key peer-reviewed articles on humid heat waves, presenting full citations and clear summaries. ChatGPT 4.o and Academic Assistant Pro also performed reasonably well, though some tools returned off-topic or non-academic sources. Compared to earlier versions like ChatGPT 3.5, which were prone to hallucinations and fabrication, the current generation represents a major leap forward in bibliographic accuracy and scientific relevance. This capacity to expedite literature reviews can be a time-saving asset for researchers, provided outputs are verified against source databases.
- FIRST PUBLISHED IN:
- Devdiscourse