AI can strengthen safety assurance in high-risk autonomous systems

The rapid integration of AI into mission-critical systems has created a unique set of safety challenges that conventional engineering methods cannot adequately address. Deep neural networks drive perception, classification, and decision-making in autonomous rovers, aircraft, and vehicles, but their internal logic remains largely inaccessible to standard verification techniques.


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 02-12-2025 14:25 IST | Created: 02-12-2025 14:25 IST
AI can strengthen safety assurance in high-risk autonomous systems
Representative Image. Credit: ChatGPT

A team of NASA Ames–affiliated researchers has unveiled a new framework that uses artificial intelligence to verify, validate, and strengthen the safety of AI-enabled systems operating in high-risk environments. Their work targets a growing global concern: the reliability of autonomous vehicles, aerospace systems, and other mission-critical platforms increasingly powered by deep neural networks. With conventional verification methods failing to keep pace with the opaque and unpredictable nature of modern AI, the researchers argue that safety engineering must now evolve by using advanced AI models to fight the very risks posed by AI components.

The study, titled Fighting AI with AI: Leveraging Foundation Models for Assuring AI-Enabled Safety-Critical Systems, published on arXiv, introduces two complementary components, REACT and SemaLens, designed to bridge long-standing gaps between natural-language system requirements, formal specifications, perception models, and real-world system behaviors. The researchers propose a full-lifecycle assurance pipeline capable of detecting design errors early, accelerating verification workflows, and offering a new pathway toward certifying AI-integrated autonomous systems.

AI components push safety engineering beyond traditional verification limits

The rapid integration of AI into mission-critical systems has created a unique set of safety challenges that conventional engineering methods cannot adequately address. Deep neural networks drive perception, classification, and decision-making in autonomous rovers, aircraft, and vehicles, but their internal logic remains largely inaccessible to standard verification techniques. The authors highlight that AI components behave in ways that are emergent, nonlinear, and sensitive to data conditions, making their outputs hard to test and their failure modes difficult to predict.

The foundation of the problem lies in what the researchers describe as a semantic gulf between high-level natural-language requirements and the low-level pixel or vector representations processed by neural networks. System requirements are typically written by humans in descriptive language, while AI modules interpret the world through raw sensor inputs. This mismatch makes it nearly impossible to trace whether the neural network meets its stated requirements or to detect subtle inconsistencies before deployment.

Compounding these AI-specific challenges are traditional requirements-engineering problems that have persisted for decades. Natural-language requirements are frequently ambiguous, incomplete, or contradictory. Translating them into precise specifications for safety-critical systems is laborious, error-prone, and often requires specialized expertise in formal logic. The result is a scalability bottleneck that threatens the reliability of next-generation autonomous systems.

The researchers stress that these gaps are not theoretical. In safety-critical domains including aerospace, requirement errors discovered post-deployment have historically led to mission failures, human harm, and costly redesigns. As system complexity grows with the addition of AI modules, early detection of requirement flaws becomes urgently necessary. Without tools that can bridge both the linguistic and computational fault lines, the certification of AI-driven systems remains an unresolved challenge.

This study positions its framework as a decisive step toward closing this gap by directly deploying advanced language and vision models to interpret requirements, formalize system behavior, and analyze perception logic at a semantic level.

New AI-assurance tools: REACT and SemaLens offer end-to-end verification pipeline

The proposed solution is based on two synergistic components: REACT and SemaLens. Together, they create a unified pipeline that moves from informal requirements written in everyday English to testable implementations evaluated against high-level conceptual expectations.

REACT, short for Requirements Engineering with AI for Consistency and Testing, acts as a requirements assistant powered by large language models. It addresses one of the most persistent pain points in safety-critical engineering: the difficulty of translating vague human-written requirements into precise, verifiable formal specifications.

The system begins by accepting raw natural-language requirements and transforming them into structured English with constrained grammar. This transformation produces interpretations that are consistent, unambiguous, and suitable for formalization. Instead of forcing a single interpretation, REACT generates multiple candidate versions, reflecting all possible meanings embedded in the original text. This approach acknowledges the inherent ambiguity of human language and positions engineers to choose the version aligned with their intended semantics.

Once the candidate requirements are created, REACT’s validation module uses formal reasoning to highlight the differences among them. Instead of presenting formal logic directly, the system shows human-readable scenarios, helping engineers quickly select the correct requirement without needing deep expertise in formal methods. The validated requirement then moves into REACT’s formalization module, which converts structured English into formal specifications such as Linear Temporal Logic for finite traces. This step is crucial for downstream verification, enabling consistency checks, conflict detection, and compliance with industry standards.

The final REACT module generates test cases directly from the formal requirements. These tests provide coverage guarantees and serve as input for the SemaLens component, enabling a seamless transition from textual requirements to perception-testing workflows.

SemaLens, the second major component, leverages multimodal foundation models such as CLIP to analyze and interpret the behavior of neural networks in perception systems. It performs spatial and temporal reasoning on sequences of images and videos, interpreting them through human-understandable concepts rather than raw pixel values.

Its monitoring function can be used offline to identify safety-critical scenarios across large datasets or online to detect deviations in real time during system operation. SemaLens can evaluate whether a perception module recognizes key semantic concepts in an image, such as an obstacle, surface change, or environmental condition, and determine whether the system behaves consistently with its requirements.

The image generation module uses diffusion models to create diverse test images and videos, expanding the range of conditions under which perception systems can be evaluated. This is particularly important for data-sparse environments or edge cases that are difficult to simulate through conventional methods. The test module introduces novel coverage metrics that assess how well images reflect relevant semantic features, enabling both black-box and white-box coverage analysis.

Finally, the AED (Analyze, Explain, and Debug) module uses VLMs to interpret the internal behavior of perception models. By aligning embeddings from the neural network with those of the foundation model, it enables the extraction of human-readable concepts that explain why the model makes certain decisions. This component supports debugging, detection of brittle features, and evaluation of robustness against adversarial or unusual inputs.

Together, REACT and SemaLens produce a full chain of assurance tools that connect requirements to runtime behavior—a capability the researchers emphasize as vital for the future of safe AI deployment.

Foundation models offer a scalable future for safe autonomous systems

The study outlines several benefits arising from the integration of REACT and SemaLens. For REACT, the most impactful advantage is the ability to perform rigorous analysis while remaining accessible to engineers without formal-methods expertise. By shifting complex logic into AI-assisted tools, the framework reduces the manual burden typically required to ensure clarity, consistency, and completeness in requirements. It enables early-stage verification and validation, preventing costly redesigns and enhancing the scalability of large, complex projects.

SemaLens contributes a different set of strengths, including improved runtime reliability, enhanced interpretability of perception models, and reduced manual annotation for debugging. Its multimodal capabilities allow it to reason across images, text, and temporal patterns, something traditional verification methods cannot achieve. By generating semantically diverse test inputs, it ensures that perception models are challenged across a wider spectrum of conditions, supporting safer operations in unpredictable environments.

The integration of these two components delivers a rare combination: scalability, semantic reasoning, conceptual explainability, and end-to-end traceability. These capabilities are essential as aerospace, autonomous vehicles, and robotic systems increasingly rely on learning-enabled modules that traditional engineering cannot easily validate.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback