The next AI leap: LLMs can process multimedia without pre-trained data
A major breakthrough of MILS is its ability to generate highly accurate captions for images, videos, and audio without being explicitly trained for those tasks. In image captioning, MILS demonstrated superior semantic accuracy compared to existing zero-shot models, successfully describing visual scenes with greater detail and relevance.

Artificial Intelligence (AI) has advanced significantly, particularly with Large Language Models (LLMs) demonstrating powerful reasoning capabilities. Traditionally, AI models needed extensive training on domain-specific datasets to handle multimodal tasks like image captioning, video understanding, and audio analysis. However, a new approach has emerged that enables LLMs to perform these tasks without any specialized training.
A recent study titled "LLMs Can See and Hear Without Any Training", authored by Kumar Ashutosh, Yossi Gandelsman, Xinlei Chen, Ishan Misra, and Rohit Girdhar, introduces MILS (Multimodal Iterative LLM Solver), a novel test-time optimization method that empowers LLMs to perform multimodal reasoning without pre-training. Published by Meta AI in collaboration with UT Austin and UC Berkeley, this research demonstrates how MILS allows LLMs to generate accurate captions for images, videos, and audio, improve text-to-image generation, facilitate style transfer, and even perform cross-modal arithmetic, all through an iterative optimization process. This article explores the study’s methodology, findings, and implications for the future of AI-driven multimodal understanding.
What is MILS and how does it work?
MILS, or Multimodal Iterative LLM Solver, is an innovative framework that allows LLMs to perform multimodal tasks without requiring training on labeled datasets. The approach is based on a simple yet effective iterative feedback loop between two key components: a GENERATOR, which proposes candidate solutions for a given task, and a SCORER, which evaluates these outputs and provides feedback to refine them. This loop continues until the responses are optimized, allowing MILS to progressively improve accuracy. Unlike conventional models that rely on large pre-trained datasets, MILS leverages test-time reasoning, making it a zero-shot learning approach for multimodal AI. This capability allows MILS to generalize across tasks, enabling image and video captioning, audio description, and text-to-image generation without requiring additional training.
A major breakthrough of MILS is its ability to generate highly accurate captions for images, videos, and audio without being explicitly trained for those tasks. In image captioning, MILS demonstrated superior semantic accuracy compared to existing zero-shot models, successfully describing visual scenes with greater detail and relevance. For video captioning, the study found that MILS could extract meaningful descriptions from moving visuals, performing comparably to models trained on video datasets. In audio captioning, MILS effectively generated descriptions of sound events, showcasing an ability to process and interpret auditory information without prior exposure. The success of MILS in these diverse tasks highlights its flexibility and generalization capabilities, making it a promising solution for multimodal AI applications across different domains.
MILS also enhances AI-powered text-to-image (T2I) generation, refining the quality and accuracy of images created by diffusion models. By iteratively adjusting text prompts based on feedback from a scoring model, MILS significantly improves the alignment between generated images and textual descriptions. This process helps address common issues in AI-generated visuals, such as incorrect object rendering and inconsistencies in image structure. Additionally, the study found that MILS-enhanced text-to-image models produce more aesthetically pleasing images with higher semantic fidelity to the original prompts. This improvement makes MILS particularly valuable for creative AI applications in areas like digital design, automated content creation, and AI-generated art, where precision and quality are crucial.
Beyond image generation, MILS can be used for style transfer and image editing, allowing AI models to apply artistic styles to images without needing extensive training on labeled datasets. The study demonstrated that MILS can seamlessly modify images to match different styles by using a gradient-free optimization approach that iterates over multiple variations. This enables AI to change the artistic appearance of an image, making it possible to generate painterly effects, retro aesthetics, or highly realistic modifications based purely on textual descriptions. Additionally, MILS can precisely edit image elements while preserving key content, making it a powerful tool for AI-driven photo editing, graphic design, and digital media transformation.
Cross-modal arithmetic: Combining image, audio, and text
One of the most groundbreaking applications of MILS is cross-modal arithmetic, where different types of media - such as images, audio, and text - can be combined to create new, unique outputs. The study demonstrated how MILS translates different modalities into textual representations, allowing AI to merge concepts in a logical and creative way. For example, MILS can take an image of a bird and an audio clip of ocean waves, then generate an AI-rendered image of a bird by the shore, combining elements from both media formats. This process enables AI to synthesize and reinterpret multimodal information, leading to potential applications in interactive storytelling, virtual world-building, and AI-driven content synthesis.
Ethical considerations and future challenges
While MILS offers groundbreaking capabilities, it also presents ethical and practical challenges that must be addressed. One major concern is bias in AI models, as MILS relies on existing LLMs and scoring models, which may inherit pre-existing biases from their training data. This could result in inaccurate or skewed outputs, affecting reliability in real-world applications. Another challenge is computational efficiency, as the iterative optimization process of MILS requires significant processing power, making it less feasible for real-time applications.
Additionally, the study emphasizes the need for human oversight to verify AI-generated outputs, ensuring accuracy and preventing misinformation or unintended distortions. Moving forward, researchers suggest focusing on developing faster, more efficient scoring mechanisms, improving bias mitigation strategies, and enhancing real-time adaptability, allowing MILS to be deployed in more interactive and high-speed AI applications.
- FIRST PUBLISHED IN:
- Devdiscourse