E-learning 2.0: AI can now create, narrate, and lip-sync entire lesson in minutes

The AI-powered web service transforms raw teaching materials, like lecture notes or textbooks, into fully rendered microlecture videos. These videos are narrated by virtual instructors whose facial expressions, lip movements, and gestures are synchronized to the audio. The system is designed to generate high-quality content in various presentation styles, including formal, casual, and bullet-point formats, appealing to different learning preferences.


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 02-04-2025 09:56 IST | Created: 02-04-2025 09:56 IST
E-learning 2.0: AI can now create, narrate, and lip-sync entire lesson in minutes
Representative Image. Credit: ChatGPT

Researchers have developed a fully automated AI system capable of creating e-learning videos featuring photorealistic digital human instructors. Designed by a team from Kobe University and Waseda University, the platform combines cutting-edge text-to-speech synthesis, natural language generation, and computer vision to deliver customized educational content with minimal human input - marking a significant leap toward scalable, personalized, and immersive learning.

The AI-powered web service transforms raw teaching materials, like lecture notes or textbooks, into fully rendered microlecture videos. These videos are narrated by virtual instructors whose facial expressions, lip movements, and gestures are synchronized to the audio. The system is designed to generate high-quality content in various presentation styles, including formal, casual, and bullet-point formats, appealing to different learning preferences.

The service begins by prompting a generative language model to rewrite or summarize the input material. Users can specify a desired tone or teaching objective, allowing the system to tailor its output to the target audience. This is followed by converting the revised text into spoken narration using deep learning-based speech synthesis. The audio is then used to animate a digital human avatar, which is selected or generated using AI-driven image and motion modeling.

To validate the system’s capabilities, the researchers conducted a case study on the life of Isaac Newton. The system successfully produced several versions of a short educational video, each with distinct stylistic differences and emotional tones. Content accuracy, voice quality, and visual realism were assessed through both automated evaluations and a user survey. The casual style of delivery, in particular, achieved the best results in terms of speech clarity, viewer engagement, and lip-sync performance.

Quantitative testing was performed using established metrics. In terms of audio, the casual delivery style reached an automatic speech recognition (ASR) score of 66.25, indicating high intelligibility. It also exhibited greater variation in short-time energy, suggesting more dynamic intonation and expressiveness. All three presentation styles shared a perfect pitch rating, and tempo measurements fell within an optimal teaching range of 50 to 200 words per minute.

To evaluate digital human performance, the team applied the Wav2Lip lip synchronization model. The casual style achieved the lowest error distance and highest synchronization confidence, confirming strong alignment between the audio track and avatar’s lip movements. This contributes to a more natural and convincing delivery, enhancing student immersion and comprehension.

User feedback from 25 participants echoed these findings. Respondents gave the system an average satisfaction score of 5.7 out of 10, with the highest marks in content consistency and overall experience. While visual quality and attractiveness received slightly lower ratings, participants acknowledged the platform’s potential and appreciated the range of delivery styles.

The system also outperformed leading commercial tools in key areas. Compared to platforms like Elai.io, Vyond, and AI Studios, the Kobe-Waseda platform offered superior automation, content personalization, and visual coherence. It provides better lip-sync accuracy, support for multiple languages, and customizable teaching styles, all within a user-friendly interface.

Technically, the system operates in a modular fashion, beginning with text processing, followed by speech synthesis, image generation, facial and gesture animation, and final video rendering. It supports both static and dynamic content types and is designed for scalability and adaptability. For instance, teachers can input either plain text or annotated PowerPoint slides, and the platform can automatically adjust the pacing, visuals, and delivery tone.

While the current implementation focuses on upper-body animations and 30-second video segments, the researchers plan to expand support for longer, more interactive lessons. They also aim to enhance real-time capabilities, reduce processing time, and offer broader stylistic and linguistic diversity. Future iterations may include haptic feedback, student-avatar interaction, and adaptive teaching strategies based on learner responses.

Limitations identified include the current system’s dependence on pre-trained language models and a narrow range of supported languages. Additionally, generation time can vary significantly based on video length and resolution, posing scalability challenges for large educational institutions. Semantic abstraction and hierarchical structuring also remain hurdles for complex academic subjects.

Despite these limitations, the research represents a significant advancement in the field of AI-powered education. By integrating digital humans into the e-learning pipeline, the system not only reduces the time and labor required to create instructional content but also enhances student engagement through high-quality, personalized delivery.

The researchers plan to release a web-based version of the service in the coming months, targeting educators, developers, and institutions alike. The platform will include tools for customizing digital avatars, selecting delivery styles, and adjusting audio parameters such as pitch and emotion. A long-term roadmap includes integration with learning management systems and support for live virtual classrooms powered by responsive AI agents.

The team's findings are published in Applied Sciences in a paper titled "Digital Human Technology in E-Learning: Custom Content Solutions".

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback