Tutorial 10: A Human-AI Collaborative Framework for Agentic Video Synthesis and Evaluation

Speakers

 

Abstract

The advent of high-fidelity text-to-video models presents a paradigm shift in digital media creation. However, harnessing these models for coherent, controllable narrative production remains a significant technical challenge. This tutorial introduces a systematic, agent-based framework for end-to-end generative video synthesis, emphasizing a human-in-the-loop collaborative workflow using Google ADK (Agent Development Kit). We conceptualize the creative pipeline as a multi-agent system, where specialized AI agents—a Scripter, a Storyboard Artist, a Cinematographer, and a Composer—work in concert under the supervision of a human “Creative Director.” A cornerstone of our methodology is an autonomous Continuity Supervisor Agent, a multimodal model tasked with automated quality assurance. This agent programmatically evaluates generated clips for temporal consistency in characters, settings, and style—a critical open problem in the field. Attendees will learn to architect and implement this agentic workflow, from structured script generation to a closed-loop, human-on-the-loop refinement process driven by the critic agent’s feedback. The tutorial provides a rigorous technical roadmap for building controllable, high-quality generative video systems and illuminates key research challenges in agentic creativity and automated evaluation.

 

Target Audience

  • AI/ML Researchers and Graduate Students: Individuals specializing in generative models (diffusion, transformers), multi-agent systems, content creation, and AI-driven creativity. This tutorial will provide a novel, cohesive framework for their research and expose them to critical open problems (e.g., temporal consistency, scalable evaluation) suitable for theses and publications.
  • Machine Learning Engineers and Applied Scientists: Practitioners in industry R&D labs who are tasked with building, deploying, and scaling generative media pipelines. The focus on system architecture, structured data flow (JSON), programmatic evaluation, and the “human-on-the-loop” workflow provides a practical blueprint for creating reliable and controllable content generation systems.

 

Outline and Description of the Tutorial

This tutorial provides a comprehensive technical deep-dive into a modular, agent-based framework for generative content creation. It is designed for an audience of researchers, engineers, and advanced practitioners interested in the architectural and algorithmic challenges of controllable media synthesis. The session will be structured around the practical implementation of a 30-60 second short film, focusing on the design of the agentic system and the human-AI interaction protocols.

Detailed Outline:

1. Introduction: From Generative Models to Agentic Systems

  • State-of-the-Art in Generative Video Synthesis: Core concepts, models (e.g., Imagen, Veo), and fundamental limitations.
  • The Agentic Abstraction: Why frame content creation as a multi-agent system? We introduce the architecture of our collaborative framework, defining the roles and interfaces for specialized agents (LLM Scripter, T2I Storyboarder, T2V Cinematographer).
  • Human-as-Director: Defining the role of the human collaborator in a “human-on-the-loop” system: setting high-level narrative goals, resolving agent ambiguity, and providing final creative judgment.

2. AI-Powered Pre-Production: A Human-AI Collaborative Workflow

  • ​Conceptual Development (Human-Director & Scripter Agent): The human provides a high-level narrative concept. The Scripter Agent (LLM) generates multiple script drafts and loglines for human selection and refinement.
  • Structured Scene Decomposition (The ‘Show, Don’t Tell’ Agent): A specialized LLM agent parses the approved natural-language script into a structured, machine-readable format (JSON). This output defines scenes, shot compositions, character actions, and camera directives, forming a “digital shot list” that is critical for downstream control.
  • Agent-Assisted Storyboarding (Director & Storyboard Artist Agent): The human director guides the Storyboard Artist Agent (e.g., Imagen 4) using prompts derived from the structured shot list. This step establishes character and environmental visual anchors for later consistency checking.

3. Generative Production: Executing the Cinematic Plan

  • The Cinematographer Agent: From Shot List to Moving Image: The Cinematographer Agent (e.g., Veo) ingests the structured shot data and storyboard images to generate video clips. We will explore techniques for prompt engineering to enforce cinematic control (e.g., specifying dolly zooms, rack focus, and artistic styles).
  • The Composer Agent: Algorithmic Music Generation: An audio generation agent (e.g., Lyria) creates a musical score based on high-level textual prompts from the director regarding mood, genre, and synchronization cues from the scene data.

4. The Director’s Cut: An Agentic Framework for Automated Evaluation and Refinement

  • ​The Temporal Coherence Problem: A technical discussion on why character, object, and style consistency is a fundamental open research problem for generative models.
  • The Continuity Supervisor Agent: A Multimodal Critic for Automated QA: This is the core technical contribution of the tutorial. We demonstrate how to implement a “critic” agent using a multimodal LLM. This agent programmatically “watches” the generated clips and:
    • Performs object/character re-identification across shots.
    • Scores clips for visual style and environmental consistency against the storyboard anchors.
    • Outputs a JSON report detailing inconsistencies (e.g., “Character A’s shirt color changed in Shot 5,” “Background setting differs between Shot 7 and 8”).
  • Human-on-the-Loop: Closing the Quality Control Loop: We showcase the complete feedback mechanism. The Continuity Supervisor Agent flags an issue. The system can then automatically suggest a revised prompt. The Human Director reviews the agent’s findings and suggestions, then approves re-generation, modifies the prompt, or accepts the flaw. This illustrates a practical and powerful human-AI teaming workflow.

5. Open Problems and Future Directions

  • ​Recap: Synthesizing the end-to-end agentic methodology.
  • Research Frontiers: Discussing open problems through the lens of agent-based systems:
    • Long-Range Coherence as Agent Memory: How can agents maintain state across an entire film?
    • Interactive Editing as Human-Agent Negotiation: Developing interfaces for fine-grained, intuitive control.
    • Scalable Evaluation Metrics: Moving beyond simple consistency checks to nuanced narrative and emotional arc evaluation.
  • Audience Q&A

 

Vertical

Generative AI Models, AI in Education and Agentic AI

Timeline

4 hours