Generative AI for Multimodal Sensing and 3D Perception

Speakers

Prof. Henry Leung (Department of Electrical and Software Engineering. The University of Calgary. Canada)

Abstract

Accurate sensing and perception are fundamental to intelligent systems, enabling machines to interpret and interact with complex environments. Traditional sensing approaches rely on dedicated hardware such as LiDAR, RGB-D cameras, and magnetic or optical sensors to capture spatial or modality-specific information. However, these systems are often costly, power-intensive, and difficult to integrate across domains. Generative AI introduces a paradigm shift by learning to infer or synthesize missing sensory data—such as predicting depth from monocular images or generating equivalent sensor outputs across modalities—thereby reducing reliance on specialized hardware while enhancing environmental understanding.

This research explores AI-driven multimodal sensing, unifying depth prediction and sensor-to-sensor transformation within a single generative framework. We investigate how deep generative models, including diffusion networks and transformer-based architectures, can learn joint latent representations that enable cross-modal synthesis, such as optical-to-SAR or RGB-to-infrared conversions, alongside 3D structure estimation from single or sparse inputs. Applications include autonomous navigation, remote sensing, industrial inspection, and medical imaging.

The study further examines challenges in data alignment, real-time inference, and generalization across sensor types, emphasizing strategies for robust multimodal training and seamless deployment without additional hardware modifications. By bridging 3D perception and sensor transformation through generative AI, this work aims to establish a unified framework for cost-efficient, adaptive, and perceptually rich sensing systems of the future.

Target Audience

This tutorial is designed for a broad spectrum of professionals, including researchers, engineers, and practitioners, who are intrigued by sensing, computer vision, artificial intelligence and sensing technologies. This tutorial welcomes individuals eager to deepen their understanding of novel approach for soft sensing based on generative AI and explore the transformative potential of generative/predictive AI in the sensor and sensing domains.

Experienced professionals seeking to expand their knowledge and stay updated on the latest advancements in generative AI and predictive sensing will find the tutorial beneficial.

Newcomers keen on delving into the intersection of sensing, computer vision and machine learning will discover valuable insights and practical guidance. The tutorial aims to foster a collaborative learning environment conducive to knowledge exchange and skill enhancement by accommodating participants with diverse backgrounds and expertise levels.

Attendees will gain actionable insights into implementing generative AI for predictive sensing algorithms with various multimodal sensing and 3D perception applications. The tutorial provides a platform for attendees to engage in hands-on exercises and discuss implementation challenges.

Outline and Description of the Tutorial

1. Introduction: From Physical Sensors to AI-Driven Perception

Definition of 3D perception and its role in intelligent systems.
Overview of traditional sensing modalities (LiDAR, RGB-D, stereo, SAR, etc.) and their integration challenges.
Motivation for AI-based multimodal perception: reducing hardware cost, improving adaptability, and enabling richer environmental understanding.

2. Foundations of Multimodal and Depth-Aware Sensing

Physical principles of traditional 3D sensors (time-of-flight, stereo, structured light, LiDAR).
Overview of multimodal data (optical, infrared, SAR, magnetic, etc.) and alignment issues.
Why depth and modality translation are conceptually linked: both infer missing spatial or cross-sensor information.

3. Generative AI for Predictive Sensing

Concept of generative multimodal sensing: predicting or synthesizing missing sensory data.
Deep architectures for multimodal learning: diffusion models, transformers, variational autoencoders.
Joint latent representation learning: how shared embeddings enable both 3D reconstruction and cross-modal synthesis.
Examples: predicting depth from RGB, generating SAR from optical, or fusing sparse and dense modalities.

4. Technical Challenges and Training Strategies

Data alignment, calibration, and synchronization across sensor types.
Handling sparse or noisy data and domain gaps.
Real-time inference constraints and model compression.
Strategies for robust multimodal training: contrastive learning, cross-domain consistency losses, and transfer learning.

5. Advantages of Generative Multimodal Sensing

Hardware efficiency: reducing reliance on specialized sensors.
Seamless integration with existing perception systems.
Enhanced adaptability across environments and applications.
Improved interpretability and fusion for autonomous or remote systems.

6. Applications

Autonomous navigation and robotics (depth and modality fusion for obstacle avoidance).
Remote sensing and Earth observation (optical–SAR synthesis and terrain reconstruction).
Industrial inspection (cross-sensor inference in harsh environments).
Medical imaging (modality translation and 3D tissue reconstruction).
AR/VR and digital twins (monocular-to-3D and multimodal simulation).

Reading List

No prerequisite is needed for this tutorial. Extra reading materials will be provided in the tutorial for readers who are interested in this topic.

Vertical

Generative AI Models, AI in Education and Agentic AI

Multimedia and Virtual Reality and Teleoperation

Timeline

2 hours