Speakers
- Antonio Zarauz Moreno (AI Research Tech Lead at Credicorp and Lecturer at the University of Buenos Aires)
- José Antonio Lagares Rodríguez
Abstract
The proliferation of large-scale document datasets has necessitated a paradigm shift from traditional Optical Character Recognition (OCR) methods to more sophisticated, scalable, and semantically rich document understanding techniques. This tutorial introduces a novel framework that leverages the power of generative AI for efficient and effective document intelligence at scale. We address the inherent inefficiencies of conventional OCR and simplistic generative AI approaches by presenting a cutting-edge solution that synergizes late-interaction retrieval strategies with advanced vision language models (VLMs). The core of our proposed framework is the innovative use of “colpali” model architectures, inspired by recent advancements like ColPali, which treat entire document pages as images for retrieval, thereby preserving visual and structural context often lost in text-extraction pipelines. This tutorial will provide a comprehensive overview of the current landscape of VLMs for document understanding, including state-of-the-art open-source models such as Docling, SmolVLM, InternVL, Qwen-VL, and the miniCPM-V series . A significant portion of the tutorial will be dedicated to a deep dive into our novel framework, demonstrating how the integration of colpali-style late-interaction retrieval with these powerful VLMs can lead to substantial gains in both retrieval accuracy and computational efficiency. Furthermore, we will explore the practical deployment of these models at scale using inference optimization libraries like vLLM, which is designed to enhance throughput and minimize latency in production environments. Attendees will gain both theoretical insights and practical skills to implement these advanced generative AI strategies in their own document-intensive applications.
Target Audience
This tutorial is designed for AI researchers, practitioners, and tech leads with an interest in document intelligence, natural language processing, computer vision, and large-scale AI systems. The content is tailored to an intermediate to advanced audience who are familiar with the fundamentals of deep learning and have some experience with large language models (LLMs) and/or computer vision models.
Expected prior knowledge:
- Mathematical Foundations: A solid understanding of linear algebra, calculus, and
probability theory, as they underpin modern neural network architectures. - Machine Learning Concepts: Familiarity with concepts such as transfer learning, fine-tuning, and attention mechanisms is essential.
- Models and Methods: Prior exposure to transformer architectures (e.g., BERT, GPT) and convolutional neural networks (CNNs) is expected. Knowledge of vision transformer (ViT) architectures would be beneficial but not strictly required.
- Programming Skills: Proficiency in Python and experience with at least one major deep learning framework (e.g., PyTorch, TensorFlow) are necessary to follow the practical components of the tutorial. Familiarity with the Hugging Face ecosystem is highly recommended.
- Adjacent Areas: A basic understanding of information retrieval concepts (e.g., dense retrieval, vector databases) would be advantageous.
Outline and Description of the Tutorial
- Part 1: Foundations of Modern Document Understanding (60 minutes)
- The Limitations of Traditional Document Processing: A critical review of OCR-based pipelines and their shortcomings in handling complex layouts, non-textual elements, and scalability.
- The Rise of Vision Language Models for Document AI: An introduction to the architecture of modern VLMs and how they enable a more holistic understanding of documents. We will survey prominent open-source models, including:
- Docling and SmolDocling: For efficient and structured document conversion.
- SmolVLM: A compact and efficient VLM for on-device applications.
- InternVL and Qwen-VL: High-performance models for a wide range of vision-language tasks.
- miniCPM-V Series: Efficient and capable models for end-device deployment.
- Hands-on Session 1: Information Extraction with a Pre-trained VLM: A guided coding exercise on using a state-of-the-art VLM (e.g., Qwen-VL) for structured information extraction from invoices and reports.
- Part 2: A Novel Framework with Late-Interaction Retrieval (75 minutes)
- 2.1. The “Colpali” Architectural Paradigm: A deep dive into the principles of late-interaction retrieval (inspired by ColBERT) and its adaptation to the visual modality of documents with the ColPali architecture. We will contrast this with traditional dense retrieval methods.
- 2.2. Framework Overview: Presentation of our novel framework that integrates a colpali-style retriever with a generative VLM for end-to-end document understanding.
- 2.3. Hands-on Session 2: Building a Colpali-style Document Retriever: Participants will implement a simplified version of a colpali retriever, learning how to generate multi-vector representations for document pages and perform late-interaction scoring.
- Part 3: Scaling Document Intelligence with vLLM (60 minutes)
- The Challenge of LLM Inference: An overview of the computational costs and latency issues associated with deploying large generative models.
- High-Throughput Serving with vLLM: A detailed explanation of the key innovations in vLLM, such as PagedAttention and continuous batching, that enable efficient inference.
- Hands-on Session 3: Deploying a Document Understanding Service with vLLM: A practical demonstration of how to serve the VLM from our framework using vLLM to handle concurrent requests efficiently.
- Q&A and Future Directions: Open discussion and exploration of future research avenues in generative AI for document understanding.
Reading List
Introductory Readings (to be read before the tutorial):
- Khattab, O., & Zaharia, M. (2020). ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. arXiv preprint arXiv:2004.12832.
- Radford, A., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. arXiv preprint arXiv:2103.00020.
- Dosovitskiy, A., et al. (2020). An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929.
- Bai, J., et al. (2023). Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv preprint arXiv:2308.12966.
Advanced Readings (for a deeper dive into the tutorial content):
- Faysse, M., et al. (2024). ColPali: Efficient Document Retrieval with Vision Language Models. arXiv preprint arXiv:2407.01342
- Kwon, W., et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. Proceedings of the 29th ACM Symposium on Operating Systems Principles.
- Chen, Z., et al. (2024). InternVL 2.5: Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling. arXiv preprint arXiv:2412.05271.
- Marafioti, A., et al. (2025). SmolVLM: Redefining small and efficient multimodal models. arXiv preprint arXiv:2504.05299.
- Yuan, Y., et al. (2024). MiniCPM-V: A GPT-4V Level MLLM on Your Phone. arXiv preprint arXiv:2408.01800.
- Auer, C., et al. (2025). Docling Technical Report. arXiv preprint arXiv:2501.07887.[
- Alibaba Cloud. (2025). Qwen2.5-VL Technical Report. arXiv preprint arXiv:2502.13923.
- Lin, W., et al. (2023). Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering. Findings of the Association for Computational Linguistics: EMNLP 2023.
Vertical
Generative AI Models, AI in Education and Agentic AI
Industrial AI
Timeline
4 hours


