Engineering Trustworthy Multi-Agent Systems: A Deep Dive

Speakers

José Enrique Pons (NTTData)

Abstract

Having autonomous agents that solve complex problems and can make decisions in unexpected situations is promising. That’s the reason why most businesses across various industries are currently exploring Agentic AI.

This tutorial presents a research-based, practical guide for designing enterprise-ready multi-agent systems based on our experience in building our Agentic Platform.

We will describe design principles derived from theoretical concepts and research papers. First, agents and their basic architecture and capabilities will be introduced. Their capabilities for long-term, complex tasks motivate their use in various industries such as supply chain management or finance. Second, agentic patterns, frameworks, and emerging communication protocols will be discussed. Third, we will cover different types of short-term and long-term memory and how to utilize them for agents to follow business rules. Once we have a basic agent design, we need to consider the observability of agents and techniques for online/offline evaluations. Finally, we will discuss trustworthy AI, including the implementation of guardrails, fact-checking, and alignment of objectives.

Attendees will learn the design principles for secure, reliable, and production-ready agent systems that reflect the current state of the art in research.

Target Audience

This tutorial is aimed at AI/ML engineers, Software Architects, and AI researchers who are interested in understanding agentic architecture and how the latest research applies in enterprise applications.

Expected prior knowledge:

Basic understanding of how an LLM works and basic knowledge about current architectures.
Fundamental programming skills in any language, but the code shown will be Python.
Some principles of composable software architecture.

Outline and Description of the Tutorial

Introduction to agents: Defining agentic systems and the motivation to use them in the Enterprise. (20 min)
Agentic patterns, frameworks and protocols: Explores common agentic patterns (ReACT, Reflection, CoT). Overview of the popular frameworks (SmolAgents, LangGraph, LlamaIndex, CrewAI). Emerging communication protocols: A2A, MCP. Their benefits and concerns. (40 min)
Agentic memory: Defining a multi-layer memory structure: short-term for objectives, long-term for business alignment. Emerging tools for semantic caching. (20 min)
Observability and evaluation: Introduction to agent observability, motivation for online and offline evaluation. Emerging tools to deal with agentic observability. (20 min)
Trustworthy AI: Implementation of safety layers: Guardrails, Fact Checks, and objective alignment. (20 min)

Reading List

Introductory papers: These papers provide an overview of the agentic AI topic, consisting of surveys on some of the topics mentioned in the tutorial:

Luo, Junyu, Weizhi Zhang, Ye Yuan, et al. “Large Language Model Agent: A Survey on Methodology, Applications and Challenges.” arXiv:2503.21460. Preprint, arXiv, March 27, 2025. https://doi.org/10.48550/arXiv.2503.21460.
Du, Shangheng, Jiabao Zhao, Jinxin Shi, et al. “A Survey on the Optimization of Large Language Model-Based Agents.” arXiv:2503.12434. Preprint, arXiv, March 16, 2025. https://doi.org/10.48550/arXiv.2503.12434.
Yang, Yingxuan, Huacan Chai, Yuanyi Song, et al. “A Survey of AI Agent Protocols.” arXiv:2504.16736. Preprint, arXiv, June 21, 2025. https://doi.org/10.48550/arXiv.2504.16736.
Pradhan, Anu, Alexandra Ortan, Apurv Verma, and Madhavan Seshadri. “LLM-as-a-Judge: Rapid Evaluation of Legal Document Recommendation for Retrieval-Augmented Generation.” arXiv:2509.12382. Preprint, arXiv, September 15, 2025. https://doi.org/10.48550/arXiv.2509.12382.
Yehudai, Asaf, Lilach Eden, Alan Li, et al. “Survey on Evaluation of LLM-Based Agents.” arXiv:2503.16416. Preprint, arXiv, March 20, 2025. https://doi.org/10.48550/arXiv.2503.16416.

There are short courses from Hugging Face that can help the attendees get hands-on practice:

https://huggingface.co/agents-course: Agents course: Covers from a developer
perspective the most popular frameworks: SmolAgents, LangGraph, and LlamaIndex.
https://huggingface.co/mcp-course: MCP course: The popular model context protocol
is gaining traction to allow agents to use tools and act in the real world.
https://huggingface.co/learn/llm-course/: Fine-tuning Language Models: Covers topics
from Instruction tuning, preference alignment, and reinforcement learning.

Books

The following books from Chip Huyen provide an overview of building generative AI applications (“AI engineering: building applications with foundation models”) and “Designing Machine Learning Systems” for an overview of classical ML. Her blog is also a valuable source of information: https://huyenchip.com/blog/

The following reading list includes and extends the topics that will be covered during the tutorial.

Foundation models, architecture, and problems:

Vaswani, Ashish, Noam Shazeer, Niki Parmar, et al. “Attention Is All You Need.”
arXiv:1706.03762. Preprint, August 2, 2023. https://doi.org/10.48550/arXiv.1706.03762.
Zhang, Muru, Ofir Press, William Merrill, Alisa Liu, and Noah A. Smith. “How Language Model
Hallucinations Can Snowball.” arXiv:2305.13534. Preprint, May 22, 2023.
https://doi.org/10.48550/arXiv.2305.13534.
Ouyang, Long, Jeff Wu, Xu Jiang, et al. “Training Language Models to Follow Instructions with
Human Feedback.” arXiv:2203.02155. Preprint, March 4, 2022.
https://doi.org/10.48550/arXiv.2203.02155.
Villalobos, Pablo, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius
Hobbhahn. “Will We Run out of Data? Limits of LLM Scaling Based on Human-Generated
Data.” arXiv:2211.04325. Preprint, June 4, 2024. https://doi.org/10.48550/arXiv.2211.04325

Evaluation methodologies:

Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. “Efficient Estimation of Word Representations in Vector Space.” arXiv:1301.3781. Preprint, September 7, 2013. https://doi.org/10.48550/arXiv.1301.3781.
Gehrmann, Sebastian, Abhik Bhattacharjee, Abinaya Mahendiran, et al. “GEMv2: Multilingual NLG Benchmarking in a Single Line of Code.” arXiv:2206.11249. Preprint, June 24, 2022. https://doi.org/10.48550/arXiv.2206.11249.
Wang, Alex, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding.” arXiv:1804.07461. Preprint, February 22, 2019. https://doi.org/10.48550/arXiv.1804.07461.
Wang, Yubo, Xueguang Ma, Ge Zhang, et al. “MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark.” arXiv:2406.01574. Preprint, November 6, 2024. https://doi.org/10.48550/arXiv.2406.01574.
Muennighoff, Niklas, Nouamane Tazi, Loïc Magne, and Nils Reimers. “MTEB: Massive Text Embedding Benchmark.” arXiv:2210.07316. Preprint, March 19, 2023. https://doi.org/10.48550/arXiv.2210.07316.
Wang, Alex, Yada Pruksachatkun, Nikita Nangia, et al. “SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems.” arXiv:1905.00537. Preprint, February 13, 2020. https://doi.org/10.48550/arXiv.1905.00537.

AI as a Judge:

Zhu, Lianghui, Xinggang Wang, and Xinlong Wang. “JudgeLM: Fine-Tuned Large Language
Models Are Scalable Judges.” arXiv:2310.17631. Preprint, March 1, 2025.
https://doi.org/10.48550/arXiv.2310.17631.
Zheng, Lianmin, Wei-Lin Chiang, Ying Sheng, et al. “Judging LLM-as-a-Judge with MT-Bench andChatbot Arena.” arXiv:2306.05685. Preprint, December 24, 2023. https://doi.org/10.48550/arXiv.2306.05685.
Valmeekam, Karthik, Matthew Marquez, and Subbarao Kambhampati. “Can Large Language Models Really Improve by Self-Critiquing Their Own Plans?” arXiv:2310.08118. Preprint, October 12, 2023. https://doi.org/10.48550/arXiv.2310.08118.

Ranking Foundation Models:

Boubdir, Meriem, Edward Kim, Beyza Ermis, Sara Hooker, and Marzieh Fadaee. “Elo Uncovered: Robustness and Best Practices in Language Model Evaluation.” arXiv:2311.17295. Preprint, November 29, 2023. https://doi.org/10.48550/arXiv.2311.17295.
Munos, Rémi, Michal Valko, Daniele Calandriello, et al. “Nash Learning from Human Feedback.” arXiv:2312.00886. Preprint, June 11, 2024. https://doi.org/10.48550/arXiv.2312.00886.

Evaluation of AI systems:

Zhong, Wanjun, Ruixiang Cui, Yiduo Guo, et al. “AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models.” arXiv:2304.06364. Preprint, September 18, 2023. https://doi.org/10.48550/arXiv.2304.06364.
Luo, Zheheng, Qianqian Xie, and Sophia Ananiadou. “ChatGPT as a Factual Inconsistency Evaluator for Text Summarization.” arXiv:2303.15621. Preprint, April 13, 2023. https://doi.org/10.48550/arXiv.2303.15621.
Sprague, Zayne, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, and Greg Durrett. “MuSR: Testing the Limits of Chain-of-Thought with Multistep Soft Reasoning.” arXiv:2310.16049. Preprint, March 23, 2024. https://doi.org/10.48550/arXiv.2310.16049.

Prompt engineering, attacks, and defenses:

Huang, Jie, Hanyin Shao, and Kevin Chen-Chuan Chang. “Are Large Pre-Trained Language Models Leaking Your Personal Information?” arXiv:2205.12628. Preprint, October 20, 2022. https://doi.org/10.48550/arXiv.2205.12628.
Carlini, Nicholas, Florian Tramer, Eric Wallace, et al. “Extracting Training Data from Large Language Models.” arXiv:2012.07805. Preprint, June 15, 2021. https://doi.org/10.48550/arXiv.2012.07805.
Chao, Patrick, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. “Jailbreaking Black Box Large Language Models in Twenty Queries.” arXiv:2310.08419. Preprint, July 18, 2024. https://doi.org/10.48550/arXiv.2310.08419.
Wallace, Eric, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. “The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions.” arXiv:2404.13208. Preprint, April 19, 2024. https://doi.org/10.48550/arXiv.2404.13208.
Zou, Andy, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. “Universal and Transferable Adversarial Attacks on Aligned Language Models.” arXiv:2307.15043. Preprint, December 20, 2023. https://doi.org/10.48550/arXiv.2307.15043.
Zhu, Kaijie, Jindong Wang, Jiaheng Zhou, et al. “PromptRobust: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts.” arXiv:2306.04528. Preprint, July 16, 2024. https://doi.org/10.48550/arXiv.2306.04528.

Agents, patterns, RAG, and memory management:

Lewis, Patrick, Ethan Perez, Aleksandra Piktus, et al. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” arXiv:2005.11401. Preprint, April 12, 2021. https://doi.org/10.48550/arXiv.2005.11401.
Shinn, Noah, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. “Reflexion: Language Agents with Verbal Reinforcement Learning.” arXiv:2303.11366. Preprint, arXiv, October 10, 2023. https://doi.org/10.48550/arXiv.2303.11366.
Yang, John, Carlos E. Jimenez, Alexander Wettig, et al. “SWE-Agent: Agent-Computer Interfaces Enable Automated Software Engineering.” arXiv:2405.15793. Preprint, arXiv, November 11, 2024. https://doi.org/10.48550/arXiv.2405.15793.
Liu, Lei, Xiaoyan Yang, Yue Shen, et al. “Think-in-Memory: Recalling and Post-Thinking Enable LLMs with Long-Term Memory.” arXiv:2311.08719. Preprint, arXiv, November 15, 2023. https://doi.org/10.48550/arXiv.2311.08719.
Bae, Sanghwan, Donghyun Kwak, Soyoung Kang, et al. “Keep Me Updated! Memory Management in Long-Term Conversations.” arXiv:2210.08750. Preprint, arXiv, October 17,2022. https://doi.org/10.48550/arXiv.2210.08750.

Vertical

Generative AI Models, AI in Education and Agentic AI

Industrial AI

Timeline

2 hours