C08: Deep Foundation Models for Multimodal Analysis

Monday, 23 June 2025, 08:30 - 12:30 CEST (Central European Summer Time - Sweden)

Björn W. Schuller (short bio)
CHI - Chair of Health Informatics, Technical University of Munich, Germany
GLAM - Group on Language, Audio & Music, Imperial College London, UK

Andreas Triantafyllopoulos (short bio)
CHI - Chair of Health Informatics, Technical University of Munich, Germany

Modality

on-line

Target Audience

The target audience is the broad audience of HCI International. The course introduces general principles of deep learning for a start in an introductory manner and then moves to the recent approaches for fusion by deep foundation models. In this way, we also target intermediate to advanced level participants from general HCI and intelligent interaction, as well as the deep learning and machine learning expert attendees.

Abstract

Multimodal foundation models (MFMs) have taken the world by storm in recent months. Initial versions, known as large language models (LLMs), were focused on textual inputs and outputs, but the research community quickly co-opted these models for processing multimodal inputs. Following this initial success, new models were released that could natively process and produce multimodal information. In this course, we present the recent history of deep multimodal methods and outline recent developments in the state of the art for multimodal fusion.

Objectives

The objective of this course is to introduce recent deep learning methods for the fusion of multimodal information streams. Text, audio, facial features, and gestures are some of the modalities typically available during human-machine and human-human interactions. Traditional methods processed each of them independently, with custom pipelines that could be integrated at a late, decision level-fusion. Modern methods relying on deep learning can accommodate multiple modalities in parallel. In particular, state-of-the-art foundation models are increasingly multimodal, being able to analyse and synthesise outputs in several different formats.

We will present deep fusion methods for a variety of multimodal information streams. We begin by reviewing earlier fusion methods, such as early and late fusion, and progress to modern-era deep, intermediate-level fusion. Following that, we will present contemporary architectures for multimodal integration relying on foundation models, where we will differentiate between native and cascade models. Additional emphasis will be placed on the different stages of training that such models undergo, distinguishing between unsupervised, supervised and reinforcement learning based training.

Benefits for attendees

Attendees are expected to obtain a comprehensive understanding of both traditional and contemporary multimodal fusion methods. In particular, they will leave the course with a deep understanding of multimodal foundation models which are spearheading advances in artificial intelligence in the past few years.

Course Content

We will present deep fusion methods for a variety of multimodal information streams. We begin by reviewing earlier fusion methods, such as early and late fusion, and progress to modern-era deep, intermediate-level fusion. Following that, we will present contemporary architectures for multimodal integration relying on foundation models, where we will differentiate between native and cascade models. Additional emphasis will be placed on the different stages of training that such models undergo, distinguishing between unsupervised, supervised and reinforcement learning based training.

Table of contents

Multimodal Fusion
1. Objectives
2. Brief History
Traditional Fusion Methodologies
1. Early Fusion
2. Late Fusion
3. Deep Intermediate Fusion
Multimodal Foundation Models
1. Introduction to Large Language Models
2. Introduction to Foundation Models
Conclusion
1. Summary
2. Comparison Traditional vs Foundation Fusion
3. Future Perspectives
Q&A

Bio Sketch of Course instructors

Björn W. Schuller received his diploma, doctoral degree, habilitation, and Adjunct Teaching Professor in Machine Intelligence and Signal Processing all in EE/IT from TUM in Munich/Germany where he is Full Professor and Chair of Health Informatics. He is also Full Professor of Artificial Intelligence and the Head of GLAM at Imperial College London/UK, co-founding CEO and current CSO of audEERING – an Audio Intelligence company based near Munich and in Berlin/Germany, Core Member in the Munich Data Science Institute (MDSI), Principal Investigator in the Munich Center for Machine Learning (MCML), Fellow of the Imperial Data Science Institute, and permanent Honourable Dean at TJNU/China and Visiting Professor at HIT/China amongst other Professorships and Affiliations. Previous stays include Full Professor and Chair of Embedded Intelligence for Health Care and Wellbeing at the University of Augsburg/Germany, independent research leader within the Alan Turing Institute as part of the UK Health Security Agency, Guest Professor at Southeast University in Nanjing/China, Full Professor at the University of Passau/Germany, Key Researcher at Joanneum Research in Graz/Austria, and the CNRS-LIMSI in Orsay/France. He is a Fellow of the ACM, Fellow of the IEEE and Golden Core Awardee of the IEEE Computer Society, Fellow of the BCS, Fellow of the ELLIS, Fellow of the ISCA, Fellow and President-Emeritus of the AAAC, and Elected Full Member Sigma Xi. He (co-)authored 1,500+ publications (60,000+ citations, h-index >110 ranking him number 8 in the UK for Computer Science), is Field Chief Editor of Frontiers in Digital Health, Editor in Chief of AI Open and was Editor in Chief of the IEEE Transactions on Affective Computing amongst manifold further commitments and service to the community. His 50+ awards include having been honoured as one of 40 extraordinary scientists under the age of 40 by the WEF in 2015. Currently, he was awarded ACM Distinguished Speaker for the term 2024-2027 and IEEE Signal Processing Society Distinguished Lecturer 2024. He served as Coordinator/PI in 20+ European Projects, is an ERC Starting and DFG Reinhart-Koselleck Grantee, and consultant of companies such as Barclays, GN, Huawei, Informetis, or Samsung. Schuller counts more than 300 public press appearances including in Business Insider, Guardian, International Business Times, Newsweek, Scientific American, Times, The Economist 1843, UK Daily Mail, and national and international podcast, radio, and television contributions such as in MIT Technology Review and “The World” and “The Why”.

Andreas Triantafyllopoulos is a doctoral candidate with the Chair of Health of Informatics of the Technical University of Munich. He received his diploma (M.Sc. equivalent) in Electrical and Computer Engineering from the University of Patras in 2017. He has co-authored 30+ publications (800+ citations, h-index 13) and his core research focus lies in multimodal intelligence for affective computing and healthcare.

Course history

Nicholas Cummins, Björn Schuller: Invited half-day Tutorial “Latest Advances in Deep Learning for Mutlimodal and Multisensorial Signal Analysis”, 23rd International Conference on Human-Computer Interaction (HCII 2021), Washington, DC, 24.-29.07.2021.
Björn Schuller, Nicholas Cummins: Invited half-day Tutorial “Deep Learning for Multimodal and Multisensorial Interaction”, 21st International Conference on Human-Computer Interaction (HCII 2019), Orlando, FL, 26.-31.07.2019.
Björn Schuller: Tutorial “Deep Learning for Multimodal and Multisensorial Interaction”, 20th ACM International Conference on Multimodal Interaction (ICMI), ACM, Boulder, CO, 16.-20.10.2018.
Nicholas Cummins, Björn Schuller: Tutorial “Recent advances in multisensory intelligent signal analysis “, IEEE 10th Sensor Array and Multichannel Signal Processing Workshop (SAM 2018), IEEE, Sheffield, UK, 08.-11.07.2018.