The 2nd Workshop on Responsibly Building the Next Generation of Multimodal Foundational Models


NeurIPS 2025 (Tentative)

San Diego, USA

Prior edition: NeurIPS 2024

Introduction

In recent years, the importance of multimodal approaches integrating language, images, video, and audio has grown exponentially, driven by their impact in various fields such as robotics. However, these technologies present critical challenges in reliability and security, with LLMs producing "hallucinations" and Text-to-Image models potentially generating harmful content. These challenges have become particularly evident with the release of models like GPT-4o and Veo3.

Current efforts to address these issues often create new challenges, leading to resource-intensive cycles of problem-solving. This highlights the urgent need for preemptive and sustainable approaches in developing the next generation of multimodal foundational models. It's crucial to establish robust design principles before building models that can understand and generate multiple modalities.

Our workshop serves as a platform to unite the community in shaping responsible design principles for next-generation multimodal models. This year, we especially encourage discussion and research on multimodal unified architectures, agentic AI systems, and human-value alignment. Our focus is on early detection and prevention of reliability issues, promoting security, equity, and sustainability in AI progress.

The goals of this workshop are to:

  • Discuss the next generation of multimodal foundational models beyond GPT-4o & Veo.
  • Discuss methodologies that enhance the reliability of multimodal models, tackling key issues such as fairness, security, misinformation, and hallucinations.
  • Enhance the robustness of these models against adversarial and backdoor attacks, thereby securing their integrity in adversarial environments.
  • Identify the sources of reliability concerns, whether they stem from data quality, model architecture, or pre-training strategies.
  • Explore novel design principles emphasizing responsibility and sustainability in multimodal models, aiming to reduce their extensive data and computational demands.


More details are coming soon! Stay tuned!



Keynotes

Stefano Ermon

Stefano Ermon

Stanford & Inception Labs

Website Google Scholar
Chaowei Xiao

Chaowei Xiao

University of Wisconsin, Madison & Nvidia

Website Google Scholar
Lama Nachman

Lama Nachman

Intel Labs

Website Google Scholar
Sander Dieleman

Sander Dieleman

Google Deepmind

Website Google Scholar

Panel

Rising Stars Lightning Talks

Organizers