SEmantic Data Engineering for Robustness Under Multimodal Settings

Website for SERUM Tutorial at WACV 2023, January 7, 2PM to 5PM

Hosted by Tejas Gokhale and Yezhou Yang (Arizona State University)


In the past decade, we have witnessed a paradigm shift in computer vision – the connection between vision and language (V+L) is now an integral part of AI. V+L comprises of human-interactive tasks such as visual question answering, image captioning, visual dialog, visual entailment and grounding, V+L navigation, and text-to-image generation. This field has already had an impact on other research communities such as NLP, robotics, graphics, and direct industrial implications for software, arts, media, and journalism. As V+L models become widely adopted, new types of challenges and failure modes are emerging, that have not been studied by previous work on robustness. Multi-modal tasks involving both vision and language (V+L) inputs, open up intriguing domain discrepancies that can affect model performance of test time.

In this tutorial, we will show how semantic data transformation – i.e. data transformation guided by the knowledge of logical and semantic features of natural language, can

Tentative Schedule

Time (UTC-10) Topic Presenter
1400--1415 Welcome and Introduction Yezhou Yang
(Associate Professor, ASU
1415--1515 Plenary Talk:
Towards Building Multimodal Foundation Models
Zhe Gan
(Staff Research Scientist, Apple
1515--1600 Robust Semantic Vision with Knowledge-Guided Data Transformation Tejas Gokhale
(Ph.D. Candidate, ASU)
1600--1620 Enhancing Video Captioning with Commonsense Descriptions Yezhou Yang
(Associate Professor, ASU)
1620--1645 Visual-Retriever-Reader for Knowledge-based Question Answering Man Luo
(Ph.D. Candidate, ASU)
1645--1700 Concluding Remarks Tejas Gokhale
(Ph.D. Candidate, ASU)

This website will be updated closer to the event date.

We acknowledge support from NSF Robust Intelligence grant #2132724