Website for SERUM Tutorial at WACV 2023, January 7, 2PM to 5PM
Hosted by Tejas Gokhale and Yezhou Yang (Arizona State University)
In the past decade, we have witnessed a paradigm shift in computer vision – the connection between vision and language (V+L) is now an integral part of AI. V+L comprises of human-interactive tasks such as visual question answering, image captioning, visual dialog, visual entailment and grounding, V+L navigation, and text-to-image generation. This field has already had an impact on other research communities such as NLP, robotics, graphics, and direct industrial implications for software, arts, media, and journalism. As V+L models become widely adopted, new types of challenges and failure modes are emerging, that have not been studied by previous work on robustness. Multi-modal tasks involving both vision and language (V+L) inputs, open up intriguing domain discrepancies that can affect model performance of test time.
In this tutorial, we will show how semantic data transformation – i.e. data transformation guided by the knowledge of logical and semantic features of natural language, can
Time (UTC-10) | Topic | Presenter |
---|---|---|
1400--1415 | Welcome and Introduction |
Yezhou Yang (Associate Professor, ASU |
1415--1515 | Plenary Talk: Towards Building Multimodal Foundation Models |
Zhe Gan (Staff Research Scientist, Apple |
1515--1600 | Robust Semantic Vision with Knowledge-Guided Data Transformation |
Tejas Gokhale (Ph.D. Candidate, ASU) |
1600--1620 | Enhancing Video Captioning with Commonsense Descriptions |
Yezhou Yang (Associate Professor, ASU) |
1620--1645 | Visual-Retriever-Reader for Knowledge-based Question Answering |
Man Luo (Ph.D. Candidate, ASU) |
1645--1700 | Concluding Remarks |
Tejas Gokhale (Ph.D. Candidate, ASU) |
This website will be updated closer to the event date.
We acknowledge support from NSF Robust Intelligence grant #2132724