O-DRUM Workshop @ CVPR 2023

Open-Domain Reasoning Under Multi-Modal Settings


Video Recording

June 19, 2023 | 8:30 AM – 5:00 PM PDT | Vancouver

ODRUM 2022 Archive: [Webpage] YouTube


Description

AI has undergone a paradigm shift in the past decade – the connection between vision and language (V+L) is now an integral part of AI, with deep impact beyond vision and NLP – robotics, graphics, cybersecurity, and HCI are utilizing V+L tools and there are direct industrial implications for software, arts, and media. The link between vision and language is much more complex than simple image--text alignment – the use of language for reasoning beyond the visible (for example, physical reasoning, spatial reasoning, commonsense reasoning, and embodied reasoning) is being pursued. Open-Domain Reasoning in Multi-Modal Settings (ODRUM 2023) provides a platform for discussions on multimodal (vision+language) topics with special emphasis on reasoning capabilities.

The aim of ODRUM 2023 is to address the emerging topic of visual reasoning using multiple modalities (such as text, images, videos, audio, etc.). The workshop will feature invited talks by experts in the realm of reasoning such as: embodied AI, navigation, learning via interaction and collaboration with humans, building large V+L that can perform multiple tasks, visual grounding, and the use of language to instruct robots. Participants and speakers will converge for a panel discussion to discuss the importance of reasoning (a core AI topic that has a rich and long history since the 1950s) to computer vision, relevance to recent progress in visual reasoning, discuss trends and challenges in open-domain reasoning, from different perspectives of NLP, vision, machine learning, and robotics researchers.



Invited Speakers and Panelists


Jiajun Wu
Assistant Professor
Stanford University

(Talk: 1040 – 1120 PDT)
(Panel: 1600 – 1700 PDT)

Alane Suhr
Young Investigator
Allen Institute for AI

(Talk: 1320 – 1410 PDT)
(Panel: 1600 – 1700 PDT)

Angel Xuan Chang
Assistant Professor
Simon Fraser University

(Talk: 1410 – 1500 PDT)
(Panel: 1600 – 1700 PDT)

Schedule

08:30 – 08:45     Welcome and Introduction
08:45 – 09:35     Karel Lenc Evaluating and Training Large Language Models with Vision Capabilities
09:35 – 10:00    Spotlight Talks
10:00 – 10:40    Poster Session + Coffee Break
10:40 – 11:30     Jiajun Wu Concept Learning Across Domains and Modalities
11:30 – 12:20     Srinath Sridhar Multi-modality in 3D Scene Understanding
12:20 – 13:20    Lunch
13:20 – 14:10     Alane Suhr Two Approaches to Grounded Language Evaluation
14:10 – 15:00     Angel Chang Reasoning with language in 3D
15:00 – 16:00    Poster Session 2 + Coffee Break + Socials
16:00 – 17:15    Panel Discussion + Concluding Remarks


Accepted Papers

Link to O-DRUM 2023 Proceedings (on CVF Open Access)

Click here to display List of all accepted papers
  • Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
    Pan Lu (The University of California, Los Angeles)*; Swaroop Ranjan Mishra (self); Tony Xia (UCLA); Liang Qiu (University of California, Los Angeles); Kai-Wei Chang (UCLA); Song-Chun Zhu (ucla); Oyvind Tafjord (AI2); Peter Clark (Allen Institute for AI); Ashwin Kalyan (Georgia Institute of Technology)
  • Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning Pan Lu (The University of California, Los Angeles)*; Liang Qiu (University of California, Los Angeles); Kai-Wei Chang (UCLA); Ying Nian Wu (University of California, Los Angeles); Song-Chun Zhu (ucla); Tanmay Rajpurohit (Georgia Institute of Technology); Peter Clark (Allen Institute for AI); Ashwin Kalyan (Georgia Institute of Technology)
  • TEVAD: Improved video anomaly detection with captions Weiling Chen (Hyundai Motor Group Innovation Center in Singapore)*; Keng Teck Ma ( Hyundai Motor Group Innovation Center in Singapore); Zi Jian Yew (Hyundai Motor Group Innovation Center in Singapore); Minhoe Hur (AIRS Company, Hyundai Motor Group); David AA Khoo (Hyundai Motor Group Innovation Center in Singapore)
  • VLMs and LLMs Can Help Detect Human-Object Interactions with Weak Supervision Mesut Erhan Unal (University of Pittsburgh)*; Adriana Kovashka (University of Pittsburgh)
  • Enhancing the Role of Context in Region-Word Alignment for Object Detection Kyle R Buettner (University of Pittsburgh)*; Adriana Kovashka (University of Pittsburgh)
  • eP-ALM: Efficient Perceptual Augmentation of Language Models Mustafa Shukor (Sorbonne University)*; Corentin Dancette (Sorbonne Universite); Matthieu Cord (Sorbonne University)
  • Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors Ryan D Burgert (Stony Brook University); Kanchana N Ranasinghe (Stony Brook University)*; Xiang Li (Stony Brook University); Michael S Ryoo (Stony Brook/Google)
  • Generative Bias for Robust Visual Question Answering Jae Won Cho (KAIST)*; Dong-Jin Kim (Hanyang University); Hyeonggon Ryu (KAIST); In So Kweon (KAIST)
  • Improving language-supervised object detection with linguistic structure analysis Arushi Rai (University of Pittsburgh)*; Adriana Kovashka (University of Pittsburgh)
  • VEIL: Vetting Extracted Image Labels from In-the-Wild Captions for Weakly-Supervised Object Detection Arushi Rai (University of Pittsburgh)*; Adriana Kovashka (University of Pittsburgh)
  • Boosting Weakly Supervised Object Detection using Hallucinated Depth Cagri Gungor (University of Pittsburgh)*; Adriana Kovashka (University of Pittsburgh)
  • BMRN: Boundary Matching and Refinement Network for Temporal Moment Localization with Natural Language Muah Seol (ETRI); Jonghee Kim (ETRI); Jinyoung Moon (Electronics and Telecommunications Research Institute)*
  • Making the V in Text-VQA Matter Shamanthak Hegde (KLE Technological University, Hubballi)*; Soumya Shamarao Jahagirdar (International Institute of Information Technology Hyderabad); Shankar Gangisetty (IIIT Hyderabad )
  • Weakly Supervised Visual Question Answer Generation Charani Alampalle (AlphaICs); Shamanthak Hegde (KLE Technological University, Hubballi)*; Soumya Shamarao Jahagirdar (International Institute of Information Technology Hyderabad); Shankar Gangisetty (IIIT Hyderabad )
  • Visual Semantic Relatedness Dataset for Image Captioning Ahmed A Sabir (Universitat Politècnica de Catalunya)*; Francesc Moreno (IRI); Lluís Padró (Universitat Politècnica de Catalunya)
  • CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes Maria Parelli (ETH Zurich); Alexandros Delitzas (ETH Zurich)*; Nikolas Hars (ETH Zurich); Georgios Vlassis (ETH Zurich); Sotirios-Konstantinos Anagnostidis (ETH Zurich); Gregor Bachmann (ETH Zurich); Thomas Hofmann (ETH Zurich)
  • T2V2T: Text-to-Video-to-Text Fusion for Text-to-Video Retrieval Jonghee Kim (ETRI)*; Youngwan Lee (ETRI); Jinyoung Moon (Electronics and Telecommunications Research Institute)
  • An Examination of the Robustness of Reference-Free Image Captioning Evaluation Metrics Saba Ahmadi (University of Montreal, Mila)*; Aishwarya Agrawal (University of Montreal, Mila, DeepMind)
  • Distilling from Vision-Language Models for Improved OOD Generalization in Image Classification Sravanti Addepalli (Indian Institute of Science)*; Ashish R Asokan (Indian Institute of Science); Lakshay Sharma (Indian Institute of Science); Venkatesh Babu RADHAKRISHNAN (Indian Institute of Science)
  • RMLVQA: A Margin Loss Approach For Visual Question Answering with Language Biases Abhipsa Basu (Indian Institute of Science, Bangalore)*; Sravanti Addepalli (Indian Institute of Science); Venkatesh Babu RADHAKRISHNAN (Indian Institute of Science)
  • ViperGPT: Visual Inference via Python Execution for Reasoning Dídac Surís (Columbia University); Sachit Menon (Columbia University)*; Carl Vondrick (Columbia University)
  • Curriculum Learning for Data-Efficient Vision-Language Alignment Tejas Srinivasan (University of Southern California)*; Xiang Ren (University of Southern California); Jesse Thomason (University of Southern California)
  • VLN Pretraining Still Works with Nonsensical or Irrelevant Instructions Wang Zhu (University of Southern California)*; Ishika Singh (University of Southern California); Yuan Huang (University of Southern California ); Robin Jia (); Jesse Thomason (University of Southern California)
  • Distinguish and Rank hard-negatives to enhance compositional understanding LE ZHANG (mila)*; Aishwarya Agrawal (University of Montreal, Mila, DeepMind)
  • MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting Oscar Mañas (Mila - Quebec AI Institute, Université de Montréal)*; Pau Rodriguez (Apple); Saba Ahmadi (Mila - Quebec AI Institute, Université de Montréal); Aida Nematzadeh (DeepMind); Yash Goyal (Samsung - SAIT AI Lab Montreal); Aishwarya Agrawal (University of Montreal, Mila, DeepMind)
  • In Search For A Good Prompt in Zero-shot Visual Question Answering Md Rabiul Awal (Mila)*; LE ZHANG (mila); Aishwarya Agrawal (University of Montreal, Mila, DeepMind)
  • SQA3D: Situated Question Answering in 3D Scenes Xiaojian Ma (University of California, Los Angeles)*

Poster Gallery

























Organizers


Tejas Gokhale

Assistant Professor
UMBC

Man Luo

Postdoctoral Researcher
Mayo Clinic

Yezhou Yang

Associate Professor
ASU

Chitta Baral

Professor
ASU

Kenneth Marino

Research Scientist
Deepmind

Zhiyuan Fang

Applied Scientist
Amazon Alexa AI

Pratyay Banerjee

Applied Scientist
Amazon Alexa AI

Please contact Man Luo (mluo26@asu.edu) or Tejas Gokhale (tgokhale@asu.edu) for additional details
The workshop is supported by US National Science Foundation grants 1816039, 2132724 as part of Research, Education, and Outreach activities.

Website maintained by Tejas Gokhale