Seminar Series

Spring 2021: Frontier topics in Vision and/or Language

Add to Calendar (.ics)   Active Perception Group

About the Seminar

We are excited to host the first installment of the seminar series, virtually via Zoom. In Spring 2021, we will feature researchers working at the forefront of topics in vision and language.

YouTube Playlist


Date Speaker Talk Summary Video
Jan 18

Zhou Yu

Zhou Yu is an Assistant Professor in CS department at Columbia University. She obtained her Ph.D. from Carnegie Mellon University in 2017. Zhou has built various dialog systems that have a real impact, such as a job interview training system, a depression screening system, and a second language learning system. Her research interest includes dialog systems, language understanding and generation, vision and language, human-computer interaction, and social robots. Zhou received an ACL 2019 best paper nomination, featured in Forbes 2018 30 under 30 in Science, and won the 2018 Amazon Alexa Prize.

Personalized Persuasive Dialog Systems

Abstract: Dialog systems such as Alexa and Siri are everywhere in our lives. They can complete tasks such as booking flights, making restaurant reservations and training people for interviews. These systems are passively follow-along human needs. What if the dialog systems have a different goal than users. We introduce dialog systems that can persuade users to donate to charities. We further improve the dialog model's coherence by tracking both semantic actions and conversational strategies from dialog history using finite-state transducers. Finally, we analyze some ethical concerns and human factors in deploying personalized persuasive dialog systems.
Jan 27

Fuxin Li

Dr. Fuxin Li is currently an assistant professor in the School of Electrical Engineering and Computer Science at Oregon State University. Before that, he has held research positions in University of Bonn and Georgia Institute of Technology. He had obtained a Ph.D. degree in the Institute of Automation, Chinese Academy of Sciences in 2009. He has won the NSF CAREER award, (co-)won the PASCAL VOC semantic segmentation challenges from 2009-2012, and led a team to the 4th place finish in the DAVIS Video Segmentation challenge 2017. He has published more than 50 papers in computer vision, machine learning and natural language processing. His main research interests are deep learning, video object segmentation, multi-target tracking, point cloud deep networks, adversarial deep learning and human understanding of deep learning.

Some Understandings and New Designs of Convolutional Networks

Abstract: We will talk about our recent work in understanding and creating new architectures for convolutional networks, which have achieved great success in computer vision, but might require an overhaul to drive deep learning to a new level. We will start by summarizing our various endeavors on understanding the mechanisms of decision-making convolutional networks, with a focus of gradually unveiling more structural elements of the decision-making and walking away from traditional saliency maps. The first direction is to break the grid structure that these networks usually rely on, leading to a lack of flexibility to scale and rotation invariance, as well as restricted generalization to more types of data. Our proposed PointConv overcomes many such limitations by allowing convolution to be applied on point sets that are distributed arbitrarily in a low-dimensional Euclidean space. This includes, but is not limited to 3D point clouds, as the work can also be applied back to 2D images to achieve greater robustness to scales, as well as other types of data such as weather stations and units in real-time strategy games. Finally, we would introduce a new approach to better estimate uncertainty in neural networks, applicable to many scenarios such as outlier detection, adversarial robustness and better exploration in reinforcement learning.
Feb 03

Vicente Ordónez-Román

Dr Vicente Ordóñez Román is a tenure-track Assistant Professor in the Department of Computer Science at the University of Virginia. His research interests lie at the intersection of computer vision, natural language processing and machine learning. His focus is in building efficient visual recognition models that can perform tasks that leverage both images and text. He is a recipient of a Best Paper Award at the conference on Empirical Methods in Natural Language Processing (EMNLP) 2017 and the Best Paper Award -- Marr Prize at the International Conference on Computer Vision (ICCV) 2013, an IBM Faculty Award, a Google Faculty Research Award, and a Facebook Research Award. Vicente obtained his PhD in Computer Science at the University of North Carolina at Chapel Hill in 2015, an MS at Stony Brook University, and an engineering degree at the Escuela Superior Politécnica del Litoral in Ecuador. He has also been Visiting Fellow at the Allen Institute for Artificial Intelligence and a Visiting Professor at Adobe Research.

Compositional Representations for Visual Recognition

Abstract: Compositionality is the ability for a model to recognize a concept based on its parts or constituents. This ability is essential to use language effectively as there exists a very large combination of plausible objects, attributes, and actions in the world. We posit that visual recognition models should be trained under these considerations. Furthermore, we argue that such property would enable models that are more robust, can be trained with fewer samples, and should mitigate the impact of spurious correlations that could introduce and amplify societal biases. This talk will expand on this idea and present examples of our current and ongoing work, among them: (1) Text2Scene, a model for text-to-image composition, (2) Drill-down an interactive instance-aware image retrieval system, and (3) Genderless, an adversarial filtering module useful for disentangling and visualizing potentially spurious features correlated with an orthogonal task. For more information, please visit our group's website:
Feb 10

Xiaolong Wang

Dr. Xiaolong Wang is an Assistant Professor of the ECE department at the University of California, San Diego. He is affiliated with the CSE department, Center for Visual Computing, Contextual Robotics Institute, and Artificial Intelligence Group. He received his Ph.D. in Robotics at Carnegie Mellon University. His postdoctoral training was at the University of California, Berkeley. His research focuses on the intersection between computer vision and robotics. He is particularly interested in learning visual representation from videos in a self-supervised manner and use this representation to guide robots to learn. Xiaolong is the Area Chair of CVPR, AAAI, ICCV. More details are available on his homepage:

Generalization in Robot Learning with Self-Supervision

Abstract While there is a lot of recent progress in robot learning, it is still very challenging to generalize the learned policies to environments/domains with unseen observations and different physics. On the other hand, self-supervised learning in computer vision has shown very promising results of learning generalizable representations. In this talk, I will bridge self-supervised learning and robot learning, and show how self-supervision can generalize robots in unseen environments. Specifically, I will first introduce test-time training, which allows self-supervised visual learning even during test time and continuously improves Reinforcement Learning (RL) agents in unknown environments. Beyond adapting to unseen visual scenes, I will talk about how to learn correspondence across domains differing in physics parameters (mass and friction), and morphology (number of agent's joints) using self-supervision. Once this correspondence is found, we can directly transfer and generalize the policy trained on one domain to the other. Finally, will also introduce our recent efforts on generalizing RL across multiple tasks using modular networks.
Feb 15

Abhishek Das

Abhishek Das is a Research Scientist at Facebook AI Research (FAIR). He was previously a Computer Science PhD student at Georgia Institute of Technology, advised by Dhruv Batra, and working closely with Devi Parikh. During and prior to his PhD, he has held visiting research positions at Queensland Brain Institute, Virginia Tech, Facebook AI Research, DeepMind, and Tesla Autopilot. He graduated from Indian Institute of Technology Roorkee in 2015 with a Bachelor's degree in Electrical Engineering. His research focuses on deep learning and its applications in climate change, and in building agents that can see (computer vision), think (reasoning/interpretability), talk (language modeling), and act (reinforcement learning). He has published at top-tier conferences -- CVPR, ICCV, ICML, IJCAI, CoRL, ECCV, EMNLP, ICASSP -- and journals -- IJCV, PAMI, CVIU. He is a recipient of graduate fellowships from Facebook, Adobe, Snap and top reviewer awards at CVPR and NeurIPS.

Building Agents that can See, Talk, and Act

Abstract Building intelligent agents that possess the ability to perceive the rich visual environment around us, communicate this understanding in natural language to humans and other agents, and execute actions in a physical environment, is a long-term goal of Artificial Intelligence. In this talk, I will present some of my recent work at various points on this spectrum in connecting vision and language to actions; from Visual Dialog (CVPR17, ICCV17, ECCV20) -- where we develop models capable of holding free-form visually-grounded natural language conversation towards a downstream goal and ways to evaluate them -- to Embodied Question Answering (CVPR18, CoRL18, ICML20) -- where we augment these models to actively navigate in simulated environments and gather visual information necessary for answering questions.
Feb 22

Xin (Eric) Wang

Xin (Eric) Wang is an Assistant Professor of Computer Science and Engineering at UC Santa Cruz. He obtained his Ph.D from UC Santa Barbara and Bachelor's degree from Zhejiang University. His research interests include Natural Language Processing, Computer Vision, and Machine Learning, with an emphasis on building embodied AI agents that can communicate with humans using natural language to perform real-world tasks. Xin has served as Area Chair for NAACL 2021 and EMNLP 2020, Senior Program Committee (SPC) for IJCAI 2021, and Session Chair for EMNLP 2020 and AAAI 2019, and has organized multiple workshops and tutorials at ACL, CVPR, ICCV, AACL, etc. He received the CVPR Best Student Paper Award in 2019.

Tackling Data Scarcity in Vision and Language

Abstract Data is the fuel of deep learning. We have witnessed incredible progress of deep learning everywhere empowered by big data. However, for challenging vision-and-language tasks that involve multi-turn human interactions, data collection is prohibitively expensive and time-consuming. In this talk, I will talk about our recent efforts on tackling data scarcity in such tasks like vision-and-language navigation and iterative text-to-image editing: (1) Self-supervised counterfactual reasoning, which incorporates counterfactual thinking to augment out-of-distribution data; (2) Multimodal text style transfer, which learns to better leverage external resources to mitigate data scarcity in outdoor navigation tasks; (3) Environment-agnostic multitask navigation, which transfers the knowledge across different language grounded navigation tasks.
Mar 01

Damien Teney

Dr Damien Teney is a Senior Researcher at the Australian Institute for Machine Learning, part of the University of Adelaide. He is about to join the Idiap research lab in Martigny, Switzerland. He was previously affiliated with Carnegie Mellon University (USA), the University of Bath (UK), and the University of Innsbruck (Austria). Damien received his Ph.D. in Computer Science at the University of Liège (Belgium), advised by Justus Piater. His research interests are at the intersection of computer vision and machine learning.

Visual Question Answering and Deep Learning : Are we building a ladder to the moon ?

Abstract The task of visual question answering is an exciting opportunity for researchers from computer vision and NLP to tackle a practical AI-complete task. Advances in deep learning have made aspects of the task seemingly within reach. However, the more we look back, the more issues we find: biased datasets, metrics that do not evaluate desired behaviours, models giving right answers for the wrong reasons, etc. This talk proposes to take a step back and reflect on capabilities that are within reach of current machine learning methods, and those that will require radically different approaches. We look through the lens of causal reasoning to identify fundamental limitations of observational training data. We will also discuss how the success of data augmentation, multi-environment training, and counterfactual training examples can all be explained with fundamental causal principles. The analysis is enlightening on the type of information missing from typical datasets, where else to find it, and how to test our models for the behaviours we really care about.
Mar 08

Muhao Chen

Muhao Chen is a researcher at USC ISI. Prior to USC, he was a postdoctoral fellow at UPenn, hosted by Dan Roth. He received his Ph.D. degree from UCLA Computer Science Department in 2019. His research focuses on data-driven machine learning approaches for processing structured data, and knowledge acquisition from unstructured data. Particularly, he is interested in developing knowledge-aware learning systems with generalizability and requiring minimal supervision, and with concrete applications to natural language understanding, knowledge base construction, computational biology and medicine. Muhao has published over 40 papers in leading AI, NLP and Comp. Bio/med venues. His work has received a best student paper award at ACM BCB, and a best paper award nomination at CoNLL. Additional information is available at

Understanding Event Processes in Natural Language

Abstract Human languages evolve to communicate about real-world events. Therefore, understanding events plays a critical role in natural language understanding (NLU). A key challenge to this mission lies in the fact that events are not just simple, standalone predicates. Rather, they are often described at different granularities, form different temporal orders, and directed by specific central goals in the context. This talk will present two parts of our recent studies on Event-Centric NLU. In the first part, I will talk about how logically-constrained learning can teach machines to understand temporal relations, membership relations and coreference of events (e.g., what should be the right process of “defend a dissertation”, “taking courses”, “publish papers” regarding “earning a PhD”?). The second part will talk about how to teach machines to understand the intents and central goals behind event processes (e.g, do machines understand that “ making a dough”, “adding toppings”, “preheating the oven” and “baking the dough” lead to “cooking pizza”?). I will also briefly discuss some recent advances and open problems in event-centric NLU, along with a system demonstration.
Mar 22

Stefan Lee

Stefan Lee is an assistant professor in the School of Electrical Engineering and Computer Science at Oregon State University and a core member of the Collaborative Robotics and Intelligent Systems (CoRIS) Institute there. His work addresses problems at the intersection of computer vision and natural language processing. He is the recipient of the DARPA Rising Research Plenary Speaker Selection (DARPA 2019), two best paper awards (EMNLP 2017, CVPR 2014 Workshop on Egocentric Vision), multiple awards for review quality (CVPR 2017,2019,2020; ICCV 2017; NeurIPS 2017-2018; ICLR 2018-2019, ECCV 2020), the Bradley Postdoctoral Fellowship (Virginia Tech), and an Outstanding Research Scientist Award (Georgia Tech -- College of Computing).

Learning Transferable Visiolingustic Representations

Abstract Vision-and-language research consists of a diverse set of tasks. While underlying image datasets, input/output APIs, and model architectures vary across tasks, there exists a common need to associate imagery and text -- i.e. to perform visual grounding. Despite this, the standard paradigm has been to treat each task in isolation -- starting from separately pretrained vision and language models and then learning to associate their outputs as part of task training. This siloed approach fails to leverage grounding supervision between tasks and can result in myopic groundings when datasets are small or biased. In this talk, I'll discuss a line of work focused on learning task-agnostic visiolinguistic representations that can serve as a common foundation for many vision-and-language tasks. First, I'll cover recent work on learning a generic multimodal encoder (ViLBERT) from large-scale web data and transferring this pretrained model to a range of vision-and-language tasks. Second, I'll show how multitask training from this base architecture further improves task performance while unifying twelve vision-and-language tasks in a single model.
Mar 29

Yin Li

Yin Li is an Assistant Professor in the Department of Biostatistics and Medical Informatics and affiliate faculty in the Department of Computer Sciences at the University of Wisconsin-Madison. Previously, he obtained his PhD in computer science from Georgia Tech and was a postdoctoral fellow in the Robotics Institute at the Carnegie Mellon University. His primary research focus is computer vision. He is also interested in the applications of vision and learning for mobile health. Specifically, his group develops methods and systems to automatically analyze human activities to address challenges related to healthcare. He has been serving as area chairs for the top vision and AI conferences, including CVPR, ICCV, ECCV and IJCAI. He was the co-recipient of the best student paper awards at MobiHealth 2014 and IEEE Face and Gesture 2015. His work was covered by MIT Tech Review, WIRED UK, New Scientist, BBC, and Forbes.

Learning Visual Knowledge from Image-Text Pairs

Abstract Images and their text descriptions (i.e., captions) are readily available in great abundance over the Internet, creating a unique opportunity to develop AI models for image and text understanding. Consequently, learning from these image-text data has received a surging interest from the vision and AI community. An image contains millions of pixels capturing the intensity and color of a visual scene. Yet the same scene can be oftentimes summarized using dozens of words in a natural language. How can we bridge the gap between visual and text data? And what can we learn from these image-text pairs? In this talk, I will describe our attempts to address these research questions, with a focus on learning visual knowledge from images and their captions. First, I will talk about our early work on learning joint representations to match images and sentences and to further align regions with an image and phrases from the image caption. Our latest development demonstrates the learning of these representations with merely image-text pairs and without knowing region-phrase correspondences. Moving forward, I will present our recent work on learning to detect visual concepts (e.g., object categories) and their relationships (e.g., predicates) -- in the form of localized scene graphs, again from only image-sentence pairs. Lastly, I will describe our method that leverages image scene graphs to generate accurate, diverse, and controllable image captions. If time permits, I will briefly cover our efforts of wearable visual sensing and first person vision.
Apr 05

Adriana Kovashka

Adriana Kovashka is an Assistant Professor in Computer Science at the University of Pittsburgh. Her research interests are in computer vision and machine learning. She has authored over twenty publications in top-tier computer vision and artificial intelligence conferences and journals (CVPR, ICCV, ECCV, NeurIPS, AAAI, ACL, TPAMI, IJCV) and ten second-tier conference publications (BMVC, ACCV, WACV). Her research is funded by the National Science Foundation, Google, Amazon and Adobe. She received the NSF CAREER award in 2021. She has served as an Area Chair for CVPR in 2018-2021, NeurIPS 2020, ICLR 2021, AAAI 2021, and will serve as co-Program Chair of ICCV 2025. She has been on program committees for over twenty conferences and journals, and has co-organized seven workshops.

Reasoning about Complex Media from Weak Multi-modal Supervision

Abstract In a world of abundant information targeting multiple senses, and increasingly powerful media, we need new mechanisms to model content. Techniques for representing individual channels, such as visual data or textual data, have greatly improved, and some techniques exist to model the relationship between channels that are “mirror images” of each other and contain the same semantics. However, multimodal data in the real world contains little redundancy; the visual and textual channels complement each other. We examine the relationship between multiple channels in complex media, in two domains, advertisements and political articles.
First, we collect a large dataset of advertisements and public service announcements, covering almost forty topics (ranging from automobiles and clothing, to health and domestic violence). We pose decoding the ads as automatically answering the questions “What should do viewer do, according to the ad” (the suggested action), and “Why should the viewer do the suggested action, according to the ad” (the suggested reason). We train a variety of algorithms to choose the appropriate action-reason statement, given the ad image and potentially a slogan embedded in it. The task is challenging because of the great diversity in how different users annotate an ad, even if they draw similar conclusions.


Yezhou Yang Chitta Baral Tejas Gokhale Shailaja Sampat Pratyay Banerjee Zhiyuan Fang

Website maintained by Tejas Gokhale