V2C

About Video to Commonsense?

Captioning is a crucial and challenging task forvideo understanding. In videos that involve active agents such as humans, the agent’s actions can bring about myriad changes in thescene. Observable changes such as movements, manipulations, and transformations ofthe objects in the scene, are reflected in conventional video captioning. Unlike images, actions in videos are also inherently linked tosocial aspects such as intentions (why the action is taking place), effects (what changes dueto the action), and attributes that describe theagent.

Thus for video understanding, such as when captioning videos or when answering questions about videos, one must have anunderstanding of these commonsense aspects.We present the first work on generating commonsense captions directly from videos, to describe latent aspects such as intentions, effects, and attributes. We present a new dataset “Video-to-Commonsense (V2C)” that contains ∼9k videos of human agents performing various actions, annotated with 3 types of commonsense descriptions.

Download the V2C dataset

Videos in V2C are inherited from MSR-VTT video dataset, we use the most common used 2D visual features by ImageNet pre-trained ResNet-152 model.

Implementations

We provide PyTorch implementations for the V2C completion tasks in:

Distribution and Usage

V2C is curated from multiple online resources (MSR-VTT Video Dataset and ATOMIC Person Commonsense Dataset). Creation of V2C is purely research oriented. If you find our dataset or model helpful, please cite our paper

@article{fang2020video2commonsense, title={Video2commonsense: Generating commonsense descriptions to enrich video captioning}, author={Fang, Zhiyuan and Gokhale, Tejas and Banerjee, Pratyay and Baral, Chitta and Yang, Yezhou}, journal={arXiv preprint arXiv:2003.05162}, year={2020} }

Annotation Examples

Event Captions: Group of people is dancing  
 togther in a room;
Itention: Because they wanted:
          to have fun;
          to feel good;
          to celebrate;
          to attract a mate;
					Effect: As a result, they will:
          feel light and free;
          be happy;
          feel excited;
          get exercised;
Attribute: They are seen as:
          open;
          active;
          careless;
          artistic;

Event Captions: A band plays at a concert;
Itention: Because they wanted:
          to enjoy the evening;
          to enjoy the concert;
          to make his presence felt;
          to entertain;
					Effect: As a result, they will:
          gets applaus;
          shuts off X-Box;
          Gets recognition;
          gets standing ovation;
Attribute: They are seen as:
          playful;
          Acting;
          sociable;
          Entertainer;

Event Captions:A man gives information 
about an apple computer;
Itention: Because they wanted:
          to share knowledge;
          to help;
          person asked x some information;
          to learn about electronics;
					Effect: As a result, they will:
          programs computers at his job;
          looks at the monitor of it;
          boots up the computer;
          sets up computer;
Attribute: They are seen as:
          geeky;
          knowledgeable;
          technical;
          smart;

V2C (Video-to-Commonsense)

Understanding the video by enriching captions with Commonsense Knowledge.

About Video to Commonsense?

Distribution and Usage

Annotation Examples