ACM'MM 2020 Video Relation Understanding Challenge

News

We'd like to thank YouTube for sponsoring a total of $1000 prize to VRU'20 grand challenge. The awards will be announced during ACM MM'20 conference.
The final leaderboard has been released. Appreciate all the participants' efforts and congratulations to the winners !!!
Testing videos have been released to the registered participants.
Submission server fully opened. Submit your first result to compete with others in validation score !
Please keep update to the latest evaluation code here.
Registration extended to 29 May. Hurry up !
Precomputed features and trajectories released, and you can find them here.
Welcome to this year’s VRU challenge, and it's open for registration now !!! (click here to see how to register)

Introduction

Although the recent advance in computer vision has effectively boosted the performance of multimedia systems, a core question still cannot be explicitly answered: Does the machine understand what is happening in a video, and are the results of the analysis interpretable by human users? Another way to look at the limitation is to evaluate how many facts the machine can recognize from a video. In many AI and knowledge-based systems, a fact is represented by a relation between a subject entity and an object entity (a.k.a. <subject,predicate,object>), which forms the fundamental building block for complex inferences and decision-making tasks.

As a key aspect of recognizing facts, Video Relation Understanding (VRU) is very challenging since it requires the system to understand many perspectives of the two entities, including appearance, action, speech, and interactions between them. In order to detect and recognize the relations in videos accurately, a system must recognize not only the features in these perspectives, but also the large variance in relation representation. This year’s VRU challenge encourages researchers to explore and develop innovative models and algorithms to detect object entities and the relationships between each pair of them in a given video.

Dataset: VidOR

This benchmark dataset contains 10,000 user-generated videos (98.6 hours) from YFCC100M. It is spatial-temporally annotated with 80 categories of objects (e.g. adult, dog, toy) and 50 categories of relationships (e.g. next to, watch, hold).

Main Task: Video Relation Detection

This task is to detect relation triplets (i.e. <subject,predicate,object>) of interest and spatio-temporally localize the subject and object of each detected relation triplet using bounding-box trajectories. For each testing video, we compute Average Precision to evaluate the detection performance and rank according to the mean AP over all the testing videos.

Optional Task: Video Object Detection

As the first step in relation detection, this task is to detect objects of certain categories and spatio-temporally localize each detected object using a bounding-box trajectory in videos. For each object category, we compute Average Precision to evaluate the detection performance and rank according to the mean AP over all the categories.

Participation

This challenge is a team-based competition. Each team can have one or more members, and an individual cannot be a member of multiple teams. To register, please create an account and form teams in the submission server. More guides to the usage of the server can be found in FAQs. Note that each team must select a final submission in the server before the submission deadline, and we will conduct a final evaluation based on the selection.

At the end of the challenge, all teams will be ranked based on the objective evaluation metrics, and the leaderboard of both tasks will be public on this website. To be eligible for ACM MM'20 grand challenge award competition, each team need further submit a 4-page overview paper (plus 1-page reference) to the conference's grand challenge track. The top three teams in terms of both the solution novelty and the ranking in the main task will receive award certificates.

Leaderboard

Main Task: Video Relation Detection

Rank	Team Name	Performance: mean AP*	Team Members
1	colab-BUAA	0.1174	Wentao Xie, Guanghui Ren, Si Liu [paper] Beihang University & YITU Technology
2	ETRI_DGRC	0.0665	Kwang-Ju Kim, Pyong-Kun Kim, Kil-Taek Lim, Jong Taek Lee Electronics and Telecommunications Research Institute
3	Fudan-BigVid	0.0599	Zixuan Su, Jingjing Chen, Yu-Gang Jiang Fudan University
4	GKBU	0.0328	Renmin University of China
5	DeepBlueAI	0.0024	DeepBlue Technology (Shanghai) Co., Ltd

Optional Task: Video Object Detection

Rank	Team Name	Performance: mean AP*	Team Members
1	DeepBlueAI	0.0966	Zhipeng Luo, Zhiguang Zhang, Lixuan Che, Yuehan Yao, Zhenyu Xu [paper] DeepBlue Technology (Shanghai) Co., Ltd
2	IVL	0.0742	Jinjin Shi, Zhihao Chen, Youxin Chen Samsung
3	ARC	0.0071	Lijian Lin Xiamen University

* The two mean AP metrics are computed in different way. Please look for the evaluation details in each task description.

** This year's challenge has 41 registered teams from around the world. However, due to the big challenges in both the tasks and dataset as well as the difficult COVID-19 period, only a few outstanding teams successfully submitted results by the final deadline.

Timeline

March 10, 2020: Website ready and call for registration
March 31, 2020: Precomputed features and trajectories release
April 29, 2020: Submission server fully open
May 29, 2020: Registration close; testing videos release to the registered participants
June 29, 2020, 23:59 AoE: submission deadline; submission server close
July 1, 2020: Final evaluation results announce on the website
July 13, 2020: Paper submission deadline