Object detection is the first step towards relation understanding in videos. This task requires participants to develop robust object detectors that not only localize objects of certain categories with bounding boxes in every video frame but also link the bounding boxes that indicate same object entity into a trajectory. This will help machine to understand the identities and dynamics of object entities at video level, which can benefit many applications that require fine-grained video understanding. Furthermore, we hope it can accelerate the research on robust video object detection in the wild, by providing a larger number of user generated videos with annotations. An expectable challenge of this task is that, the detectors should be able to re-identify an object which disappears sometime in a video. Technically, the detectors have to overcome difficulties from video quality (e.g. free camera motion, illumination and blurring) and object variation (e.g. occlusion and deformable shape).
The VidOR dataset for this task consists of user-generated videos from Flickr, along with annotations on 80 categories of object (e.g., "adult", "child", "dog", "table"). Bounding boxes are annotated for objects in each frame, and the objects' identities among frames are also provided. The training/validation/testing splits are 7,000, 835 and 2,165 videos. Specially, if the object just shows part of its body in the image (e.g., a hand of a person), it is also annotated in this dataset.
The videos and annotations of the training and validation sets can be downloaded directly from here. Note that the downloaded annotations contain additional annotation of relations. This task allows participants to use them to train the object detection models.
We adopt average precision (AP) as metric to evaluate the detection performance for each object category. The trajectory-level mean AP (mAP) is defined as follows:
Given a predicted trajectory (a.k.a. tubelet) $\mathcal{T}_p$ and a ground truth trajectory $\mathcal{T}_g$ of a certain category, the temporal Intersection over Union (tIoU) between these two trajectories is calculated as: $$\text{tIoU}(\mathcal{T}_p, \mathcal{T}_g)=\frac{D_p \cap D_g}{D_p\cup D_g}, $$ where $D_p$, $D_g$ denote the time duration of the predicted trajectory and the ground truth trajectory respectively. In out settings, the threshold for \( \text{tIoU} \) is 0.5, which means any result with \( \text{tIoU} \geq 0.5 \) will be regarded as true positive prediction. Besides, the \( \text{IoU} \) threshold for frame-level bounding box is set to (0.5, 0.7, 0.9). For each trajectory pair, their \( \text{tIoU} \) is averaged on the three frame-level IoU values.
We only consider the top-20 predictions for each video. The final mAP is obtained by averaging the APs across all the object categories.
The evaluation code used by the evaluation server can be found here.
Please use the following JSON format to submit your results to the submission server. Before uploading the result, please compress the JSON file using XZ and make sure the final file size not exceed 100MB.
The example above is illustrative. Comments must be removed in your submission. A sample submission file is available here.