In the area of action recognition, the task of action detection is the most difficult task which requires the spatio-temporal localization of actions of interest. This task is intended to evaluate the abilities of the algorithms in 1) recognizing actions from the 42 common categories performed by human and animals (a bit different from the human-centric action detections); and 2) localizing the actions in space and time, where multiple subjects may appear and each of them may perform several actions at a time. Hence, the detectors need to have creative solutions that leverage any input modalities, overcome the large variation within each category of action representation and learn the intention of action.

### Dataset

The dataset for this task contains 10K user-generated videos and ~69,000 annotated action instances in 42 categories of atomic visual actions (e.g. "watch", "grab", and "hug"). In the annotations, each action instance is temporally localized by a starting and a ending frame indice, and then spatially localized at the in-between frames by a bounding-box trajectory over the subject of the action. Note that the subjects are not limited to humankind (adult, child, baby), but also include several common animals listed in the introduction page of the dataset. The distribution among training, validation and testing is 7,000, 835 and 2,165 videos respectively.

The videos and annotations can be downloaded directly from here. In order to obtain the action annotation related to this task, you need to parse each of the "relation_instances" in the annotations and extract the "predicate" of action type and the corresponding "subject_tid" to form an action instance. You can use the script here for this processing.

Please note that there are 10.7% and 12.5% videos in the training and validation split, respectively, not containing any annotated action instance, On the other hand, the annotations contain additional annotation of objects and relations. This task allows participants to use these extra data to train the action detection models.

### Evaluation Metric

The Average Precision (AP) is used as the metric for evaluating the detection performance per action category and calculate the mean AP (mAP) over all categories as the final ranking score. To determine if a predicted action instance $(c^p, \mathcal{T}^p)$ is true positive, we see if it can match a ground truth $(c^g, \mathcal{T}^g)$ that meets the following requirements:

• their action categories are same, i.e. $c^p = c^g$;
• their bounding-box trajectories overlap s.t. $\text{vIoU}(\mathcal{T}^p, \mathcal{T}^g) \geq 0.5$;
• the overlap of the trajectory pair $\text{ov}_{pg}=\text{vIoU}(\mathcal{T}^p, \mathcal{T}^g)$ is the maximum among those paired with the other unmatched ground truths $\mathcal{G}$, i.e. $\text{ov}_{pg} \geq \text{ov}_{pg'} (g' \in \mathcal{G})$.
To illustrate the terms, $c$ is defined as the action category of an action instance; $\mathcal{T}$ is the bounding-box trajectory defined by starting and ending frame index and a sequence of bounding-boxes; and $\text{vIoU}$ refers to the voluminal Intersection over Union.

The evaluation code used by the evaluation server can be found here. The number of predictions per video is limited up to 500.

### Submission Format

Please use the following JSON format when submitting your results for the challenge:

The example above is illustrative. Comments must be removed in your submission.
A sample submission file is available here.