Abstract
Detecting temporal extents of human actions in videos is a challenging computer vision problem that requires detailed manual supervision including frame-level labels.This expensive annotation process limits deploying action detectors to a limited number of categories. We propose a novel method, called WSGN, that learns to detect actions from weak supervision, using only video-level labels. WSGN learns to exploit both video-specific and dataset-wide statistics to predict relevance of each frame to an action category. This strategy leads to significant gains in action detection for two standard benchmarks THU-MOS14 and Charades. Our method obtains excellent results compared to state-of-the-art methods that uses similar features and loss functions on THUMOS14 dataset. Similarly, our weakly supervised method is only 0.3% mAP behind a state-of-the-art supervised method on challenging Charades dataset for action localization.
| Original language | English |
|---|---|
| Title of host publication | 2020 IEEE Winter Conference on Applications of Computer Vision |
| Publisher | Institute of Electrical and Electronics Engineers |
| Pages | 526-535 |
| Number of pages | 10 |
| ISBN (Electronic) | 978-1-7281-6553-0 |
| ISBN (Print) | 978-1-7281-6554-7 |
| DOIs | |
| Publication status | Published - 14 May 2020 |
| Event | 2020 Winter Conference on Applications of Computer Vision - Aspen, United States Duration: 1 Mar 2020 → 5 Mar 2020 https://wacv20.wacv.net/ |
Publication series
| Name | |
|---|---|
| Publisher | IEEE |
| ISSN (Print) | 2472-6737 |
| ISSN (Electronic) | 2642-9381 |
Conference
| Conference | 2020 Winter Conference on Applications of Computer Vision |
|---|---|
| Abbreviated title | WACV 2020 |
| Country/Territory | United States |
| City | Aspen |
| Period | 1/03/20 → 5/03/20 |
| Internet address |