Efficient pre-training for localized instruction generation of procedural videos

Anil Kumar Batra*, Davide Moltisanti, Laura Sevilla-Lara, Marcus Rohrbach, Frank Keller

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Procedural videos, exemplified by recipe demonstrations, are instrumental in conveying step-by-step instructions. However, understanding such videos is challenging as it involves the precise localization of steps and the generation of textual instructions. Manually annotating steps and writing instructions is costly, which limits the size of current datasets and hinders effective learning. Leveraging large but noisy video-transcript datasets for pre-training can boost performance but demands significant computational resources. Furthermore, transcripts contain irrelevant content and differ in style from human-written instructions. To mitigate these issues, we propose a novel technique, Sieve & Swap, to automatically generate high-quality training data for the recipe domain: (i) Sieve: filters irrelevant transcripts and (ii) Swap: acquires high-quality text by replacing transcripts with human-written instruction from a text-only recipe dataset. The resulting dataset is three orders of magnitude smaller than current web-scale datasets but enables efficient training of large-scale models. Alongside Sieve & Swap, we propose Procedure Transformer (ProcX), a model for end-to-end step localization and instruction generation for procedural videos. When pre-trained on our curated dataset, this model achieves state-of-the-art performance on YouCook2 and Tasty while using a fraction of the training data. We have released code and dataset. (https://github.com/anilbatra2185/sns_procx).
Original languageEnglish
Title of host publicationEuropean Conference on Computer Vision, Proceedings
EditorsAleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, Gül Varol
PublisherSpringer
Pages347-363
Number of pages17
ISBN (Electronic)9783031732324
ISBN (Print)9783031732317
DOIs
Publication statusPublished - 30 Sept 2024
EventThe 34th European Conference on Computer Vision - MiCo Milano Convention Centre, Milan, Italy
Duration: 29 Sept 20244 Oct 2024
https://eccv.ecva.net/

Publication series

NameLecture Notes in Computer Science
PublisherSpringer
Volume15097
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

ConferenceThe 34th European Conference on Computer Vision
Abbreviated titleECCV 2024
Country/TerritoryItaly
CityMilan
Period29/09/244/10/24
Internet address

Fingerprint

Dive into the research topics of 'Efficient pre-training for localized instruction generation of procedural videos'. Together they form a unique fingerprint.

Cite this