Self-supervised Learning of Interpretable Keypoints from Unlabelled Videos

Tomas Jakab, Ankush Gupta, Hakan Bilen, Andrea Vedaldi

Research output: Chapter in Book/Report/Conference proceedingConference contribution


We propose a new method for recognizing the pose of objects from a single image that for learning uses only unlabelled videos and a weak empirical prior on the object poses. Video frames differ primarily in the pose of the objects they contain, so our method distils the pose information by analyzing the differences between frames. The distillation uses a new dual representation of the geometry of objects as a set of 2D keypoints, and as a pictorial representation, i.e. a skeleton image. This has three benefits: (1) it provides a tight ‘geometric bottleneck’ which disentangles pose from appearance, (2) it can leverage powerful image-to-image translation networks to map between photometry and geometry, and (3) it allows to incorporate empirical pose priors in the learning process. The pose priors are obtained from unpaired data, such as from a different dataset or modality such as mocap, such that no annotated image is ever used in learning the pose recognition network. In standard benchmarks for pose recognition for humans and faces, our method achieves state-of-the-art performance among methods that do not require any labelled images for training. Project page:
Original languageEnglish
Title of host publication2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Place of PublicationSeattle, WA, USA
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Number of pages11
ISBN (Electronic)978-1-7281-7168-5
ISBN (Print)978-1-7281-7169-2
Publication statusPublished - 5 Aug 2020
EventIEEE Conference on Computer Vision and Pattern Recognition 2020 - Seattle, United States
Duration: 16 Jun 202018 Jun 2020

Publication series

ISSN (Print)1063-6919
ISSN (Electronic)2575-7075


ConferenceIEEE Conference on Computer Vision and Pattern Recognition 2020
Abbreviated titleCVPR 2020
CountryUnited States
Internet address

Fingerprint Dive into the research topics of 'Self-supervised Learning of Interpretable Keypoints from Unlabelled Videos'. Together they form a unique fingerprint.

Cite this