Width of Minima Reached by Stochastic Gradient Descent is Influenced by Learning Rate to Batch Size Ratio

Stanislaw Jastrzębski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, Amos Storkey

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

We show that the dynamics and convergence properties of SGD are set by the ratio of learning rate to batch size. We observe that this ratio is a key determinant of the generalization error, which we suggest is mediated by controlling the width of the final minima found by SGD. We verify our analysis experimentally on a range of deep neural networks and datasets.
Original languageEnglish
Title of host publicationProceedings of 27th International Conference on Artificial Neural Networks
Place of PublicationRhodes, Greece
PublisherSpringer
Pages392-402
Number of pages10
ISBN (Electronic)978-3-030-01424-7
ISBN (Print)978-3-030-01423-0
DOIs
Publication statusPublished - Oct 2018
Event27th International Conference on Artificial Neural Networks - Rhodes, Greece
Duration: 4 Oct 20187 Oct 2018
https://e-nns.org/icann2018/

Publication series

NameLecture Notes in Computer Science
PublisherSpringer, Cham
Volume11141
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349
NameTheoretical Computer Science and General Issues
Volume11141

Conference

Conference27th International Conference on Artificial Neural Networks
Abbreviated titleICANN 2018
Country/TerritoryGreece
CityRhodes
Period4/10/187/10/18
Internet address

Fingerprint

Dive into the research topics of 'Width of Minima Reached by Stochastic Gradient Descent is Influenced by Learning Rate to Batch Size Ratio'. Together they form a unique fingerprint.

Cite this