Decentralized SGD and Average-direction SAM are Asymptotically Equivalent

Tongtian Zhu, Fengxiang He*, Kaixuan Chen, Mingli Song, Dacheng Tao

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Decentralized stochastic gradient descent (D-SGD) allows collaborative learning on massive devices simultaneously without the control of a central server. However, existing theories claim that decentralization invariably undermines generalization. In this paper, we challenge the conventional belief and present a completely new perspective for understanding decentralized learning. We prove that D-SGD implicitly minimizes the loss function of an average-direction Sharpness-aware minimization (SAM) algorithm under general non-convex non-β-smooth settings. This surprising asymptotic equivalence reveals an intrinsic regularization-optimization trade-off and three advantages of decentralization: (1) there exists a free uncertainty evaluation mechanism in D-SGD to improve posterior estimation; (2) D-SGD exhibits a gradient smoothing effect; and (3) the sharpness regularization effect of D-SGD does not decrease as total batch size increases, which justifies the potential generalization benefit of D-SGD over centralized SGD (C-SGD) in large-batch scenarios.
Original languageEnglish
Title of host publicationProceedings of the 40th International Conference on Machine Learning
Number of pages32
Publication statusPublished - 10 Jul 2023
EventThe Fortieth International Conference on Machine Learning - Honolulu, United States
Duration: 23 Jul 202329 Jul 2023
Conference number: 40

Publication series

NameProceedings of Machine Learning Research
ISSN (Electronic)2640-3498


ConferenceThe Fortieth International Conference on Machine Learning
Abbreviated titleICML 2023
Country/TerritoryUnited States
Internet address


Dive into the research topics of 'Decentralized SGD and Average-direction SAM are Asymptotically Equivalent'. Together they form a unique fingerprint.

Cite this