Addressing Skewness in Iterative ML Jobs with Parameter Partition

Shaoqiang Wang, Wei Chen, Xiaobo Zhou, Sang-Yoon Chang, Hua Ji

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Computational skewness is a significant challenge in multi-tenant data-parallel clusters that introduce dynamic heterogeneity of machine capacity in distributed data processing. Previous efforts to addressing skewness mostly focus on batch jobs based on the assumption that processing time is linearly dependent on the size of partitioned data. However, they are ill-suited for iterative machine learning (ML) jobs, which (1) exhibit a non-linear relationship between the size of partitioned parameters and processing time within each iteration, and (2) show an explicit binding relationship between input data and parameters for
parameter update.

In this paper, we present FlexPara, a parameter partition approach that leverages the non-linear relationship and provisions adaptive tasks to match the distinct machine capacity so as to address the skewness in iterative ML jobs on dataparallel clusters. FlexPara first predicts task processing time based on a capacity model designed for iterative ML jobs without the linear assumption. It then partitions parameters to parallel tasks through proactive parameter reassignment. Such reassignment can significantly reduce network transmission cost incurred by input data movement due to the binding relationship. We implement FlexPara in Spark and evaluate it with various ML jobs. Experimental results show that compared to hash partition, FlexPara speeds up the execution by up to 54% and 43% in private and NSF Chameleon clusters, respectively.
Original languageEnglish
Title of host publicationIEEE INFOCOM 2019 - IEEE Conference on Computer Communications
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Pages1261-1269
Number of pages9
ISBN (Electronic)978-1-7281-0515-4
ISBN (Print)978-1-7281-0516-1
DOIs
Publication statusPublished - 17 Jun 2019
EventIEEE International Conference on Computer Communications - Paris, France
Duration: 29 Apr 20192 May 2019
https://infocom2019.ieee-infocom.org/

Publication series

Name
ISSN (Print)0743-166X
ISSN (Electronic)2641-9874

Conference

ConferenceIEEE International Conference on Computer Communications
Abbreviated titleIEEE INFOCOM 2019
Country/TerritoryFrance
CityParis
Period29/04/192/05/19
Internet address

Fingerprint

Dive into the research topics of 'Addressing Skewness in Iterative ML Jobs with Parameter Partition'. Together they form a unique fingerprint.

Cite this