Abstract
Computational skewness is a significant challenge in multi-tenant data-parallel clusters that introduce dynamic heterogeneity of machine capacity in distributed data processing. Previous efforts to addressing skewness mostly focus on batch jobs based on the assumption that processing time is linearly dependent on the size of partitioned data. However, they are ill-suited for iterative machine learning (ML) jobs, which (1) exhibit a non-linear relationship between the size of partitioned parameters and processing time within each iteration, and (2) show an explicit binding relationship between input data and parameters for
parameter update.
In this paper, we present FlexPara, a parameter partition approach that leverages the non-linear relationship and provisions adaptive tasks to match the distinct machine capacity so as to address the skewness in iterative ML jobs on dataparallel clusters. FlexPara first predicts task processing time based on a capacity model designed for iterative ML jobs without the linear assumption. It then partitions parameters to parallel tasks through proactive parameter reassignment. Such reassignment can significantly reduce network transmission cost incurred by input data movement due to the binding relationship. We implement FlexPara in Spark and evaluate it with various ML jobs. Experimental results show that compared to hash partition, FlexPara speeds up the execution by up to 54% and 43% in private and NSF Chameleon clusters, respectively.
parameter update.
In this paper, we present FlexPara, a parameter partition approach that leverages the non-linear relationship and provisions adaptive tasks to match the distinct machine capacity so as to address the skewness in iterative ML jobs on dataparallel clusters. FlexPara first predicts task processing time based on a capacity model designed for iterative ML jobs without the linear assumption. It then partitions parameters to parallel tasks through proactive parameter reassignment. Such reassignment can significantly reduce network transmission cost incurred by input data movement due to the binding relationship. We implement FlexPara in Spark and evaluate it with various ML jobs. Experimental results show that compared to hash partition, FlexPara speeds up the execution by up to 54% and 43% in private and NSF Chameleon clusters, respectively.
Original language | English |
---|---|
Title of host publication | IEEE INFOCOM 2019 - IEEE Conference on Computer Communications |
Publisher | Institute of Electrical and Electronics Engineers |
Pages | 1261-1269 |
Number of pages | 9 |
ISBN (Electronic) | 978-1-7281-0515-4 |
ISBN (Print) | 978-1-7281-0516-1 |
DOIs | |
Publication status | Published - 17 Jun 2019 |
Event | IEEE International Conference on Computer Communications - Paris, France Duration: 29 Apr 2019 → 2 May 2019 https://infocom2019.ieee-infocom.org/ |
Publication series
Name | |
---|---|
ISSN (Print) | 0743-166X |
ISSN (Electronic) | 2641-9874 |
Conference
Conference | IEEE International Conference on Computer Communications |
---|---|
Abbreviated title | IEEE INFOCOM 2019 |
Country/Territory | France |
City | Paris |
Period | 29/04/19 → 2/05/19 |
Internet address |