TY - JOUR
T1 - Data-driven mixed-Integer linear programming-based optimisation for efficient failure detection in large-scale distributed systems
AU - Er-Rahmadi, Btissam
AU - Ma, Tiejun
N1 - Funding Information:
This piece of research was partially sponsored by Huawei Ltd.
Funding Information:
The authors acknowledge the use of the IRIDIS High Performance Computing Facility, and associated support services at the University of Southampton, UK, in the completion of this work. The authors would like to thank Prof. Tolga Bektas, Professor of Logistics Management at the University of Liverpool, UK, for assisting this research by providing valuable advice in the optimisation modelling. Finally, the authors would like to thank the anonymous reviewers for their valuable feedback.
Publisher Copyright:
© 2022 Elsevier B.V.
PY - 2022/6/8
Y1 - 2022/6/8
N2 - Failure detectors (FDs) are fundamental building blocks for distributed systems. An FD detects whether a process has crashed or not based on the reception of heartbeats’ messages sent by this process over a communication channel. A key challenge of FDs is to tune their parameters to achieve optimal performance which satisfies the desired system requirements. This is challenging due to the complexities of large-scale networks. Existing FDs ignore such optimisation and adopt ad-hoc parameters. In this paper, we propose a new Mixed Integer Linear Programming (MILP) optimisation-based FD algorithm. We obtain the MILP formulation via piecewise linearisation relaxations. The MILP involves obtaining optimal FD parameters that meet the optimal trade-off between its performance metrics requirements, network conditions and system parameters. The MILP maximises our FD’s accuracy under bounded failure detection time while considering network and system conditions as constraints. The MILP’s solution represents optimised FD parameters that maximise FD’s expected performance. To adapt to real-time network changes, our proposed MILP-based FD fits the probability distribution of heartbeats’ inter-arrivals. To address our FD scalability challenge in large-scale systems where the MILP model needs to compute approximate optimal solutions quickly, we also propose a heuristic algorithm. To test our proposed approach, we adopt Amazon Cloud as a realistic testing environment and develop a simulator for robustness tests. Our results show consistent improvement of overall FD performance and scalability. To the best of our knowledge, this is the first attempt to combine the MILP-based optimisation modelling with FD to achieve performance guarantees.
AB - Failure detectors (FDs) are fundamental building blocks for distributed systems. An FD detects whether a process has crashed or not based on the reception of heartbeats’ messages sent by this process over a communication channel. A key challenge of FDs is to tune their parameters to achieve optimal performance which satisfies the desired system requirements. This is challenging due to the complexities of large-scale networks. Existing FDs ignore such optimisation and adopt ad-hoc parameters. In this paper, we propose a new Mixed Integer Linear Programming (MILP) optimisation-based FD algorithm. We obtain the MILP formulation via piecewise linearisation relaxations. The MILP involves obtaining optimal FD parameters that meet the optimal trade-off between its performance metrics requirements, network conditions and system parameters. The MILP maximises our FD’s accuracy under bounded failure detection time while considering network and system conditions as constraints. The MILP’s solution represents optimised FD parameters that maximise FD’s expected performance. To adapt to real-time network changes, our proposed MILP-based FD fits the probability distribution of heartbeats’ inter-arrivals. To address our FD scalability challenge in large-scale systems where the MILP model needs to compute approximate optimal solutions quickly, we also propose a heuristic algorithm. To test our proposed approach, we adopt Amazon Cloud as a realistic testing environment and develop a simulator for robustness tests. Our results show consistent improvement of overall FD performance and scalability. To the best of our knowledge, this is the first attempt to combine the MILP-based optimisation modelling with FD to achieve performance guarantees.
KW - Nonlinear programming
KW - Mixed integer linear programming
KW - Distributed systems
KW - Failure detection
KW - Heartbeats
U2 - 10.1016/j.ejor.2022.02.006
DO - 10.1016/j.ejor.2022.02.006
M3 - Article
SN - 0377-2217
VL - 303
SP - 337
EP - 353
JO - European Journal of Operational Research
JF - European Journal of Operational Research
IS - 1
ER -