Abstract / Description of output
Scientific workflows have become mainstream for conducting large-scale scientific research. As a result, many workflow applications and Workflow Management Systems (WMSs) have been developed as part of the cyberinfrastructure to allow scientists to execute their applications seamlessly on a range of distributed platforms. In spite of many success stories, a key challenge for running workflows in distributed systems is failure prediction, detection, and recovery. In this paper, we propose an approach to use control theory developed as part of autonomic computing to predict failures before they happen, and mitigated them when possible. The proposed approach apply the proportional-integral-derivative controller (PID controller) control loop mechanism, which is widely used in industrial control systems, to mitigate faults by adjusting the inputs of the mechanism. The PID controller aims at detecting the possibility of a fault far enough in advance so that an action can be performed to prevent it from happening. To demonstrate the feasibility of the approach, we tackle two common execution faults of the Big Data era|data footprint and memory usage. We define, implement, and evaluate simple PID controllers to autonomously manage data and memory usage of a bioinformatics work ow that consumes/produces over 4.4TB of data, and requires over 24TB of memory to run all tasks concurrently. Experimental results indicate that work flow executions may significantly benefit from PID controllers, in particular under online and unknown conditions. Simulation results show that nearly-optimal executions (slowdown of 1.01) can be attained when using our proposed method, and faults are detected and mitigated far in advance.
Original language | English |
---|---|
Title of host publication | Proceedings of the 11th Workshop on Workflows in Support of Large-Scale Science co-located with The International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2016) |
Place of Publication | Salt Lake City, Utah, USA |
Publisher | CEUR Workshop Proceedings (CEUR-WS.org) |
Pages | 15-24 |
Number of pages | 10 |
Publication status | Published - 28 Feb 2017 |
Event | 11th Workshop on Workflows in Support of Large-Scale Science co-located with The International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2016) - Salt Lake City, United States Duration: 14 Nov 2016 → 14 Nov 2016 http://ceur-ws.org/Vol-1800/ |
Publication series
Name | |
---|---|
Publisher | CEUR Workshop Proceedings |
Volume | 1800 |
ISSN (Print) | 1613-0073 |
Conference
Conference | 11th Workshop on Workflows in Support of Large-Scale Science co-located with The International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2016) |
---|---|
Abbreviated title | WORKS 2016 |
Country/Territory | United States |
City | Salt Lake City |
Period | 14/11/16 → 14/11/16 |
Internet address |