Abstract / Description of output
Understanding and troubleshooting distributed systems in the cloud is considered a very difficult problem because the execution of a single user request is distributed to multiple machines. Further, the multi-tenancy nature of cloud environments further introduces interference that causes performance issues. Most existing troubleshooting tools either focus on log analysis or intrusive tracing methods, leaving resource usage monitoring unexplored.
We propose and implement LRTrace, a non-intrusive tracing and feedback control tool for distributed applications in lightweight virtualized environments. LRTrace profiles both log messages and actual resource consumptions of an application at runtime in a fine-grained manner, which is made possible by lightweight container-based virtualization. By correlating these two kinds of information, LRTrace provides users the ability to build the relationship between changes in resource consumption and application events. Furthermore, LRTrace allows users to define and implement their own feedback control plug-ins to manage the cluster in a semi-automatic manner. In system evaluation, we run Spark and MapReduce applications in a multi-tenant cluster and show that LRTrace can diagnose performance issues caused by either interference or bugs, or both. It also helps users to understand the workflows of data-parallel applications.
We propose and implement LRTrace, a non-intrusive tracing and feedback control tool for distributed applications in lightweight virtualized environments. LRTrace profiles both log messages and actual resource consumptions of an application at runtime in a fine-grained manner, which is made possible by lightweight container-based virtualization. By correlating these two kinds of information, LRTrace provides users the ability to build the relationship between changes in resource consumption and application events. Furthermore, LRTrace allows users to define and implement their own feedback control plug-ins to manage the cluster in a semi-automatic manner. In system evaluation, we run Spark and MapReduce applications in a multi-tenant cluster and show that LRTrace can diagnose performance issues caused by either interference or bugs, or both. It also helps users to understand the workflows of data-parallel applications.
Original language | English |
---|---|
Title of host publication | Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing |
Place of Publication | New York, NY, USA |
Publisher | ACM |
Pages | 168-179 |
Number of pages | 12 |
ISBN (Print) | 978-1-4503-5785-2 |
DOIs | |
Publication status | Published - 11 Jun 2018 |
Event | 27th International Symposium on High-Performance Parallel and Distributed Computing - Tempe, United States Duration: 11 Jun 2018 → 15 Jun 2018 http://www.hpdc.org/2018/ |
Publication series
Name | HPDC '18 |
---|---|
Publisher | ACM |
Conference
Conference | 27th International Symposium on High-Performance Parallel and Distributed Computing |
---|---|
Abbreviated title | HPDC 2018 |
Country/Territory | United States |
City | Tempe |
Period | 11/06/18 → 15/06/18 |
Internet address |
Keywords / Materials (for Non-textual outputs)
- data-parallel applications
- lightweight virtualization
- logs
- resource metrics
- troubleshooting