Abstract
Incremental and approximate computations are increasingly being adopted for data analytics to achieve low-latency execution and efficient utilization of computing resources. Incremental computation updates the output incrementally instead of re-computing everything from scratch for successive runs of a job with input changes. Approximate computation returns an approximate output for a job instead of the exact output. Both paradigms rely on computing over a subset of data items instead of computing over the entire dataset, but they differ in their means for skipping parts of the computation. Incremental computing relies on the memoization of intermediate results of sub-computations, and reusing these memoized results across jobs. Approximate computing relies on representative sampling of the entire dataset to compute over a subset of data items.
In this paper, we observe that these two paradigms are complementary, and can be married together! Our idea is quite simple: design a sampling algorithm that biases the sample selection to the memoized data items from previous runs. To realize this idea, we designed an online stratified sampling algorithm that uses self-adjusting computation to produce an incrementally updated approximate output with bounded error. We implemented our algorithm in a data analytics system called IncApprox based on Apache Spark Streaming. Our evaluation using micro-benchmarks and real-world case-studies shows that IncApprox achieves the benefits of both incremental and approximate computing.
In this paper, we observe that these two paradigms are complementary, and can be married together! Our idea is quite simple: design a sampling algorithm that biases the sample selection to the memoized data items from previous runs. To realize this idea, we designed an online stratified sampling algorithm that uses self-adjusting computation to produce an incrementally updated approximate output with bounded error. We implemented our algorithm in a data analytics system called IncApprox based on Apache Spark Streaming. Our evaluation using micro-benchmarks and real-world case-studies shows that IncApprox achieves the benefits of both incremental and approximate computing.
Original language | English |
---|---|
Title of host publication | WWW '16 Proceedings of the 25th International Conference on World Wide Web |
Place of Publication | Republic and Canton of Geneva, Switzerland |
Publisher | International World Wide Web Conferences Steering Committee |
Pages | 1133-1144 |
Number of pages | 12 |
ISBN (Print) | 978-1-4503-4143-1 |
DOIs | |
Publication status | Published - 2016 |
Event | 25th International Conference Companion on World Wide Web - Montreal, Canada Duration: 11 Apr 2016 → 15 Apr 2016 http://www2016.ca/ |
Publication series
Name | WWW '16 |
---|---|
Publisher | International World Wide Web Conferences Steering Committee |
Conference
Conference | 25th International Conference Companion on World Wide Web |
---|---|
Abbreviated title | WWW 2016 |
Country/Territory | Canada |
City | Montreal |
Period | 11/04/16 → 15/04/16 |
Internet address |