SAND Join — A skew handling join algorithm for Google's MapReduce framework

F. Atta, S. D. Viglas, S. Niazi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The simplicity and flexibility of the MapReduce framework have motivated programmers of large scale distributed data processing applications to develop their applications using this framework. However, the implementations of this framework, including Hadoop, do not handle skew in the input data effectively. Skew in the input data results in poor load balancing which can swamp the benefits achievable by parallelization of applications on such parallel processing frameworks. The performance of join operation, which is the most expensive and most frequently executed operation, is severely degraded in the presence of heavy skew in the input datasets to be joined. Hadoop's implementation of the join operation cannot effectively handle such skewed joins, attributed to the use of hash partitioning for load distribution. In this work, we introduce “Skew hANDling Join” (SAND Join) that employs range partitioning instead of hash partitioning for load distribution. Experiments show that SAND Join algorithm can efficiently perform joins on the datasets that are sufficiently skewed. We also compare the performance of this algorithm with that of Hadoop's join algorithms.
Original languageEnglish
Title of host publicationMultitopic Conference (INMIC), 2011 IEEE 14th International
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Pages170-175
Number of pages6
ISBN (Print)978-1-4577-0654-7
DOIs
Publication statusPublished - 1 Dec 2011

Fingerprint Dive into the research topics of 'SAND Join — A skew handling join algorithm for Google's MapReduce framework'. Together they form a unique fingerprint.

Cite this