Neither of the two broad classes of fault models considered by traditional fault tolerance techniques --- crash and Byzantine faults --- suit the environment of systems that run in today's data centers. On the one hand, assuming Byzantine faults is considered overkill due to the assumption of a worst-case adversarial behavior, and the use of other techniques to guard against malicious attacks. On the other hand, the crash fault model is insufficient since it does not capture non-crash faults that may result from a variety of unexpected conditions that are commonplace in this setting. In this paper, we present the case for a more practical approach at handling non-crash (but non-adversarial) faults in data-center scale computations. In this context, we discuss how such problem can be tackled for an important class of data-center scale systems: systems for large-scale processing of data, with a particular focus on the Pig programming framework. Such an approach not only covers a significant fraction of the processing jobs that run in today's data centers, but is potentially applicable to a broader class of applications.
|Title of host publication||Proceedings of the 4th International Workshop on Large Scale Distributed Systems and Middleware|
|Place of Publication||New York, NY, USA|
|Number of pages||6|
|Publication status||Published - Jun 2010|