EPFL just published my thesis online. You can check it out here (EPFL infoscience).
Amid a data revolution that is transforming industries around the globe, computing systems have undergone a paradigm shift where many applications are scaled out to run on multiple computers in a computing cluster. As the storage and processing capabilities of a single machine are unable to keep pace with the amount of data, companies turn to distributed solutions to organize, persist, and analyze this Big Data. These new software solutions divide datasets into partitions that are processed in parallel on separate machines. A common problem in current cluster computing frameworks is load imbalance and limited parallelism due to skewed data distributions, processing times, and machine speeds. Load imbalance occurs when a few machines have more work and take longer to finish while the others remain idle, resulting in reduced overall performance and low resource utilization. The underlying cause for these issues is that data locality, where machines process the data stored locally, leads to tight coupling between a partition and the machine on which it is placed. This dissertation proposes a novel scatter architecture for computer cluster applications that effectively addresses the load imbalance problem and improves resource utilization. Scatter systems abandon data locality in favor of load balance. Existing systems first target locality, bringing computation to the data, and only then attempt to minimize load imbalance. Instead, we disregard any locality concerns and focus solely on achieving load balance by scattering all data across machines and bringing data to the computation as needed. The scatter architecture disaggregates compute and storage resources and pools all resources together across machines. We ensure storage load balance by spreading the data for each partition in small blocks across all storage devices and allow machines to retrieve data using an efficient, decentralized scheme. Scatter systems achieve compute load balance at runtime by dynamically adjusting the parallelism within a partition, either through work-sharing or by offloading background tasks to other machines. We design and implement three cluster applications inspired by the scatter architecture: a graph processing system that scales out using secondary storage, a general-purpose analytics framework, and a filesystem substrate that mitigates load imbalance for existing distributed databases. We demonstrate that each application can gracefully handle significant load imbalance with minimal performance degradation and that these applications can perform up to an order of magnitude faster than existing systems.