Redundancy scheduling for fast, predictable, and timely response in distributed systems
Description
TitleRedundancy scheduling for fast, predictable, and timely response in distributed systems
Date Created2022
Other Date2022-05 (degree)
Extent164 pages : illustrations
DescriptionPerformance improvement in distributed systems has been under study for decades, and the proposed solutions are diverse. Among all, redundancy has shown to be effective both in theory and in practice. With redundancy, more than one server can handle each portion of a job. Therefore, the system may complete a request without relying on the faultless behavior of all the servers. This thesis studies redundancy scheduling to achieve a faster, more predictable, and timely response in distributed systems. Moreover, by implementing a reinforcement learning system, we show that timeliness plays a role in the convergence speed of such systems.The first problem of this thesis considers the redundancy scheduling problem to im- prove the compute time of a master-worker distributed computing. First, we find the opti- mum assignment policy of a set of tasks to workers at a given redundancy level. Using the results from majorization theory, we show that if the service time of tasks is a stochastically (decreasing and) convex random variable in the number of workers assigned to each task, a balanced assignment of non-overlapping batches of tasks minimizes the average job com- pute time. Knowing the optimum assignment for a given redundancy level, we then study the optimum level of redundancy from the perspective of average compute time and com- pute time predictability. We derive the optimum redundancy level as a function of workers’ service time distribution. We show that, when optimizing for redundancy level, there is a trade-off between the average value and the predictability of compute time. In other words, the redundancy level that minimizes the average compute time is not necessarily the same as the redundancy level that maximizes compute time predictability. We run experiments on Google cluster traces and show that a careful assignment of redundancy (according to the workers’ service time distribution) can speed up the computations by order of magnitude.
The second chapter considers redundancy scheduling in queuing systems. We aim at bringing insight into the performance of nonadaptive scheduling techniques in queuing systems with redundancy. To that end, we develop an analogy to a classical urns and balls problem and use it as a proxy to understand the performance of two well-known nonadap- tive scheduling policies: random and round-robin. We introduce performance indicators in the analogous model and argue that they are good predictors of the queuing time of non- adaptive scheduling policies. We then propose a novel nonadaptive scheduling policy based on combinatorial designs and show that it has better performance indicators. Simulation results confirm that the proposed scheduling policy, as the performance indicators sug- gest, reduces the average queuing time of jobs compared to both random and round-robin scheduling.
The third chapter considers the timeliness of retrieved data from a distributed storage with redundancy. We study a distributed storage architecture where a combination of syn- chronous and asynchronous replication is used. Such architectures are widely deployed in commercial storage systems, such as Google Cloud Spanner and Amazon Aurora. With Age of Information (AoI) as the freshness metric, we characterize the average timeliness of the data retrieved from such storage systems as a function of replica synchronism. We de- rive the average age under two different replication scenarios, corresponding to real-world systems with different constraints. We derive the closed-form average age when the writ- ing times to replicas follow i.i.d exponential distribution. Our numerical results show that, depending on the relative speed of the write operation to the two groups of replicas, there exists an optimal number of synchronous replicas, which minimizes the average age of the retrieved data.
The last chapter of this thesis considers the role of timely experience samples in the convergence of a reinforcement learning algorithm. In particular, in a setting where the collection of experience samples and learning of the model occurs at two separate entities, respectively called actor and learner, we investigate the effect of timeliness of experiences used by the learner on the convergence of Q-learning. With the Windy Grid-World as the task to solve, we observe that the fresher the experiences used by the learner, the faster the convergence of Q-learning. Moreover, our result shows that a small number of fresh experiences could result in faster convergence than many stale experiences.
NotePh.D.
NoteIncludes bibliographical references
Genretheses
LanguageEnglish
CollectionSchool of Graduate Studies Electronic Theses and Dissertations
Organization NameRutgers, The State University of New Jersey
RightsThe author owns the copyright to this work.