TY - JOUR TI - A programming model and execution system for adaptive ensemble applications on high performance computing systems DO - https://doi.org/doi:10.7282/t3-0yx8-xx61 PY - 2019 AB - Traditionally, advances in high-performance scientific computing have focused on the scale, performance, and optimization of an application with a large, single task, and less on applications comprised of multiple tasks. However, many scientific problems are expressed as applications that require the collective outcome of one or more ensembles of computational tasks in order to provide insight into the problem being studied. Depending on the scientific problem, a task of an ensemble can be any type of a program: from a molecular simulation, to a data analysis or a machine learning model. With different communication and coordination patterns, both within and across ensembles, the number and type of applications that can be formulated as ensembles is vast and spans many scientific domains, including biophysics, climate science, polar science and earth science. The performance of ensemble applications can be improved further by using partial results to adapt the application at runtime. Partial results of ongoing executions can be analyzed with multiple methods to adapt the application to focus on relevant portions of the problem space or reduce the time to execution of the application. These benefits are confirmed by the increasing role played by adaptivity in ensemble applications developed to support several domain sciences, including biophysics and climate science. Although HPC systems provide the computational power required for ensemble applications, their design and policies tend to privilege the execution of single, very large tasks. On the biggest and busiest systems in the world, queue waiting time for each task can reach days while lack of elastic coordination and communication infrastructure makes distributing the execution of ensemble applications difficult. Further, access, submission and execution methods vary across HPC systems, alongside their policies and performance. HPC systems are increasingly displaying performance dynamism and fluctuations due to aggressive thermal management and throttling. Together, these factors make using HPC systems for adaptive and non-adaptive ensemble applications challenging. Existing solutions to express and execute ensemble applications on HPC systems range from complex scripts and domain specific workflow systems to general purpose workflow systems. Scripts and domain specific workflow systems serve as point solutions often limited in functionality, usability and performance to the scope of the specific application and the HPC system. General purpose systems, on the other hand, requires retrofitting the ensemble applications using the tools and interfaces provided by the system which can be challenging, when feasible. The goal of this research is to advance the state-of-the-art by simplifying the programmability of ensemble applications, abstracting complexity of their scalable, efficient and robust execution on HPC systems, and, most importantly, enabling the domain scientists to focus on the computational campaigns and algorithmic innovations that are of importance to their science domains. In this dissertation, we describe several science drivers that employ ensemble applications to address some of the most challenging scientific problems of our time. We address three main challenges of executing ensemble applications at scale on HPC resources: (i), we address application diversity and programmability by developing a generic programming model that treats an ‘ensemble’ as a first order concern and provides constructs specifically to express ensemble applications; (ii) we develop a software system, called Ensemble Toolkit, to provide scalable and robust execution of ensemble applications while abstracting the user from the architecture and policies of HPC systems, and resource and execution management; and (iii) we propose and evaluate scheduling strategies to manage the effect of workload heterogeneity and resource dynamism on the time to execute ensemble applications on HPC systems. We discuss several achievements and results obtained in various scientific domains as a consequence of the research and development described in this dissertation. KW - High performance computing KW - Electrical and Computer Engineering LA - English ER -