Zhang, Fan. Programming and runtime support for enabling data-intensive coupled scientific simulation workflows. Retrieved from https://doi.org/doi:10.7282/T31N8307
DescriptionEmerging coupled scientific simulation workflows are composed of multiple component applications that interact and exchange data at runtime. Coupled simulation workflow enables multi-physics multi-model code coupling and online data analysis, which has the potential to provide high-fidelity modeling and accelerate the simulation data to insight process. However, running coupled simulation workflows on extreme-scale computing systems presents several challenges. First, most workflow component applications are originally developed as programs that execute independently. Composing a workflow requires the programming support to glue the component applications, orchestrate their executions and express data exchange. Second, simulation workflow requires extracting and moving data between coupled applications. As the data volumes and generate rates keep growing, the traditional disk I/O based data movement approach becomes cost prohibitive and workflow requires more scalable and efficient approach to support the data movement. Third, the cost of moving large volume of data over system interconnection network becomes dominating and significantly impacts the workflow execution time. Minimize the amount of network data movement and localize data transfers in the network topology is critical for reducing such cost. To achieve this, workflow task placement should exploit data locality to the extent possible and move computation closer to data. This thesis addresses these challenges related to workflow composition, data management and task placement, and makes the following contributions: (1) This thesis presents DIMES data management framework to support memory-to-memory data movement between coupled applications. DIMES co-locates in-memory staging on application compute nodes to store data that needs to be shared or exchanged, and enables accessing the data through array-based query interface. (2) This thesis presents CoDS task execution framework to support workflow composition and execution, which implements the task execution programming interface for composing customized workflow and orchestrating the execution of component applications. (3) This thesis presents communication- and topology-aware task mapping, which implements a holistic approach to map workflow communication graph onto physical network topology. The method effectively reduces the total size of network data movement and reduces the workflow communication time. The research concepts and software prototypes have been evaluated using real application workflows on extreme-scale computing systems.