Docan, Ciprian. Asynchronous data transfers on large scale HPC systems with experiments on the CRAY XT3/XT4. Retrieved from https://doi.org/doi:10.7282/T35Q4WF0
DescriptionA key challenge faced by the emerging large scale scientific and engineering simulations is effectively and efficiently managing the large volumes of heterogeneous data generated. This includes offloading this data from the compute nodes at runtime, and transferring it over to service nodes or remote clusters for online
monitoring, analysis, or archiving. To be effective, these I/O operations should not impose additional synchronization penalties on
the simulation, should have minimal impact on the computational performance, maintain overall Quality of Service, and ensure that no data is lost.
This thesis describes the design, implementation, and operation of DART (Decoupled Asynchronous Remote Transfers). DART is a thin software layer built on RDMA (Remote Direct Memory Access) communication technology, and specifically the Portals RDMA library to allow fast, low-overhead access to data from simulations running on compute elements, and to support high-throughput low latency
asynchronous I/O transfer of this data.
DART is part of the infrastructure for an integrated simulation of fusion plasma in a Tokamak being developed at the Center for Plasma Edge Simulation (CPES), a DoE Office of Fusion Energy Science (OFES) Fusion Simulation Projects (FSP). A performance evaluation on the Cray XT3/XT4 system at Oak Ridge National Laboratory demonstrates that DART can be used to offload expensive I/O operations to dedicated service nodes allowing more efficient utilization of the compute elements.