Application-aware on-line failure recovery for extreme-scale HPC environments

Gamell Balmana, Marc

doi:doi:10.7282/T37P927J

RUcore: Rutgers University Community Repository

Search
- All
- Text
- Images
- Audio
- Video
Advanced Search | Help

Search all content in all RUcore collections.
Services
Collections

Help Contact Us My Account

Home

Resource

Application-aware on-line failure recovery for extreme-scale HPC environments

PDF

PDF format is widely accepted and good for printing.

Plug-in required

PDF-1(3.11 MB)

Citation & Export

View Usage Statistics

Staff View

Citation & Export
Hide

Simple citation

Gamell Balmana, Marc. Application-aware on-line failure recovery for extreme-scale HPC environments. Retrieved from https://doi.org/doi:10.7282/T37P927J

Export

Click here for information about Citation Management Tools at Rutgers.

Statistics
Hide

Description

TitleApplication-aware on-line failure recovery for extreme-scale HPC environments

NameGamell Balmana, Marc (author); Parashar, Manish (chair); Marsic, Ivan (internal member); Silver, Deborah (internal member); Teranishi, Keita (outside member); Rutgers University; Graduate School - New Brunswick

Date Created2017

Other Date2017-05 (degree)

SubjectElectrical and Computer Engineering, Fault tolerance (Engineering), High performance computing

Extent1 online resource (xxvi, 235 p. : ill.)

DescriptionHigh Performance Computing (HPC) brings with it the promise of deeper insight into complex phenomena through the execution of various extreme-scale applications, especially those in the fields of science and engineering. The increasing computational demands of these applications continue to push the limits of current extreme scale HPC systems. As a result, the community is working toward achieving exascale systems able to compute 10^18 floating point operations per second (FLOPS). Since these systems are expected to contain a large number of components, reliability is one of the key anticipated challenges. Due to the extensive periods of time that complex applications require, future systems will likely see an increase in process and node failures during application execution. These failures, also known as hard failures, are currently handled by terminating the execution and restarting it from the last stored checkpoint. This checkpoint-restart methodology requires the application to periodically save its distributed state into a centralized, stable storage --an approach that is not expected to scale to future extreme-scale systems. While the illusion of a failure-free machine --implemented either via hardware or system software strategies-- is adequate for current HPC systems, they may prove too costly in future extreme-scale machines. Resilience is, therefore, a key challenge that must be addressed in order to realize the exascale vision. This dissertation explores new models that leverage application-awareness to enable on-line failure recovery. On-line recovery, which does not require the interruption of surviving processes in order to collectively restart the entire application, offers better cost/performance tradeoffs by reducing recovery overheads. Recovering processes on-line enables application-specific data recovery strategies and optimized in-memory checkpointing while avoiding the repetition of initialization procedures --the least optimized part of most production-level applications-- on all processes. This dissertation addresses three areas of research in on-line failure recovery. First, it explores a generic global on-line recovery model, involving all processes in the recovery process. Second, it explores optimized local recovery in which communication characteristics of certain application classes are leveraged to reduce overheads due to failure. In particular, finite difference partial differential equation solvers using stencil operators are used as the driving application class. Third, this dissertation demonstrates how the overhead of multiple, independent failures can be masked to effectively reduce the impact on total execution time. The models presented in this dissertation are implemented and evaluated in Fenix and FenixLR, a pair of generic and extensible frameworks used to demonstrate the concepts.

NotePh.D.

NoteIncludes bibliographical references

Noteby Marc Gamell Balmana

Genretheses, ETD doctoral

Persistent URLhttps://doi.org/doi:10.7282/T37P927J

Languageeng

CollectionGraduate School - New Brunswick Electronic Theses and Dissertations

Organization NameRutgers, The State University of New Jersey

RightsThe author owns the copyright to this work.

Version 8.5.5

Citation & ExportHide

Simple citation

Export

StatisticsHide

Description

Citation & Export
Hide

Statistics
Hide