Failure analysis, modeling, and prediction for BlueGene/L

Liang, Yinglung

doi:doi:10.7282/T32J6C8N

RUcore: Rutgers University Community Repository

Search
- All
- Text
- Images
- Audio
- Video
Advanced Search | Help

Search all content in all RUcore collections.
Services
Collections

Help Contact Us My Account

Home

Resource

Failure analysis, modeling, and prediction for BlueGene/L

PDF

PDF format is widely accepted and good for printing.

Plug-in required

PDF-1(2.41 MB)

Citation & Export

View Usage Statistics

Staff View

Citation & Export
Hide

Simple citation

Liang, Yinglung. Failure analysis, modeling, and prediction for BlueGene/L. Retrieved from https://doi.org/doi:10.7282/T32J6C8N

Export

Click here for information about Citation Management Tools at Rutgers.

Statistics
Hide

Description

Uniform TitleFailure analysis, modeling, and prediction for BlueGene/L

NameLiang, Yinglung (author); Zhang, Yanyong (chair); Trappe, Wade (internal member); Parashar, Manish (internal member); Xiong, Hui (outside member); Rutgers University; Graduate School - New Brunswick

Date Created2007

Other Date2007 (degree)

SubjectElectrical and Computer Engineering, Supercomputers, Computers--Reliability

Extentxii, 129 pages

DescriptionThe growing computational and storage needs of scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM's BlueGene/L, a 64K dual-core processor system. One of the challenges of designing and deploying such systems in a production setting is the need to take failure occurrences into account. Once the large scale system equipped with a failure predictability, the fault tolerance and resource management strategies of the system can be improved significantly, and its performance can be highly increased.
This dissertation is based on the Reliability, Availability and Serviceabilit (RAS) events generated by IBM BlueGene/L over a period of 142 days. Using these logs, we performed failure analysis, modeling, and prediction. Filters are created to reveal the system failure behaviors, three preliminary models are identified for the failures, and finally, three failure predictors are established for the system. We heavily use data mining and time series analysis techniques for this dissertation. Our comprehensive evaluation demonstrates that our Bi-Modal Nearest Neighbor predictor greatly outperforms the other two (RIPPER and LIBSVM based), leading
to an F-measure of 70% and 50% for a 12-hour and 6-hour prediction window size.

NotePh.D.

NoteIncludes bibliographical references (p. 123-127).

Genretheses, ETD doctoral

Persistent URLhttps://doi.org/doi:10.7282/T32J6C8N

LanguageEnglish

CollectionGraduate School - New Brunswick Electronic Theses and Dissertations

Organization NameRutgers, The State University of New Jersey

RightsThe author owns the copyright to this work.

Version 8.5.5

Citation & ExportHide

Simple citation

Export

StatisticsHide

Description

Citation & Export
Hide

Statistics
Hide