Designing and Modelling Selective Replication for Fault-tolerant HPC Applications


Creative Commons License

Subasi O., YALÇIN G. , Zyulkyarov F., Unsal O., Labarta J.

17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), Madrid, Spain, 14 - 17 May 2017, pp.452-457 identifier

  • Publication Type: Conference Paper / Full Text
  • Doi Number: 10.1109/ccgrid.2017.40
  • City: Madrid
  • Country: Spain
  • Page Numbers: pp.452-457

Abstract

Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Performance Computing (HPC) applications. There are studies that address fail-stop errors and studies that address SDCs. However few studies address both types of errors together. In this paper we propose a software-based selective replication technique for HPC applications for both fail-stop errors and SDCs. Since complete replication of applications can be costly in terms of resources, we develop a runtime-based technique for selective replication. Selective replication provides an opportunity to meet HPC reliability targets while decreasing resource costs. Our technique is low-overhead, automatic and completely transparent to the user.