Designing and Modelling Selective Replication for Fault-tolerant HPC Applications


Creative Commons License

Subasi O., YALÇIN G. , Zyulkyarov F., Unsal O., Labarta J.

17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), Madrid, İspanya, 14 - 17 Mayıs 2017, ss.452-457 identifier

  • Doi Numarası: 10.1109/ccgrid.2017.40
  • Basıldığı Şehir: Madrid
  • Basıldığı Ülke: İspanya
  • Sayfa Sayısı: ss.452-457

Özet

Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Performance Computing (HPC) applications. There are studies that address fail-stop errors and studies that address SDCs. However few studies address both types of errors together. In this paper we propose a software-based selective replication technique for HPC applications for both fail-stop errors and SDCs. Since complete replication of applications can be costly in terms of resources, we develop a runtime-based technique for selective replication. Selective replication provides an opportunity to meet HPC reliability targets while decreasing resource costs. Our technique is low-overhead, automatic and completely transparent to the user.