A runtime heuristic to selectively replicate tasks for application-specific reliability targets


Creative Commons License

Subasi O., YALÇIN G. , ZYULKYAROV F., UNSAL O., LABARTA J.

IEEE International Conference on Cluster Computing (CLUSTER), Taipei, Tayvan, 13 - 15 Eylül 2016, ss.498-505 identifier identifier

  • Doi Numarası: 10.1109/cluster.2016.54
  • Basıldığı Şehir: Taipei
  • Basıldığı Ülke: Tayvan
  • Sayfa Sayısı: ss.498-505

Özet

In this paper we propose a runtime-based selective task replication technique for task-parallel high performance computing applications. Our selective task replication technique is automatic and does not require modification/recompilation of OS, compiler or application code. Our heuristic, we call App_FIT, selects tasks to replicate such that the specified reliability target for an application is achieved. In our experimental evaluation, we show that App_FIT selective replication heuristic is low-overhead and highly scalable. In addition, results indicate that complete task replication is overkill for achieving reliability targets. We show that with App_FIT, we can tolerate pessimistic exascale error rates with only 53% of the tasks being replicated.