Title: Algorithm-based fault-tolerant parallel sorting

Authors: Edson T. Camargo; Elias P. Duarte Junior

Addresses: Federal University of Technology – Parana (UTFPR), Toledo, PR, Brazil ' Department of Informatics, Federal University of Parana (UFPR), Curitiba, PR, Brazil

Abstract: High performance computing (HPC) systems often require substantial resources, and can take up to several hours or days to execute. Upon a failure, it is important to loose as little computation as possible. In this work we present an algorithm-based fault tolerance (ABFT) strategy for hypercube-based parallel algorithms. The strategy assumes the virtual VCube topology, which has several logarithmic properties that are preserved even as nodes fail. The strategy guarantees that the algorithm does not halt even after up to (N - 1) nodes crash, in a system of N nodes. We use parallel sorting as a case study, describing how to make a fault-tolerant version of three parallel sorting algorithms: HyperQuickSort, QuickMerge and Bitonic Sort. The algorithms were implemented in MPI using ULMF to handle faults. Experimental results are presented showing the performance and robustness of the solution for sorting up to a billion integers in scenarios with faults.

Keywords: high performance computing; HPC; algorithm-based fault tolerance; ABFT; user level failure mitigation; ULFM; fault tolerance.

DOI: 10.1504/IJCCBS.2024.139097

International Journal of Critical Computer-Based Systems, 2024 Vol.11 No.1/2, pp.2 - 29

Received: 01 Apr 2023
Accepted: 10 Aug 2023

Published online: 13 Jun 2024 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article