Abstract: The binary-forking model is a parallel computation model, formally defined by Blelloch et al., in which a thread can fork a concurrent child thread, recursively and asynchronously. The model incurs a cost of Θ(log ð‘›) to spawn or synchronize ð‘› tasks or threads. The binary-forking model realistically captures the performance of parallel algorithms implemented using modern multithreaded programming languages on multicore shared-memory machines. In contrast, the widely studied theoretical PRAM model does not consider the cost of spawning and synchronizing threads, and as a result, algorithms achieving optimal performance bounds in the PRAM model may not be optimal in the binary-forking model. Often, algorithms need to be redesigned to achieve optimal performance bounds in the binary-forking model and the non-constant synchronization cost makes the task challenging. In this paper, we show that in the binary-forking model we can achieve optimal or near-optimal span with negligible or no asymptotic blowup in work for comparison-based sorting, Strassen’s matrix multiplication (MM), and the Fast Fourier Transform (FFT). Our major results are as follows: (1) A randomized comparison- based sorting algorithm with optimal ð‘‚ (log ð‘›) span and ð‘‚ (ð‘› log ð‘›) work, both w.h.p. in ð‘›. (2) An optimal ð‘‚(logð‘›) span algorithm for Strassen’s matrix multiplication (MM) with only a Θ (log log ð‘›)- factor blow-up in work as well as a near-optimal ð‘‚ (log ð‘› log log log ð‘›) span algorithm with no asymptotic blow-up in work. (3) A near- optimal ð‘‚ (log ð‘› log log log ð‘›) span Fast Fourier Transform (FFT) algorithm with less than a log ð‘›-factor blow-up in work for all prac- tical values of ð‘› (i.e., 𑛠≤ 10^10,000).
(Paper by Zafar Ahmad, Rezaul Chowdhury, Rathish Das, Pramod Ganapathi, Aaron Gregory, and Mohammad Mahdi Javanmard)