Fast algorithms for matrix multiplication --- i.e.,
algorithms that compute less than O(N^3) operations--- are
becoming attractive for two simple reasons:
State of the artIn the era of parallel processors on a chip, we will be able to solve larger and larger problems faster than we could do before. In practice, we will have a small parallel system (i.e., a single 50-100 GFLOPS node), and we must have more and faster memory (i.e., GBytes and GBytes/sec) to make these systems scalable and, if not, possible at all.These systems will be complex to build and understand. A developer or a compiler will have hard time in generating good codes. We foresee that experimentation will be a fundamental part of the next generation of libraries and code generators (and softare tools in general). The choice of the algorithm and its implementations will be a function of the system (i.e., Athlon vs. Pentium 4), problem size (i.e., Matrix multiplication of size N=10,000) and problem type (i.e., double or single precision). Our ContributionWe propose the hybrid implementation of fast algorithms with the classic algorithms. These are divide-and-conquer algorithms that are simple to formulate and to understand, and easy to deploy. They naturally exploit parallelism and data locality.The simplicity of the code leads to easy maintanance and portatibility, as well as to a simple performance model, where few terms must be tuned for a (specific) systems. In fact, these hybrid fast algorithms split explicitly the computation into parts (i.e., kernel) that are compute bound and others that are memory bound. By a carefull but straigtforward Experimentation + modeling, we can develop high performance fast codes with no hassle. |