Fast sparse matrix-vector multiplication by partitioning and reordering

A.N. Yzelman

    Research output: ThesisDoctoral thesis 1 (Research UU / Graduation UU)

    Abstract

    The thesis introduces a cache-oblivious method for the sparse matrix-vector (SpMV) multiplication, which is an important computational kernel in many applications. The method works by permuting rows and columns of the input matrix so that the resulting reordered matrix induces cache-friendly behaviour during the SpMV multiply. This matrix reordering is obtained by using a recursive sparse matrix partitioning scheme, based on hypergraph partitioning, and is deeply linked to earlier partitioning schemes for distributed parallel SpMV multiplication. Initially, the input sparse matrix is modelled as a one-dimensional hypergraph. Extending this to a standard two-dimensional model while still allowing the reordering method to perform only row and column permutations, requires a generalisation of the standard Compressed Row Storage data structure to a block-based data structure. It is shown which recursive block ordering is asymptotically optimal. Furthermore, using the techniques developed for the 2D method, another cache-oblivious method based on the Hilbert curve is developed. All these methods are cache-oblivious in that the reordering algorithm does not depend on cache parameters of specific target architectures. Experiments performed on three different architectures show that the 1D method can improve SpMV multiplication performance of unstructured sparse matrices up to a factor two. The adapted Hilbert scheme is somewhat less effective, whereas the two-dimensional method typically performs better; there, the largest gain is over a factor of three. To further speed up the SpMV multiplication on multicore architectures, the performance of parallel SpMV multiplication is investigated using the Bulk Synchronous Parallel (BSP) model and the proof-of-concept MulticoreBSP library. This library is written in Java and explicitly targets shared-memory systems, and is furthermore applied to the problems of fast Fourier transformation as well as the dense LU decomposition. This parallel scheme is also applied in tandem with cache-oblivious reordering to obtain a full parallel cache-oblivious scheme. Plain parallelisation resulted in a maximum speedup of 3.5 for four threads, thus showing that the BSP model applies well to shared-memory multicore systems. Combined with reordering, superlinear speedups become possible: in one case a gain of a factor 2.8 was attained while using only two threads.
    Original languageEnglish
    QualificationDoctor of Philosophy
    Awarding Institution
    • Utrecht University
    Supervisors/Advisors
    • Bisseling, Rob, Primary supervisor
    Award date3 Oct 2011
    Place of PublicationUtrecht
    Publisher
    Print ISBNs978-90-393-5590-9
    Publication statusPublished - 3 Oct 2011

    Fingerprint

    Dive into the research topics of 'Fast sparse matrix-vector multiplication by partitioning and reordering'. Together they form a unique fingerprint.

    Cite this