@Article{DAlberto:2009:AWM, author = "Paolo D'Alberto and Alexandru Nicolau", title = "Adaptive {Winograd's} Matrix Multiplications", journal = "{ACM} Transactions on Mathematical Software", volume = "36", number = "1", month = mar, year = 2009, pages = "3:1--3:23", URL = "http://doi.acm.org/10.1145/1486525.1486528", accepted = "30 June 2008", abstract = " Modern architectures have complex memory hierarchies and increasing parallelism (e.g., multi-cores). These features make achieving and maintaining good performance across rapidly changing architectures increasingly difficult. Performance has become a complex trade-off, not just a simple matter of counting cost of simple CPU operations. \par We present a novel, hybrid, and adaptive recursive Strassen-Winograd's matrix multiplication (MM) that uses automatically tuned linear algebra software (ATLAS) or GotoBLAS. Our algorithm applies to any size and shape matrices stored in either row or column major layout (in double-precision in this work) and thus is efficiently applicable to both C and FORTRAN implementations. In addition, our algorithm divides the computation into equivalent in-complexity sub-MMs and does not require any extra computation to combine the intermediary sub-MM results. \par We achieve up to 22 percent execution-time reduction versus GotoBLAS/ATLAS alone for a single core system and up to 19 percent for a 2 dual-core processor system. Most importantly, even for small matrices such as $1500 \times 1500$, our approach attains already 10 percent execution-time reduction and, for MM of matrices larger than $3000 \times 3000$, it delivers performance that would correspond, for a classic $O(n^3)$ algorithm, to faster-than-processor peak performance (i.e., our algorithm delivers the equivalent of 5 GFLOPS performance on a system with 4.4 GFLOPS peak performance and where GotoBLAS achieves only 4 GFLOPS). This is a result of the savings in operations (and thus FLOPS). Therefore, our algorithm is faster than any {\em classic} MM algorithms could ever be for matrices of this size. Furthermore, we present experimental evidence based on established methodologies found in the literature that our algorithm is, for a family of matrices, as accurate as the classic algorithms.", }