# High-performance implementation of the level-3 BLAS

@article{Goto2008HighperformanceIO, title={High-performance implementation of the level-3 BLAS}, author={Kazushige Goto and Robert A. van de Geijn}, journal={ACM Trans. Math. Softw.}, year={2008}, volume={35}, pages={4:1-4:14} }

A simple but highly effective approach for transforming high-performance implementations on cache-based architectures of matrix-matrix multiplication into implementations of other commonly used matrix-matrix computations (the level-3 BLAS) is presented. Exceptional performance is demonstrated on various architectures.

#### Figures and Topics from this paper

#### 336 Citations

Anatomy of high-performance matrix multiplication

- Computer Science
- TOMS
- 2008

We present the basic principles that underlie the high-performance implementation of the matrix-matrix multiplication that is part of the widely used GotoBLAS library. Design decisions are justified… Expand

0 Implementing high-performance complex matrix multiplication via the 3 m and 4 m methods

- 2017

of these so-called “induced” methods, and observe that the assembly-level method actually resides along the 4M spectrum of algorithmic variants. Implementations are developed within the BLIS… Expand

GotoBLAS - Anatomy of a fast matrix multiplication High performance libraries in computational science

- Computer Science
- 2008

This paper is an attempt to summarize theoretical and practical approaches which were used to develop high performance BLAS code and shows ideas as implementation analysis and efficient memory usage are useful for many real world problems. Expand

Attaining High Performance in General-Purpose Computations on Current Graphics Processors

- Computer Science
- VECPAR
- 2008

This paper evaluates the performance of linear algebra and image processing routines, both on classical and unified GPU architectures and traditional processors (CPUs). Expand

BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing

- Computer Science
- ICS
- 2016

Existing multi-GPU level 3 BLAS issues are investigated and issues, such as the improper load balancing, inefficient communication, insufficient GPU stream level concurrency and data caching, impede current implementations from fully harnessing heterogeneous computing resources; and the inter-GPU Peer-to-Peer(P2P) communication remains unexplored. Expand

High-Performance Matrix Multiply on a Massively Multithreaded Fiteng1000 Processor

- Computer Science
- ICA3PP
- 2012

This paper presents parallel algorithms with shared A or B matrix in the memory for the special massively multithreaded Fiteng1000 processor and shows that the algorithms have well good parallel performance and achieve near-peak performance. Expand

Implementing High-Performance Complex Matrix Multiplication via the 1M Method

- Computer Science, Mathematics
- SIAM J. Sci. Comput.
- 2020

Almost all efforts to optimize high-performance matrix-matrix multiplication have been focused on the case where matrices contain real elements. The community's collective assumption appears to hav...

Task Parallel Implementation of Matrix Multiplication on Multi-socket Multi-core Architectures

- Computer Science
- ICA3PP
- 2015

This paper presents a parallel implementation framework for dense matrix multiplication on multi-socket multi-core architectures that combines the Winograd algorithm and the classical algorithm to achieve dynamic load balancing and enforce data locality. Expand

GEMM Optimization for a Decoupled Access/Execute Architecture Processor

- Computer Science
- 2015

The GEMM kernel based on the DAE processor was divided into 4 levels, and several levels of the new algorithm are capable to self-adjust, and the performance of the algorithm was effectively improved. Expand

Model-driven Level 3 BLAS Performance Optimization on Loongson 3A Processor

- Computer Science
- ICPADS
- 2012

A variety of methods were employed to avoid L1 cache misses in single thread optimization, including cache and register blocking, the Loongson 3A 128-bit memory accessing extension instructions, software prefetching, and single precision floating-point SIMD instructions. Expand

#### References

SHOWING 1-10 OF 17 REFERENCES

Anatomy of high-performance matrix multiplication

- Computer Science
- TOMS
- 2008

We present the basic principles that underlie the high-performance implementation of the matrix-matrix multiplication that is part of the widely used GotoBLAS library. Design decisions are justified… Expand

GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark

- Computer Science
- TOMS
- 1998

This work states that it is possible to develop a portable and high-performance level 3 BLAS library mainly relying on a highly optimized GEMM, the routine for the general matrix multiply and add operation. Expand

Toward Scalable Matrix Multiply on Multithreaded Architectures

- Computer Science
- Euro-Par
- 2007

We show empirically that some of the issues that affected the design of linear algebra libraries for distributed memory architectures will also likely affect such libraries for shared memory… Expand

A set of level 3 basic linear algebra subprograms

- Mathematics, Computer Science
- TOMS
- 1990

This paper describes an extension to the set of Basic Linear Algebra Subprograms. The extensions are targeted at matrix-vector operations that should provide for efficient and portable… Expand

FLAME: Formal Linear Algebra Methods Environment

- Computer Science
- TOMS
- 2001

This paper illustrates the observations by looking at the Formal Linear Algebra Methods Environment (FLAME), a framework that facilitates the derivation and implementation of linear algebra algorithms on sequential architectures, and demonstrates that lessons learned in the distributed-memory world can guide us toward better approaches even in the sequential world. Expand

Automatically Tuned Linear Algebra Software

- Computer Science
- Proceedings of the IEEE/ACM SC98 Conference
- 1998

An approach for the automatic generation and optimization of numerical software for processors with deep memory hierarchies and pipelined functional units using the widely used linear algebra kernels called the Basic Linear Algebra Subroutines (BLAS). Expand

Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software

- Mathematics, Computer Science
- SIAM Rev.
- 2004

Some of the recent advances made by applying the paradigm of recursion to dense matrix computations on today's memory-tiered computer systems are reviewed and details. Expand

LAPACK Users Guide

- Computer Science
- 1995

The third edition of LAPACK provided a guide to troubleshooting and installation of Routines, as well as providing examples of how to convert from LINPACK or EISPACK to BLAS. Expand

ACM Transactions on Mathematical Software

- ACM Transactions on Mathematical Software
- 2008

Received May

- Received May
- 2006