The GotoBLAS/BLIS Approach to Optimizing Matrix-Matrix Multiplication - Step-by-Step

This page leads one, step by step, through the optimization of matrix-matrix multiplication. For now, it is assumed that the reader has a Linux account on an Intel processor based computer. We will use the gcc compiler as part of the exercise.

If you are taking Linear Algebra - Foundations to Frontiers, post comments and questions on the discussion board.


If you use these materials for a class project, you MUST disclose where you found this information. You will risk being accused of academic dishonesty otherwise...


This work is based on two publications. You will want to read these before you start the exercise (don't worry if you only understand a fraction of the paper at first), and then again when you are done with this exercise. If you use information on this page in other research, please reference these papers.

Set Up

This wiki page assumes that you have access to an Intel-based processor, the gnu-cc compiler, and octave (an Open Source version of MATLAB that is part of a typical Linux or Unix install).

To be able to follow along with the below examples, you will want to download some routines, as described on the Set Up page.

Make sure that the makefile starts with the following lines:

This indicates that the performance of the version of matrix-matrix multiplication in MMult0.c is measured (by virtue of the statement OLD  :=0).

Next, to make sure that when plotting the graphs are properly scaled, set certain parameters in the file proc_parameters.m. See the comments in that file. (Setting these parameters will ensure that when plotting the y-axis ranges from 0 to the peak performance of the architecture.)

Picking the right clock speed is a bit tricky, given that modern architectures have something called 'turbo boost' which changes the clock speed. For example, the Intel i5 core in my laptop has a clock speed of 1.7 GHz, but a turbo boost rate of 2.6 GHz. I chose to indicate in proc_parameters.m that the processor has a clock speed of 2.6 GHz, since otherwise some of the results would show that the implementation attains greater than the peak speed of the processor...


The performance graph (on my 1.7GHz Intel Core i5 MacBook Air) looks something like

Notice that the two curves are right on top of each other because data for the same implementation are being compared. From the fact that the top of the graph represents peak performance, it is obvious that this simple implementation achieves only a fraction of the ideal performance.

A question, of course is, is this the best we can do? We are going to walk through a sequence of optimizations, culminating in performance marked by "NEW" in the following graph:

Step-by-step optimizations

We will now lead the visitor through a series of optimizations. In some cases, a new implementation (optimization) merely is a small step in the right direction. We change the code a little at a time in order to be able to make sure it remains correct.

Computing four elements of C at a time

Hiding computation in a subroutine

This does not yield better performance:

It does set us up for the next step.

Computing four elements at a time

At this point, we are starting to see some performance improvements:

Further optimizing

There is considerable improvement for problem sizes that fit (at least partially) in the L2 cache. Still, there is a lot of room for improvement.

Computing a 4 x 4 block of C at a time

We now compute a 4 x 4 block of C at a time in order to use vector instructions and vector registers effectively. The idea is as follows: There are special instructions as part of the SSE3 instruction set that allow one to perform two 'multiply accumulate' operations (two multiplies and two adds) per clock cycle for a total of four floating point operations per clock cycle. To use these, one has to place data in 'vector registers'. There are sixteen of these, each of which can hold two double precision numbers. So, we can keep 32 double precision numbers in registers. We will use sixteen of these to hold elements of C, a 4 x 4 block.

Repeating the same optimizations

At this point, we are again starting to see some performance improvements:

Further optimizing

We now start optimizing differently as we did for the 1x4 case.

We notice a considerable performance boost:

Still, there is a lot of room for improvement.

Blocking to maintain performance

Now, performance is maintained:

Packing into contiguous memory

We now attain 90% of the turbo boost peak of the processor!

rvdgWiki: HowToOptimizeGemm (last edited 2015-10-06 14:42:58 by RobertVanDeGeijn)