# A quick trick for faster naïve matrix multiplication

If you need to multiply some matrices together very quickly, usually it's best to use a highly optimized library like ATLAS.
But sometimes adding such a dependency isn't worth it, if you're worried about portability, code size, etc.
If you just need good performance, rather than the *best possible* performance, it can make sense to hand-roll your own matrix multiplication function.

Unfortunately, the way that matrix multiplication is usually taught:

```
C_{i,j} = \sum_k A_{i,k} \, B_{k,j}
```

`$\Bigg($`

`$\Bigg) = \Bigg($`

`$\Bigg) \, \Bigg($`

`$\Bigg)$`

leads to a slow implementation:

```
void matmul(double *dest, const double *lhs, const double *rhs,
size_t rows, size_t mid, size_t cols) {
for (size_t i = 0; i < rows; ++i) {
for (size_t j = 0; j < cols; ++j) {
const double *rhs_row = rhs;
double sum = 0.0;
for (size_t k = 0; k < mid; ++k) {
sum += lhs[k] * rhs_row[j];
rhs_row += cols;
}
*dest++ = sum;
}
lhs += mid;
}
}
```

This function multiplies a `rows`

×`mid`

matrix with a `mid`

×`cols`

matrix using the "linear algebra 101" algorithm.
Unfortunately, it has a bad memory access pattern: we loop over `dest`

and `lhs`

pretty much in order, but jump all over the place in `rhs`

, since it's stored row-major but we need its columns.

Luckily there's a simple fix that's dramatically faster: instead of computing each cell of the destination separately, we can update whole rows of it at a time. Effectively, we do this:

```
C_{i} = \sum_j A_{i,j} \, B_j
```

`$\Bigg($`

`$\Bigg) = \Bigg($`

`$\Bigg) \, \Bigg($`

`$\Bigg)$`

In code, it looks like this:

```
void matmul(double *dest, const double *lhs, const double *rhs,
size_t rows, size_t mid, size_t cols) {
memset(dest, 0, rows * cols * sizeof(double));
for (size_t i = 0; i < rows; ++i) {
const double *rhs_row = rhs;
for (size_t j = 0; j < mid; ++j) {
for (size_t k = 0; k < cols; ++k) {
dest[k] += lhs[j] * rhs_row[k];
}
rhs_row += cols;
}
dest += cols;
lhs += mid;
}
}
```

On my computer, that drops the time to multiply two 256×256 matrices from 37ms to 13ms (with `gcc -O3`

).
ATLAS does it in 5ms though, so always use something like it if it's available.