Optimizing matmul on cpu!
Also writing an accompanying blog.
I am using the following machine with a 10 core CPU (4 Performance cores+6 Efficiency cores).
Consider two matrices
For simplicity and without loss of generality, let's consider the matrices to be square so we have
This is a total of 2n - 1 floating point operations (FLOPs) required to calculate one element of C --
As
For the purpose of this repo, I consider
The matmul loop is this:
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
for (int k = 0; k < N; k++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}When complied with the -O3 flag which is the maximum level with safe optimizations, the latency is:
| Technique | 4096 | 8192 | Speedup |
|---|---|---|---|
Baseline (np) |
0.10 s |
0.78 s |
– |
Naive implementation |
203 s |
46 min |
- |
The speedup column for all further tables is calculated with respect to the row (optimization technique) just above so it simply gives an idea of the amount of improvement we get with a new intervention as compared to what we had just before it.
Full compilation command below:
gcc -Wall -O3 sgemm-cpu/matmuls/naive.c -o sgemm-cpu/matmuls/naive
All the further optimizations use the same flags to compile the code.
To make sure the compiler never issues a separate store and load instruction for the running partial sum and thus reduce some latency, we could accumulate the partial sum in a register variable, like so:
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
float running_sum = 0.0;
for (int k = 0; k < N; k++) {
running_sum += A[i][k] * B[k][j];
}
C[i][j] = running_sum;
}
}That brings the latency to:
| Technique | 4096 | 8192 | Speedup ( |
Speedup ( |
|---|---|---|---|---|
Baseline (np) |
0.10s |
0.74s |
– |
– |
Naive implementation |
203s |
46min |
- |
- |
Naive w register accumulation |
199s |
27min |
1.02x |
1.7x |
Experimenting w/ different loop orders, the lowest latency corresponds to order ikj as follows:
| Technique | 4096 | 8192 | Speedup ( |
Speedup ( |
|---|---|---|---|---|
Baseline (np) |
0.10s |
0.74s |
– |
– |
Naive implementation |
203s |
46min |
- |
- |
Naive w register accumulation |
199s |
27min |
1.02x |
1.7x |
Loop reordering (ikj) |
4.31s |
34.28s |
46x |
47x |
for (int i = 0; i < N; i++) {
for (int k = 0; k < N; k++) {
for (int j = 0; j < N; j++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}Tiled on all three loops ijk:
for (int i_tile = 0; i_tile < N; i_tile += TILE_I) {
int iend = (i_tile + TILE_I < N) ? i_tile + TILE_I : N;
for (int j_tile = 0; j_tile < N; j_tile += TILE_J) {
int jend = (j_tile + TILE_J < N) ? j_tile + TILE_J : N;
for (int k_tile = 0; k_tile < N; k_tile += TILE_K) {
int kend = (k_tile + TILE_K < N) ? k_tile + TILE_K : N;
for (int i = i_tile; i < iend; i++) {
for (int k = k_tile; k < kend; k++) {
float a_ik = A[i][k];
for (int j = j_tile; j < jend; j++) {
C[i][j] += a_ik * B[k][j];
}
}
}
}
}
}This doesn't help much for matrices of size 4096 x 4096 due to the already large caches on my machine. Full results:
| Technique | 4096 | 8192 | Speedup ( |
Speedup ( |
|---|---|---|---|---|
Baseline (np) |
0.10s |
0.74s |
– |
– |
Naive implementation |
203s |
46min |
- |
- |
Naive w register accumulation |
199s |
27min |
1.02x |
1.7x |
Loop reordering (ikj) |
4.31s |
34.28s |
46x |
47x |
ijk tiling (best tile sizes) |
3.16s |
26.20s |
1.36 |
1.3x |
The best tile size for
| Technique | 4096 | 8192 | Speedup ( |
Speedup ( |
|---|---|---|---|---|
Baseline (np) |
0.10s |
0.74s |
– |
– |
Naive implementation |
203s |
46min |
- |
- |
Naive w register accumulation |
199s |
27min |
1.02x |
1.7x |
Loop reordering (ikj) |
4.31s |
34.28s |
46x |
47x |
ijk tiling (best tile sizes) |
3.16s |
26.20s |
1.36 |
1.3x |
Multithreading |
1.19s |
9.88s |
2.6x |
2.6x |