This project analysized matrix multiplication algorithms in an HPC environment. Analysis was helped by Intel MSRs, enabled by LLNL msr-safe kernel module and interface. Intel Intrinsics SSE 4.1 instruction libraries were used to help optimize.
Experiment was developed for and executed on Intel Xeon E5-2670 @ 2.6 Hz. Node had two sockets, each with a cpu with 8 cores and 2 threads per core.
Standard test matrices are made and written to files by R to be used as inputs and known correct answers for checking correctness of algorithm. To generate test matrices, type:
Rscript generate_test_cases.R 256
where 256 is the size of the matrix
To make matrix multiplication without Intel Intrinsics libraries, type:
make naive
To run matrix multiplication without Intel Intrinsics, type:
make runnaive N=256 THREADS=8
where N is the size of the matrix NxN and THREADS is the number of threads to run with
To make naive algorithm with Intel Intrinsics libraries, type:
make naive_intrinsics
To run Intel Intrinsics naive algorithm, type:
make runnaive_intrinsics N=256 THREADS=8
where N is the size of the matrix NxN and THREADS is the number of threads to run with
To make naive algorithm with single-precision floating point numbers instead of double-precision, type:
make naive_float
To run naive algorithm with single-precision floating points, type:
make runnaive_float N=256 THREADS=8
where N is the size of the matrix NxN and THREADS is the number of threads to run with. This may result in errors due to a lack of precision.
To make strassen multiplication, type:
make strassen
To run the strassen multiplication algorithm, type:
make runstrassen N=256
where N is the size of the matrix NxN.
On the initial unoptimized matrix multiplication, as the matrix size increases, the instructions per cycle decreases dramaticaly
At the same time, as the matrix size increase, the cache misses per instruction increases.
Therefore, it seems like as the matrix size increases, the function becomes increasingly more memory bound
To solve this, read and saved 6 values ahead in order to have faster and less reads to RAM, instead pulling these values into cache. This dramatically speed up the time.
Additionally, I did experiments with Intel Intrinsics SSE instructions. Because of the size of the cache, it seemed like more than just 6 values could be saved for fast access. In order to get around whether or not the compiler would unroll the loops, used generate_code.py file to generate multiply functions.
Focusing on the beginning of the graph, we can see the the most consistently faster amount of entries saved is 8, which seems significantly smaller than the size of cache, which merits some more investigation.
Strassen's matrix multiplication algorithm is a recursive matrix multiplication algorithm which trades less multiplications for more addition and subtraction operations.
Initially, I had the base case at N=1. Compared to the naive algorithm, the strassen lagged behind the naive until the matrices got bigger than N=4096 at which point the unoptimized single threaded strassen was faster than the single threaded naive.
With both axes taken to log base 2
To optimize, I took the base case out to N=2 and used Intel Intrinsics to the base multiplication and used Intrinsics to make the intermediate matrices, which dramatically lowered the execution time.
With time natural log