Profile-Guided Optimization (PGO) benchmark report #343

zamazan4ik · 2024-07-06T16:37:25Z

Hi!

Thank you for the project! I evaluated Profile-Guided Optimization (PGO) on many projects - all the results are available at https://github.com/zamazan4ik/awesome-pgo . Since this compiler optimization works well in many places including different parsers, I decided to apply it to the project - here are my benchmark results.

Test environment

Fedora 40
Linux kernel 6.9.7
AMD Ryzen 9 5900x
48 Gib RAM
SSD Samsung 980 Pro 2 Tib
Compiler - Rustc 1.80.0-nightly
amber version: master branch on commit 49bd5e3d5abfee66ac457efd5e7fd9f8347dbc8f
Disabled Turbo boost

Benchmark

For benchmark purposes, I use built-in into the project benchmarks. For PGO optimization I use cargo-pgo tool. Release bench result I got with taskset -c 0 cargo +nightly bench command. The PGO training phase is done with taskset -c 0 cargo +nightly pgo bench, PGO optimization phase - with taskset -c 0 cargo +nightly pgo optimize bench.

taskset -c 0 is used for reducing the OS scheduler influence on the results. All measurements are done on the same machine, with the same background "noise" (as much as I can guarantee).

Results

I got the following results:

Release:

test normal_brute_force   ... bench:   2,388,357.20 ns/iter (+/- 4,791.51)
test normal_fjs           ... bench:     838,203.50 ns/iter (+/- 2,861.21)
test normal_quick_search  ... bench:   1,063,104.70 ns/iter (+/- 4,052.30)
test normal_tbm           ... bench:     824,669.00 ns/iter (+/- 2,831.21)
test thread2_fjs          ... bench:     916,545.90 ns/iter (+/- 9,251.22)
test thread2_quick_search ... bench:     972,761.50 ns/iter (+/- 10,755.98)
test thread2_tbm          ... bench:     900,198.90 ns/iter (+/- 10,378.47)
test thread4_fjs          ... bench:     993,167.10 ns/iter (+/- 19,304.66)
test thread4_quick_search ... bench:   1,096,666.30 ns/iter (+/- 17,765.52)
test thread4_tbm          ... bench:     975,198.00 ns/iter (+/- 24,880.23)
test thread8_fjs          ... bench:   1,031,909.50 ns/iter (+/- 32,014.73)
test thread8_quick_search ... bench:   1,094,980.20 ns/iter (+/- 30,592.85)
test thread8_tbm          ... bench:   1,010,250.50 ns/iter (+/- 32,807.65)

PGO optimized compared to Release:

test normal_brute_force   ... bench:   2,377,768.00 ns/iter (+/- 4,203.28)
test normal_fjs           ... bench:     832,123.90 ns/iter (+/- 1,175.01)
test normal_quick_search  ... bench:     781,434.90 ns/iter (+/- 1,337.30)
test normal_tbm           ... bench:     817,408.82 ns/iter (+/- 1,533.72)
test thread2_fjs          ... bench:     911,620.60 ns/iter (+/- 9,724.17)
test thread2_quick_search ... bench:     859,772.50 ns/iter (+/- 9,159.61)
test thread2_tbm          ... bench:     895,979.00 ns/iter (+/- 9,644.95)
test thread4_fjs          ... bench:     989,742.60 ns/iter (+/- 61,258.67)
test thread4_quick_search ... bench:     934,455.00 ns/iter (+/- 24,774.49)
test thread4_tbm          ... bench:     972,370.80 ns/iter (+/- 27,719.42)
test thread8_fjs          ... bench:   1,024,729.00 ns/iter (+/- 33,593.14)
test thread8_quick_search ... bench:     966,772.20 ns/iter (+/- 34,357.41)
test thread8_tbm          ... bench:   1,005,971.50 ns/iter (+/- 30,121.34)

(just for reference) PGO instrumented compared to Release:

test normal_brute_force   ... bench:   2,397,680.80 ns/iter (+/- 3,403.57)
test normal_fjs           ... bench:     836,002.80 ns/iter (+/- 1,344.55)
test normal_quick_search  ... bench:   1,626,714.75 ns/iter (+/- 2,539.17)
test normal_tbm           ... bench:     825,404.20 ns/iter (+/- 11,977.07)
test thread2_fjs          ... bench:     916,865.60 ns/iter (+/- 11,711.78)
test thread2_quick_search ... bench:   1,715,809.70 ns/iter (+/- 8,874.30)
test thread2_tbm          ... bench:     904,512.20 ns/iter (+/- 9,680.85)
test thread4_fjs          ... bench:     994,223.10 ns/iter (+/- 125,324.43)
test thread4_quick_search ... bench:   1,822,633.10 ns/iter (+/- 37,125.17)
test thread4_tbm          ... bench:     982,140.20 ns/iter (+/- 28,045.62)
test thread8_fjs          ... bench:   1,029,430.90 ns/iter (+/- 32,401.76)
test thread8_quick_search ... bench:   1,854,443.30 ns/iter (+/- 30,437.64)
test thread8_tbm          ... bench:   1,015,415.20 ns/iter (+/- 31,493.83)

According to the results, PGO measurably improves at least the quick_search case. Other benchmarks are only slightly improved.

Further steps

I can suggest the following action points:

Perform more PGO benchmarks with other datasets in various scenarios (if you are interested enough in it). If it shows improvements - add a note to the documentation (the README file, I guess) about possible improvements in the library's performance with PGO.
Probably, you can try to get some insights about how the code can be optimized further based on the changes that the compiler performed with PGO. It can be done via analyzing flamegraphs before and after applying PGO to understand the difference or checking some assembly/LLVM IR differences before and after PGO.
Maybe even you will consider integrating the build possibility with PGO into the project's build scripts.

I would be happy to answer your questions about PGO.

P.S. It's just a benchmark report with some an idea for improvement for the project. I created the Issue only because Discussions are disabled for the repository.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profile-Guided Optimization (PGO) benchmark report #343

Profile-Guided Optimization (PGO) benchmark report #343

zamazan4ik commented Jul 6, 2024

Profile-Guided Optimization (PGO) benchmark report #343

Profile-Guided Optimization (PGO) benchmark report #343

Comments

zamazan4ik commented Jul 6, 2024

Test environment

Benchmark

Results

Further steps