1BRC - 0.201s default dataset, 0.384s 10k dataset, 64/96 threads, no-fork, C++, m7i.48xlarge #495
Replies: 7 comments 27 replies
-
I also notice that even at the time of writing, there's quite a bit of discussion on the #138 that I'm not caught up on, so this solution may already be yesterdays news. |
Beta Was this translation helpful? Give feedback.
-
Also I'm not sure I fully understand how to properly leverage hyperthreading, as folks seems to be getting great speedups using it, and my performance pretty much universally degrades when I go past physical core count 🤔 |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Hi, here are 2 example strings where hash collision happen:
I think this code doesn't check for string equality and just relies on the hash? If so, it will not pass all tests. See: https://twitter.com/nietras1/status/1746162801729564812?t=HiDNrbLli0QYwJRPzlm9RQ&s=19
Yup, that's expected. In my post next to the benchmark, I mention test version |
Beta Was this translation helpful? Give feedback.
-
@lehuyduc @noahfalk, I think you both have the winning solutions (in various dimensions) at present. But you seem to have access to more "compliant" hardware than I do. If possible would love to get some results for the default dataset on competion hardware, or whatever hardware you got your benchmarks on. 10k is probably less interesting as it's not optimised for the 10k case, just compliant. Also, there's quite a lot of moving parts (hardware/cores/threads/code versions), so if any of my benchmarks above are not right for some reason, please advise! |
Beta Was this translation helpful? Give feedback.
-
This is a run of your code on the CCX33 I have access to:
The first one is without any THREADS env var set, the 2nd is when I explicitly set THREADS=8. Also I noticed when looking at the output that your entry has an error message when THREADS were set to 8 so I don't know if it was doing what it was supposed to be doing.
Here is the 10K with THREADS=8:
If you are looking for the fastest known solution, @austindonisan's is the fastest one I am aware of on the default data. Looking at the code his entry appears to exploit SIMD parallelism more thoroughly than any others I've seen. Its lovely engineering. |
Beta Was this translation helpful? Give feedback.
-
These is some discussion of @austindonisan's solution in @lehuyduc's thread: #138 (reply in thread) |
Beta Was this translation helpful? Give feedback.
-
https://github.com/charlielye/1brc
I offer a solution with a best runtime of 200.9ms on a m7i.48xlarge, for the default dataset. The result is compliant (as far as I'm aware), but not optimised for dealing with the 10k dataset. I don't have access to competition hardware.
This time includes unmapping. I decided not pursue the
fork
trick to:This does however set an upper bound on performance, as at some point there is no point optimising beyond getting the best unmap performance. Indeed, the implementation of this solution was mostly expended on dealing with unmap. If we exclude the cost of unmap we can see results in the 0.158s mark.
100 runs in hyperfine, note the min time recorded.
No effort was spent improving performance for the 10k dataset. 96 threads offers the best time here.
Comparing against: https://github.com/lehuyduc/1brc-simd
main_small.cpp
with the default dataset on the same hardware, it performs less well in a single run against the fork trick. Note the best time of 0.181s, but outperforms in the average case, most likely due to unmap optimisation.If we run both solutions with 8 threads, default dataset, this solution seems to outperform, even in the minimum case (avx512 might be helping here, but it's not available on competition hardware):
With the 10k dataset, we see @lehuyduc solution outperform in the best case, but it degrades in continous runs, again probably due to forks.
Tricks in this solution:
memcmp
for comparision and further crc32 instructions for hashing. This could be improved with simd, or maybe further use of masking and integer comparison.I'm writing this at the end of the competition time. I got completely nerd sniped on this over the past month and wish I could say I won't spend any more time on it, but it's been a lot of fun.
I'll probably have to adopt the fork trick if I want to try and compete with the best times of other solutions.
Beta Was this translation helpful? Give feedback.
All reactions