1BRC - 0.201s default dataset, 0.384s 10k dataset, 64/96 threads, no-fork, C++, m7i.48xlarge #495

charlielye · 2024-01-19T18:22:45Z

charlielye
Jan 19, 2024

I offer a solution with a best runtime of 200.9ms on a m7i.48xlarge, for the default dataset. The result is compliant (as far as I'm aware), but not optimised for dealing with the 10k dataset. I don't have access to competition hardware.
This time includes unmapping. I decided not pursue the fork trick to:

Offer something different to some other solutions.
Try to be "honest" in the true cost, as using the fork trick means you can't get reliable hyperfine results as the previous process is technically still "running" between test runs (maybe as a zombie process, or just the kernel is still busy).

This does however set an upper bound on performance, as at some point there is no point optimising beyond getting the best unmap performance. Indeed, the implementation of this solution was mostly expended on dealing with unmap. If we exclude the cost of unmap we can see results in the 0.158s mark.

ubuntu@ip-172-31-35-24:~$ THREADS=64 time -p ./a.out
   Chunks: 8
  Threads: 64
    Setup: 15.5635
 Parallel: 137.125
Combining: 4.25785
  Writing: 0.763731
    Total: 157.711
  w/Unmap: 198.606
Processed: 1000000000
real 0.20
user 7.96
sys 0.85

100 runs in hyperfine, note the min time recorded.

ubuntu@ip-172-31-35-24:~$ THREADS=64 hyperfine --warmup 1 --runs 100 ./a.out
Benchmark 1: ./a.out
  Time (mean ± σ):     206.8 ms ±   3.6 ms    [User: 7807.1 ms, System: 1112.0 ms]
  Range (min … max):   200.9 ms … 221.5 ms    100 runs

No effort was spent improving performance for the 10k dataset. 96 threads offers the best time here.

ubuntu@ip-172-31-35-24:~$ THREADS=96 hyperfine --warmup 1 --runs 100 "./a.out measurements_10k.txt"
Benchmark 1: ./a.out measurements_10k.txt
  Time (mean ± σ):     390.8 ms ±   4.3 ms    [User: 26219.3 ms, System: 1569.5 ms]
  Range (min … max):   383.7 ms … 406.8 ms    100 runs

Comparing against: https://github.com/lehuyduc/1brc-simd main_small.cpp with the default dataset on the same hardware, it performs less well in a single run against the fork trick. Note the best time of 0.181s, but outperforms in the average case, most likely due to unmap optimisation.

ubuntu@ip-172-31-35-24:~$ hyperfine --warmup 1 --runs 100 ./leh_small64
Benchmark 1: ./leh_small64
  Time (mean ± σ):     219.5 ms ±  32.0 ms    [User: 1.5 ms, System: 0.4 ms]
  Range (min … max):   181.1 ms … 345.1 ms    100 runs

If we run both solutions with 8 threads, default dataset, this solution seems to outperform, even in the minimum case (avx512 might be helping here, but it's not available on competition hardware):

ubuntu@ip-172-31-35-24:~$ THREADS=8 hyperfine --warmup 1 --runs 10 ./a.out
Benchmark 1: ./a.out
  Time (mean ± σ):     958.6 ms ±   6.0 ms    [User: 6805.2 ms, System: 474.7 ms]
  Range (min … max):   949.7 ms … 967.9 ms    10 runs

ubuntu@ip-172-31-35-24:~$ hyperfine --warmup 1 --runs 10 ./leh_small8
Benchmark 1: ./leh_small8
  Time (mean ± σ):      1.107 s ±  0.015 s    [User: 0.001 s, System: 0.000 s]
  Range (min … max):    1.093 s …  1.132 s    10 runs

With the 10k dataset, we see @lehuyduc solution outperform in the best case, but it degrades in continous runs, again probably due to forks.

ubuntu@ip-172-31-35-24:~$ THREADS=64 hyperfine --warmup 1 --runs 20 "./a.out measurements_10k.txt"
Benchmark 1: ./a.out measurements_10k.txt
  Time (mean ± σ):     459.5 ms ±   1.6 ms    [User: 24476.8 ms, System: 1069.6 ms]
  Range (min … max):   457.0 ms … 462.4 ms    20 runs

ubuntu@ip-172-31-35-24:~$ hyperfine --warmup 1 --runs 20 "./leh64 measurements_10k.txt"
Benchmark 1: ./leh64 measurements_10k.txt
  Time (mean ± σ):     569.7 ms ± 100.6 ms    [User: 1.4 ms, System: 0.6 ms]
  Range (min … max):   414.5 ms … 851.6 ms    20 runs

Tricks in this solution:

Implements a basic scheduler over a pool of core-pinned threads to simplify some tricks.
Memory map the file in. Assumes we've done a warmup run to pull the file into kernel page cache.
Process the files in linear chunks, so we can unmap one chunk when we begin processing the next.
Initialize thread hash table memory over thread pool.
SIMD for looping over chunks of data to detect semi-colons.
Use of compiler hints for expected branches.
All value processing as int not float.
Use of crc32 cpu instructions for hashing (not sure how widely available, but on amd/intel previous/current gen aws instances).
Use of lookup tables.
- For converting the number string to a value, we assume most are 4 bytes, and do a single crc32 hash. Edge case is 5 bytes for an additional byte of hashing.
- For masking off bytes in station names <= 16 bytes, allowing for 2x 64 bit integer comparison when comparing and hashing station names. >16 bytes is considered an edge case and currently uses inefficient memcmp for comparision and further crc32 instructions for hashing. This could be improved with simd, or maybe further use of masking and integer comparison.
We align hash table entries (128 bytes) to cache-lines, and size the table such that it should mostly fit in L2 cache.
Avoid some unnecessary branching, where it seemed to help performance.
Combine hash results in log2(threads) levels of reduction.

I'm writing this at the end of the competition time. I got completely nerd sniped on this over the past month and wish I could say I won't spend any more time on it, but it's been a lot of fun.
I'll probably have to adopt the fork trick if I want to try and compete with the best times of other solutions.

charlielye · 2024-01-19T18:25:51Z

charlielye
Jan 19, 2024
Author

I also notice that even at the time of writing, there's quite a bit of discussion on the #138 that I'm not caught up on, so this solution may already be yesterdays news.

0 replies

charlielye · 2024-01-19T18:43:21Z

charlielye
Jan 19, 2024
Author

Also I'm not sure I fully understand how to properly leverage hyperthreading, as folks seems to be getting great speedups using it, and my performance pretty much universally degrades when I go past physical core count 🤔

1 reply

sharpobject Jan 19, 2024

Hyperthreading lets the core schedule 2 streams of instructions instead of 1. If there's not actually any room for more similar work because you're saturating some port already, you won't gain anything.

dzaima · 2024-01-19T19:00:37Z

dzaima
Jan 19, 2024

int sum isn't enough for 1B records, all a;99.9 ((float)sum should be changed to double too); even with that changed, your output seems off by a bit on a 100M record/1.38GB test, especially noticable when changing to print the sum instead of average

6 replies

charlielye Jan 19, 2024
Author

Hmm. I'm not sure I follow actually, int is a 64 bit signed integer, so should be big enough. And changing float to double makes no different to output hash (and shouldn't actually matter either).

charlielye Jan 19, 2024
Author

Maybe I just have test data that isn't triggering an issue that your test data is. I plan to run against other official test vectors, so hopefully can catch any edge bugs there...

dzaima Jan 19, 2024

seems like your number parser parses -0.5 (and any -0.X?) as 0

dzaima Jan 19, 2024

int is usually a 32-bit signed integer. Do you have some configuration that makes that not be the case or something? (the sum size and float not affecting the output for the default generated dataset makes sense, but the challenge rules state that it must work on more input types, which allow for a case of 1B repeated a;99.9; float perhaps is even less consequential as there's no description on the needed rounding, but it's probably good to change it anyway)

charlielye Jan 19, 2024
Author

You're right. derp. Although changing to uint64_t didn't change my hash so I guess my test data is lucky.
It didn't impact timings either way. Will investigate parser bug in a bit 🙏

lehuyduc · 2024-01-20T00:37:51Z

lehuyduc
Jan 20, 2024

Hi, here are 2 example strings where hash collision happen:

string 1: 000111100100000000001011110
string 2: 111001110
hash of both:1169095459

I think this code doesn't check for string equality and just relies on the hash? If so, it will not pass all tests. See: https://twitter.com/nietras1/status/1746162801729564812?t=HiDNrbLli0QYwJRPzlm9RQ&s=19

I would also note that none of these runs was able to reproduce the 0.362s time seen on the 128 core threadripper run.

Yup, that's expected. In my post next to the benchmark, I mention test version 1brc_valid16. Version 17 and later focused on improving 10K key dataset performance, at the cost of base dataset performance. It's slower due to larger hash map (NUM_BINS), and at 128 threads that cost a lot.

11 replies

sharpobject Jan 20, 2024

If you use a large and cryptographically secure hash then the solution is likely to be slower for the normal input than one using a small and weak hash and handling collisions correctly, so it doesn't seem like the answer to this concern matters very much.

lehuyduc Jan 20, 2024

Is there any ambiguity around the hash/collision issue?

No, the contest requires you to handle hash collision properly. That means you have to either:

Store the string in your hash table to compare
Use a cryptographically secure hash 👀
I forgot a 3rd option: perfect hashing. @RagnarGrootKoerkamp has a very cool library that guarantees no hash collision. The only problem is, you have to rebuild the perfect hash table every time you meet a new key.

If you use a 256-bit hash, then it's a totally valid solution. But a hash modulo HASH_TABLE_SIZE = 2^20 for example, will 100% fail against a hash collision attack. The judge clarifies solutions like that to be invalid.

Hiya, my solution passes this test. Same hash as above.

Yes that's expected. The test generator we have is really simple, it's mostly to check the average case, not the edge cases.

Your code will give wrong result for this test for example:

000111100100000000001011110:-16.0
111001110:15.0
000111100100000000001011110:-16.0
111001110:15.0
000111100100000000001011110:-16.0
111001110:15.0
000111100100000000001011110:-16.0
111001110:15.0
... copy above 100 times

RagnarGrootKoerkamp Jan 20, 2024

Yes one can do perfect hashing, but this still requires collision detection. So sadly one still has to store strings or hashes.

dzaima Jan 20, 2024

Also looks like a hash of 0 is broken; here are two records whose names hash to 0 and don't show up in the result anywhere:

9000908890899098;10.0
mkgcgmgkceakkggc;20.0

charlielye Jan 31, 2024
Author

This should now be addressed, as I now perform proper string comparison in bucket collisions.

charlielye · 2024-01-31T08:32:37Z

charlielye
Jan 31, 2024
Author

@lehuyduc @noahfalk, I think you both have the winning solutions (in various dimensions) at present. But you seem to have access to more "compliant" hardware than I do. If possible would love to get some results for the default dataset on competion hardware, or whatever hardware you got your benchmarks on. 10k is probably less interesting as it's not optimised for the 10k case, just compliant.

Also, there's quite a lot of moving parts (hardware/cores/threads/code versions), so if any of my benchmarks above are not right for some reason, please advise!

5 replies

austindonisan Jan 31, 2024

My implementation is fastest I've seen, running in about 82% of the time of @noahfalk's solution (~50% of yours and @lehuyduc's).

https://github.com/austindonisan/1brc/tree/master

With 8 threads it averages 660ms with the flag set to not leave orphan processes:

It was written with the competition Zen2 CPU in mind. I've just been renting a GCP instance (N2D) with a similar CPU. You could rent an EC2 C5a instance for something similar. Targeting Sapphire Rapids would look totally different with not only AV512, but usable GATHER and PDEP/PEXT as well.

charlielye Jan 31, 2024
Author

Oh wow. I missed your impl. Light speed! 🙇

charlielye Jan 31, 2024
Author

Hmm, although running your code on my hardware I think I'm seeing output corruption? This is start of output:

{2
Alexandra=2.5/2.5/2.5, 2.5
Fresno=12.7/12.7/12.7, 9
Tromsø=7.2/7.2/7.2, Abha=-33.4/18.0/67.4, Abidjan=-26.0/26.0/76.1, Abéché=-19.0/29.4/80.2, Accra=-25.2/26.4/81.1, Addis Ababa=-37.0/16.0/63.7, Adelaide=-30.9/17.3/65.2, Aden=-19.9/29.1/82.9, Ahvaz=-23.8/25.4/78.2, Albuquerque=-38.2/14.0/65.7, Alexandra=-36.3/11.0/59.3, Alexandria=-29.1/20.0/68.7, Algiers=-32.8/18.2/67.5, Alice Springs=-30.0/21.0/73.2, Almaty=-41.6/10.0/59.0

Is there a discussion thread for your solution? I'd like to understand more the details of your strategy.

austindonisan Jan 31, 2024

Hmm, although running your code on my hardware I think I'm seeing output corruption? This is start of output:

Edit: I fixed it

I'll make a post summarizing the major points of my solution later today.

austindonisan Feb 1, 2024

I wrote up my implementation here:
#710

noahfalk · 2024-01-31T13:51:36Z

noahfalk
Jan 31, 2024

If possible would love to get some results for the default dataset on competion hardware, or whatever hardware you got your benchmarks on.

This is a run of your code on the CCX33 I have access to:

root@ubuntu-32gb-hil-1:~/git/charlielye_1brc# git log
commit 30ec82e2d7db2f3db9f7376272d570236ecf7b96 (HEAD -> main, origin/main, origin/HEAD)
Author: Charlie Lye <[email protected]>
Date:   Wed Jan 31 08:47:49 2024 +0000

    README
    
root@ubuntu-32gb-hil-1:~/git/charlielye_1brc# hyperfine -w 1 -r 5 "./a.out ~/git/1brc_data/measurements.txt"
Benchmark 1: ./a.out ~/git/1brc_data/measurements.txt
  Time (mean ± σ):      3.849 s ±  0.024 s    [User: 14.523 s, System: 0.962 s]
  Range (min … max):    3.827 s …  3.884 s    5 runs

root@ubuntu-32gb-hil-1:~/git/charlielye_1brc# export THREADS=8
root@ubuntu-32gb-hil-1:~/git/charlielye_1brc# hyperfine -w 1 -r 5 "./a.out ~/git/1brc_data/measurements.txt"
Benchmark 1: ./a.out ~/git/1brc_data/measurements.txt
  Time (mean ± σ):      2.111 s ±  0.016 s    [User: 14.677 s, System: 0.984 s]
  Range (min … max):    2.091 s …  2.131 s    5 runs

The first one is without any THREADS env var set, the 2nd is when I explicitly set THREADS=8. Also I noticed when looking at the output that your entry has an error message when THREADS were set to 8 so I don't know if it was doing what it was supposed to be doing.

root@ubuntu-32gb-hil-1:~/git/charlielye_1brc# ./a.out ~/git/1brc_data/measurements.txt
Error setting thread affinity: 8Invalid argument
   Chunks: 8
  Threads: 8
    Setup: 3.92179
 Parallel: 2109.72
Combining: 1.0249
  Writing: 0.565064
    Total: 2115.23
  w/Unmap: 2149.04
Processed: 1000000000

Here is the 10K with THREADS=8:

root@ubuntu-32gb-hil-1:~/git/charlielye_1brc# hyperfine -w 1 -r 5 "./a.out ~/git/1brc_data/measurements-10K.txt"
Benchmark 1: ./a.out ~/git/1brc_data/measurements-10K.txt
  Time (mean ± σ):      5.268 s ±  0.092 s    [User: 38.798 s, System: 1.637 s]
  Range (min … max):    5.149 s …  5.381 s    5 runs

I think you both have the winning solutions (in various dimensions) at present

If you are looking for the fastest known solution, @austindonisan's is the fastest one I am aware of on the default data. Looking at the code his entry appears to exploit SIMD parallelism more thoroughly than any others I've seen. Its lovely engineering.

3 replies

charlielye Jan 31, 2024
Author

Thanks so much for running!
Ok, so I’ve definitely benefited from physical cores. My solution is quite worse than others on the competition machine as I don’t think I benefit much from hyper-threading. Much to learn here.

austins solution is mind blowing. I’m tempted to pursue his speed, but think I have quite a knowledge gap.

Would anyone have recommendations for tooling to help with this? Or is it really just better understanding of the machine limits in terms of memory bandwidth, and cpu instruction cycles etc? One could imagine with the right knowledge you can envelope math out a solutions performance before even getting into the code. But there is also so much abstraction in modern hardware it feels like a lot is data gathering and analysis with tools would help.

noahfalk Jan 31, 2024

I did some profiling with vtune as part of developing my solution, though I see no reason you couldn't use another profiler. Certainly having access to hardware counters and assembly instruction level attribution of costs was useful. When I developed my solution there was some back and forth between simulating things in my head with back of the envelope calculations and then implementing things to make empirical measurements. Certainly modern hardware is quite complex and it is challenging to predict how it will run once the workload isn't trivial. Another part of it though is just spotting opportunities on what is possible. Austin spotted a variety of places where additional parallelism was possible with SIMD that I simply hadn't noticed.

charlielye Feb 2, 2024
Author

That makes a lot of sense, thank you!

noahfalk · 2024-01-31T13:58:22Z

noahfalk
Jan 31, 2024

Is there a discussion thread for your solution? I'd like to understand more the details of your strategy.

These is some discussion of @austindonisan's solution in @lehuyduc's thread: #138 (reply in thread)

1 reply

sharpobject Jan 31, 2024

This is wonderful reading. I haven't really considered using most of these toys because Java doesn't have them. I appreciate the reminder that I cannot use pdep or pext on Zen2 as well.

1BRC - 0.201s default dataset, 0.384s 10k dataset, 64/96 threads, no-fork, C++, m7i.48xlarge #495

Replies: 7 comments · 27 replies

charlielye Jan 19, 2024 Author

charlielye Jan 19, 2024 Author

charlielye Jan 19, 2024 Author

charlielye Jan 19, 2024 Author

charlielye Jan 19, 2024 Author

charlielye Jan 31, 2024 Author

charlielye Jan 31, 2024 Author

charlielye Jan 31, 2024 Author

charlielye Jan 31, 2024 Author

charlielye Jan 31, 2024 Author

charlielye Feb 2, 2024 Author

Replies: 7 comments 27 replies

charlielye
Jan 19, 2024
Author

charlielye
Jan 19, 2024
Author

charlielye Jan 19, 2024
Author

charlielye Jan 19, 2024
Author

charlielye Jan 19, 2024
Author

charlielye Jan 31, 2024
Author

charlielye
Jan 31, 2024
Author

charlielye Jan 31, 2024
Author

charlielye Jan 31, 2024
Author

charlielye Jan 31, 2024
Author

charlielye Feb 2, 2024
Author