I made this for fun to take a shot at the Java 1BRC in C++. I managed to get a 64x speedup compared to the naive solution (from 196 seconds to 3).
Contains much more performant code to generate the input data.
| Solution | Average | Best | Worst | Cold |
|---|---|---|---|---|
| markusaksli_fast_threaded | 3.03s | 3.02s | 3.04s | 3.03s |
| markusaksli_default_threaded | 4.03s | 4.02s | 4.04s | 4.04s |
| markusaksli_fast | 22.18s | 22.16s | 22.21s | 22.16s |
| markusaksli_default | 28.66s | 27.19s | 31.42s | 28.21s |
| naive_plus | 95.87s | 94.57s | 97.76s | 94.57s |
| naive | 209.66s | 198.41s | 233.22s | 198.41s |
- CPU: AMD Ryzen 9 7950X 16-Core Processor
- RAM: 64.0 GB
- Storage: Samsung SSD 990 PRO 2TB (NVMe)
- Sequential Read 7124 MB/s
- Random Read 1359375 IOPS
bin/gen.exe is the built Windows binary for src/gen.cpp.
There is a gen_data script in the repository root that will run this exe and generate the default 1brc.txt file in the data folder.
To find out what args the exe has run it or the script with -h, but the main ones are:
-stations [int (default 100)]- Number of station names to use (up to 41343)-lines [int (default 1000000000)]- Number of lines to generate-buffersize [double (default 4.0)]- The size of the generation buffer (in GB). The bigger the better, and around 16 GB the buffer will only need to be filled once.
The build_all.bat script will build every solution in solutions, and benchmark.bat will benchmark each solution and save the results in a CSV file. To run benchmark.bat, you need to have the sync.exe Sysinternals tool in your Path and will need admin privileges to run it to flush the file system cache between solutions.
To add a new solution, open a PR with
- The new solution
- Updated build_all.bat
- Updated run_benchmark.ps1
Worth noting that the actual longest station name in the input data is Dolores Hidalgo Cuna de la Independencia Nacional at 49 bytes, even though it's stated it will be < 100 bytes.
The original challenge specified that the file would be read from a RAM disk, but I preferred to optimize for something actually being read from disk.
The canonical simple way to solve the challenge in C++ (although it already needs some platform-specific code on Windows to output in UTF-8).
Basic attempt at optimizing the naive solution, primarily reducing string copying.
My default code for parsing text and using my simple base layer data structures. This is mostly how I would write my basic solution to any text parsing problem.
Improvements
- Memory mapping the input file
- Seeking to the delimiter
;using SIMD - Fast double parsing (if I needed full double parsing would use a library, but it's simpler to just write the char-by-char parsing yourself in this case)
- No string copying (just using views into the file data)
- Simple inlineable hash function and table lookup
An actual attempt at optimizing my default approach.
Improvements
- Parsing and storing only minimal int station data (16 bytes) with a single 8 byte load per temperature
- Using a fixed-size flat power-of-2 hash map with linear probing for even simpler lookups
- Custom compact 4 byte key structure for the map so more of them can fit in cache
- Did some experimentation on the quickest way to parse the delimiter, turns out just a basic char-by-char loop that combines calculating the hash worked better than anything smart.
- Didn't do any hash seed searching in the source station names for a perfect hash function, felt too hacky or cheap.
- Didn't manage to get any I/O improvements by touching pages or prefetching.
Multithreaded version of markusaksli_default.
- Partitions the file by dividing the content by the number of threads and seeking to line boundaries
- Each thread fills its own hash map
- Simple merge where we loop through each pair in the thread hash map and do a lookup into the main thread's hash map.
Multithreaded version of markusaksli_fast with the same principles.
Final findings
- 97% of CPU time spent in the parsing function.
- 70% in parsing numbers and adding to station data
- 17% in parsing and hashing station names
- 8% in hashmap lookup
- 1% spent merging results and sorting
- 3030 ms over 16290.6 MB is 5376.4 MB/s, which is 75.5% of my benchmarked SSD sequential read speed
- 3030 ms at an average all-core clock rate of 5100 MHz is 15483600000 CPU cycles, which over 16290602421 bytes is 0.95 CPU cycles/byte
- Running a search to make a perfect hash function (probably the biggest improvement?)
- Merge and sort improvements
- Post-parse per-thread restructuring and partial sorting for faster merge at the end
- Replacing std::sort (quicksort) with a radix sort
- Basically pointless based on the CPU time distribution