Billion Row Challenge C++

I made this for fun to take a shot at the Java 1BRC in C++. I managed to get a 64x speedup compared to the naive solution (from 196 seconds to 3).

Contains much more performant code to generate the input data.

Results

Solution	Average	Best	Worst	Cold
markusaksli_fast_threaded	3.03s	3.02s	3.04s	3.03s
markusaksli_default_threaded	4.03s	4.02s	4.04s	4.04s
markusaksli_fast	22.18s	22.16s	22.21s	22.16s
markusaksli_default	28.66s	27.19s	31.42s	28.21s
naive_plus	95.87s	94.57s	97.76s	94.57s
naive	209.66s	198.41s	233.22s	198.41s

Hardware

CPU: AMD Ryzen 9 7950X 16-Core Processor
RAM: 64.0 GB
Storage: Samsung SSD 990 PRO 2TB (NVMe)
- Sequential Read 7124 MB/s
- Random Read 1359375 IOPS

Getting started

bin/gen.exe is the built Windows binary for src/gen.cpp. There is a gen_data script in the repository root that will run this exe and generate the default 1brc.txt file in the data folder.

To find out what args the exe has run it or the script with -h, but the main ones are:

-stations [int (default 100)] - Number of station names to use (up to 41343)
-lines [int (default 1000000000)] - Number of lines to generate
-buffersize [double (default 4.0)] - The size of the generation buffer (in GB). The bigger the better, and around 16 GB the buffer will only need to be filled once.

The build_all.bat script will build every solution in solutions, and benchmark.bat will benchmark each solution and save the results in a CSV file. To run benchmark.bat, you need to have the sync.exe Sysinternals tool in your Path and will need admin privileges to run it to flush the file system cache between solutions.

To add a new solution, open a PR with

The new solution
Updated build_all.bat
Updated run_benchmark.ps1

Original challenge description

See 1BRC Rules and limits.

Worth noting that the actual longest station name in the input data is Dolores Hidalgo Cuna de la Independencia Nacional at 49 bytes, even though it's stated it will be < 100 bytes.

The original challenge specified that the file would be read from a RAM disk, but I preferred to optimize for something actually being read from disk.

Memory mapping the input file
Seeking to the delimiter ; using SIMD
Fast double parsing (if I needed full double parsing would use a library, but it's simpler to just write the char-by-char parsing yourself in this case)
No string copying (just using views into the file data)
Simple inlineable hash function and table lookup

markusaksli_fast

An actual attempt at optimizing my default approach.

Improvements

Parsing and storing only minimal int station data (16 bytes) with a single 8 byte load per temperature
Using a fixed-size flat power-of-2 hash map with linear probing for even simpler lookups
Custom compact 4 byte key structure for the map so more of them can fit in cache
Did some experimentation on the quickest way to parse the delimiter, turns out just a basic char-by-char loop that combines calculating the hash worked better than anything smart.
Didn't do any hash seed searching in the source station names for a perfect hash function, felt too hacky or cheap.
Didn't manage to get any I/O improvements by touching pages or prefetching.

markusaksli_default_threaded

Multithreaded version of markusaksli_default.

Partitions the file by dividing the content by the number of threads and seeking to line boundaries
Each thread fills its own hash map
Simple merge where we loop through each pair in the thread hash map and do a lookup into the main thread's hash map.

markusaksli_fast_threaded

Multithreaded version of markusaksli_fast with the same principles.

Final findings

97% of CPU time spent in the parsing function.
- 70% in parsing numbers and adding to station data
- 17% in parsing and hashing station names
- 8% in hashmap lookup
1% spent merging results and sorting
3030 ms over 16290.6 MB is 5376.4 MB/s, which is 75.5% of my benchmarked SSD sequential read speed
3030 ms at an average all-core clock rate of 5100 MHz is 15483600000 CPU cycles, which over 16290602421 bytes is 0.95 CPU cycles/byte

Potential unexplored optimizations

Running a search to make a perfect hash function (probably the biggest improvement?)
Merge and sort improvements
- Post-parse per-thread restructuring and partial sorting for faster merge at the end
- Replacing std::sort (quicksort) with a radix sort
- Basically pointless based on the CPU time distribution

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
bin		bin
data		data
solutions		solutions
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchmark.bat		benchmark.bat
benchmark_results_current.csv		benchmark_results_current.csv
billion_row_challenge.sln		billion_row_challenge.sln
billion_row_challenge.vcxproj		billion_row_challenge.vcxproj
billion_row_challenge.vcxproj.filters		billion_row_challenge.vcxproj.filters
build_all.bat		build_all.bat
gen_16GB.bat		gen_16GB.bat
gen_data.bat		gen_data.bat
gen_small.bat		gen_small.bat
run_benchmark.ps1		run_benchmark.ps1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Billion Row Challenge C++

Results

Hardware

Getting started

Original challenge description

Included Solutions

naive

naive_plus

markusaksli_default

markusaksli_fast

markusaksli_default_threaded

markusaksli_fast_threaded

Potential unexplored optimizations

About

Uh oh!

Releases

Packages

Languages

License

markusaksli/billion-row-challenge-cpp

Folders and files

Latest commit

History

Repository files navigation

Billion Row Challenge C++

Results

Hardware

Getting started

Original challenge description

Included Solutions

Potential unexplored optimizations

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages