zsort is a high-performance, parallel, in-memory sorter for modern Linux systems. It is designed to be a faster drop-in replacement for LC_ALL=C sort for datasets that fit in RAM, optimizing heavily for cache locality and instruction pipelining.
Benchmarks were performed on an 88-core / 330 GB RAM Linux machine.
zsort outperforms both the standard GNU sort and the commercial/industry-standard nsort.
Input: 100M lines, 100-byte records (generated via gensort -a).
| Sorter | Time (Wall) | Throughput | vs zsort |
|---|---|---|---|
| zsort | 3.30s | ~3.03 GB/s | -- |
| nsort | 9.52s | ~1.05 GB/s | 2.9x slower |
| GNU sort (parallel) | 55.4s | ~0.18 GB/s | 16.8x slower |
| GNU sort (default) | 62.1s | ~0.16 GB/s | 18.8x slower |
Input: 10M lines, fixed 10-char alphanumeric strings.
| Sorter | Time (Wall) | vs zsort |
|---|---|---|
| zsort | 0.22s | -- |
| nsort | 0.87s | 4.0x slower |
| GNU sort (parallel) | 4.18s | 19x slower |
All benchmarks run with
LC_ALL=C, outputting to/dev/nullto measure pure sorting throughput without disk write bottlenecks.
zsort achieves this performance by minimizing CPU stalls and maximizing memory bandwidth:
- Prefix Key Optimization: Instead of dereferencing string pointers for every comparison (which causes cache misses), zsort stores the first 8 bytes of each line as a
uint64_tdirectly in the sorting structure.- Result: 99% of comparisons happen in CPU registers.
- Benefit: Eliminates pointer chasing and branch mispredictions.
- Zero-Copy & Hugepages: Uses
mmapwithMADV_HUGEPAGEto reduce TLB misses and avoid copying data from kernel to user space during the read phase. - Parallel Write: Uses
pwrite()with pre-calculated offsets to allow all threads to blast data to disk simultaneously, saturating I/O bandwidth without file descriptor locking. - Cache-Friendly Layout: Data structures are aligned to cache lines to prevent false sharing between cores.
Requires GCC (with pthreads support) and Make.
git clone https://github.com/yourusername/zsort.git
cd zsort
makezsort reads from a file and writes to a file or stdout.
# Sort to a file
./zsort input.txt output.txt
# Sort to stdout
./zsort input.txt > output.txt
# Benchmark (sort-only, no disk I/O)
./zsort input.txt /dev/null- Any text file with
\n-delimited lines - Variable-length lines supported
- Binary content supported
- Files with or without trailing newline handled correctly
Parallel sample sort with 8-byte prefix keys across all available CPU cores.
typedef struct {
uint64_t key; // first 8 bytes as big-endian uint64 (prefix compare)
const char *ptr; // pointer to line start in mmap'd input
uint32_t len; // line length excluding newline
uint32_t _pad;
} line_t;-
mmap input —
MAP_PRIVATEwithMADV_SEQUENTIAL | MADV_HUGEPAGE. -
Parallel line discovery — Each thread scans its byte chunk for
\n, counts lines, then a prefix sum assigns write offsets, and each thread buildsline_trecords into a shared array. -
Sample sort partitioning — Thread 0 picks
num_buckets * 64evenly-spaced samples, sorts them with introsort, and selectsnum_buckets - 1splitters. All threads classify their lines via binary search into per-thread histograms, a prefix sum computes global bucket boundaries, and all threads scatter records into a second array.num_buckets = nthreads * 8for good load balancing. -
Per-bucket introsort (work-stealing) — An atomic counter distributes buckets. Each thread sorts in-place using introsort: Hoare partition with median-of-3 pivot, insertion sort at n <= 32, heapsort fallback at depth
2*log2(n). All comparisons inlined (no function pointer overhead).
-
Parallel output — Each thread computes its output byte range via prefix sum, then copies sorted lines into an output buffer with software prefetching.
| Allocation | Size |
|---|---|
| Input (mmap) | 10 GB |
| Record array A | 2.4 GB |
| Record array B | 2.4 GB |
| Output buffer | 10 GB |
| Total | ~25 GB |
# Small file
./zsort arquivo.txt out.txt
LC_ALL=C sort arquivo.txt > ref.txt
cmp out.txt ref.txt && echo "CORRECT"
# 10 GB file
./zsort dataset_10gb.txt out.txt
LC_ALL=C sort dataset_10gb.txt > ref.txt
cmp out.txt ref.txt && echo "CORRECT"MIT — see LICENSE for details.