Align vectors on disk for optimal performance on ARM CPUs #15341

kaivalnp · 2025-10-16T18:13:40Z

Description

Today, float vectors are aligned to 4 bytes in a Lucene index, but with Panama -- we can work with (upto) 512 bits (== 64 bytes, or 16 floats) at the same time.

I wonder if we should change this alignment to 64 bytes, in order to get optimal vector search performance with Panama?

github-actions · 2025-10-16T18:15:32Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

kaivalnp · 2025-10-16T18:23:06Z

I wrote a small JMH benchmark to "pad" float vectors on disk with some padBytes:

Benchmark                                     (padBytes)  (size)   Mode  Cnt   Score   Error   Units
VectorScorerBenchmark.floatDotProductMemSeg            0     256  thrpt   15  32.848 ± 0.098  ops/us
VectorScorerBenchmark.floatDotProductMemSeg            1     256  thrpt   15  24.158 ± 0.410  ops/us
VectorScorerBenchmark.floatDotProductMemSeg            2     256  thrpt   15  24.055 ± 0.291  ops/us
VectorScorerBenchmark.floatDotProductMemSeg            4     256  thrpt   15  26.381 ± 0.078  ops/us
VectorScorerBenchmark.floatDotProductMemSeg            6     256  thrpt   15  23.772 ± 0.785  ops/us
VectorScorerBenchmark.floatDotProductMemSeg            8     256  thrpt   15  26.666 ± 0.039  ops/us
VectorScorerBenchmark.floatDotProductMemSeg           16     256  thrpt   15  26.753 ± 0.112  ops/us
VectorScorerBenchmark.floatDotProductMemSeg           20     256  thrpt   15  26.489 ± 0.229  ops/us
VectorScorerBenchmark.floatDotProductMemSeg           32     256  thrpt   15  32.805 ± 0.106  ops/us
VectorScorerBenchmark.floatDotProductMemSeg           50     256  thrpt   15  24.651 ± 0.556  ops/us
VectorScorerBenchmark.floatDotProductMemSeg           64     256  thrpt   15  32.762 ± 0.376  ops/us
VectorScorerBenchmark.floatDotProductMemSeg          100     256  thrpt   15  25.888 ± 0.069  ops/us
VectorScorerBenchmark.floatDotProductMemSeg          128     256  thrpt   15  32.874 ± 0.065  ops/us
VectorScorerBenchmark.floatDotProductMemSeg          255     256  thrpt   15  24.906 ± 0.120  ops/us
VectorScorerBenchmark.floatDotProductMemSeg          256     256  thrpt   15  32.780 ± 0.091  ops/us

My machine uses the 256-bit variant of Panama to score vectors, so I saw optimal performance when floats are aligned to 32 bytes -- but keeping it 64 here as the max case..

kaivalnp · 2025-10-16T18:24:45Z

cc @mikemccand who found this^ byte-misalignment possibility offline!

kaivalnp · 2025-10-16T19:15:58Z

Also noting that for byte vectors, I saw no impact of padding:

Benchmark                                     (padBytes)  (size)   Mode  Cnt   Score   Error   Units
VectorScorerBenchmark.binaryDotProductMemSeg           0     256  thrpt   15  20.453 ± 0.171  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg           1     256  thrpt   15  20.651 ± 0.177  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg           2     256  thrpt   15  20.601 ± 0.150  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg           4     256  thrpt   15  20.602 ± 0.163  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg           6     256  thrpt   15  20.677 ± 0.252  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg           8     256  thrpt   15  20.395 ± 0.154  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg          16     256  thrpt   15  20.368 ± 0.122  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg          20     256  thrpt   15  20.364 ± 0.089  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg          32     256  thrpt   15  20.337 ± 0.041  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg          50     256  thrpt   15  20.617 ± 0.139  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg          64     256  thrpt   15  20.557 ± 0.272  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg         100     256  thrpt   15  20.770 ± 0.233  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg         128     256  thrpt   15  20.487 ± 0.151  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg         255     256  thrpt   15  20.419 ± 0.118  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg         256     256  thrpt   15  20.644 ± 0.390  ops/us

..so I'm not changing its alignment in this PR

mikemccand · 2025-10-17T14:31:45Z

Wow, alignment still matters, and it matters a lot (24 -> 33 ops/us)! Thank you @kaivalnp for testing. Was this an aarch64 CPU (Graviton 3 or 4?).

It's frustrating how the CPU just silently runs slower ... but what else could it do.

I wonder whether modern x86-64 (Intel, AMD) CPUs also show this effect. I'll test this PR on nightly Lucene benchy box (beast3).

kaivalnp · 2025-10-17T15:24:44Z

Was this an aarch64 CPU (Graviton 3 or 4?)

Yes, it was a Graviton3 (m7g) CPU. lscpu says:

Architecture:        aarch64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              64
On-line CPU(s) list: 0-63
Thread(s) per core:  1
Core(s) per socket:  64
Socket(s):           1
NUMA node(s):        1
Vendor ID:           ARM
Model:               1
Stepping:            r1p1
BogoMIPS:            2100.00
L1d cache:           64K
L1i cache:           64K
L2 cache:            1024K
L3 cache:            32768K
NUMA node0 CPU(s):   0-63
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm paca pacg dcpodp svei8mm svebf16 i8mm bf16 dgh rng

mikemccand · 2025-10-20T15:51:51Z

I tested on beast3 (nightly benchmarking box) -- a Ryzen Threadripper 3990X:

processor       : 127
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 49
model name      : AMD Ryzen Threadripper 3990X 64-Core Processor
stepping        : 0
microcode       : 0x830107c
cpu MHz         : 2900.000
cache size      : 512 KB
physical id     : 0
siblings        : 128
core id         : 63
cpu cores       : 64
apicid          : 127
initial apicid  : 127
fpu             : yes
fpu_exception   : yes
cpuid level     : 16
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pcl\
mulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 c\
dp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lb\
rv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sev sev_es
bugs            : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret spectre_v2_user
bogomips        : 5788.93
TLB size        : 3072 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 43 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]

I applied this PR, built (./gradlew :lucene:benchmark-jmh:assemble), and ran java --module-path lucene/benchmark-jmh/build/benchmarks --module org.apache.lucene.benchmark.jmh VectorScorerBenchmark -p size=256, and got:

Benchmark                                      (padBytes)  (size)   Mode  Cnt   Score   Error   Units
VectorScorerBenchmark.binaryDotProductDefault           0     256  thrpt   15   8.601 ± 0.047  ops/us
VectorScorerBenchmark.binaryDotProductDefault           1     256  thrpt   15   8.573 ± 0.058  ops/us
VectorScorerBenchmark.binaryDotProductDefault           2     256  thrpt   15   8.593 ± 0.023  ops/us
VectorScorerBenchmark.binaryDotProductDefault           4     256  thrpt   15   8.579 ± 0.026  ops/us
VectorScorerBenchmark.binaryDotProductDefault           6     256  thrpt   15   8.584 ± 0.026  ops/us
VectorScorerBenchmark.binaryDotProductDefault           8     256  thrpt   15   8.605 ± 0.019  ops/us
VectorScorerBenchmark.binaryDotProductDefault          16     256  thrpt   15   8.603 ± 0.034  ops/us
VectorScorerBenchmark.binaryDotProductDefault          20     256  thrpt   15   8.583 ± 0.031  ops/us
VectorScorerBenchmark.binaryDotProductDefault          32     256  thrpt   15   8.581 ± 0.030  ops/us
VectorScorerBenchmark.binaryDotProductDefault          50     256  thrpt   15   8.591 ± 0.052  ops/us
VectorScorerBenchmark.binaryDotProductDefault          64     256  thrpt   15   8.611 ± 0.033  ops/us
VectorScorerBenchmark.binaryDotProductDefault         100     256  thrpt   15   8.594 ± 0.051  ops/us
VectorScorerBenchmark.binaryDotProductDefault         128     256  thrpt   15   8.620 ± 0.032  ops/us
VectorScorerBenchmark.binaryDotProductDefault         255     256  thrpt   15   8.597 ± 0.026  ops/us
VectorScorerBenchmark.binaryDotProductDefault         256     256  thrpt   15   8.605 ± 0.056  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg            0     256  thrpt   15  25.203 ± 1.850  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg            1     256  thrpt   15  25.961 ± 0.047  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg            2     256  thrpt   15  25.314 ± 1.959  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg            4     256  thrpt   15  25.958 ± 0.067  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg            6     256  thrpt   15  25.295 ± 1.977  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg            8     256  thrpt   15  26.122 ± 0.073  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg           16     256  thrpt   15  26.056 ± 0.184  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg           20     256  thrpt   15  25.848 ± 1.589  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg           32     256  thrpt   15  25.817 ± 0.417  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg           50     256  thrpt   15  26.065 ± 0.585  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg           64     256  thrpt   15  26.045 ± 0.162  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg          100     256  thrpt   15  26.093 ± 0.061  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg          128     256  thrpt   15  26.101 ± 0.090  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg          255     256  thrpt   15  26.028 ± 0.088  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg          256     256  thrpt   15  26.027 ± 0.301  ops/us
VectorScorerBenchmark.floatDotProductDefault            0     256  thrpt   15  15.241 ± 0.010  ops/us
VectorScorerBenchmark.floatDotProductDefault            1     256  thrpt   15  15.169 ± 0.232  ops/us
VectorScorerBenchmark.floatDotProductDefault            2     256  thrpt   15  15.230 ± 0.082  ops/us
VectorScorerBenchmark.floatDotProductDefault            4     256  thrpt   15  15.231 ± 0.034  ops/us
VectorScorerBenchmark.floatDotProductDefault            6     256  thrpt   15  15.229 ± 0.048  ops/us
VectorScorerBenchmark.floatDotProductDefault            8     256  thrpt   15  15.216 ± 0.091  ops/us
VectorScorerBenchmark.floatDotProductDefault           16     256  thrpt   15  15.278 ± 0.048  ops/us
VectorScorerBenchmark.floatDotProductDefault           20     256  thrpt   15  15.058 ± 0.711  ops/us
VectorScorerBenchmark.floatDotProductDefault           32     256  thrpt   15  15.192 ± 0.100  ops/us
VectorScorerBenchmark.floatDotProductDefault           50     256  thrpt   15  15.300 ± 0.047  ops/us
VectorScorerBenchmark.floatDotProductDefault           64     256  thrpt   15  15.257 ± 0.083  ops/us
VectorScorerBenchmark.floatDotProductDefault          100     256  thrpt   15  15.272 ± 0.038  ops/us
VectorScorerBenchmark.floatDotProductDefault          128     256  thrpt   15  15.144 ± 0.529  ops/us
VectorScorerBenchmark.floatDotProductDefault          255     256  thrpt   15  15.248 ± 0.024  ops/us
VectorScorerBenchmark.floatDotProductDefault          256     256  thrpt   15  15.276 ± 0.039  ops/us
VectorScorerBenchmark.floatDotProductMemSeg             0     256  thrpt   15  20.360 ± 0.077  ops/us
VectorScorerBenchmark.floatDotProductMemSeg             1     256  thrpt   15  20.252 ± 0.177  ops/us
VectorScorerBenchmark.floatDotProductMemSeg             2     256  thrpt   15  20.281 ± 0.060  ops/us
VectorScorerBenchmark.floatDotProductMemSeg             4     256  thrpt   15  20.261 ± 0.048  ops/us
VectorScorerBenchmark.floatDotProductMemSeg             6     256  thrpt   15  20.285 ± 0.063  ops/us
VectorScorerBenchmark.floatDotProductMemSeg             8     256  thrpt   15  20.359 ± 0.072  ops/us
VectorScorerBenchmark.floatDotProductMemSeg            16     256  thrpt   15  20.344 ± 0.078  ops/us
VectorScorerBenchmark.floatDotProductMemSeg            20     256  thrpt   15  20.272 ± 0.090  ops/us
VectorScorerBenchmark.floatDotProductMemSeg            32     256  thrpt   15  20.413 ± 0.010  ops/us
VectorScorerBenchmark.floatDotProductMemSeg            50     256  thrpt   15  20.066 ± 0.051  ops/us
VectorScorerBenchmark.floatDotProductMemSeg            64     256  thrpt   15  20.386 ± 0.051  ops/us
VectorScorerBenchmark.floatDotProductMemSeg           100     256  thrpt   15  20.029 ± 0.095  ops/us
VectorScorerBenchmark.floatDotProductMemSeg           128     256  thrpt   15  20.348 ± 0.049  ops/us
VectorScorerBenchmark.floatDotProductMemSeg           255     256  thrpt   15  20.047 ± 0.101  ops/us
VectorScorerBenchmark.floatDotProductMemSeg           256     256  thrpt   15  20.335 ± 0.037  ops/us

Net/net it seems like alignment of the mapped in-ram (virtual address space) doesn't matter?

I also tested newer CPU (Raptor Lake) -- I'll post that shortly.

mikemccand · 2025-10-20T15:53:46Z

Raptor Lake box is i9-13900K:

processor       : 31
vendor_id       : GenuineIntel
cpu family      : 6
model           : 183
model name      : 13th Gen Intel(R) Core(TM) i9-13900K
stepping        : 1
microcode       : 0x12f
cpu MHz         : 800.000
cache size      : 36864 KB
physical id     : 0
siblings        : 32
core id         : 47
cpu cores       : 24
apicid          : 94
initial apicid  : 94
fpu             : yes
fpu_exception   : yes
cpuid level     : 32
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx sm\
x est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt\
 clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect user_shstk avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi vnmi umip pku ospke waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities
vmx flags       : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs ept_violation_ve ept_mode_based_exec tsc_scaling usr_wait_pause
bugs            : spectre_v1 spectre_v2 spec_store_bypass swapgs eibrs_pbrsb rfds bhi spectre_v2_user
bogomips        : 5990.40
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

Results:

Benchmark                                      (padBytes)  (size)   Mode  Cnt   Score   Error   Units
VectorScorerBenchmark.binaryDotProductDefault           0     256  thrpt   15  14.037 ± 0.061  ops/us
VectorScorerBenchmark.binaryDotProductDefault           1     256  thrpt   15  14.046 ± 0.071  ops/us
VectorScorerBenchmark.binaryDotProductDefault           2     256  thrpt   15  14.139 ± 0.089  ops/us
VectorScorerBenchmark.binaryDotProductDefault           4     256  thrpt   15  14.069 ± 0.040  ops/us
VectorScorerBenchmark.binaryDotProductDefault           6     256  thrpt   15  14.038 ± 0.072  ops/us
VectorScorerBenchmark.binaryDotProductDefault           8     256  thrpt   15  14.094 ± 0.070  ops/us
VectorScorerBenchmark.binaryDotProductDefault          16     256  thrpt   15  14.073 ± 0.059  ops/us
VectorScorerBenchmark.binaryDotProductDefault          20     256  thrpt   15  14.134 ± 0.075  ops/us
VectorScorerBenchmark.binaryDotProductDefault          32     256  thrpt   15  14.016 ± 0.044  ops/us
VectorScorerBenchmark.binaryDotProductDefault          50     256  thrpt   15  14.031 ± 0.046  ops/us
VectorScorerBenchmark.binaryDotProductDefault          64     256  thrpt   15  14.082 ± 0.068  ops/us
VectorScorerBenchmark.binaryDotProductDefault         100     256  thrpt   15  14.013 ± 0.059  ops/us
VectorScorerBenchmark.binaryDotProductDefault         128     256  thrpt   15  14.079 ± 0.069  ops/us
VectorScorerBenchmark.binaryDotProductDefault         255     256  thrpt   15  14.143 ± 0.074  ops/us
VectorScorerBenchmark.binaryDotProductDefault         256     256  thrpt   15  14.026 ± 0.028  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg            0     256  thrpt   15  49.305 ± 0.244  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg            1     256  thrpt   15  48.572 ± 0.030  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg            2     256  thrpt   15  48.508 ± 0.198  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg            4     256  thrpt   15  48.636 ± 0.094  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg            6     256  thrpt   15  48.536 ± 0.185  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg            8     256  thrpt   15  49.346 ± 0.166  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg           16     256  thrpt   15  49.419 ± 0.102  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg           20     256  thrpt   15  49.224 ± 0.396  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg           32     256  thrpt   15  49.423 ± 0.134  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg           50     256  thrpt   15  48.676 ± 0.167  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg           64     256  thrpt   15  49.060 ± 0.866  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg          100     256  thrpt   15  49.181 ± 0.210  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg          128     256  thrpt   15  49.444 ± 0.082  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg          255     256  thrpt   15  48.362 ± 0.163  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg          256     256  thrpt   15  48.169 ± 5.291  ops/us
VectorScorerBenchmark.floatDotProductDefault            0     256  thrpt   15  23.215 ± 0.023  ops/us
VectorScorerBenchmark.floatDotProductDefault            1     256  thrpt   15  23.207 ± 0.067  ops/us
VectorScorerBenchmark.floatDotProductDefault            2     256  thrpt   15  23.181 ± 0.086  ops/us
VectorScorerBenchmark.floatDotProductDefault            4     256  thrpt   15  23.156 ± 0.290  ops/us
VectorScorerBenchmark.floatDotProductDefault            6     256  thrpt   15  23.232 ± 0.012  ops/us
VectorScorerBenchmark.floatDotProductDefault            8     256  thrpt   15  23.215 ± 0.091  ops/us
VectorScorerBenchmark.floatDotProductDefault           16     256  thrpt   15  23.194 ± 0.071  ops/us
VectorScorerBenchmark.floatDotProductDefault           20     256  thrpt   15  23.202 ± 0.083  ops/us
VectorScorerBenchmark.floatDotProductDefault           32     256  thrpt   15  23.207 ± 0.048  ops/us
VectorScorerBenchmark.floatDotProductDefault           50     256  thrpt   15  23.227 ± 0.031  ops/us
VectorScorerBenchmark.floatDotProductDefault           64     256  thrpt   15  23.187 ± 0.095  ops/us
VectorScorerBenchmark.floatDotProductDefault          100     256  thrpt   15  23.246 ± 0.114  ops/us
VectorScorerBenchmark.floatDotProductDefault          128     256  thrpt   15  23.214 ± 0.077  ops/us
VectorScorerBenchmark.floatDotProductDefault          255     256  thrpt   15  23.212 ± 0.035  ops/us
VectorScorerBenchmark.floatDotProductDefault          256     256  thrpt   15  23.239 ± 0.117  ops/us
VectorScorerBenchmark.floatDotProductMemSeg             0     256  thrpt   15  53.514 ± 5.159  ops/us
VectorScorerBenchmark.floatDotProductMemSeg             1     256  thrpt   15  49.594 ± 3.885  ops/us
VectorScorerBenchmark.floatDotProductMemSeg             2     256  thrpt   15  50.504 ± 0.122  ops/us
VectorScorerBenchmark.floatDotProductMemSeg             4     256  thrpt   15  51.385 ± 4.406  ops/us
VectorScorerBenchmark.floatDotProductMemSeg             6     256  thrpt   15  50.497 ± 0.146  ops/us
VectorScorerBenchmark.floatDotProductMemSeg             8     256  thrpt   15  52.327 ± 0.292  ops/us
VectorScorerBenchmark.floatDotProductMemSeg            16     256  thrpt   15  51.401 ± 4.426  ops/us
VectorScorerBenchmark.floatDotProductMemSeg            20     256  thrpt   15  52.373 ± 0.307  ops/us
VectorScorerBenchmark.floatDotProductMemSeg            32     256  thrpt   15  54.779 ± 0.078  ops/us
VectorScorerBenchmark.floatDotProductMemSeg            50     256  thrpt   15  49.447 ± 1.502  ops/us
VectorScorerBenchmark.floatDotProductMemSeg            64     256  thrpt   15  54.788 ± 0.060  ops/us
VectorScorerBenchmark.floatDotProductMemSeg           100     256  thrpt   15  51.600 ± 0.352  ops/us
VectorScorerBenchmark.floatDotProductMemSeg           128     256  thrpt   15  54.650 ± 0.377  ops/us
VectorScorerBenchmark.floatDotProductMemSeg           255     256  thrpt   15  50.042 ± 0.167  ops/us
VectorScorerBenchmark.floatDotProductMemSeg           256     256  thrpt   15  54.583 ± 0.399  ops/us

There might be small some mis-alignment penalty for float SIMD?

rmuir · 2025-10-20T17:57:35Z

Yes, it was a Graviton3 (m7g) CPU. lscpu says:

from the cpu's optimization guide:

kaivalnp · 2025-10-20T18:30:00Z

Thanks @mikemccand, there doesn't seem to be any performance penalty on "beast3 (nightly benchmarking box) -- a Ryzen Threadripper 3990X". There's definitely some impact of alignment on "Raptor Lake box is i9-13900K", but this is lower than my machine (<10%) -- so this alignment issue is mostly on Graviton, or ARM CPUs in general, as @rmuir shared?

I tried running knnPerfTest.py on Cohere vectors (768d) with DOT_PRODUCT similarity

main (4-byte-alignment)

recall  latency(ms)  netCPU  avgCpuCount     nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.890        1.917   1.908        0.995  1000000   100      50       32        250         no     5388     67.66      14780.22          130.41             1         3014.60      2929.688     2929.688       HNSW

This PR (64-byte-alignment)

recall  latency(ms)  netCPU  avgCpuCount     nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.891        1.845   1.836        0.995  1000000   100      50       32        250         no     5403     62.48      16004.61          130.03             1         3014.90      2929.688     2929.688       HNSW

Indexing was sped up by ~7.6%, while Search was sped up by ~3.8%

I see another action item from this benchmark: I wasn't aligning the output inside this merge function, which is used by HNSW-based vector formats for merging (see that index(s) improved in my benchmark, but not force_merge(s) -- which should speed up after this additional change?)

rmuir · 2025-10-21T05:05:46Z

Just as a general comment around this performance: the 256-bit SVE vectors available on these processors have unfortunately not had a lot of love from our side.

Lots of digging into the 128-bit neons on the mac and the various AVX on intel, but not much yet on the SVE.

I also think they are in early stages on the java side: I see a lot of mail traffic making improvements to these vectors in the OpenJDK, especially that might impact the integer side (e.g. compress). Might even be worth trying to build snapshot of JDK.

mikemccand · 2025-10-21T13:13:05Z

Thanks @rmuir. How does the Panama Vector API handle alignment? Does it have methods to allocate aligned on-heap or off-heap vectors? Hmm it looks like SegmentAllocator has an allocate method that takes byteAlignment, so it is possible in pure Java.

mikemccand · 2025-10-21T13:15:36Z

I see another action item from this benchmark: I wasn't aligning the output inside this merge function, which is used by HNSW-based vector formats for merging (see that index(s) improved in my benchmark, but not force_merge(s) -- which should speed up after this additional change?)

Oh good catch! I wonder what other places might write the flat vectors?

Is the alignment also (or maybe less) important for the quantized cases? (Your results above are for float32 vectors?)

Maybe at least luceneutil could somehow warn if vectors are unaligned during its perf testing?

This reverts commit 5764ac8.

kaivalnp · 2025-10-21T17:15:16Z

I wasn't aligning the output inside this merge function

Hmm this did not help for some reason (merge time increased)..

main (4-byte-alignment)

recall  latency(ms)  netCPU  avgCpuCount     nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.891        1.895   1.887        0.996  1000000   100      50       32        250         no     5408     72.35      13822.46          101.30             1         3014.54      2929.688     2929.688       HNSW

This PR (64-byte-alignment)

recall  latency(ms)  netCPU  avgCpuCount     nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.893        1.868   1.860        0.996  1000000   100      50       32        250         no     5426     69.55      14378.97          127.48             1         3015.40      2929.688     2929.688       HNSW

I'll still add a commit + revert, so people can see what I tried, and comment if I'm missing something!

Is the alignment also (or maybe less) important for the quantized cases?

I think alignment is less important for quantized vectors (which are stored as byte vectors on disk) -- because none of the JMH benchmarks show non-trivial variation with padding? (see VectorScorerBenchmark.binaryDotProductMemSeg)

Your results above are for float32 vectors?

Yeah, those benchmarks^ are for float vectors

Maybe at least luceneutil could somehow warn if vectors are unaligned during its perf testing?

I added some print statements to complain if non-64-byte-aligned addresses were used (MemorySegment#address % 64 == 0) -- and it complained in (only) baseline benchmarks..

Not committing because it may not be needed after this PR?

github-actions · 2025-10-21T17:21:04Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

github-actions · 2025-11-05T00:26:46Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

kaivalnp · 2025-11-12T04:56:30Z

Sorry for the delay here, I ran benchmarks a few more times offline, and the differences in index(s) and force_merge(s) seem to be noisy (they take about the same time in main v/s this PR on average)

This is because:

During indexing, the HNSW graph is built on-heap -- so there's no impact of alignment
During merging, we create a new temp file to write vectors merged from all segments -- which is then used to score vectors during graph creation in the new segment -- and it starts at offset 0 (i.e. already aligned)

The only constant is the speedup in search time (3-4%)

Another recent run with 1M Cohere vectors, 768d, DOT_PRODUCT, force merged into a single segment:

main:

recall  latency(ms)  netCPU  avgCpuCount     nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.892        1.880   1.872        0.996  1000000   100      50       32        250         no     5419    161.46       6193.68          197.68             1         3012.18      2929.688     2929.688       HNSW

This PR:

recall  latency(ms)  netCPU  avgCpuCount     nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.891        1.824   1.815        0.995  1000000   100      50       32        250         no     5408    162.23       6163.94          197.08             1         3012.26      2929.688     2929.688       HNSW

mikemccand

Thanks @kaivalnp -- this looks good to me. Thank you for also improving JMH benchy so we can measure these alignment impacts.

I left a few small comments.

I suppose the added up to 15 bytes won't matter in practice -- it's constant per .vec file, so any added bytes are amortized against the often massive .vec files in larger and larger segments as they are merged...

mikemccand · 2025-11-17T19:16:39Z

lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatVectorsWriter.java

  }

+  private static long alignOutput(IndexOutput output, VectorEncoding encoding) throws IOException {
+    return output.alignFilePointer(


Oh how nice that we already had a method to alignFilePointer -- I wonder who uses that today?

All usages (except here in the compound format) are in Lucene*VectorsWriter classes :)

That was my invention and was specifically made for CFS files. But keep in mind, we may/should make this configurable in CFS files, too, because if you have a CFS file warapped around your vectors, its gets unaligned if you require 64 bytes).

The compound file seems to align all files to 8 bytes (Long.BYTES) inside the "blob" -- which works because all other alignFilePointer calls I see are factors of 8 (I only see 4 byte alignment, using Float.BYTES)

we may/should make this configurable in CFS files, too

Would this involve aligning all files to 64 bytes instead? (because we do not retain the intended alignment in the index, so we'd have to pick the LCM of all occurrences of alignFilePointer in the codebase?)

aligning all files to 64 bytes instead?

This is slightly wasteful because most files do not need the high alignment, but at the same time something like an extension-specific alignment (i.e. for .vec files) seems leaky..

mikemccand · 2025-11-17T19:19:58Z

lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatVectorsWriter.java

-    long vectorDataOffset = vectorData.alignFilePointer(Float.BYTES);
-    switch (fieldData.fieldInfo.getVectorEncoding()) {
+    VectorEncoding encoding = fieldData.fieldInfo.getVectorEncoding();
+    long vectorDataOffset = alignOutput(vectorData, encoding);


Hmm should we not bother aligning if the vectors themselves won't stay aligned? I.e. the vector dimensions is not 0 mod 16 or so?

Ah nice catch, we can at least align to 4 bytes (slightly better than no alignment) -- as it was before this PR

I've added a commit to apply 64 byte alignment when dimension % 16 == 0, and 4 byte alignment otherwise.

Looking back, I wonder if we can just keep it to 64 in all cases, for code simplicity? (vectors with arbitrary dimensions will implicitly be 4 byte aligned in this case).

No strong opinions though..

Yeah OK +1 to just keep it simple -- always align.

mikemccand · 2025-11-17T19:21:35Z

lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatVectorsWriter.java

    return total;
  }

+  private static long alignOutput(IndexOutput output, VectorEncoding encoding) throws IOException {


Can we add a NOTE somewhere in javadocs (maybe in the ...Format.java?) that this format tries to preserve alignment to 64 byte boundary because on at least ARM Neoverse this matters, or so? And that means ideally to maximize performance the incoming vectors (if they are float[]) should have 0 mod 16 dimensionality?

Makes sense, added!

i.e. only applied when dimension is a multiple of 16 Also add Javadoc comment about the alignment

kaivalnp · 2025-11-17T21:33:52Z

lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/VectorScorerBenchmark.java

  @Param({"1", "128", "207", "256", "300", "512", "702", "1024"})
  public int size;

+  @Param({"0", "1", "2", "4", "6", "8", "16", "20", "32", "50", "64", "100", "128", "255", "256"})


Q: I added a bunch of values during testing, should we reduce while committing? (say 0, 1, 4, 64)

+1

Also maybe add a comment about why we have this alignment benchy if you didn't already?

Added a small comment!

…ectors" Also add comment about padBytes in VectorScorerBenchmark (used to capture performance impact of byte alignment)

Align float vectors to 64 bytes

6839182

github-actions bot added the module:core/codecs label Oct 16, 2025

kaivalnp changed the title ~~Align vectors on disk for optimal vectorized performance?~~ Align vectors on disk for optimal performance on ARM CPUs Oct 20, 2025

Also align temp file used during merge

5764ac8

Revert "Also align temp file used during merge"

f9402e9

This reverts commit 5764ac8.

github-actions bot added the Stale label Nov 5, 2025

Kaival Parikh added 2 commits November 12, 2025 04:54

Merge branch 'main' into align-vectors

0c3950c

Refactor + add comment + CHANGES.txt entry

0f46f3c

github-actions bot added this to the 10.4.0 milestone Nov 12, 2025

github-actions bot removed the Stale label Nov 13, 2025

mikemccand reviewed Nov 17, 2025

View reviewed changes

Kaival Parikh added 2 commits November 17, 2025 21:30

Only apply the optimal byte alignment if it will hold for all vectors

8b70ce8

i.e. only applied when dimension is a multiple of 16 Also add Javadoc comment about the alignment

Merge branch 'main' into align-vectors

ff6a509

kaivalnp commented Nov 17, 2025

View reviewed changes

Undo "Only apply the optimal byte alignment if it will hold for all v…

08ab76e

…ectors" Also add comment about padBytes in VectorScorerBenchmark (used to capture performance impact of byte alignment)

Align vectors on disk for optimal performance on ARM CPUs #15341

Are you sure you want to change the base?

Align vectors on disk for optimal performance on ARM CPUs #15341

Conversation

kaivalnp commented Oct 16, 2025

Description

Uh oh!

github-actions bot commented Oct 16, 2025

Uh oh!

kaivalnp commented Oct 16, 2025

Uh oh!

kaivalnp commented Oct 16, 2025

Uh oh!

kaivalnp commented Oct 16, 2025

Uh oh!

mikemccand commented Oct 17, 2025

Uh oh!

kaivalnp commented Oct 17, 2025

Uh oh!

mikemccand commented Oct 20, 2025

Uh oh!

mikemccand commented Oct 20, 2025

Uh oh!

rmuir commented Oct 20, 2025

Uh oh!

kaivalnp commented Oct 20, 2025

Uh oh!

rmuir commented Oct 21, 2025

Uh oh!

mikemccand commented Oct 21, 2025

Uh oh!

mikemccand commented Oct 21, 2025

Uh oh!

kaivalnp commented Oct 21, 2025

Uh oh!

github-actions bot commented Oct 21, 2025

Uh oh!

github-actions bot commented Nov 5, 2025

Uh oh!

kaivalnp commented Nov 12, 2025

Uh oh!

mikemccand left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants