You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CommitBee's splitter currently operates at file granularity: each file goes into exactly one commit group. This works well when different concerns live in separate files, but falls apart when a single file contains changes belonging to multiple logical concerns.
Consider a real-world scenario: you refactored error handling in provider.rsand added a response size cap in the same file. The streaming loop got a new size check (concern A), and the new() constructor got Result return type changes (concern B). Meanwhile, mod.rs has a new constant for the size cap (concern A) and updated create_provider() calls for the Result change (concern B). The current splitter sees two files with overlapping concerns and merges them into one group β even though a human would clearly create two separate commits.
Hunk-level splitting means CommitBee would analyze individual diff hunks within files and group related hunks across files into separate commits. The LLM (or a pre-LLM heuristic layer) would determine that hunk 1 of provider.rs + hunk 2 of mod.rs form one logical change, while hunk 2 of provider.rs + hunk 1 of mod.rs form another.
Why This Matters
Current Limitations
The file-level splitter already does sophisticated work β diff-shape fingerprinting, Jaccard similarity, symbol dependency merging β but it hits a hard ceiling: the atom of splitting is the file. When two concerns touch the same file, they're inseparable.
This happens constantly in practice:
Refactors that touch call sites: changing a function signature requires updating the definition (one concern) and every call site (same concern), but those call sites live in files that also have unrelated changes
Cross-cutting infrastructure: adding error handling, logging, or a shared constant affects multiple hunks across multiple files, interleaved with feature work
Related API changes: a new field in a struct, its serialization impl, and the code that populates it might each be one hunk in three different files β alongside other hunks in those same files
What Good Commits Look Like
The gold standard is git add -p β a skilled developer staging individual hunks. Hunk-level splitting would automate this judgment, producing commits where:
Every hunk in the commit is there for the same reason
The commit is self-contained (doesn't break the build if checked out alone)
The commit message accurately describes all changes because they share a single purpose
Proposed Approach
Phase 1: Hunk Extraction and Annotation
Parse each file's diff into individual hunks with metadata:
structDiffHunk{file:PathBuf,hunk_index:usize,header:String,// @@ -10,5 +10,7 @@start_line:usize,end_line:usize,added_lines:Vec<String>,removed_lines:Vec<String>,// Enrichment from tree-sitterenclosing_symbol:Option<String>,// e.g., "AnthropicProvider::generate"symbol_kind:Option<SymbolKind>,}
The key enrichment is enclosing symbol: mapping each hunk to the function/struct/impl it lives inside. This is already partially available β tree-sitter parses full files and the analyzer maps hunks to symbol spans. The infrastructure exists; it just needs to flow into the splitter.
Language coverage matters here. The quality of hunk clustering depends directly on how well tree-sitter can identify enclosing symbols. CommitBee currently supports Rust, TypeScript, JavaScript, Python, and Go. Expanding to more languages (Java, C/C++, Ruby, Swift, Kotlin, etc.) would make hunk-level splitting viable for more codebases. Without enclosing symbol data, hunk clustering falls back to weaker signals like token overlap β still useful, but less precise.
Phase 2: Hunk Similarity and Clustering
Group hunks by semantic relatedness using signals CommitBee already computes:
Signal
Weight
Example
Same enclosing symbol (cross-file)
High
Provider::new() changed in 3 files
Shared token vocabulary (Jaccard)
Medium
All hunks mention MAX_RESPONSE_BYTES
Caller-callee relationship
High
Hunk adds function, another hunk calls it
Same diff shape (balanced add/remove)
Low
Mechanical refactors
Same symbol kind modified
Low
All hunks modify struct fields
The clustering algorithm would be similar to the current group_by_diff_shape but operating on hunks instead of files, with the enclosing symbol providing much stronger signal than file-level module detection.
Phase 3: Atomic Staging and Committing
The current split flow has a known weakness: it's non-atomic. The unstage_all β stage_files β commit loop means a failure mid-sequence leaves earlier commits applied with no rollback. Hunk-level splitting makes this worse β partial file staging via git apply --cached has more failure modes than whole-file staging.
This needs to be solved properly rather than carried forward:
Index snapshot approach:
1. Save full index state: git stash create
2. Reset to clean index: git reset
3. For each hunk group:
a. Construct unified patch from group's hunks
b. Apply to index: git apply --cached <patch>
c. Verify (optional): cargo check / tree-sitter parse
d. If verify fails: abort, restore snapshot, report error
e. Commit
f. Reset index for next group
4. On any failure: git stash apply <snapshot> to restore original state
The key difference from today: a saved snapshot allows full rollback. If commit 3/5 fails, we restore the index to its pre-split state rather than leaving 2 orphan commits. This also eliminates the TOCTOU window β the snapshot captures the index atomically before any mutations begin.
For file-level splitting (the current behavior), this same approach replaces the fragile unstage_all loop. It's worth implementing even before hunk-level splitting ships.
Verification options per group:
Check
Latency
Catches
None (current)
0s
Nothing
Tree-sitter parse
~10ms
Syntax errors from split hunks
cargo check / tsc --noEmit
2-30s
Type errors, missing imports
Full test suite
Minutes
Behavioral regressions
A reasonable default: tree-sitter parse (fast, catches the most common hunk-split failure). Full compilation check as an opt-in flag (--verify-splits).
Hunk classification and commit message generation are fundamentally different tasks with different complexity requirements. This maps naturally to a hybrid model pipeline where work is distributed by task weight:
Local-first classification (small model):
The hunk grouping task is structurally simple β given N hunks with metadata, assign each to a cluster. A small local model (e.g., qwen3.5:4b) can handle this because the input is compact (hunk headers, enclosing symbols, token lists) and the output is just group assignments, not prose.
Once groups are established, the commit message prompt per group is smaller and more focused than the original all-files prompt. This is where a larger model adds value β either a bigger local model (qwen3.5:8b, higher thinking token budget) or a cloud API for higher quality prose.
Possible configurations:
Classification
Message Generation
Use Case
Heuristic only
qwen3.5:4b local
Fastest, fully offline
qwen3.5:4b local
qwen3.5:4b local
Current default behavior
qwen3.5:4b local
qwen3.5:8b local
Better messages, still offline
qwen3.5:4b local
Cloud API (Anthropic/OpenAI)
Best message quality
Heuristic only
Cloud API
Skip classification LLM cost
The key insight: classification doesn't need the expensive model, and message generation doesn't need to see all the hunks. Splitting the pipeline lets each stage use the right-sized tool.
Or kept simple with a single provider (current behavior) when no pipeline config is present.
Challenges
Build Coherence
The hardest problem: will each hunk group compile independently? Splitting at file boundaries guarantees each commit has complete files. Splitting at hunk boundaries means a commit might have half a function signature change without the corresponding call-site update in the same file.
Mitigation strategies:
Symbol-aware merging: if two hunks in the same file modify the same function, they must stay together
Verification gate: tree-sitter parse (fast) or cargo check (thorough) after staging each group; merge failing group with the most related remaining group
Conservative default: when uncertain, keep hunks from the same file together (graceful degradation to current behavior)
Hunk Interdependencies
Hunks within a file can depend on each other through local variables, control flow, or type definitions that span multiple hunks. Context lines in the diff help, but aren't always sufficient.
The enclosing-symbol approach handles the common case: hunks inside the same function body almost always belong together. The tricky case is hunks in different functions within the same file that are logically related (e.g., a helper function and its caller, both in the same module).
Language Coverage Gap
Hunk clustering quality degrades for unsupported languages. Without tree-sitter grammars, there's no enclosing symbol data, and the system falls back to token-only similarity β which can't distinguish "two hunks that happen to use the same variable name" from "two hunks that modify the same function."
Priority languages to add (by ecosystem prevalence):
Medium β preprocessor complicates symbol extraction
Ruby
tree-sitter-ruby
Low β method/class detection straightforward
Swift
tree-sitter-swift
Low
Kotlin
tree-sitter-kotlin
Low
PHP
tree-sitter-php
Medium
Each grammar addition follows the same pattern: add the tree-sitter-* crate, write a queries/*.scm capture file for symbol extraction, register in AnalyzerService. The analyzer's rayon-parallel architecture handles new languages without structural changes.
UX: Showing Hunk Groups
The current split suggestion UI shows file lists per group. With hunk-level splitting, we'd need to show which parts of each file go where β potentially with abbreviated diff snippets or line ranges. This needs to be informative without overwhelming the user.
Incremental Delivery
This doesn't need to ship all at once:
v0.4: Foundations β Hunk extraction (parse diffs into DiffHunk structs, enrich with enclosing symbols). Atomic index snapshot for the existing file-level split flow. No behavioral change to splitting logic yet.
v0.5: Hunk-aware file grouping β Use hunk-level signals in the splitter, but still commit at file granularity (assign each file to the group that claims the majority of its hunks). Strictly better than today with zero staging complexity.
v0.6: True hunk-level staging β git apply --cached for per-hunk staging with index snapshot rollback. Verification gate (tree-sitter parse by default, --verify-splits for full compilation check).
v0.7: Hybrid pipeline β Optional two-stage model configuration: small/local model for hunk classification, configurable model for message generation.
Ongoing: Language expansion β Add tree-sitter grammars incrementally. Each new language improves hunk clustering quality for projects using it.
Step 2 is the sweet spot for early value: it uses hunk analysis to make better file-level decisions without any staging machinery changes. Step 1's atomic index snapshot is independently valuable β it fixes the existing split flow's rollback problem.
Open Questions
How often do mixed-concern files actually occur? Is this a 10% or 60% problem in real workflows? Should we instrument the current splitter to log cases where all files end up in one group despite multiple hunks?
Verification default: tree-sitter parse is fast but only catches syntax errors. Is that sufficient, or should the default be cargo check with a --no-verify-splits opt-out?
Interaction with --no-split: should hunk-level splitting be a separate flag, or fold into the existing split toggle?
Conflict with manual staging: if the user has carefully staged specific hunks via git add -p, should CommitBee detect and respect that intent rather than re-splitting?
Hybrid pipeline complexity: is the two-provider config worth the UX cost, or should the hybrid approach only be exposed as presets (e.g., pipeline = "local-fast" / "local-quality" / "hybrid-cloud")?
Feedback Welcome
This is an early-stage design discussion. If you have thoughts on:
Real-world scenarios where this would (or wouldn't) help
Alternative approaches to hunk grouping
The hybrid pipeline model β is split classification/generation useful or over-engineered?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
The Idea
CommitBee's splitter currently operates at file granularity: each file goes into exactly one commit group. This works well when different concerns live in separate files, but falls apart when a single file contains changes belonging to multiple logical concerns.
Consider a real-world scenario: you refactored error handling in
provider.rsand added a response size cap in the same file. The streaming loop got a new size check (concern A), and thenew()constructor gotResultreturn type changes (concern B). Meanwhile,mod.rshas a new constant for the size cap (concern A) and updatedcreate_provider()calls for theResultchange (concern B). The current splitter sees two files with overlapping concerns and merges them into one group β even though a human would clearly create two separate commits.Hunk-level splitting means CommitBee would analyze individual diff hunks within files and group related hunks across files into separate commits. The LLM (or a pre-LLM heuristic layer) would determine that hunk 1 of
provider.rs+ hunk 2 ofmod.rsform one logical change, while hunk 2 ofprovider.rs+ hunk 1 ofmod.rsform another.Why This Matters
Current Limitations
The file-level splitter already does sophisticated work β diff-shape fingerprinting, Jaccard similarity, symbol dependency merging β but it hits a hard ceiling: the atom of splitting is the file. When two concerns touch the same file, they're inseparable.
This happens constantly in practice:
What Good Commits Look Like
The gold standard is
git add -pβ a skilled developer staging individual hunks. Hunk-level splitting would automate this judgment, producing commits where:Proposed Approach
Phase 1: Hunk Extraction and Annotation
Parse each file's diff into individual hunks with metadata:
The key enrichment is enclosing symbol: mapping each hunk to the function/struct/impl it lives inside. This is already partially available β tree-sitter parses full files and the analyzer maps hunks to symbol spans. The infrastructure exists; it just needs to flow into the splitter.
Language coverage matters here. The quality of hunk clustering depends directly on how well tree-sitter can identify enclosing symbols. CommitBee currently supports Rust, TypeScript, JavaScript, Python, and Go. Expanding to more languages (Java, C/C++, Ruby, Swift, Kotlin, etc.) would make hunk-level splitting viable for more codebases. Without enclosing symbol data, hunk clustering falls back to weaker signals like token overlap β still useful, but less precise.
Phase 2: Hunk Similarity and Clustering
Group hunks by semantic relatedness using signals CommitBee already computes:
Provider::new()changed in 3 filesMAX_RESPONSE_BYTESstructfieldsThe clustering algorithm would be similar to the current
group_by_diff_shapebut operating on hunks instead of files, with the enclosing symbol providing much stronger signal than file-level module detection.Phase 3: Atomic Staging and Committing
The current split flow has a known weakness: it's non-atomic. The
unstage_all β stage_files β commitloop means a failure mid-sequence leaves earlier commits applied with no rollback. Hunk-level splitting makes this worse β partial file staging viagit apply --cachedhas more failure modes than whole-file staging.This needs to be solved properly rather than carried forward:
Index snapshot approach:
The key difference from today: a saved snapshot allows full rollback. If commit 3/5 fails, we restore the index to its pre-split state rather than leaving 2 orphan commits. This also eliminates the TOCTOU window β the snapshot captures the index atomically before any mutations begin.
For file-level splitting (the current behavior), this same approach replaces the fragile
unstage_allloop. It's worth implementing even before hunk-level splitting ships.Verification options per group:
cargo check/tsc --noEmitA reasonable default: tree-sitter parse (fast, catches the most common hunk-split failure). Full compilation check as an opt-in flag (
--verify-splits).Staged changes: file_a.rs: [hunk1, hunk2, hunk3] file_b.rs: [hunk1, hunk2] file_c.rs: [hunk1] After hunk-level analysis: Group 1 (refactor: error handling): file_a.rs hunk1 + file_b.rs hunk2 Group 2 (feat: response size cap): file_a.rs hunk2 + file_b.rs hunk1 + file_c.rs hunk1 Group 3 (fix: EOF buffer parsing): file_a.rs hunk3Phase 4: Hybrid Model Pipeline
Hunk classification and commit message generation are fundamentally different tasks with different complexity requirements. This maps naturally to a hybrid model pipeline where work is distributed by task weight:
Local-first classification (small model):
The hunk grouping task is structurally simple β given N hunks with metadata, assign each to a cluster. A small local model (e.g.,
qwen3.5:4b) can handle this because the input is compact (hunk headers, enclosing symbols, token lists) and the output is just group assignments, not prose.Input: "Hunk A: modifies Provider::new(), tokens: [Result, Error, client] Hunk B: modifies Provider::generate(), tokens: [MAX_RESPONSE_BYTES, len] Hunk C: modifies Provider::new(), tokens: [Result, Error, build]" Output: "Group 1: [A, C] Group 2: [B]"Heavier model for message generation:
Once groups are established, the commit message prompt per group is smaller and more focused than the original all-files prompt. This is where a larger model adds value β either a bigger local model (
qwen3.5:8b, higher thinking token budget) or a cloud API for higher quality prose.Possible configurations:
qwen3.5:4blocalqwen3.5:4blocalqwen3.5:4blocalqwen3.5:4blocalqwen3.5:8blocalqwen3.5:4blocalThe key insight: classification doesn't need the expensive model, and message generation doesn't need to see all the hunks. Splitting the pipeline lets each stage use the right-sized tool.
This could be configured as:
Or kept simple with a single provider (current behavior) when no pipeline config is present.
Challenges
Build Coherence
The hardest problem: will each hunk group compile independently? Splitting at file boundaries guarantees each commit has complete files. Splitting at hunk boundaries means a commit might have half a function signature change without the corresponding call-site update in the same file.
Mitigation strategies:
cargo check(thorough) after staging each group; merge failing group with the most related remaining groupHunk Interdependencies
Hunks within a file can depend on each other through local variables, control flow, or type definitions that span multiple hunks. Context lines in the diff help, but aren't always sufficient.
The enclosing-symbol approach handles the common case: hunks inside the same function body almost always belong together. The tricky case is hunks in different functions within the same file that are logically related (e.g., a helper function and its caller, both in the same module).
Language Coverage Gap
Hunk clustering quality degrades for unsupported languages. Without tree-sitter grammars, there's no enclosing symbol data, and the system falls back to token-only similarity β which can't distinguish "two hunks that happen to use the same variable name" from "two hunks that modify the same function."
Priority languages to add (by ecosystem prevalence):
tree-sitter-javatree-sitter-c/tree-sitter-cpptree-sitter-rubytree-sitter-swifttree-sitter-kotlintree-sitter-phpEach grammar addition follows the same pattern: add the
tree-sitter-*crate, write aqueries/*.scmcapture file for symbol extraction, register inAnalyzerService. The analyzer's rayon-parallel architecture handles new languages without structural changes.UX: Showing Hunk Groups
The current split suggestion UI shows file lists per group. With hunk-level splitting, we'd need to show which parts of each file go where β potentially with abbreviated diff snippets or line ranges. This needs to be informative without overwhelming the user.
Incremental Delivery
This doesn't need to ship all at once:
DiffHunkstructs, enrich with enclosing symbols). Atomic index snapshot for the existing file-level split flow. No behavioral change to splitting logic yet.git apply --cachedfor per-hunk staging with index snapshot rollback. Verification gate (tree-sitter parse by default,--verify-splitsfor full compilation check).Step 2 is the sweet spot for early value: it uses hunk analysis to make better file-level decisions without any staging machinery changes. Step 1's atomic index snapshot is independently valuable β it fixes the existing split flow's rollback problem.
Open Questions
cargo checkwith a--no-verify-splitsopt-out?--no-split: should hunk-level splitting be a separate flag, or fold into the existing split toggle?git add -p, should CommitBee detect and respect that intent rather than re-splitting?pipeline = "local-fast"/"local-quality"/"hybrid-cloud")?Feedback Welcome
This is an early-stage design discussion. If you have thoughts on:
Please share them.
Beta Was this translation helpful? Give feedback.
All reactions