Hunk-Level Commit Splitting 🧑‍🚒🔥 #2

Sephyi · 2026-03-12T23:38:43Z

Sephyi
Mar 12, 2026
Maintainer

The Idea

CommitBee's splitter currently operates at file granularity: each file goes into exactly one commit group. This works well when different concerns live in separate files, but falls apart when a single file contains changes belonging to multiple logical concerns.

Consider a real-world scenario: you refactored error handling in provider.rs and added a response size cap in the same file. The streaming loop got a new size check (concern A), and the new() constructor got Result return type changes (concern B). Meanwhile, mod.rs has a new constant for the size cap (concern A) and updated create_provider() calls for the Result change (concern B). The current splitter sees two files with overlapping concerns and merges them into one group — even though a human would clearly create two separate commits.

Hunk-level splitting means CommitBee would analyze individual diff hunks within files and group related hunks across files into separate commits. The LLM (or a pre-LLM heuristic layer) would determine that hunk 1 of provider.rs + hunk 2 of mod.rs form one logical change, while hunk 2 of provider.rs + hunk 1 of mod.rs form another.

Why This Matters

Current Limitations

The file-level splitter already does sophisticated work — diff-shape fingerprinting, Jaccard similarity, symbol dependency merging — but it hits a hard ceiling: the atom of splitting is the file. When two concerns touch the same file, they're inseparable.

This happens constantly in practice:

Refactors that touch call sites: changing a function signature requires updating the definition (one concern) and every call site (same concern), but those call sites live in files that also have unrelated changes
Cross-cutting infrastructure: adding error handling, logging, or a shared constant affects multiple hunks across multiple files, interleaved with feature work
Related API changes: a new field in a struct, its serialization impl, and the code that populates it might each be one hunk in three different files — alongside other hunks in those same files

What Good Commits Look Like

The gold standard is git add -p — a skilled developer staging individual hunks. Hunk-level splitting would automate this judgment, producing commits where:

Every hunk in the commit is there for the same reason
The commit is self-contained (doesn't break the build if checked out alone)
The commit message accurately describes all changes because they share a single purpose

Proposed Approach

Phase 1: Hunk Extraction and Annotation

Parse each file's diff into individual hunks with metadata:

struct DiffHunk {
    file: PathBuf,
    hunk_index: usize,
    header: String,          // @@ -10,5 +10,7 @@
    start_line: usize,
    end_line: usize,
    added_lines: Vec<String>,
    removed_lines: Vec<String>,
    // Enrichment from tree-sitter
    enclosing_symbol: Option<String>,  // e.g., "AnthropicProvider::generate"
    symbol_kind: Option<SymbolKind>,
}

The key enrichment is enclosing symbol: mapping each hunk to the function/struct/impl it lives inside. This is already partially available — tree-sitter parses full files and the analyzer maps hunks to symbol spans. The infrastructure exists; it just needs to flow into the splitter.

Language coverage matters here. The quality of hunk clustering depends directly on how well tree-sitter can identify enclosing symbols. CommitBee currently supports Rust, TypeScript, JavaScript, Python, and Go. Expanding to more languages (Java, C/C++, Ruby, Swift, Kotlin, etc.) would make hunk-level splitting viable for more codebases. Without enclosing symbol data, hunk clustering falls back to weaker signals like token overlap — still useful, but less precise.

Phase 2: Hunk Similarity and Clustering

Group hunks by semantic relatedness using signals CommitBee already computes:

Signal	Weight	Example
Same enclosing symbol (cross-file)	High	`Provider::new()` changed in 3 files
Shared token vocabulary (Jaccard)	Medium	All hunks mention `MAX_RESPONSE_BYTES`
Caller-callee relationship	High	Hunk adds function, another hunk calls it
Same diff shape (balanced add/remove)	Low	Mechanical refactors
Same symbol kind modified	Low	All hunks modify `struct` fields

The clustering algorithm would be similar to the current group_by_diff_shape but operating on hunks instead of files, with the enclosing symbol providing much stronger signal than file-level module detection.

Phase 3: Atomic Staging and Committing

The current split flow has a known weakness: it's non-atomic. The unstage_all → stage_files → commit loop means a failure mid-sequence leaves earlier commits applied with no rollback. Hunk-level splitting makes this worse — partial file staging via git apply --cached has more failure modes than whole-file staging.

This needs to be solved properly rather than carried forward:

Index snapshot approach:

1. Save full index state:        git stash create
2. Reset to clean index:         git reset
3. For each hunk group:
   a. Construct unified patch from group's hunks
   b. Apply to index:            git apply --cached <patch>
   c. Verify (optional):         cargo check / tree-sitter parse
   d. If verify fails:           abort, restore snapshot, report error
   e. Commit
   f. Reset index for next group
4. On any failure:               git stash apply <snapshot> to restore original state

The key difference from today: a saved snapshot allows full rollback. If commit 3/5 fails, we restore the index to its pre-split state rather than leaving 2 orphan commits. This also eliminates the TOCTOU window — the snapshot captures the index atomically before any mutations begin.

For file-level splitting (the current behavior), this same approach replaces the fragile unstage_all loop. It's worth implementing even before hunk-level splitting ships.

Verification options per group:

Check	Latency	Catches
None (current)	0s	Nothing
Tree-sitter parse	~10ms	Syntax errors from split hunks
`cargo check` / `tsc --noEmit`	2-30s	Type errors, missing imports
Full test suite	Minutes	Behavioral regressions

A reasonable default: tree-sitter parse (fast, catches the most common hunk-split failure). Full compilation check as an opt-in flag (--verify-splits).

Staged changes:
  file_a.rs: [hunk1, hunk2, hunk3]
  file_b.rs: [hunk1, hunk2]
  file_c.rs: [hunk1]

After hunk-level analysis:
  Group 1 (refactor: error handling):
    file_a.rs hunk1 + file_b.rs hunk2
  Group 2 (feat: response size cap):
    file_a.rs hunk2 + file_b.rs hunk1 + file_c.rs hunk1
  Group 3 (fix: EOF buffer parsing):
    file_a.rs hunk3

Phase 4: Hybrid Model Pipeline

Hunk classification and commit message generation are fundamentally different tasks with different complexity requirements. This maps naturally to a hybrid model pipeline where work is distributed by task weight:

Local-first classification (small model):

The hunk grouping task is structurally simple — given N hunks with metadata, assign each to a cluster. A small local model (e.g., qwen3.5:4b) can handle this because the input is compact (hunk headers, enclosing symbols, token lists) and the output is just group assignments, not prose.

Input:  "Hunk A: modifies Provider::new(), tokens: [Result, Error, client]
         Hunk B: modifies Provider::generate(), tokens: [MAX_RESPONSE_BYTES, len]
         Hunk C: modifies Provider::new(), tokens: [Result, Error, build]"
Output: "Group 1: [A, C]  Group 2: [B]"

Heavier model for message generation:

Once groups are established, the commit message prompt per group is smaller and more focused than the original all-files prompt. This is where a larger model adds value — either a bigger local model (qwen3.5:8b, higher thinking token budget) or a cloud API for higher quality prose.

Possible configurations:

Classification	Message Generation	Use Case
Heuristic only	`qwen3.5:4b` local	Fastest, fully offline
`qwen3.5:4b` local	`qwen3.5:4b` local	Current default behavior
`qwen3.5:4b` local	`qwen3.5:8b` local	Better messages, still offline
`qwen3.5:4b` local	Cloud API (Anthropic/OpenAI)	Best message quality
Heuristic only	Cloud API	Skip classification LLM cost

The key insight: classification doesn't need the expensive model, and message generation doesn't need to see all the hunks. Splitting the pipeline lets each stage use the right-sized tool.

This could be configured as:

[pipeline]
classifier_provider = "ollama"
classifier_model = "qwen3.5:4b"
generator_provider = "anthropic"
generator_model = "claude-sonnet-4-6"

Or kept simple with a single provider (current behavior) when no pipeline config is present.

Challenges

Build Coherence

The hardest problem: will each hunk group compile independently? Splitting at file boundaries guarantees each commit has complete files. Splitting at hunk boundaries means a commit might have half a function signature change without the corresponding call-site update in the same file.

Mitigation strategies:

Symbol-aware merging: if two hunks in the same file modify the same function, they must stay together
Verification gate: tree-sitter parse (fast) or cargo check (thorough) after staging each group; merge failing group with the most related remaining group
Conservative default: when uncertain, keep hunks from the same file together (graceful degradation to current behavior)

Hunk Interdependencies

Hunks within a file can depend on each other through local variables, control flow, or type definitions that span multiple hunks. Context lines in the diff help, but aren't always sufficient.

The enclosing-symbol approach handles the common case: hunks inside the same function body almost always belong together. The tricky case is hunks in different functions within the same file that are logically related (e.g., a helper function and its caller, both in the same module).

Language Coverage Gap

Hunk clustering quality degrades for unsupported languages. Without tree-sitter grammars, there's no enclosing symbol data, and the system falls back to token-only similarity — which can't distinguish "two hunks that happen to use the same variable name" from "two hunks that modify the same function."

Priority languages to add (by ecosystem prevalence):

Language	Tree-sitter grammar	Complexity
Java	`tree-sitter-java`	Low — well-structured, clear class/method boundaries
C / C++	`tree-sitter-c` / `tree-sitter-cpp`	Medium — preprocessor complicates symbol extraction
Ruby	`tree-sitter-ruby`	Low — method/class detection straightforward
Swift	`tree-sitter-swift`	Low
Kotlin	`tree-sitter-kotlin`	Low
PHP	`tree-sitter-php`	Medium

Each grammar addition follows the same pattern: add the tree-sitter-* crate, write a queries/*.scm capture file for symbol extraction, register in AnalyzerService. The analyzer's rayon-parallel architecture handles new languages without structural changes.

UX: Showing Hunk Groups

The current split suggestion UI shows file lists per group. With hunk-level splitting, we'd need to show which parts of each file go where — potentially with abbreviated diff snippets or line ranges. This needs to be informative without overwhelming the user.

Incremental Delivery

This doesn't need to ship all at once:

v0.4: Foundations — Hunk extraction (parse diffs into DiffHunk structs, enrich with enclosing symbols). Atomic index snapshot for the existing file-level split flow. No behavioral change to splitting logic yet.
v0.5: Hunk-aware file grouping — Use hunk-level signals in the splitter, but still commit at file granularity (assign each file to the group that claims the majority of its hunks). Strictly better than today with zero staging complexity.
v0.6: True hunk-level staging — git apply --cached for per-hunk staging with index snapshot rollback. Verification gate (tree-sitter parse by default, --verify-splits for full compilation check).
v0.7: Hybrid pipeline — Optional two-stage model configuration: small/local model for hunk classification, configurable model for message generation.
Ongoing: Language expansion — Add tree-sitter grammars incrementally. Each new language improves hunk clustering quality for projects using it.

Step 2 is the sweet spot for early value: it uses hunk analysis to make better file-level decisions without any staging machinery changes. Step 1's atomic index snapshot is independently valuable — it fixes the existing split flow's rollback problem.

Open Questions

How often do mixed-concern files actually occur? Is this a 10% or 60% problem in real workflows? Should we instrument the current splitter to log cases where all files end up in one group despite multiple hunks?
Verification default: tree-sitter parse is fast but only catches syntax errors. Is that sufficient, or should the default be cargo check with a --no-verify-splits opt-out?
Interaction with --no-split: should hunk-level splitting be a separate flag, or fold into the existing split toggle?
Conflict with manual staging: if the user has carefully staged specific hunks via git add -p, should CommitBee detect and respect that intent rather than re-splitting?
Hybrid pipeline complexity: is the two-provider config worth the UX cost, or should the hybrid approach only be exposed as presets (e.g., pipeline = "local-fast" / "local-quality" / "hybrid-cloud")?

Feedback Welcome

This is an early-stage design discussion. If you have thoughts on:

Real-world scenarios where this would (or wouldn't) help
Alternative approaches to hunk grouping
The hybrid pipeline model — is split classification/generation useful or over-engineered?
Priority languages for tree-sitter expansion
Edge cases that would break this model

Please share them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hunk-Level Commit Splitting 🧑‍🚒🔥 #2

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Hunk-Level Commit Splitting 🧑‍🚒🔥 #2

Uh oh!

Sephyi Mar 12, 2026 Maintainer

The Idea

Why This Matters

Current Limitations

What Good Commits Look Like

Proposed Approach

Phase 1: Hunk Extraction and Annotation

Phase 2: Hunk Similarity and Clustering

Phase 3: Atomic Staging and Committing

Phase 4: Hybrid Model Pipeline

Challenges

Build Coherence

Hunk Interdependencies

Language Coverage Gap

UX: Showing Hunk Groups

Incremental Delivery

Open Questions

Feedback Welcome

Replies: 0 comments

Sephyi
Mar 12, 2026
Maintainer