feat: implement syntax-aware code chunking with Tree-sitter #434

danielfrey63 · 2026-01-01T12:43:42Z

Summary

This PR introduces intelligent code chunking using Tree-sitter parsers to prevent splitting code in the middle of functions or class definitions. The implementation provides syntax-aware splitting for 20+ programming languages while maintaining backward compatibility through fallback mechanisms.

Key Features

Syntax-Aware Splitting

Tree-sitter Integration: Uses Tree-sitter parsers to understand code structure and split at logical boundaries (functions, classes, methods, etc.)
Multi-Language Support: Supports Python, JavaScript/TypeScript, Java, C/C++, C#, Go, Rust, PHP, Ruby, Swift, Kotlin, Scala, Lua, Bash, HTML, CSS, JSON, YAML, TOML, Markdown
Smart Node Detection: Identifies definition-like nodes using keyword matching (function, class, method, interface, struct, enum, trait, impl, module, namespace, type)

Intelligent Routing

CodeAwareSplitter: Automatically routes code documents through Tree-sitter chunker and non-code documents through the existing TextSplitter
Metadata-Driven: Uses is_code and type metadata to determine optimal splitting strategy
Seamless Integration: Works with existing data pipeline without breaking changes

Configurable Parameters

Line-based Chunking: Configurable chunk size (default: 200 lines) and overlap (default: 20 lines)
Minimum Chunk Size: Prevents tiny fragments (default: 5 lines minimum)
Enable/Disable: Can be toggled on/off via configuration

Robust Error Handling

Graceful Fallback: Falls back to line-based splitting if Tree-sitter is unavailable or parsing fails
Import Safety: Robust module importing with error handling for missing dependencies
Parser Availability Check: Automatically detects if Tree-sitter parsers are available for the target language

Files Changed

Core Implementation

api/code_splitter.py (new): Complete Tree-sitter based code splitting implementation
- TreeSitterCodeSplitter: Main chunking engine with syntax awareness
- CodeAwareSplitter: Intelligent router for code vs. non-code documents
- Helper functions for parsing, chunking, and metadata handling

Integration

api/data_pipeline.py: Updated to use CodeAwareSplitter in the data processing pipeline
api/config/embedder.json: Added code_splitter configuration section
api/config.py: Updated to load code splitter configuration

Dependencies

api/pyproject.toml: Added Tree-sitter dependencies with version pinning for compatibility
- tree-sitter = ">=0.21.0,<0.22.0" (compatible with tree-sitter-languages)
- tree-sitter-languages = {version = ">=1.10.0", python = "<3.13"}

Configuration

The code splitter can be configured via api/config/embedder.json:

{
  "code_splitter": {
    "enabled": true,
    "chunk_size_lines": 200,
    "chunk_overlap_lines": 20,
    "min_chunk_lines": 5
  }
}

Technical Details

Tree-sitter Compatibility

Resolved API incompatibility between tree-sitter and tree-sitter-languages
Pinned tree-sitter to version >=0.21.0,<0.22.0 for compatibility with tree-sitter-languages 1.10.2
Conditional dependency installation for Python < 3.13 (tree-sitter-languages compatibility)
There is an open issue #75 since July 2024 requesting a new binary wheel for tree-sitter-languages 1.10.2

Chunking Algorithm

Parse: Use Tree-sitter to parse code into AST
Extract: Identify definition-like nodes (functions, classes, etc.)
Split: Create chunks based on node boundaries
Fallback: Use line-based splitting if parsing fails
Metadata: Enrich chunks with line numbers and indexing information

Performance Considerations

Lazy import of Tree-sitter modules to avoid startup overhead
Efficient byte-to-line offset conversion
Minimal memory footprint through streaming processing

Benefits

Improved Context Quality

Code chunks maintain semantic coherence
Better context for AI models and embeddings
Reduced fragmentation of logical code units

Developer Experience

More meaningful search results
Better code comprehension in AI responses
Preserved function/class boundaries in documentation

Scalability

Handles large codebases efficiently
Configurable chunk sizes for different use cases
Graceful degradation for unsupported languages

Migration Notes

Backward Compatible: Existing functionality remains unchanged
Configuration Optional: Code splitter works with sensible defaults
Gradual Rollout: Can be enabled/disabled per environment
No Breaking Changes: Existing API contracts preserved

gemini-code-assist · 2026-01-01T12:44:00Z

Summary of Changes

Hello @danielfrey63, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the code processing capabilities by introducing a syntax-aware code chunking mechanism powered by Tree-sitter. This new approach intelligently splits code documents into semantically coherent units, such as functions and classes, across over 20 programming languages. It features an intelligent routing system that automatically applies the appropriate splitting strategy (syntax-aware for code, line-based for other text) and includes robust error handling with graceful fallbacks. This enhancement aims to improve the quality of context provided to AI models, leading to more meaningful embeddings and better developer experience, all while maintaining backward compatibility.

Highlights

Syntax-Aware Code Chunking: Introduces intelligent code chunking using Tree-sitter parsers to prevent splitting code in the middle of functions or class definitions, supporting over 20 programming languages.
Intelligent Document Routing: A new CodeAwareSplitter automatically routes code documents through the Tree-sitter chunker and non-code documents through the existing TextSplitter based on metadata.
Configurable Parameters: The code splitter is highly configurable, allowing adjustments for chunk size, overlap, and minimum chunk size, and can be enabled or disabled via configuration.
Robust Error Handling and Fallback: Includes graceful fallback to line-based splitting if Tree-sitter is unavailable or parsing fails, ensuring the data pipeline remains robust.
Improved Context Quality: By maintaining semantic coherence in code chunks, this change provides better context for AI models and embeddings, leading to more meaningful search results and AI responses.
Backward Compatibility: The implementation is designed to be backward compatible, preserving existing API contracts and functionality without breaking changes.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant and valuable feature: syntax-aware code chunking using Tree-sitter. The implementation is well-structured, providing support for multiple languages and including robust fallback mechanisms. My review focuses on several opportunities to enhance the code's efficiency, maintainability, and robustness. Key suggestions include optimizing byte-to-line conversions to improve performance, refining exception handling to be more specific, and refactoring duplicated code to improve maintainability. Overall, this is a strong contribution that will greatly improve context quality for code-related tasks.

api/code_splitter.py

api/data_pipeline.py

danielfrey63 · 2026-01-01T22:32:18Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a significant new feature for syntax-aware code chunking using Tree-sitter, which is a great improvement for context quality in code-related tasks. The implementation is robust, with good fallback mechanisms and configuration options.

My review focuses on improving code clarity, robustness, and maintainability in the new api/code_splitter.py file. I've suggested refactoring the Tree-sitter import logic for better type safety, improving the keyword matching heuristic for more accuracy, and making the exception handling for monkey-patching safer and more informative. These changes will make the new component even more solid.

The integration into the data pipeline and configuration files is well-done. The dependency management in pyproject.toml is also handled correctly with appropriate version pinning.

Overall, this is a well-executed feature addition.

api/code_splitter.py

danielfrey63 · 2026-01-01T22:44:40Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a significant and well-designed feature for syntax-aware code chunking using Tree-sitter. The implementation is robust, with support for multiple languages and a graceful fallback mechanism. The configuration and integration into the existing data pipeline are clean. My review focuses on improving the long-term maintainability and robustness of the code. The main points are to replace the risky practice of monkey-patching the adalflow library with a more explicit serialization strategy, to make exception handling more specific to avoid masking potential bugs, and to address the brittle error handling that relies on string matching in exception messages. These changes will make the new code splitter even more reliable and easier to maintain in the future.

api/code_splitter.py

api/data_pipeline.py

api/code_splitter.py

- Extract duplicated chunk metadata logic to helper method (DRY principle) - Optimize byte encoding by pre-encoding text once for large files - Improve error handling with specific exception types - Enhance node type matching to avoid partial matches - Add proper logging for adalflow component registrationfixed gemini code assist proposals Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

danielfrey63 · 2026-01-01T23:25:22Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces an excellent feature for syntax-aware code chunking using Tree-sitter. The implementation is robust, with thoughtful error handling, fallback mechanisms, and clear configuration. The integration into the existing data pipeline is also well-executed. I have two main points of feedback. The most critical is a suggestion to improve the splitting logic for large code blocks to avoid falling back to simple line-based splitting, which would make the feature even more powerful. The second is a minor point to ensure metadata consistency across all generated chunks. Overall, this is a very strong and valuable contribution.

gemini-code-assist · 2026-01-01T23:28:25Z

api/code_splitter.py

+    def _split_code_text(self, text: str, meta: Dict[str, Any], file_type: str) -> List[Document]:
+        parser = self._try_get_parser(file_type)
+        if parser is None:
+            return self._fallback_line_split(text, meta)
+
+        text_bytes = text.encode("utf-8", errors="replace")
+        try:
+            tree = parser.parse(text_bytes)
+        except Exception:
+            return self._fallback_line_split(text, meta)
+
+        root = getattr(tree, "root_node", None)
+        if root is None:
+            return self._fallback_line_split(text, meta)
+
+        nodes = list(_iter_definition_like_nodes(root))
+        if not nodes:
+            return self._fallback_line_split(text, meta)
+
+        pieces: List[Tuple[str, int]] = []
+        for node in nodes:
+            try:
+                start_b = int(getattr(node, "start_byte"))
+                end_b = int(getattr(node, "end_byte"))
+            except (AttributeError, ValueError, TypeError) as e:
+                logger.debug("Could not process a tree-sitter node for file type '%s': %s", file_type, e)
+                continue
+            snippet = _slice_text_by_bytes_preencoded(text_bytes, start_b, end_b)
+            start_line = _byte_offset_to_line_preencoded(text_bytes, start_b)
+            pieces.append((snippet, start_line))
+
+        if not pieces:
+            return self._fallback_line_split(text, meta)
+
+        docs: List[Document] = []
+        for snippet, start_line in pieces:
+            snippet_lines = snippet.splitlines(True)
+            if len(snippet_lines) < self.config.min_chunk_lines:
+                continue
+
+            if len(snippet_lines) <= self.config.chunk_size_lines:
+                docs.append(self._make_chunk_doc(snippet, meta, start_line))
+                continue
+
+            for sub, sub_start_idx in _split_lines_with_overlap(
+                snippet_lines,
+                chunk_size_lines=self.config.chunk_size_lines,
+                chunk_overlap_lines=self.config.chunk_overlap_lines,
+            ):
+                sub_text = "".join(sub)
+                docs.append(self._make_chunk_doc(sub_text, meta, start_line + sub_start_idx))
+
+        if not docs:
+            return self._fallback_line_split(text, meta)
+        else:
+            return self._add_chunk_metadata(docs)


The current implementation for splitting large code blocks has a significant limitation. When a semantic block (like a class) is larger than chunk_size_lines, it falls back to simple line-based splitting (lines 226-232). This can break up methods or other logical units within that block, which undermines the goal of syntax-aware chunking.

For example, a 500-line Java class will be identified as a single semantic block. If chunk_size_lines is 200, this class will be split into line-based chunks, ignoring the method boundaries within it.

A more robust approach would be to recursively split large nodes. If a node is too large, the splitter should attempt to split it based on its children definition nodes (e.g., methods within a class) before falling back to line-based splitting. This would ensure that the chunking remains syntax-aware at deeper levels of the code structure.

gemini-code-assist · 2026-01-01T23:28:25Z

api/code_splitter.py

+        if not docs:
+            return [Document(text=text, meta_data=dict(meta))]
+        else:
+            return self._add_chunk_metadata(docs)


There's a small inconsistency in metadata handling. When _fallback_line_split is unable to create any chunks (e.g., because the text is shorter than min_chunk_lines), it returns the original document as a single-element list. However, this document lacks the chunk_index and chunk_total metadata that are added to all other chunks via _add_chunk_metadata. To ensure all returned documents have consistent metadata, you should process this single-document case through _add_chunk_metadata as well. This can be simplified by unifying the return path.

if not docs: docs = [Document(text=text, meta_data=dict(meta))] return self._add_chunk_metadata(docs)

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…h context preservation - Implemented _split_node_recursively to break down large semantic blocks. - Added 'parent shell' extraction to keep class headers and logic between definitions. - Introduced max_recursion_depth in CodeSplitterConfig to prevent stack overflows. - Refined AST traversal to correctly recurse into block-like nodes (e.g., JS class bodies). - Expanded test suite with complex, deeply nested, and scattered member scenarios.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…deepwiki-open into feature/functional-chunker

sng-asyncfunc · 2026-01-03T21:19:49Z

Looks awesome! Do you mind to show me a quick demo of how this work?

danielfrey63 · 2026-01-03T22:40:20Z

Looks awesome! Do you mind to show me a quick demo of how this work?

No, please lets have a look together. Or I'll deliver a test file to investigate the resulting embeddings, as you wish.

Sorry for the failed docker build. I could have to do with the fact that I am maintaining conda envs, not venvs locally.

danielfrey63 · 2026-01-03T22:40:59Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a significant and well-implemented feature for syntax-aware code chunking using Tree-sitter. The implementation is robust, with good fallback mechanisms and comprehensive test coverage for various languages and complex code structures. The integration into the existing data pipeline is clean.

My review includes a couple of suggestions to improve maintainability and consistency. Specifically, I've pointed out an opportunity to simplify dataclass serialization and an inconsistency in how minimum chunk size is handled, which could lead to the creation of tiny, low-context chunks.

Overall, this is a great addition that will significantly improve the quality of code chunking.

gemini-code-assist · 2026-01-03T22:44:18Z

api/code_splitter.py

@@ -0,0 +1,375 @@
+from __future__ import annotations
+
+from dataclasses import dataclass


To improve maintainability, you can also import asdict here. It can be used later to simplify the serialization of the CodeSplitterConfig dataclass into a dictionary.

Suggested change

from dataclasses import dataclass

from dataclasses import dataclass, asdict

gemini-code-assist · 2026-01-03T22:44:18Z

api/code_splitter.py

+            if parent_text.strip():  # Only add if there's non-whitespace content
+                docs.append(self._make_chunk_doc(parent_text, meta, start_line))


The creation of the parent shell chunk does not respect min_chunk_lines. This can result in very small chunks (e.g., 1-2 lines for a class definition), which is inconsistent with the fallback splitter's behavior and the stated goal of preventing tiny fragments.

Applying the min_chunk_lines filter here would enforce consistency. However, please be aware that this might lead to loss of context if a small parent shell (like a class declaration) is discarded. An alternative could be to prepend small parent shells to the first child chunk.

Here is a suggestion to apply the filter for consistency:

Suggested change

if parent_text.strip(): # Only add if there's non-whitespace content

docs.append(self._make_chunk_doc(parent_text, meta, start_line))

if parent_text.strip() and len(parent_text.splitlines(True)) >= self.config.min_chunk_lines:

docs.append(self._make_chunk_doc(parent_text, meta, start_line))

gemini-code-assist · 2026-01-03T22:44:18Z

api/code_splitter.py

+            "code_splitter_config": {
+                "chunk_size_lines": self._code_splitter.config.chunk_size_lines,
+                "chunk_overlap_lines": self._code_splitter.config.chunk_overlap_lines,
+                "min_chunk_lines": self._code_splitter.config.min_chunk_lines,
+                "max_recursion_depth": self._code_splitter.config.max_recursion_depth,
+                "enabled": self._code_splitter.config.enabled,
+            }


Using asdict (which can be imported from dataclasses) simplifies the serialization of the CodeSplitterConfig dataclass. This makes the code more concise and easier to maintain, as you won't need to manually update this dictionary if the dataclass fields change.

"code_splitter_config": asdict(self._code_splitter.config),

danielfrey63 added 7 commits December 30, 2025 23:47

fix: introduce functional chunking for source files

b34fb57

register CodeAwareSplitter and improve its log messages

e6306c4

docs: reverted README.md and api/README.md back to original

66f42c4

fix: reverted repo.json to original state

73fa53d

fix: clear separation - revert partial changes

67a5b34

fix: cleanup directories

c025575

fix: cleanup files

fb299fc

gemini-code-assist bot reviewed Jan 1, 2026

View reviewed changes

api/code_splitter.py Outdated Show resolved Hide resolved

api/code_splitter.py Outdated Show resolved Hide resolved

api/code_splitter.py Outdated Show resolved Hide resolved

api/code_splitter.py Show resolved Hide resolved

api/data_pipeline.py Outdated Show resolved Hide resolved

fix: refactore gemini code assist suggestions

17243fa

gemini-code-assist bot reviewed Jan 1, 2026

View reviewed changes

api/code_splitter.py Outdated Show resolved Hide resolved

api/code_splitter.py Outdated Show resolved Hide resolved

api/code_splitter.py Outdated Show resolved Hide resolved

api/code_splitter.py Show resolved Hide resolved

api/code_splitter.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Jan 1, 2026

View reviewed changes

api/code_splitter.py Outdated Show resolved Hide resolved

api/data_pipeline.py Outdated Show resolved Hide resolved

api/code_splitter.py Outdated Show resolved Hide resolved

api/code_splitter.py Outdated Show resolved Hide resolved

api/code_splitter.py Outdated Show resolved Hide resolved

danielfrey63 force-pushed the feature/functional-chunker branch from 93f649c to 78b01ce Compare January 1, 2026 22:54

fix: removed risky module loading

0900aea

gemini-code-assist bot reviewed Jan 1, 2026

View reviewed changes

danielfrey63 and others added 7 commits January 3, 2026 14:23

fix: hierarchical splitting

4215e41

fix: clean hierarchical chunking

a2e0b4d

Update api/code_splitter.py

40bab77

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update tests/unit/test_code_splitter.py

12b2a4e

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

fix: unused variable removal

b58ff0f

Merge branch 'feature/functional-chunker' of github.com:danielfrey63/…

a1df9aa

…deepwiki-open into feature/functional-chunker

danielfrey63 added 3 commits January 3, 2026 23:29

fix: added test coverage

e20482c

refactor(api): simplify tree-sitter import and enhance config tests

a216766

refactor(api): remove un-necessary import

c51b99d

gemini-code-assist bot reviewed Jan 3, 2026

View reviewed changes

danielfrey63 added 6 commits January 4, 2026 17:34

setup workflow actions

dcf795e

chore: debug settings

daa8e62

test(api): add edge case tests for code splitter coverage

5776401

test(api): added exeption testing reaching coverage of 98%

15a7e7f

test(api): added full AST/CST based node identification

76b5a33

fix(api): updates based several feedbacks

fb914c0

		@@ -0,0 +1,375 @@
		from __future__ import annotations

		from dataclasses import dataclass

	from dataclasses import dataclass
	from dataclasses import dataclass, asdict

		if parent_text.strip(): # Only add if there's non-whitespace content
		docs.append(self._make_chunk_doc(parent_text, meta, start_line))

feat: implement syntax-aware code chunking with Tree-sitter #434

Are you sure you want to change the base?

feat: implement syntax-aware code chunking with Tree-sitter #434

Uh oh!

Conversation

danielfrey63 commented Jan 1, 2026

Summary

Key Features

Syntax-Aware Splitting

Intelligent Routing

Configurable Parameters

Robust Error Handling

Files Changed

Core Implementation

Integration

Dependencies

Configuration

Technical Details

Tree-sitter Compatibility

Chunking Algorithm

Performance Considerations

Benefits

Improved Context Quality

Developer Experience

Scalability

Migration Notes

Uh oh!

gemini-code-assist bot commented Jan 1, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danielfrey63 commented Jan 1, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danielfrey63 commented Jan 1, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danielfrey63 commented Jan 1, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

sng-asyncfunc commented Jan 3, 2026

Uh oh!

danielfrey63 commented Jan 3, 2026

Uh oh!

danielfrey63 commented Jan 3, 2026