Skip to content

Conversation

@danielfrey63
Copy link
Contributor

Summary

This PR introduces intelligent code chunking using Tree-sitter parsers to prevent splitting code in the middle of functions or class definitions. The implementation provides syntax-aware splitting for 20+ programming languages while maintaining backward compatibility through fallback mechanisms.

Key Features

Syntax-Aware Splitting

  • Tree-sitter Integration: Uses Tree-sitter parsers to understand code structure and split at logical boundaries (functions, classes, methods, etc.)
  • Multi-Language Support: Supports Python, JavaScript/TypeScript, Java, C/C++, C#, Go, Rust, PHP, Ruby, Swift, Kotlin, Scala, Lua, Bash, HTML, CSS, JSON, YAML, TOML, Markdown
  • Smart Node Detection: Identifies definition-like nodes using keyword matching (function, class, method, interface, struct, enum, trait, impl, module, namespace, type)

Intelligent Routing

  • CodeAwareSplitter: Automatically routes code documents through Tree-sitter chunker and non-code documents through the existing TextSplitter
  • Metadata-Driven: Uses is_code and type metadata to determine optimal splitting strategy
  • Seamless Integration: Works with existing data pipeline without breaking changes

Configurable Parameters

  • Line-based Chunking: Configurable chunk size (default: 200 lines) and overlap (default: 20 lines)
  • Minimum Chunk Size: Prevents tiny fragments (default: 5 lines minimum)
  • Enable/Disable: Can be toggled on/off via configuration

Robust Error Handling

  • Graceful Fallback: Falls back to line-based splitting if Tree-sitter is unavailable or parsing fails
  • Import Safety: Robust module importing with error handling for missing dependencies
  • Parser Availability Check: Automatically detects if Tree-sitter parsers are available for the target language

Files Changed

Core Implementation

  • api/code_splitter.py (new): Complete Tree-sitter based code splitting implementation
    • TreeSitterCodeSplitter: Main chunking engine with syntax awareness
    • CodeAwareSplitter: Intelligent router for code vs. non-code documents
    • Helper functions for parsing, chunking, and metadata handling

Integration

Dependencies

  • api/pyproject.toml: Added Tree-sitter dependencies with version pinning for compatibility
    • tree-sitter = ">=0.21.0,<0.22.0" (compatible with tree-sitter-languages)
    • tree-sitter-languages = {version = ">=1.10.0", python = "<3.13"}

Configuration

The code splitter can be configured via api/config/embedder.json:

{
  "code_splitter": {
    "enabled": true,
    "chunk_size_lines": 200,
    "chunk_overlap_lines": 20,
    "min_chunk_lines": 5
  }
}

Technical Details

Tree-sitter Compatibility

  • Resolved API incompatibility between tree-sitter and tree-sitter-languages
  • Pinned tree-sitter to version >=0.21.0,<0.22.0 for compatibility with tree-sitter-languages 1.10.2
  • Conditional dependency installation for Python < 3.13 (tree-sitter-languages compatibility)
  • There is an open issue #75 since July 2024 requesting a new binary wheel for tree-sitter-languages 1.10.2

Chunking Algorithm

  1. Parse: Use Tree-sitter to parse code into AST
  2. Extract: Identify definition-like nodes (functions, classes, etc.)
  3. Split: Create chunks based on node boundaries
  4. Fallback: Use line-based splitting if parsing fails
  5. Metadata: Enrich chunks with line numbers and indexing information

Performance Considerations

  • Lazy import of Tree-sitter modules to avoid startup overhead
  • Efficient byte-to-line offset conversion
  • Minimal memory footprint through streaming processing

Benefits

Improved Context Quality

  • Code chunks maintain semantic coherence
  • Better context for AI models and embeddings
  • Reduced fragmentation of logical code units

Developer Experience

  • More meaningful search results
  • Better code comprehension in AI responses
  • Preserved function/class boundaries in documentation

Scalability

  • Handles large codebases efficiently
  • Configurable chunk sizes for different use cases
  • Graceful degradation for unsupported languages

Migration Notes

  • Backward Compatible: Existing functionality remains unchanged
  • Configuration Optional: Code splitter works with sensible defaults
  • Gradual Rollout: Can be enabled/disabled per environment
  • No Breaking Changes: Existing API contracts preserved

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @danielfrey63, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the code processing capabilities by introducing a syntax-aware code chunking mechanism powered by Tree-sitter. This new approach intelligently splits code documents into semantically coherent units, such as functions and classes, across over 20 programming languages. It features an intelligent routing system that automatically applies the appropriate splitting strategy (syntax-aware for code, line-based for other text) and includes robust error handling with graceful fallbacks. This enhancement aims to improve the quality of context provided to AI models, leading to more meaningful embeddings and better developer experience, all while maintaining backward compatibility.

Highlights

  • Syntax-Aware Code Chunking: Introduces intelligent code chunking using Tree-sitter parsers to prevent splitting code in the middle of functions or class definitions, supporting over 20 programming languages.
  • Intelligent Document Routing: A new CodeAwareSplitter automatically routes code documents through the Tree-sitter chunker and non-code documents through the existing TextSplitter based on metadata.
  • Configurable Parameters: The code splitter is highly configurable, allowing adjustments for chunk size, overlap, and minimum chunk size, and can be enabled or disabled via configuration.
  • Robust Error Handling and Fallback: Includes graceful fallback to line-based splitting if Tree-sitter is unavailable or parsing fails, ensuring the data pipeline remains robust.
  • Improved Context Quality: By maintaining semantic coherence in code chunks, this change provides better context for AI models and embeddings, leading to more meaningful search results and AI responses.
  • Backward Compatibility: The implementation is designed to be backward compatible, preserving existing API contracts and functionality without breaking changes.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and valuable feature: syntax-aware code chunking using Tree-sitter. The implementation is well-structured, providing support for multiple languages and including robust fallback mechanisms. My review focuses on several opportunities to enhance the code's efficiency, maintainability, and robustness. Key suggestions include optimizing byte-to-line conversions to improve performance, refining exception handling to be more specific, and refactoring duplicated code to improve maintainability. Overall, this is a strong contribution that will greatly improve context quality for code-related tasks.

@danielfrey63
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant new feature for syntax-aware code chunking using Tree-sitter, which is a great improvement for context quality in code-related tasks. The implementation is robust, with good fallback mechanisms and configuration options.

My review focuses on improving code clarity, robustness, and maintainability in the new api/code_splitter.py file. I've suggested refactoring the Tree-sitter import logic for better type safety, improving the keyword matching heuristic for more accuracy, and making the exception handling for monkey-patching safer and more informative. These changes will make the new component even more solid.

The integration into the data pipeline and configuration files is well-done. The dependency management in pyproject.toml is also handled correctly with appropriate version pinning.

Overall, this is a well-executed feature addition.

@danielfrey63
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-designed feature for syntax-aware code chunking using Tree-sitter. The implementation is robust, with support for multiple languages and a graceful fallback mechanism. The configuration and integration into the existing data pipeline are clean. My review focuses on improving the long-term maintainability and robustness of the code. The main points are to replace the risky practice of monkey-patching the adalflow library with a more explicit serialization strategy, to make exception handling more specific to avoid masking potential bugs, and to address the brittle error handling that relies on string matching in exception messages. These changes will make the new code splitter even more reliable and easier to maintain in the future.

- Extract duplicated chunk metadata logic to helper method (DRY
   principle)
- Optimize byte encoding by pre-encoding text once for large files
- Improve error handling with specific exception types
- Enhance node type matching to avoid partial matches
- Add proper logging for adalflow component registrationfixed gemini code assist proposals

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@danielfrey63 danielfrey63 force-pushed the feature/functional-chunker branch from 93f649c to 78b01ce Compare January 1, 2026 22:54
@danielfrey63
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an excellent feature for syntax-aware code chunking using Tree-sitter. The implementation is robust, with thoughtful error handling, fallback mechanisms, and clear configuration. The integration into the existing data pipeline is also well-executed. I have two main points of feedback. The most critical is a suggestion to improve the splitting logic for large code blocks to avoid falling back to simple line-based splitting, which would make the feature even more powerful. The second is a minor point to ensure metadata consistency across all generated chunks. Overall, this is a very strong and valuable contribution.

Comment on lines 182 to 237
def _split_code_text(self, text: str, meta: Dict[str, Any], file_type: str) -> List[Document]:
parser = self._try_get_parser(file_type)
if parser is None:
return self._fallback_line_split(text, meta)

text_bytes = text.encode("utf-8", errors="replace")
try:
tree = parser.parse(text_bytes)
except Exception:
return self._fallback_line_split(text, meta)

root = getattr(tree, "root_node", None)
if root is None:
return self._fallback_line_split(text, meta)

nodes = list(_iter_definition_like_nodes(root))
if not nodes:
return self._fallback_line_split(text, meta)

pieces: List[Tuple[str, int]] = []
for node in nodes:
try:
start_b = int(getattr(node, "start_byte"))
end_b = int(getattr(node, "end_byte"))
except (AttributeError, ValueError, TypeError) as e:
logger.debug("Could not process a tree-sitter node for file type '%s': %s", file_type, e)
continue
snippet = _slice_text_by_bytes_preencoded(text_bytes, start_b, end_b)
start_line = _byte_offset_to_line_preencoded(text_bytes, start_b)
pieces.append((snippet, start_line))

if not pieces:
return self._fallback_line_split(text, meta)

docs: List[Document] = []
for snippet, start_line in pieces:
snippet_lines = snippet.splitlines(True)
if len(snippet_lines) < self.config.min_chunk_lines:
continue

if len(snippet_lines) <= self.config.chunk_size_lines:
docs.append(self._make_chunk_doc(snippet, meta, start_line))
continue

for sub, sub_start_idx in _split_lines_with_overlap(
snippet_lines,
chunk_size_lines=self.config.chunk_size_lines,
chunk_overlap_lines=self.config.chunk_overlap_lines,
):
sub_text = "".join(sub)
docs.append(self._make_chunk_doc(sub_text, meta, start_line + sub_start_idx))

if not docs:
return self._fallback_line_split(text, meta)
else:
return self._add_chunk_metadata(docs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation for splitting large code blocks has a significant limitation. When a semantic block (like a class) is larger than chunk_size_lines, it falls back to simple line-based splitting (lines 226-232). This can break up methods or other logical units within that block, which undermines the goal of syntax-aware chunking.

For example, a 500-line Java class will be identified as a single semantic block. If chunk_size_lines is 200, this class will be split into line-based chunks, ignoring the method boundaries within it.

A more robust approach would be to recursively split large nodes. If a node is too large, the splitter should attempt to split it based on its children definition nodes (e.g., methods within a class) before falling back to line-based splitting. This would ensure that the chunking remains syntax-aware at deeper levels of the code structure.

Comment on lines 259 to 262
if not docs:
return [Document(text=text, meta_data=dict(meta))]
else:
return self._add_chunk_metadata(docs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a small inconsistency in metadata handling. When _fallback_line_split is unable to create any chunks (e.g., because the text is shorter than min_chunk_lines), it returns the original document as a single-element list. However, this document lacks the chunk_index and chunk_total metadata that are added to all other chunks via _add_chunk_metadata. To ensure all returned documents have consistent metadata, you should process this single-document case through _add_chunk_metadata as well. This can be simplified by unifying the return path.

        if not docs:
            docs = [Document(text=text, meta_data=dict(meta))]
        return self._add_chunk_metadata(docs)

danielfrey63 and others added 7 commits January 3, 2026 14:23
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…h context preservation

- Implemented _split_node_recursively to break down large semantic blocks.
- Added 'parent shell' extraction to keep class headers and logic between definitions.
- Introduced max_recursion_depth in CodeSplitterConfig to prevent stack overflows.
- Refined AST traversal to correctly recurse into block-like nodes (e.g., JS class bodies).
- Expanded test suite with complex, deeply nested, and scattered member scenarios.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…deepwiki-open into feature/functional-chunker
@sng-asyncfunc
Copy link
Contributor

Looks awesome! Do you mind to show me a quick demo of how this work?

@danielfrey63
Copy link
Contributor Author

Looks awesome! Do you mind to show me a quick demo of how this work?

No, please lets have a look together. Or I'll deliver a test file to investigate the resulting embeddings, as you wish.

Sorry for the failed docker build. I could have to do with the fact that I am maintaining conda envs, not venvs locally.

@danielfrey63
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-implemented feature for syntax-aware code chunking using Tree-sitter. The implementation is robust, with good fallback mechanisms and comprehensive test coverage for various languages and complex code structures. The integration into the existing data pipeline is clean.

My review includes a couple of suggestions to improve maintainability and consistency. Specifically, I've pointed out an opportunity to simplify dataclass serialization and an inconsistency in how minimum chunk size is handled, which could lead to the creation of tiny, low-context chunks.

Overall, this is a great addition that will significantly improve the quality of code chunking.

@@ -0,0 +1,375 @@
from __future__ import annotations

from dataclasses import dataclass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To improve maintainability, you can also import asdict here. It can be used later to simplify the serialization of the CodeSplitterConfig dataclass into a dictionary.

Suggested change
from dataclasses import dataclass
from dataclasses import dataclass, asdict

Comment on lines +270 to +271
if parent_text.strip(): # Only add if there's non-whitespace content
docs.append(self._make_chunk_doc(parent_text, meta, start_line))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The creation of the parent shell chunk does not respect min_chunk_lines. This can result in very small chunks (e.g., 1-2 lines for a class definition), which is inconsistent with the fallback splitter's behavior and the stated goal of preventing tiny fragments.

Applying the min_chunk_lines filter here would enforce consistency. However, please be aware that this might lead to loss of context if a small parent shell (like a class declaration) is discarded. An alternative could be to prepend small parent shells to the first child chunk.

Here is a suggestion to apply the filter for consistency:

Suggested change
if parent_text.strip(): # Only add if there's non-whitespace content
docs.append(self._make_chunk_doc(parent_text, meta, start_line))
if parent_text.strip() and len(parent_text.splitlines(True)) >= self.config.min_chunk_lines:
docs.append(self._make_chunk_doc(parent_text, meta, start_line))

Comment on lines +360 to +366
"code_splitter_config": {
"chunk_size_lines": self._code_splitter.config.chunk_size_lines,
"chunk_overlap_lines": self._code_splitter.config.chunk_overlap_lines,
"min_chunk_lines": self._code_splitter.config.min_chunk_lines,
"max_recursion_depth": self._code_splitter.config.max_recursion_depth,
"enabled": self._code_splitter.config.enabled,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using asdict (which can be imported from dataclasses) simplifies the serialization of the CodeSplitterConfig dataclass. This makes the code more concise and easier to maintain, as you won't need to manually update this dictionary if the dataclass fields change.

            "code_splitter_config": asdict(self._code_splitter.config),

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants