-
Notifications
You must be signed in to change notification settings - Fork 1.5k
feat: implement syntax-aware code chunking with Tree-sitter #434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: implement syntax-aware code chunking with Tree-sitter #434
Conversation
Summary of ChangesHello @danielfrey63, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly upgrades the code processing capabilities by introducing a syntax-aware code chunking mechanism powered by Tree-sitter. This new approach intelligently splits code documents into semantically coherent units, such as functions and classes, across over 20 programming languages. It features an intelligent routing system that automatically applies the appropriate splitting strategy (syntax-aware for code, line-based for other text) and includes robust error handling with graceful fallbacks. This enhancement aims to improve the quality of context provided to AI models, leading to more meaningful embeddings and better developer experience, all while maintaining backward compatibility. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant and valuable feature: syntax-aware code chunking using Tree-sitter. The implementation is well-structured, providing support for multiple languages and including robust fallback mechanisms. My review focuses on several opportunities to enhance the code's efficiency, maintainability, and robustness. Key suggestions include optimizing byte-to-line conversions to improve performance, refining exception handling to be more specific, and refactoring duplicated code to improve maintainability. Overall, this is a strong contribution that will greatly improve context quality for code-related tasks.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant new feature for syntax-aware code chunking using Tree-sitter, which is a great improvement for context quality in code-related tasks. The implementation is robust, with good fallback mechanisms and configuration options.
My review focuses on improving code clarity, robustness, and maintainability in the new api/code_splitter.py file. I've suggested refactoring the Tree-sitter import logic for better type safety, improving the keyword matching heuristic for more accuracy, and making the exception handling for monkey-patching safer and more informative. These changes will make the new component even more solid.
The integration into the data pipeline and configuration files is well-done. The dependency management in pyproject.toml is also handled correctly with appropriate version pinning.
Overall, this is a well-executed feature addition.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant and well-designed feature for syntax-aware code chunking using Tree-sitter. The implementation is robust, with support for multiple languages and a graceful fallback mechanism. The configuration and integration into the existing data pipeline are clean. My review focuses on improving the long-term maintainability and robustness of the code. The main points are to replace the risky practice of monkey-patching the adalflow library with a more explicit serialization strategy, to make exception handling more specific to avoid masking potential bugs, and to address the brittle error handling that relies on string matching in exception messages. These changes will make the new code splitter even more reliable and easier to maintain in the future.
- Extract duplicated chunk metadata logic to helper method (DRY principle) - Optimize byte encoding by pre-encoding text once for large files - Improve error handling with specific exception types - Enhance node type matching to avoid partial matches - Add proper logging for adalflow component registrationfixed gemini code assist proposals Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
93f649c to
78b01ce
Compare
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces an excellent feature for syntax-aware code chunking using Tree-sitter. The implementation is robust, with thoughtful error handling, fallback mechanisms, and clear configuration. The integration into the existing data pipeline is also well-executed. I have two main points of feedback. The most critical is a suggestion to improve the splitting logic for large code blocks to avoid falling back to simple line-based splitting, which would make the feature even more powerful. The second is a minor point to ensure metadata consistency across all generated chunks. Overall, this is a very strong and valuable contribution.
| def _split_code_text(self, text: str, meta: Dict[str, Any], file_type: str) -> List[Document]: | ||
| parser = self._try_get_parser(file_type) | ||
| if parser is None: | ||
| return self._fallback_line_split(text, meta) | ||
|
|
||
| text_bytes = text.encode("utf-8", errors="replace") | ||
| try: | ||
| tree = parser.parse(text_bytes) | ||
| except Exception: | ||
| return self._fallback_line_split(text, meta) | ||
|
|
||
| root = getattr(tree, "root_node", None) | ||
| if root is None: | ||
| return self._fallback_line_split(text, meta) | ||
|
|
||
| nodes = list(_iter_definition_like_nodes(root)) | ||
| if not nodes: | ||
| return self._fallback_line_split(text, meta) | ||
|
|
||
| pieces: List[Tuple[str, int]] = [] | ||
| for node in nodes: | ||
| try: | ||
| start_b = int(getattr(node, "start_byte")) | ||
| end_b = int(getattr(node, "end_byte")) | ||
| except (AttributeError, ValueError, TypeError) as e: | ||
| logger.debug("Could not process a tree-sitter node for file type '%s': %s", file_type, e) | ||
| continue | ||
| snippet = _slice_text_by_bytes_preencoded(text_bytes, start_b, end_b) | ||
| start_line = _byte_offset_to_line_preencoded(text_bytes, start_b) | ||
| pieces.append((snippet, start_line)) | ||
|
|
||
| if not pieces: | ||
| return self._fallback_line_split(text, meta) | ||
|
|
||
| docs: List[Document] = [] | ||
| for snippet, start_line in pieces: | ||
| snippet_lines = snippet.splitlines(True) | ||
| if len(snippet_lines) < self.config.min_chunk_lines: | ||
| continue | ||
|
|
||
| if len(snippet_lines) <= self.config.chunk_size_lines: | ||
| docs.append(self._make_chunk_doc(snippet, meta, start_line)) | ||
| continue | ||
|
|
||
| for sub, sub_start_idx in _split_lines_with_overlap( | ||
| snippet_lines, | ||
| chunk_size_lines=self.config.chunk_size_lines, | ||
| chunk_overlap_lines=self.config.chunk_overlap_lines, | ||
| ): | ||
| sub_text = "".join(sub) | ||
| docs.append(self._make_chunk_doc(sub_text, meta, start_line + sub_start_idx)) | ||
|
|
||
| if not docs: | ||
| return self._fallback_line_split(text, meta) | ||
| else: | ||
| return self._add_chunk_metadata(docs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation for splitting large code blocks has a significant limitation. When a semantic block (like a class) is larger than chunk_size_lines, it falls back to simple line-based splitting (lines 226-232). This can break up methods or other logical units within that block, which undermines the goal of syntax-aware chunking.
For example, a 500-line Java class will be identified as a single semantic block. If chunk_size_lines is 200, this class will be split into line-based chunks, ignoring the method boundaries within it.
A more robust approach would be to recursively split large nodes. If a node is too large, the splitter should attempt to split it based on its children definition nodes (e.g., methods within a class) before falling back to line-based splitting. This would ensure that the chunking remains syntax-aware at deeper levels of the code structure.
| if not docs: | ||
| return [Document(text=text, meta_data=dict(meta))] | ||
| else: | ||
| return self._add_chunk_metadata(docs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a small inconsistency in metadata handling. When _fallback_line_split is unable to create any chunks (e.g., because the text is shorter than min_chunk_lines), it returns the original document as a single-element list. However, this document lacks the chunk_index and chunk_total metadata that are added to all other chunks via _add_chunk_metadata. To ensure all returned documents have consistent metadata, you should process this single-document case through _add_chunk_metadata as well. This can be simplified by unifying the return path.
if not docs:
docs = [Document(text=text, meta_data=dict(meta))]
return self._add_chunk_metadata(docs)Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…h context preservation - Implemented _split_node_recursively to break down large semantic blocks. - Added 'parent shell' extraction to keep class headers and logic between definitions. - Introduced max_recursion_depth in CodeSplitterConfig to prevent stack overflows. - Refined AST traversal to correctly recurse into block-like nodes (e.g., JS class bodies). - Expanded test suite with complex, deeply nested, and scattered member scenarios.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…deepwiki-open into feature/functional-chunker
|
Looks awesome! Do you mind to show me a quick demo of how this work? |
No, please lets have a look together. Or I'll deliver a test file to investigate the resulting embeddings, as you wish. Sorry for the failed docker build. I could have to do with the fact that I am maintaining conda envs, not venvs locally. |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant and well-implemented feature for syntax-aware code chunking using Tree-sitter. The implementation is robust, with good fallback mechanisms and comprehensive test coverage for various languages and complex code structures. The integration into the existing data pipeline is clean.
My review includes a couple of suggestions to improve maintainability and consistency. Specifically, I've pointed out an opportunity to simplify dataclass serialization and an inconsistency in how minimum chunk size is handled, which could lead to the creation of tiny, low-context chunks.
Overall, this is a great addition that will significantly improve the quality of code chunking.
| @@ -0,0 +1,375 @@ | |||
| from __future__ import annotations | |||
|
|
|||
| from dataclasses import dataclass | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if parent_text.strip(): # Only add if there's non-whitespace content | ||
| docs.append(self._make_chunk_doc(parent_text, meta, start_line)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The creation of the parent shell chunk does not respect min_chunk_lines. This can result in very small chunks (e.g., 1-2 lines for a class definition), which is inconsistent with the fallback splitter's behavior and the stated goal of preventing tiny fragments.
Applying the min_chunk_lines filter here would enforce consistency. However, please be aware that this might lead to loss of context if a small parent shell (like a class declaration) is discarded. An alternative could be to prepend small parent shells to the first child chunk.
Here is a suggestion to apply the filter for consistency:
| if parent_text.strip(): # Only add if there's non-whitespace content | |
| docs.append(self._make_chunk_doc(parent_text, meta, start_line)) | |
| if parent_text.strip() and len(parent_text.splitlines(True)) >= self.config.min_chunk_lines: | |
| docs.append(self._make_chunk_doc(parent_text, meta, start_line)) |
| "code_splitter_config": { | ||
| "chunk_size_lines": self._code_splitter.config.chunk_size_lines, | ||
| "chunk_overlap_lines": self._code_splitter.config.chunk_overlap_lines, | ||
| "min_chunk_lines": self._code_splitter.config.min_chunk_lines, | ||
| "max_recursion_depth": self._code_splitter.config.max_recursion_depth, | ||
| "enabled": self._code_splitter.config.enabled, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using asdict (which can be imported from dataclasses) simplifies the serialization of the CodeSplitterConfig dataclass. This makes the code more concise and easier to maintain, as you won't need to manually update this dictionary if the dataclass fields change.
"code_splitter_config": asdict(self._code_splitter.config),
Summary
This PR introduces intelligent code chunking using Tree-sitter parsers to prevent splitting code in the middle of functions or class definitions. The implementation provides syntax-aware splitting for 20+ programming languages while maintaining backward compatibility through fallback mechanisms.
Key Features
Syntax-Aware Splitting
Intelligent Routing
is_codeandtypemetadata to determine optimal splitting strategyConfigurable Parameters
Robust Error Handling
Files Changed
Core Implementation
Integration
code_splitterconfiguration sectionDependencies
tree-sitter = ">=0.21.0,<0.22.0"(compatible with tree-sitter-languages)tree-sitter-languages = {version = ">=1.10.0", python = "<3.13"}Configuration
The code splitter can be configured via api/config/embedder.json:
{ "code_splitter": { "enabled": true, "chunk_size_lines": 200, "chunk_overlap_lines": 20, "min_chunk_lines": 5 } }Technical Details
Tree-sitter Compatibility
tree-sitterandtree-sitter-languagestree-sitterto version>=0.21.0,<0.22.0for compatibility withtree-sitter-languages 1.10.2tree-sitter-languages 1.10.2Chunking Algorithm
Performance Considerations
Benefits
Improved Context Quality
Developer Experience
Scalability
Migration Notes