-
Notifications
You must be signed in to change notification settings - Fork 102
feat: add code chunking functionality #398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add code chunking functionality #398
Conversation
|
✅ DCO Check Passed Thanks @bridgetmcg, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
🟢 Require two reviewer for test updatesWonderful, this rule succeeded.When test data is updated, we require two reviewers
|
5c6fc1e to
32b120d
Compare
vagenas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Let me share some first thoughts, mostly on how these capabilities can be packaged and exposed:
I see the PR contains various language-specific chunkers, e.g. Java, Python etc.
My recommendation would be:
- to encapsulate these capabilities under a single component, which would include the language detection inside it, and,
- to ensure composability with existing chunkers, instead of introducing a new chunker, I'd rather provide this as a capability pluggable into the
HierarchicalChunker(good fit because (1) it follows Item boundaries, so CodeItems can be nicely delegated, and (2) is itself composable into theHybridChunker). Let us still define the interface specifics, but it could look like an optional kwarg inHierarchicalChunker.chunk(), e.g.code_chunking_strategy, adhering to a matching interface.
A more minor comment is regarding CodeDocMeta, which I see inherits from BaseMeta. Some application code may expect to interact with DocMeta (also a BaseMeta child), so if possible we should best extend that.
This point goes hand in hand with the interface specifics TBD above.
(For now I would focus on the present PR & the points above — the idea of introducing a code backend as per the docling repo PR can be discussed at a second step.)
|
hi @vagenas, I added some logic to address your suggestions. |
vagenas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @bridgetmcg, many thanks for the new iteration, incorporating feedback from above!
I only have a couple last points from my side (@dolfim-ibm I don't know if you want to add anything):
- I see the actual integration meanwhile occurs via
HierarchicalChunker->DefaultCodeChunkingStrategy->CodeChunkingStrategyFactorywhich returns aCodeChunker(subclass ofBaseChunker). Since the chunker primitive should be runnable on any document, it could be confusing to expose a component that operates only on e.g. Python, as a "chunker". Ideally one could just satisfy the newly introduced interface/protocol (and notBaseChunkerwhich may require additional points), but perhaps the fastest way to address this point is just to mark theCodeChunkerclass hierarchy "internal" by prepending with_. Then it's clear all these classes are implementation internals, and users only need to care about the strategy they can optionally pass toHierarchicalChunker. - the defined
CodeChunkingStrategy(Protocol)does not seem to be used anywhere — shouldn't this be somewhere in the typing of fieldcode_chunking_strategywithinHierarchicalChunker? - to allow for extensibility we support the notion of customizable serializers; while this perhaps makes little sense in
CodeItems, for consistency, inHierarchicalChunkerwe should still best use the result fromdoc_serializer(instead of justitem.text)
Hope that makes sense. Otherwise we can also have a quick call to clarify & finalize.
| meta: CodeDocMeta | ||
|
|
||
|
|
||
| class ChunkType(str, Enum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to avoid users thinking this is a generic ChunkType, we could rename the class to CodeChunkType.
|
Looking at the discussion and the proposed |
What about the obvious way to expand this, i.e. adding an optional |
I was just looking up what we do in the serializers, there we are indeed using this approach. So let's go with it. |
|
@vagenas I believe I addressed your comments! Let me know if not. Many thanks! |
vagenas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bridgetmcg I made some in-line comments incl. code suggestions.
Please also install the pre-commit hooks, so all checks are verified locally before pushing.
(E.g. I think some tests are still not up-to-date. FYI to generate the data, set env var DOCLING_GEN_TEST_DATA=1, e.g. DOCLING_GEN_TEST_DATA=1 uv run pytest)
|
@bridgetmcg sounds good, now we still need to address the conflicts on uv.lock (not possible manually), which means:
Can you take care of these? Otherwise let me know and I could try to look into it. |
68bbf44 to
b417cae
Compare
|
@vagenas I had to restrict the tree-sitter versioning due to python compatibility. treesitter > 0.24 requires python 3.10+ and 0.23 requires all the treesitter language libraries to be <0.24 as well. |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
|
|
||
| def _strip_markdown_code_formatting(self, text: str) -> str: | ||
| """Strip markdown code block formatting from text.""" | ||
| if not text.startswith("```") or not text.endswith("```"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These backticks look like a bug, perhaps to best be addressed on a different level.
| CODE_BLOCK = "code_block" | ||
|
|
||
|
|
||
| class CodeChunkingStrategy(Protocol): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can make a BaseCodeChunkingStrategy(ABC) out of this, with abstract method
I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 334811a Signed-off-by: Bridget McGinn <bridget.mcginn@ibm.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Signed-off-by: Bridget <bridget.mcginn@ibm.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Signed-off-by: Bridget <bridget.mcginn@ibm.com>
I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 46bb88a I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 10e9ed8 I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: d9827c7 I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 814dc61 Signed-off-by: Bridget McGinn <bridget.mcginn@ibm.com>
Signed-off-by: Bridget McGinn <bridget.mcginn@ibm.com>
Signed-off-by: Bridget McGinn <bridget.mcginn@ibm.com>
I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: a4a21e9 I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 0266c63 I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 336dd6a I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 68890e9 I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 3c65eef Signed-off-by: Bridget McGinn <bridget.mcginn@ibm.com>
607347f to
b9dec1b
Compare
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: Bridget McGinn <bridget.mcginn@ibm.com>
docling_core/types/doc/labels.py
Outdated
| } | ||
| return mapping.get(self, None) | ||
|
|
||
| def is_supported_for_chunking(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this not used currently?
Signed-off-by: Bridget McGinn <bridget.mcginn@ibm.com>
|
@bridgetmcg can you fix the DCO? |
I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 63c7739 I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 431d357 I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: f3175c2 I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 1a01de8 I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 025aea3 Signed-off-by: Bridget McGinn <bridget.mcginn@ibm.com>
- encapsulated code chunking specifics to separate package - clearly separated public vs internal API via module and method naming conventions - simplified or removed parts not stricly necessary for public API (e.g. lang support querying, noopstrategy) - split chunk data model to separate modules to prevent circular dependencies - renamed DefaultCodeChunkingStrategy to Standard... for clarity as it need not be the default strategy - fixed some issues (e.g. gen flag in test) Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @bridgetmcg, I did some refactoring & made couple improvements — details in 641ea9c commit message (also removed the previous change of doc_items type in CodeDocMeta).
PeterStaar-IBM
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wonderful! 🎖️
This PR introduces code chunking functionality to
docling-core, enabling intelligent parsing and chunking of source code files across multiple programming languages. The implementation leverages tree-sitter for accurate parsing and provides language-specific chunkers for Python, TypeScript, JavaScript, Java, and C.Features
Core
Language Support
Testing