feat: add code chunking functionality #398

bridgetmcg · 2025-10-03T13:24:52Z

This PR introduces code chunking functionality to docling-core, enabling intelligent parsing and chunking of source code files across multiple programming languages. The implementation leverages tree-sitter for accurate parsing and provides language-specific chunkers for Python, TypeScript, JavaScript, Java, and C.

Features

Core

CodeChunker - Base abstract class for code chunking with Tree-sitter integration
Language-specific chunkers - Specialized implementations for 5 major programming languages
Smart chunk splitting - Automatic splitting of large functions while preserving context

Language Support

Python (PythonFunctionChunker) - Functions, classes, imports, module variables
TypeScript (TypeScriptFunctionChunker) - Functions, classes, interfaces, imports
JavaScript (JavaScriptFunctionChunker) - Inherits from TypeScript chunker
Java (JavaFunctionChunker) - Methods, constructors, classes, enums, interfaces
C (CFunctionChunker) - Functions, structs, macros, preprocessor definitions

Testing

test_code_chunker.py - multi-language, real code samples

github-actions · 2025-10-03T13:25:03Z

✅ DCO Check Passed

Thanks @bridgetmcg, all your commits are properly signed off. 🎉

mergify · 2025-10-03T13:25:29Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

#approved-reviews-by >= 2

dosubot · 2025-10-03T13:26:32Z

Related Documentation

Checked 3 published document(s) in 1 knowledge base(s). No updates required.

^{How did I do? Any feedback?}

vagenas

Nice! Let me share some first thoughts, mostly on how these capabilities can be packaged and exposed:

I see the PR contains various language-specific chunkers, e.g. Java, Python etc.
My recommendation would be:

to encapsulate these capabilities under a single component, which would include the language detection inside it, and,
to ensure composability with existing chunkers, instead of introducing a new chunker, I'd rather provide this as a capability pluggable into the HierarchicalChunker (good fit because (1) it follows Item boundaries, so CodeItems can be nicely delegated, and (2) is itself composable into the HybridChunker). Let us still define the interface specifics, but it could look like an optional kwarg in HierarchicalChunker.chunk(), e.g. code_chunking_strategy, adhering to a matching interface.

A more minor comment is regarding CodeDocMeta, which I see inherits from BaseMeta. Some application code may expect to interact with DocMeta (also a BaseMeta child), so if possible we should best extend that.
This point goes hand in hand with the interface specifics TBD above.

(For now I would focus on the present PR & the points above — the idea of introducing a code backend as per the docling repo PR can be discussed at a second step.)

bridgetmcg · 2025-10-10T16:38:42Z

hi @vagenas, I added some logic to address your suggestions.

vagenas

Hi @bridgetmcg, many thanks for the new iteration, incorporating feedback from above!

I only have a couple last points from my side (@dolfim-ibm I don't know if you want to add anything):

I see the actual integration meanwhile occurs via HierarchicalChunker -> DefaultCodeChunkingStrategy -> CodeChunkingStrategyFactory which returns a CodeChunker (subclass of BaseChunker). Since the chunker primitive should be runnable on any document, it could be confusing to expose a component that operates only on e.g. Python, as a "chunker". Ideally one could just satisfy the newly introduced interface/protocol (and not BaseChunker which may require additional points), but perhaps the fastest way to address this point is just to mark the CodeChunker class hierarchy "internal" by prepending with _. Then it's clear all these classes are implementation internals, and users only need to care about the strategy they can optionally pass to HierarchicalChunker.
the defined CodeChunkingStrategy(Protocol) does not seem to be used anywhere — shouldn't this be somewhere in the typing of field code_chunking_strategy within HierarchicalChunker?
to allow for extensibility we support the notion of customizable serializers; while this perhaps makes little sense in CodeItems, for consistency, in HierarchicalChunker we should still best use the result from doc_serializer (instead of just item.text)

Hope that makes sense. Otherwise we can also have a quick call to clarify & finalize.

dolfim-ibm · 2025-10-22T09:38:37Z

docling_core/transforms/chunker/hierarchical_chunker.py

+    meta: CodeDocMeta
+
+
+class ChunkType(str, Enum):


to avoid users thinking this is a generic ChunkType, we could rename the class to CodeChunkType.

dolfim-ibm · 2025-10-22T09:54:26Z

Looking at the discussion and the proposed code_chunking_strategy, I'm think we want to consider how this will scale to other types. For example, multiple chunking strategies for tables are a recurrent topic.

vagenas · 2025-10-22T11:20:59Z

Looking at the discussion and the proposed code_chunking_strategy, I'm think we want to consider how this will scale to other types. For example, multiple chunking strategies for tables are a recurrent topic.

What about the obvious way to expand this, i.e. adding an optional table_chunking_strategy to the HierarchicalChunker?

dolfim-ibm · 2025-10-22T11:32:56Z

Looking at the discussion and the proposed code_chunking_strategy, I'm think we want to consider how this will scale to other types. For example, multiple chunking strategies for tables are a recurrent topic.

What about the obvious way to expand this, i.e. adding an optional table_chunking_strategy to the HierarchicalChunker?

I was just looking up what we do in the serializers, there we are indeed using this approach. So let's go with it.

bridgetmcg · 2025-10-22T20:42:22Z

@vagenas I believe I addressed your comments! Let me know if not. Many thanks!

vagenas

@bridgetmcg I made some in-line comments incl. code suggestions.
Please also install the pre-commit hooks, so all checks are verified locally before pushing.
(E.g. I think some tests are still not up-to-date. FYI to generate the data, set env var DOCLING_GEN_TEST_DATA=1, e.g. DOCLING_GEN_TEST_DATA=1 uv run pytest)

pyproject.toml

docling_core/transforms/chunker/hierarchical_chunker.py

bridgetmcg · 2025-10-24T01:33:26Z

@vagenas I ran all the pre-commit checks which showed that now with the language identification we can correctly label some code snippets from other tests. Those were updated in 814dc61

vagenas · 2025-10-24T06:33:57Z

@bridgetmcg sounds good, now we still need to address the conflicts on uv.lock (not possible manually), which means:

branch needs to get up-to-date with latest main
uv.lock needs to be regenerated
(if 1. is done by rebasing, a force-push would be needed)

Can you take care of these? Otherwise let me know and I could try to look into it.

bridgetmcg · 2025-10-24T18:09:03Z

@vagenas I had to restrict the tree-sitter versioning due to python compatibility. treesitter > 0.24 requires python 3.10+ and 0.23 requires all the treesitter language libraries to be <0.24 as well.

codecov · 2025-10-27T08:35:10Z

Codecov Report

❌ Patch coverage is 85.46410% with 166 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...s/chunker/code_chunking/_language_code_chunkers.py	83.73%	157 Missing ⚠️
...ng_core/transforms/chunker/code_chunking/_utils.py	91.93%	5 Missing ⚠️
...r/code_chunking/standard_code_chunking_strategy.py	90.00%	3 Missing ⚠️
docling_core/transforms/chunker/doc_chunk.py	97.14%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

vagenas · 2025-10-30T13:48:54Z

docling_core/transforms/chunker/hierarchical_chunker.py

+
+    def _strip_markdown_code_formatting(self, text: str) -> str:
+        """Strip markdown code block formatting from text."""
+        if not text.startswith("```") or not text.endswith("```"):


These backticks look like a bug, perhaps to best be addressed on a different level.

vagenas · 2025-10-30T13:52:55Z

docling_core/transforms/chunker/hierarchical_chunker.py

+    CODE_BLOCK = "code_block"
+
+
+class CodeChunkingStrategy(Protocol):


Can make a BaseCodeChunkingStrategy(ABC) out of this, with abstract method

I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 334811a Signed-off-by: Bridget McGinn <bridget.mcginn@ibm.com>

Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Signed-off-by: Bridget <bridget.mcginn@ibm.com>

I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 46bb88a I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 10e9ed8 I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: d9827c7 I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 814dc61 Signed-off-by: Bridget McGinn <bridget.mcginn@ibm.com>

Signed-off-by: Bridget McGinn <bridget.mcginn@ibm.com>

I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: a4a21e9 I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 0266c63 I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 336dd6a I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 68890e9 I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 3c65eef Signed-off-by: Bridget McGinn <bridget.mcginn@ibm.com>

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

Signed-off-by: Bridget McGinn <bridget.mcginn@ibm.com>

vagenas · 2025-11-07T16:02:51Z

docling_core/types/doc/labels.py

        }
-        return mapping.get(self, None)
+
+    def is_supported_for_chunking(self) -> bool:


is this not used currently?

docling_core/types/doc/labels.py

Signed-off-by: Bridget McGinn <bridget.mcginn@ibm.com>

vagenas · 2025-11-10T16:04:03Z

@bridgetmcg can you fix the DCO?

I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 63c7739 I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 431d357 I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: f3175c2 I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 1a01de8 I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 025aea3 Signed-off-by: Bridget McGinn <bridget.mcginn@ibm.com>

- encapsulated code chunking specifics to separate package - clearly separated public vs internal API via module and method naming conventions - simplified or removed parts not stricly necessary for public API (e.g. lang support querying, noopstrategy) - split chunk data model to separate modules to prevent circular dependencies - renamed DefaultCodeChunkingStrategy to Standard... for clarity as it need not be the default strategy - fixed some issues (e.g. gen flag in test) Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

vagenas

Hi @bridgetmcg, I did some refactoring & made couple improvements — details in 641ea9c commit message (also removed the previous change of doc_items type in CodeDocMeta).

PeterStaar-IBM

wonderful! 🎖️

bridgetmcg changed the title ~~Add Code Chunking Functionality~~ feat: add code chunking functionality Oct 3, 2025

bridgetmcg force-pushed the feat/code-chunking branch from 5c6fc1e to 32b120d Compare October 3, 2025 13:51

bridgetmcg mentioned this pull request Oct 3, 2025

feat: code chunking backend for docling docling-project/docling#2378

Open

3 tasks

vagenas reviewed Oct 7, 2025

View reviewed changes

vagenas reviewed Oct 22, 2025

View reviewed changes

dolfim-ibm reviewed Oct 22, 2025

View reviewed changes

vagenas reviewed Oct 23, 2025

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

pyproject.toml Outdated Show resolved Hide resolved

docling_core/transforms/chunker/hierarchical_chunker.py Outdated Show resolved Hide resolved

bridgetmcg requested a review from vagenas October 24, 2025 13:17

bridgetmcg force-pushed the feat/code-chunking branch 2 times, most recently from 68bbf44 to b417cae Compare October 24, 2025 18:03

vagenas reviewed Oct 30, 2025

View reviewed changes

bridgetmcg and others added 6 commits October 30, 2025 10:51

initial code chunking for docling-core

63c7739

DCO Remediation Commit for Bridget McGinn <bridget.mcginn@ibm.com>

23be660

I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 334811a Signed-off-by: Bridget McGinn <bridget.mcginn@ibm.com>

include language detections, add code chunking into hierarchical chunker

431d357

add serializer, internal marking of chunkers, typing

f3175c2

Update pyproject.toml

f41474d

Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Signed-off-by: Bridget <bridget.mcginn@ibm.com>

Update docling_core/transforms/chunker/hierarchical_chunker.py

850c5cc

Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Signed-off-by: Bridget <bridget.mcginn@ibm.com>

bridgetmcg added 6 commits October 30, 2025 10:51

run all pre-commit less pytest

1a01de8

update test files for code ID

025aea3

update uv.lock

c801a2f

Signed-off-by: Bridget McGinn <bridget.mcginn@ibm.com>

revert to stricter treesitter versioning due to compatibility

c2ba93c

Signed-off-by: Bridget McGinn <bridget.mcginn@ibm.com>

bridgetmcg force-pushed the feat/code-chunking branch from 607347f to b9dec1b Compare October 30, 2025 14:55

vagenas and others added 3 commits October 30, 2025 16:25

remove language detection (to be run by client, i.e. docling)

455654d

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

align new dependency specs

8a2e61f

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

address backticks, ABC, and supported languages feedback

11340ab

Signed-off-by: Bridget McGinn <bridget.mcginn@ibm.com>

vagenas reviewed Nov 7, 2025

View reviewed changes

docling_core/types/doc/labels.py Show resolved Hide resolved

remove Language class and reuse CodeLanguageLabel

fc64987

Signed-off-by: Bridget McGinn <bridget.mcginn@ibm.com>

bridgetmcg and others added 2 commits November 10, 2025 13:26

vagenas approved these changes Nov 12, 2025

View reviewed changes

vagenas requested a review from PeterStaar-IBM November 12, 2025 08:48

PeterStaar-IBM approved these changes Nov 12, 2025

View reviewed changes

PeterStaar-IBM merged commit 3097645 into docling-project:main Nov 12, 2025
12 checks passed

		CODE_BLOCK = "code_block"


		class CodeChunkingStrategy(Protocol):

feat: add code chunking functionality #398

feat: add code chunking functionality #398

Conversation

bridgetmcg commented Oct 3, 2025

Features

Core

Language Support

Testing

Uh oh!

github-actions bot commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🟢 Enforce conventional commit

🟢 Require two reviewer for test updates

Uh oh!

dosubot bot commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vagenas left a comment

Choose a reason for hiding this comment

Uh oh!

bridgetmcg commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vagenas left a comment

Choose a reason for hiding this comment

Uh oh!

dolfim-ibm Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

dolfim-ibm commented Oct 22, 2025

Uh oh!

vagenas commented Oct 22, 2025

Uh oh!

dolfim-ibm commented Oct 22, 2025

Uh oh!

bridgetmcg commented Oct 22, 2025

Uh oh!

vagenas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bridgetmcg commented Oct 24, 2025

Uh oh!

vagenas commented Oct 24, 2025

Uh oh!

bridgetmcg commented Oct 24, 2025

Uh oh!

codecov bot commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

vagenas Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

vagenas Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

vagenas Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vagenas commented Nov 10, 2025

Uh oh!

vagenas left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PeterStaar-IBM left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

github-actions bot commented Oct 3, 2025 •

edited

Loading

mergify bot commented Oct 3, 2025 •

edited

Loading

dosubot bot commented Oct 3, 2025 •

edited

Loading

bridgetmcg commented Oct 10, 2025 •

edited

Loading

codecov bot commented Oct 27, 2025 •

edited

Loading

vagenas left a comment •

edited

Loading