Skip to content

Conversation

@bridgetmcg
Copy link
Contributor

This PR introduces code chunking functionality to docling-core, enabling intelligent parsing and chunking of source code files across multiple programming languages. The implementation leverages tree-sitter for accurate parsing and provides language-specific chunkers for Python, TypeScript, JavaScript, Java, and C.

Features

Core

  • CodeChunker - Base abstract class for code chunking with Tree-sitter integration
  • Language-specific chunkers - Specialized implementations for 5 major programming languages
  • Smart chunk splitting - Automatic splitting of large functions while preserving context

Language Support

  • Python (PythonFunctionChunker) - Functions, classes, imports, module variables
  • TypeScript (TypeScriptFunctionChunker) - Functions, classes, interfaces, imports
  • JavaScript (JavaScriptFunctionChunker) - Inherits from TypeScript chunker
  • Java (JavaFunctionChunker) - Methods, constructors, classes, enums, interfaces
  • C (CFunctionChunker) - Functions, structs, macros, preprocessor definitions

Testing

  • test_code_chunker.py - multi-language, real code samples

@github-actions
Copy link
Contributor

github-actions bot commented Oct 3, 2025

DCO Check Passed

Thanks @bridgetmcg, all your commits are properly signed off. 🎉

@mergify
Copy link

mergify bot commented Oct 3, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

@dosubot
Copy link

dosubot bot commented Oct 3, 2025

Related Documentation

Checked 3 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

@bridgetmcg bridgetmcg changed the title Add Code Chunking Functionality feat: add code chunking functionality Oct 3, 2025
Copy link
Collaborator

@vagenas vagenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Let me share some first thoughts, mostly on how these capabilities can be packaged and exposed:

I see the PR contains various language-specific chunkers, e.g. Java, Python etc.
My recommendation would be:

  • to encapsulate these capabilities under a single component, which would include the language detection inside it, and,
  • to ensure composability with existing chunkers, instead of introducing a new chunker, I'd rather provide this as a capability pluggable into the HierarchicalChunker (good fit because (1) it follows Item boundaries, so CodeItems can be nicely delegated, and (2) is itself composable into the HybridChunker). Let us still define the interface specifics, but it could look like an optional kwarg in HierarchicalChunker.chunk(), e.g. code_chunking_strategy, adhering to a matching interface.

A more minor comment is regarding CodeDocMeta, which I see inherits from BaseMeta. Some application code may expect to interact with DocMeta (also a BaseMeta child), so if possible we should best extend that.
This point goes hand in hand with the interface specifics TBD above.

(For now I would focus on the present PR & the points above — the idea of introducing a code backend as per the docling repo PR can be discussed at a second step.)

@bridgetmcg
Copy link
Contributor Author

bridgetmcg commented Oct 10, 2025

hi @vagenas, I added some logic to address your suggestions.

Copy link
Collaborator

@vagenas vagenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bridgetmcg, many thanks for the new iteration, incorporating feedback from above!

I only have a couple last points from my side (@dolfim-ibm I don't know if you want to add anything):

  1. I see the actual integration meanwhile occurs via HierarchicalChunker -> DefaultCodeChunkingStrategy -> CodeChunkingStrategyFactory which returns a CodeChunker (subclass of BaseChunker). Since the chunker primitive should be runnable on any document, it could be confusing to expose a component that operates only on e.g. Python, as a "chunker". Ideally one could just satisfy the newly introduced interface/protocol (and not BaseChunker which may require additional points), but perhaps the fastest way to address this point is just to mark the CodeChunker class hierarchy "internal" by prepending with _. Then it's clear all these classes are implementation internals, and users only need to care about the strategy they can optionally pass to HierarchicalChunker.
  2. the defined CodeChunkingStrategy(Protocol) does not seem to be used anywhere — shouldn't this be somewhere in the typing of field code_chunking_strategy within HierarchicalChunker?
  3. to allow for extensibility we support the notion of customizable serializers; while this perhaps makes little sense in CodeItems, for consistency, in HierarchicalChunker we should still best use the result from doc_serializer (instead of just item.text)

Hope that makes sense. Otherwise we can also have a quick call to clarify & finalize.

meta: CodeDocMeta


class ChunkType(str, Enum):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to avoid users thinking this is a generic ChunkType, we could rename the class to CodeChunkType.

@dolfim-ibm
Copy link
Contributor

Looking at the discussion and the proposed code_chunking_strategy, I'm think we want to consider how this will scale to other types. For example, multiple chunking strategies for tables are a recurrent topic.

@vagenas
Copy link
Collaborator

vagenas commented Oct 22, 2025

Looking at the discussion and the proposed code_chunking_strategy, I'm think we want to consider how this will scale to other types. For example, multiple chunking strategies for tables are a recurrent topic.

What about the obvious way to expand this, i.e. adding an optional table_chunking_strategy to the HierarchicalChunker?

@dolfim-ibm
Copy link
Contributor

Looking at the discussion and the proposed code_chunking_strategy, I'm think we want to consider how this will scale to other types. For example, multiple chunking strategies for tables are a recurrent topic.

What about the obvious way to expand this, i.e. adding an optional table_chunking_strategy to the HierarchicalChunker?

I was just looking up what we do in the serializers, there we are indeed using this approach. So let's go with it.

@bridgetmcg
Copy link
Contributor Author

@vagenas I believe I addressed your comments! Let me know if not. Many thanks!

Copy link
Collaborator

@vagenas vagenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bridgetmcg I made some in-line comments incl. code suggestions.
Please also install the pre-commit hooks, so all checks are verified locally before pushing.
(E.g. I think some tests are still not up-to-date. FYI to generate the data, set env var DOCLING_GEN_TEST_DATA=1, e.g. DOCLING_GEN_TEST_DATA=1 uv run pytest)

@bridgetmcg
Copy link
Contributor Author

@vagenas I ran all the pre-commit checks which showed that now with the language identification we can correctly label some code snippets from other tests. Those were updated in 814dc61

@vagenas
Copy link
Collaborator

vagenas commented Oct 24, 2025

@bridgetmcg sounds good, now we still need to address the conflicts on uv.lock (not possible manually), which means:

  1. branch needs to get up-to-date with latest main
  2. uv.lock needs to be regenerated
  3. (if 1. is done by rebasing, a force-push would be needed)

Can you take care of these? Otherwise let me know and I could try to look into it.

@bridgetmcg bridgetmcg requested a review from vagenas October 24, 2025 13:17
@bridgetmcg bridgetmcg force-pushed the feat/code-chunking branch 2 times, most recently from 68bbf44 to b417cae Compare October 24, 2025 18:03
@bridgetmcg
Copy link
Contributor Author

@vagenas I had to restrict the tree-sitter versioning due to python compatibility. treesitter > 0.24 requires python 3.10+ and 0.23 requires all the treesitter language libraries to be <0.24 as well.

@codecov
Copy link

codecov bot commented Oct 27, 2025


def _strip_markdown_code_formatting(self, text: str) -> str:
"""Strip markdown code block formatting from text."""
if not text.startswith("```") or not text.endswith("```"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These backticks look like a bug, perhaps to best be addressed on a different level.

CODE_BLOCK = "code_block"


class CodeChunkingStrategy(Protocol):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can make a BaseCodeChunkingStrategy(ABC) out of this, with abstract method

bridgetmcg and others added 6 commits October 30, 2025 10:51
I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 334811a

Signed-off-by: Bridget McGinn <bridget.mcginn@ibm.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Bridget <bridget.mcginn@ibm.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Bridget <bridget.mcginn@ibm.com>
I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 46bb88a
I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 10e9ed8
I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: d9827c7
I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 814dc61

Signed-off-by: Bridget McGinn <bridget.mcginn@ibm.com>
Signed-off-by: Bridget McGinn <bridget.mcginn@ibm.com>
Signed-off-by: Bridget McGinn <bridget.mcginn@ibm.com>
I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: a4a21e9
I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 0266c63
I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 336dd6a
I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 68890e9
I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 3c65eef

Signed-off-by: Bridget McGinn <bridget.mcginn@ibm.com>
vagenas and others added 3 commits October 30, 2025 16:25
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: Bridget McGinn <bridget.mcginn@ibm.com>
}
return mapping.get(self, None)

def is_supported_for_chunking(self) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this not used currently?

Signed-off-by: Bridget McGinn <bridget.mcginn@ibm.com>
@vagenas
Copy link
Collaborator

vagenas commented Nov 10, 2025

@bridgetmcg can you fix the DCO?

bridgetmcg and others added 2 commits November 10, 2025 13:26
I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 63c7739
I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 431d357
I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: f3175c2
I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 1a01de8
I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 025aea3

Signed-off-by: Bridget McGinn <bridget.mcginn@ibm.com>
- encapsulated code chunking specifics to separate package
- clearly separated public vs internal API via module and method naming conventions
- simplified or removed parts not stricly necessary for public API (e.g. lang support querying, noopstrategy)
- split chunk data model to separate modules to prevent circular dependencies
- renamed DefaultCodeChunkingStrategy to Standard... for clarity as it need not be the default strategy
- fixed some issues (e.g. gen flag in test)

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Copy link
Collaborator

@vagenas vagenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bridgetmcg, I did some refactoring & made couple improvements — details in 641ea9c commit message (also removed the previous change of doc_items type in CodeDocMeta).

Copy link
Contributor

@PeterStaar-IBM PeterStaar-IBM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wonderful! 🎖️

@PeterStaar-IBM PeterStaar-IBM merged commit 3097645 into docling-project:main Nov 12, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants