-
|
It seems that the default list of RecursiveCharacterTextSplitter should include sentence splitting characters (".", "!", "?"). Otherwise the documentation is misleading ("This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible"). Currently, it does not try to keep sentences together. https://docs.langchain.com/oss/python/integrations/splitters/recursive_text_splitter |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
You're absolutely right that the documentation could make this clearer. The default This design choice was intentional — the default splitter aims to work well across non-natural language content (e.g., code, markdown, data tables) where punctuation-based splitting might cause unwanted fragmentation. If you want sentence-aware behavior (to actually “keep sentences together” as the docs suggest), you can explicitly override the from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
separators=["\n\n", "\n", ".", "!", "?", " ", ""],
chunk_size=1000,
chunk_overlap=100
)This will preserve sentence-level structure before falling back to finer splits. That said, your observation about the documentation wording is valid — it could better reflect that sentence-level separators are not part of the default configuration. A small docs clarification or PR adding these as optional defaults might help align expectations. |
Beta Was this translation helpful? Give feedback.
You're absolutely right that the documentation could make this clearer. The default
RecursiveCharacterTextSplitterdoes not include sentence-level separators like".","!", or"?"in itsseparatorslist. Instead, it prioritizes structural boundaries (paragraphs, newlines, spaces) for general-purpose use cases.This design choice was intentional — the default splitter aims to work well across non-natural language content (e.g., code, markdown, data tables) where punctuation-based splitting might cause unwanted fragmentation.
If you want sentence-aware behavior (to actually “keep sentences together” as the docs suggest), you can explicitly override the
separatorslist, for example: