Skip to content

Add Row Text Splitter node for line-based document chunking#6138

Open
Dexterity104 wants to merge 2 commits intoFlowiseAI:mainfrom
Dexterity104:feature/row-text-splitter-node
Open

Add Row Text Splitter node for line-based document chunking#6138
Dexterity104 wants to merge 2 commits intoFlowiseAI:mainfrom
Dexterity104:feature/row-text-splitter-node

Conversation

@Dexterity104
Copy link
Copy Markdown

Summary

Add a new Row Text Splitter node under Text Splitters in packages/components.
It's meant for line-based content (CSV-style data, table exports, logs) where each row should become its own document.

Closes #6112

What's included

  • New RowTextSplitter node wired into the existing TextSplitter ecosystem
  • Config options:
    • Line Separator (defaults to \n, supports escaped values like \r\n, \t)
    • Trim Whitespace toggle
    • Include Empty Lines toggle
  • Implementation extends LangChain’s TextSplitter, so it works with:
    • splitText
    • splitDocuments
    • createDocuments
  • Keeps incoming metadata and adds line-based loc info

Behavior

For input:

id,name,age
1,Alice,30
2,Bob,25
3,Charlie,40

With defaults (separator \n, trim on, empty lines off), the splitter produces four chunks:

  1. id,name,age
  2. 1,Alice,30
  3. 2,Bob,25
  4. 3,Charlie,40

Each line is a separate document, aligned exactly with the original rows.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the RowTextSplitter component, which enables splitting text into individual rows based on a configurable separator. The implementation includes options for trimming whitespace and filtering empty lines, as well as a node wrapper for integration. A review comment suggests optimizing the line-splitting logic by using a regular expression to handle both LF and CRLF line endings more efficiently during the split operation, rather than performing replacements within a loop.

Comment on lines +34 to +40
const rawLines = text.split(this.lineSeparator)
const lines: string[] = []

for (let raw of rawLines) {
if (this.lineSeparator === '\n') {
raw = raw.replace(/\r$/, '')
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation of line splitting and carriage return handling can be optimized. Performing a regex replacement inside a loop for every line is less efficient than handling it during the split operation, especially for large documents like logs or CSVs. Using a regex in the split method when the separator is a newline is a more performant and idiomatic approach to handle both LF and CRLF line endings.

Suggested change
const rawLines = text.split(this.lineSeparator)
const lines: string[] = []
for (let raw of rawLines) {
if (this.lineSeparator === '\n') {
raw = raw.replace(/\r$/, '')
}
const rawLines = this.lineSeparator === '\n' ? text.split(/\r?\n/) : text.split(this.lineSeparator)
const lines: string[] = []
for (const raw of rawLines) {
References
  1. Prioritize code readability and understandability over conciseness. A series of simple, chained operations can be preferable to a single, more complex one (e.g., a complex regex with a replacer function) if it improves understandability and reduces the potential for future errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The document is accurately split into lines

1 participant