Commit 6aee131
⚡️ Speed up function
### 📄 30% (0.30x) speedup for ***`group_broken_paragraphs` in
`unstructured/cleaners/core.py`***
⏱️ Runtime : **`21.2 milliseconds`** **→** **`16.3 milliseconds`** (best
of `66` runs)
### 📝 Explanation and details
Here’s an optimized version of your code, preserving all function
signatures, return values, and comments.
**Key improvements:**
- **Precompile regexes** inside the functions where they are used
repeatedly.
- **Avoid repeated `.strip()` and `.split()`** calls in tight loops by
working with stripped data directly.
- **Reduce intermediate allocations** (like unnecessary list comps).
- **Optimize `all_lines_short` computation** by short-circuiting
iteration (`any` instead of `all` and negating logic).
- Minimize calls to regex replace by using direct substitution when
possible.
**Summary of key speedups**.
- Precompiled regex references up-front—no repeated compile.
- Reordered bullet-matching logic for early fast-path continue.
- Short-circuit `all_lines_short`: break on the first long line.
- Avoids unnecessary double stripping/splitting.
- Uses precompiled regexes even when constants may be strings.
This version will be noticeably faster, especially for large documents
or tight loops.
✅ **Correctness verification report:**
| Test | Status |
| --------------------------- | ----------------- |
| ⚙️ Existing Unit Tests | ✅ **58 Passed** |
| 🌀 Generated Regression Tests | ✅ **49 Passed** |
| ⏪ Replay Tests | ✅ **6 Passed** |
| 🔎 Concolic Coverage Tests | 🔘 **None Found** |
|📊 Tests Coverage | 100.0% |
<details>
<summary>⚙️ Existing Unit Tests and Runtime</summary>
| Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup |
|:--------------------------------------------------------------------------------------------|:--------------|:---------------|:----------|
| `cleaners/test_core.py::test_group_broken_paragraphs` | 19.5μs |
16.1μs | ✅21.0% |
|
`cleaners/test_core.py::test_group_broken_paragraphs_non_default_settings`
| 23.9μs | 21.7μs | ✅10.2% |
| `partition/test_text.py::test_partition_text_groups_broken_paragraphs`
| 1.97ms | 1.96ms | ✅0.347% |
|
`test_tracer_py__replay_test_0.py::test_unstructured_cleaners_core_group_broken_paragraphs`
| 161μs | 119μs | ✅34.9% |
</details>
<details>
<summary>🌀 Generated Regression Tests and Runtime</summary>
```python
from __future__ import annotations
import re
# imports
import pytest # used for our unit tests
from unstructured.cleaners.core import group_broken_paragraphs
# Dummy patterns for testing (since unstructured.nlp.patterns is unavailable)
# These are simplified versions for the sake of testing
DOUBLE_PARAGRAPH_PATTERN_RE = re.compile(r"\n\s*\n")
E_BULLET_PATTERN = re.compile(r"^\s*e\s+", re.MULTILINE)
PARAGRAPH_PATTERN = re.compile(r"\n")
PARAGRAPH_PATTERN_RE = re.compile(r"\n")
# Unicode bullets for test
UNICODE_BULLETS_RE = re.compile(r"^\s*[•○·]", re.MULTILINE)
from unstructured.cleaners.core import group_broken_paragraphs
# unit tests
# -------------------- BASIC TEST CASES --------------------
def test_empty_string():
# Test that empty input returns empty string
codeflash_output = group_broken_paragraphs('') # 1.38μs -> 2.69μs (48.7% slower)
def test_single_line():
# Test that a single line is returned unchanged
codeflash_output = group_broken_paragraphs('Hello world.') # 6.58μs -> 6.83μs (3.68% slower)
def test_two_paragraphs_with_double_newline():
# Test that two paragraphs separated by double newline are preserved
text = "First paragraph.\nSecond line.\n\nSecond paragraph.\nAnother line."
expected = "First paragraph. Second line.\n\nSecond paragraph. Another line."
codeflash_output = group_broken_paragraphs(text) # 13.7μs -> 14.2μs (3.07% slower)
def test_paragraphs_with_single_line_breaks():
# Test that lines in a paragraph are joined with spaces
text = "The big red fox\nis walking down the lane.\n\nAt the end of the lane\nthe fox met a bear."
expected = "The big red fox is walking down the lane.\n\nAt the end of the lane the fox met a bear."
codeflash_output = group_broken_paragraphs(text) # 18.8μs -> 16.2μs (15.7% faster)
def test_bullet_points():
# Test bullet points are handled and line breaks inside bullets are joined
text = "• The big red fox\nis walking down the lane.\n\n• At the end of the lane\nthe fox met a bear."
expected = [
"• The big red fox is walking down the lane.",
"• At the end of the lane the fox met a bear."
]
codeflash_output = group_broken_paragraphs(text); result = codeflash_output # 33.4μs -> 19.7μs (69.7% faster)
def test_e_bullet_points():
# Test pytesseract e-bullet conversion is handled
text = "e The big red fox\nis walking down the lane.\n\ne At the end of the lane\nthe fox met a bear."
# e should be converted to ·
expected = [
"· The big red fox is walking down the lane.",
"· At the end of the lane the fox met a bear."
]
codeflash_output = group_broken_paragraphs(text); result = codeflash_output # 27.8μs -> 16.9μs (64.3% faster)
def test_short_lines_not_grouped():
# Test that lines with <5 words are not grouped
text = "Apache License\nVersion 2.0, January 2004\nhttp://www.apache.org/licenses/"
expected = "Apache License\nVersion 2.0, January 2004\nhttp://www.apache.org/licenses/"
codeflash_output = group_broken_paragraphs(text) # 10.5μs -> 11.5μs (8.37% slower)
def test_mixed_bullet_and_normal():
# Test that a mix of bullets and normal paragraphs works
text = (
"• First bullet\nis split\n\n"
"A normal paragraph\nwith line break.\n\n"
"• Second bullet\nis also split"
)
expected = [
"• First bullet is split",
"A normal paragraph with line break.",
"• Second bullet is also split"
]
codeflash_output = group_broken_paragraphs(text); result = codeflash_output # 31.2μs -> 21.3μs (46.3% faster)
# -------------------- EDGE TEST CASES --------------------
def test_all_whitespace():
# Test input of only whitespace returns empty string
codeflash_output = group_broken_paragraphs(' \n ') # 3.52μs -> 4.19μs (16.1% slower)
def test_only_newlines():
# Test input of only newlines returns empty string
codeflash_output = group_broken_paragraphs('\n\n\n') # 2.44μs -> 3.46μs (29.7% slower)
def test_single_bullet_with_no_linebreaks():
# Test bullet point with no line breaks is preserved
text = "• A bullet point with no line breaks."
codeflash_output = group_broken_paragraphs(text) # 15.3μs -> 8.46μs (81.1% faster)
def test_paragraph_with_multiple_consecutive_newlines():
# Test that multiple consecutive newlines are treated as paragraph breaks
text = "First para.\n\n\nSecond para.\n\n\n\nThird para."
expected = "First para.\n\nSecond para.\n\nThird para."
codeflash_output = group_broken_paragraphs(text) # 11.4μs -> 11.6μs (1.56% slower)
def test_leading_and_trailing_newlines():
# Test that leading and trailing newlines are ignored
text = "\n\nFirst para.\nSecond line.\n\nSecond para.\n\n"
expected = "First para. Second line.\n\nSecond para."
codeflash_output = group_broken_paragraphs(text) # 11.9μs -> 12.5μs (4.58% slower)
def test_bullet_point_with_leading_spaces():
# Test bullet with leading whitespace is handled
text = " • Bullet with leading spaces\nand a line break."
expected = "• Bullet with leading spaces and a line break."
codeflash_output = group_broken_paragraphs(text) # 18.4μs -> 10.6μs (73.3% faster)
def test_unicode_bullets():
# Test that various unicode bullets are handled
text = "○ Unicode bullet\nline two.\n\n· Another unicode bullet\nline two."
expected = [
"○ Unicode bullet line two.",
"· Another unicode bullet line two."
]
codeflash_output = group_broken_paragraphs(text); result = codeflash_output # 27.7μs -> 15.7μs (75.8% faster)
def test_short_lines_with_blank_lines():
# Test that short lines with blank lines are preserved and not grouped
text = "Title\n\nSubtitle\n\n2024"
expected = "Title\n\nSubtitle\n\n2024"
codeflash_output = group_broken_paragraphs(text) # 9.66μs -> 10.1μs (4.73% slower)
def test_mixed_short_and_long_lines():
# Test a paragraph with both short and long lines
text = "Title\nThis is a long line that should be grouped with the next.\nAnother long line."
expected = "Title This is a long line that should be grouped with the next. Another long line."
codeflash_output = group_broken_paragraphs(text) # 14.9μs -> 13.2μs (13.3% faster)
def test_bullet_point_with_inner_blank_lines():
# Test bullet points with inner blank lines
text = "• Bullet one\n\n• Bullet two\n\n• Bullet three"
expected = [
"• Bullet one",
"• Bullet two",
"• Bullet three"
]
codeflash_output = group_broken_paragraphs(text); result = codeflash_output # 24.9μs -> 13.7μs (81.4% faster)
def test_paragraph_with_tabs_and_spaces():
# Test paragraphs with tabs and spaces are grouped correctly
text = "First\tparagraph\nis here.\n\n\tSecond paragraph\nis here."
expected = "First\tparagraph is here.\n\n\tSecond paragraph is here."
codeflash_output = group_broken_paragraphs(text) # 12.4μs -> 12.4μs (0.314% slower)
# -------------------- LARGE SCALE TEST CASES --------------------
def test_large_number_of_paragraphs():
# Test function with 500 paragraphs
paras = ["Paragraph {} line 1\nParagraph {} line 2".format(i, i) for i in range(500)]
text = "\n\n".join(paras)
expected = "\n\n".join(["Paragraph {} line 1 Paragraph {} line 2".format(i, i) for i in range(500)])
codeflash_output = group_broken_paragraphs(text) # 1.79ms -> 1.69ms (5.66% faster)
def test_large_number_of_bullets():
# Test function with 500 bullet points, each split over two lines
bullets = ["• Bullet {} part 1\nBullet {} part 2".format(i, i) for i in range(500)]
text = "\n\n".join(bullets)
expected = "\n\n".join(["• Bullet {} part 1 Bullet {} part 2".format(i, i) for i in range(500)])
codeflash_output = group_broken_paragraphs(text) # 3.72ms -> 1.88ms (97.3% faster)
def test_large_mixed_content():
# Test function with 200 normal paragraphs and 200 bullet paragraphs
paras = ["Normal para {} line 1\nNormal para {} line 2".format(i, i) for i in range(200)]
bullets = ["• Bullet {} part 1\nBullet {} part 2".format(i, i) for i in range(200)]
# Interleave them
text = "\n\n".join([item for pair in zip(paras, bullets) for item in pair])
expected = "\n\n".join([
"Normal para {} line 1 Normal para {} line 2".format(i, i)
for i in range(200)
] + [
"• Bullet {} part 1 Bullet {} part 2".format(i, i)
for i in range(200)
])
# Since we interleaved, need to interleave expected as well
expected = "\n\n".join([
val for pair in zip(
["Normal para {} line 1 Normal para {} line 2".format(i, i) for i in range(200)],
["• Bullet {} part 1 Bullet {} part 2".format(i, i) for i in range(200)]
) for val in pair
])
codeflash_output = group_broken_paragraphs(text) # 2.48ms -> 1.59ms (55.8% faster)
def test_performance_on_large_text():
# Test that the function can handle a large block of text efficiently (not a correctness test)
big_text = "This is a line in a very big paragraph.\n" * 999
# Should be grouped into a single paragraph with spaces
expected = " ".join(["This is a line in a very big paragraph."] * 999)
codeflash_output = group_broken_paragraphs(big_text) # 2.62ms -> 2.62ms (0.161% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from __future__ import annotations
import re
# imports
import pytest # used for our unit tests
from unstructured.cleaners.core import group_broken_paragraphs
# Dummy regexes for test purposes (since we don't have unstructured.nlp.patterns)
DOUBLE_PARAGRAPH_PATTERN_RE = re.compile(r"\n\s*\n")
E_BULLET_PATTERN = re.compile(r"^e\s")
PARAGRAPH_PATTERN = re.compile(r"\n")
PARAGRAPH_PATTERN_RE = re.compile(r"\n")
UNICODE_BULLETS_RE = re.compile(r"^[\u2022\u2023\u25E6\u2043\u2219\u25AA\u25CF\u25CB\u25A0\u25A1\u25B2\u25B3\u25BC\u25BD\u25C6\u25C7\u25C9\u25CB\u25D8\u25D9\u25E6\u2605\u2606\u2765\u2767\u29BE\u29BF\u25A0-\u25FF]")
from unstructured.cleaners.core import group_broken_paragraphs
# unit tests
# -------------------------------
# 1. Basic Test Cases
# -------------------------------
def test_single_paragraph_joined():
# Should join lines in a single paragraph into one line
text = "The big red fox\nis walking down the lane."
expected = "The big red fox is walking down the lane."
codeflash_output = group_broken_paragraphs(text) # 11.2μs -> 9.78μs (14.9% faster)
def test_multiple_paragraphs():
# Should join lines in each paragraph, and keep paragraphs separate
text = "The big red fox\nis walking down the lane.\n\nAt the end of the lane\nthe fox met a bear."
expected = "The big red fox is walking down the lane.\n\nAt the end of the lane the fox met a bear."
codeflash_output = group_broken_paragraphs(text) # 17.7μs -> 15.7μs (13.0% faster)
def test_preserve_double_newlines():
# Double newlines should be preserved as paragraph breaks
text = "Para one line one\nPara one line two.\n\nPara two line one\nPara two line two."
expected = "Para one line one Para one line two.\n\nPara two line one Para two line two."
codeflash_output = group_broken_paragraphs(text) # 13.8μs -> 14.0μs (1.43% slower)
def test_short_lines_not_joined():
# Short lines (less than 5 words) should not be joined, but kept as separate lines
text = "Apache License\nVersion 2.0, January 2004\nhttp://www.apache.org/licenses/"
expected = "Apache License\nVersion 2.0, January 2004\nhttp://www.apache.org/licenses/"
codeflash_output = group_broken_paragraphs(text) # 10.7μs -> 11.2μs (4.59% slower)
def test_bullet_points_grouped():
# Bullet points with line breaks should be joined into single lines per bullet
text = "• The big red fox\nis walking down the lane.\n\n• At the end of the lane\nthe fox met a bear."
expected = "• The big red fox is walking down the lane.\n\n• At the end of the lane the fox met a bear."
codeflash_output = group_broken_paragraphs(text) # 35.4μs -> 21.1μs (68.0% faster)
def test_e_bullet_points_grouped():
# 'e' as bullet should be replaced and grouped
text = "e The big red fox\nis walking down the lane."
expected = "· The big red fox is walking down the lane."
codeflash_output = group_broken_paragraphs(text) # 17.5μs -> 10.9μs (61.7% faster)
# -------------------------------
# 2. Edge Test Cases
# -------------------------------
def test_empty_string():
# Empty string should return empty string
codeflash_output = group_broken_paragraphs("") # 1.13μs -> 2.03μs (44.3% slower)
def test_only_newlines():
# String of only newlines should return empty string
codeflash_output = group_broken_paragraphs("\n\n\n") # 2.70μs -> 3.52μs (23.1% slower)
def test_spaces_and_newlines():
# String of spaces and newlines should return empty string
codeflash_output = group_broken_paragraphs(" \n \n\n ") # 2.91μs -> 3.90μs (25.4% slower)
def test_single_word():
# Single word should be returned as is
codeflash_output = group_broken_paragraphs("Hello") # 5.77μs -> 6.09μs (5.24% slower)
def test_single_line_paragraphs():
# Multiple single-line paragraphs separated by double newlines
text = "First para.\n\nSecond para.\n\nThird para."
expected = "First para.\n\nSecond para.\n\nThird para."
codeflash_output = group_broken_paragraphs(text) # 11.3μs -> 12.0μs (5.89% slower)
def test_paragraph_with_trailing_newlines():
# Paragraph with trailing newlines should be handled
text = "The big red fox\nis walking down the lane.\n\n"
expected = "The big red fox is walking down the lane."
codeflash_output = group_broken_paragraphs(text) # 12.7μs -> 11.1μs (13.6% faster)
def test_bullet_with_extra_spaces():
# Bullet with extra spaces and newlines
text = " • The quick brown\nfox jumps over\n the lazy dog. "
expected = "• The quick brown fox jumps over the lazy dog. "
codeflash_output = group_broken_paragraphs(text) # 22.5μs -> 12.6μs (78.1% faster)
def test_mixed_bullets_and_normal():
# Mixed bullet and non-bullet paragraphs
text = "• Bullet one\ncontinues here.\n\nNormal para\ncontinues here."
expected = "• Bullet one continues here.\n\nNormal para continues here."
codeflash_output = group_broken_paragraphs(text) # 22.0μs -> 15.6μs (40.8% faster)
def test_multiple_bullet_styles():
# Multiple Unicode bullet styles
text = "• Bullet A\nline two.\n\n◦ Bullet B\nline two."
expected = "• Bullet A line two.\n\n◦ Bullet B line two."
codeflash_output = group_broken_paragraphs(text) # 23.7μs -> 12.4μs (90.4% faster)
def test_short_and_long_lines_mixed():
# A paragraph with both short and long lines
text = "Short\nThis is a much longer line that should be joined\nAnother short"
# Only the first and last lines are short, but the presence of a long line means the paragraph will be joined
expected = "Short This is a much longer line that should be joined Another short"
codeflash_output = group_broken_paragraphs(text) # 14.1μs -> 12.7μs (10.9% faster)
def test_paragraph_with_tabs():
# Paragraph with tabs instead of spaces
text = "The big red fox\tis walking down the lane."
expected = "The big red fox\tis walking down the lane."
codeflash_output = group_broken_paragraphs(text) # 9.45μs -> 7.96μs (18.7% faster)
def test_bullet_with_leading_newline():
# Bullet point with a leading newline
text = "\n• Bullet with leading newline\ncontinues here."
expected = "• Bullet with leading newline continues here."
codeflash_output = group_broken_paragraphs(text) # 18.7μs -> 9.98μs (87.2% faster)
def test_bullet_with_trailing_newline():
# Bullet point with a trailing newline
text = "• Bullet with trailing newline\ncontinues here.\n"
expected = "• Bullet with trailing newline continues here."
codeflash_output = group_broken_paragraphs(text) # 17.2μs -> 9.58μs (79.6% faster)
def test_unicode_bullet_variants():
# Test with a variety of Unicode bullets
text = "● Unicode bullet one\ncontinues\n\n○ Unicode bullet two\ncontinues"
expected = "● Unicode bullet one continues\n\n○ Unicode bullet two continues"
codeflash_output = group_broken_paragraphs(text) # 24.3μs -> 13.8μs (76.7% faster)
def test_multiple_empty_paragraphs():
# Multiple empty paragraphs between text
text = "First para.\n\n\n\nSecond para."
expected = "First para.\n\nSecond para."
codeflash_output = group_broken_paragraphs(text) # 9.26μs -> 9.85μs (6.00% slower)
# -------------------------------
# 3. Large Scale Test Cases
# -------------------------------
def test_large_number_of_paragraphs():
# 500 paragraphs, each with two lines to be joined
paras = ["Line one {}\nLine two {}".format(i, i) for i in range(500)]
text = "\n\n".join(paras)
expected = "\n\n".join(["Line one {} Line two {}".format(i, i) for i in range(500)])
codeflash_output = group_broken_paragraphs(text) # 1.36ms -> 1.29ms (5.79% faster)
def test_large_number_of_bullets():
# 300 bullet points, each with two lines
paras = ["• Bullet {}\ncontinues here.".format(i) for i in range(300)]
text = "\n\n".join(paras)
expected = "\n\n".join(["• Bullet {} continues here.".format(i) for i in range(300)])
codeflash_output = group_broken_paragraphs(text) # 1.98ms -> 969μs (104% faster)
def test_large_mixed_content():
# Mix of 200 normal paras and 200 bullets
normal_paras = ["Normal {}\ncontinues".format(i) for i in range(200)]
bullet_paras = ["• Bullet {}\ncontinues".format(i) for i in range(200)]
all_paras = []
for i in range(200):
all_paras.append(normal_paras[i])
all_paras.append(bullet_paras[i])
text = "\n\n".join(all_paras)
expected = "\n\n".join([
"Normal {} continues".format(i) if j % 2 == 0 else "• Bullet {} continues".format(i//2)
for j, i in enumerate(range(400))
])
# Fix expected to match the correct sequence
expected = "\n\n".join(
["Normal {} continues".format(i) for i in range(200)] +
["• Bullet {} continues".format(i) for i in range(200)]
)
# The function will process in order, so we need to interleave
interleaved = []
for i in range(200):
interleaved.append("Normal {} continues".format(i))
interleaved.append("• Bullet {} continues".format(i))
expected = "\n\n".join(interleaved)
codeflash_output = group_broken_paragraphs(text)
def test_large_short_lines():
# 1000 short lines, all should be preserved as is (not joined)
text = "\n".join(["A {}".format(i) for i in range(1000)])
expected = "\n".join(["A {}".format(i) for i in range(1000)])
codeflash_output = group_broken_paragraphs(text) # 605μs -> 565μs (7.11% faster)
def test_large_paragraph_with_long_lines():
# One paragraph with 1000 long lines (should be joined into one)
text = "\n".join(["This is a long line number {}".format(i) for i in range(1000)])
expected = " ".join(["This is a long line number {}".format(i) for i in range(1000)])
codeflash_output = group_broken_paragraphs(text) # 2.11ms -> 2.09ms (1.10% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
```
</details>
To edit these changes `git checkout
codeflash/optimize-group_broken_paragraphs-mcg8s57e` and push.
[](https://codeflash.ai)
---------
Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
Co-authored-by: Saurabh Misra <misra.saurabh1@gmail.com>
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
Co-authored-by: Alan Bertl <alan@unstructured.io>group_broken_paragraphs by 30% (#4088)1 parent 1030a69 commit 6aee131
3 files changed
+25
-14
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
2 | 2 | | |
3 | 3 | | |
| 4 | + | |
| 5 | + | |
4 | 6 | | |
5 | 7 | | |
6 | 8 | | |
| |||
10 | 12 | | |
11 | 13 | | |
12 | 14 | | |
13 | | - | |
14 | 15 | | |
15 | 16 | | |
16 | 17 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
119 | 119 | | |
120 | 120 | | |
121 | 121 | | |
122 | | - | |
| 122 | + | |
| 123 | + | |
123 | 124 | | |
124 | 125 | | |
125 | 126 | | |
126 | | - | |
| 127 | + | |
127 | 128 | | |
128 | | - | |
| 129 | + | |
| 130 | + | |
129 | 131 | | |
130 | 132 | | |
131 | | - | |
| 133 | + | |
132 | 134 | | |
133 | 135 | | |
134 | 136 | | |
| |||
151 | 153 | | |
152 | 154 | | |
153 | 155 | | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
154 | 162 | | |
155 | 163 | | |
156 | 164 | | |
157 | | - | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
158 | 171 | | |
159 | 172 | | |
160 | 173 | | |
| |||
163 | 176 | | |
164 | 177 | | |
165 | 178 | | |
166 | | - | |
167 | | - | |
168 | | - | |
169 | | - | |
170 | | - | |
| 179 | + | |
| 180 | + | |
171 | 181 | | |
172 | | - | |
| 182 | + | |
173 | 183 | | |
174 | 184 | | |
175 | 185 | | |
| |||
0 commit comments