Skip to content

fix: improve Calibre artifact cleanup#5

Closed
fredchu wants to merge 1 commit intodeusyu:mainfrom
fredchu:fix/calibre-text-cleanup
Closed

fix: improve Calibre artifact cleanup#5
fredchu wants to merge 1 commit intodeusyu:mainfrom
fredchu:fix/calibre-text-cleanup

Conversation

@fredchu
Copy link
Copy Markdown

@fredchu fredchu commented Mar 22, 2026

Summary

扩展 clean_calibre_markers() 处理更多 Pandoc/Calibre 转换残留,每条规则附 before/after 样本。

从 PR #2 拆出来的第二项改动(文本清理),按 review 建议修正了激进的全局替换。

改动内容

规则 Before After
属性块(含 #id) ## Title {#calibre_link-0 .calibre3} ## Title
转义括号包裹 \[Some paragraph.\] Some paragraph.
空标题 ## (removed)
标题 [*text*] 包裹 ## [*Some Title*] ## Some Title

针对 review 意见的修正

  1. \[ / \] 不再全局替换 — 改用 (?m)^\\?\\\[(?m)\\?\\\]$,只匹配行首/行尾,不会破坏 LaTeX display math \[...\]
  2. strip('[]* ') 已移除 — 标题清理改用 re.match(r'^\[\s*\*(.+?)\*\s*\]$', text) 精确匹配 [*text*] 模式,不会误删合法字符

Test plan

  • 属性块移除:text{.calibre5}text
  • 转义括号:行首行尾的 \[...\] 被移除
  • LaTeX 保护:行中的 \[x+1\] 不受影响
  • 标题清理:## [*Some Title*]## Some Title
  • bold 括号:[**Chapter One**]**Chapter One**

🤖 Generated with Claude Code

Expand cleanup to handle more Pandoc/Calibre conversion artifacts:
- Attribute blocks with #id and .class (e.g. {#calibre_link-0 .calibre3})
- Escaped bracket wrapping \[...\] — only at line boundaries to
  preserve LaTeX display math mid-line
- Empty headings (# with no text)
- Heading [*Title*] wrapping via specific regex match

Each rule includes before/after samples in the docstring.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@deusyu
Copy link
Copy Markdown
Owner

deusyu commented Apr 23, 2026

Thanks again for the PR.

After reviewing this against the current main branch, I do not think this is a small patch anymore. Improving Calibre/Pandoc cleanup needs to be handled more carefully under the current regression expectations, so I do not want to merge this directly in its current form.

I am going to close this for now, roll the underlying idea into the roadmap on my side, and handle any follow-up work from the maintainer side.

Thanks again for the contribution.

@deusyu deusyu closed this Apr 23, 2026
deusyu added a commit that referenced this pull request Apr 23, 2026
- CLAUDE.md: promote pipeline artifact names (`book.html`, `book.epub`,
  ...) to a project Convention so future changes don't silently rename
  them; if title-based filenames are added, they must be optional
  aliases/copies, not replacements
- README{,.zh-CN}.md: append `(context: closed #N)` to each backlog
  bullet so the rationale stays traceable to the PR discussions that
  produced it (#3, #4, #5, #6)
- tests/baselines/README.md: codify "a baseline that has never been run
  is not a baseline" — SOURCE.md numbers come from a measured run, not
  estimates
- .gitignore: ignore local AI agent / tooling dirs (.agents, .claude,
  _bmad) so they stop showing up as untracked
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants