Skip to content

fix: faulty TOC import/export (SD-2183)#2371

Merged
harbournick merged 6 commits intomainfrom
luccas/sd-2183-bug-file-becomes-corrupted-on-export
Mar 13, 2026
Merged

fix: faulty TOC import/export (SD-2183)#2371
harbournick merged 6 commits intomainfrom
luccas/sd-2183-bug-file-becomes-corrupted-on-export

Conversation

@luccas-harbour
Copy link
Contributor

Summary

Fixes the DOCX export/import issues behind SD-2183 for table of contents content controls.

This PR addresses two related problems:

  • exported TOC content controls could emit invalid/empty w:id values
  • TOC field instructions stored inside a content control were not reliably round-tripped, which caused the exported document to lose the w:fldChar / w:instrText structure Word expects

What Changed

1. Prevent empty SDT IDs during export

  • stop emitting w:id for document part SDTs when the value is empty
  • sanitize passthrough sdtPr data so empty w:id nodes are not re-exported

2. Handle complex fields stored in a single run

  • update preProcessNodesForFldChar to correctly process fields when begin, instrText, separate, and end are all stored in the same w:r
  • preserve unknown-field fallback behavior for those compressed single-run cases
  • add regression coverage for:
    • TOC fields in a single run
    • unknown fields in a single run
    • drawing/pict content inside active field collection
    • nested/recursive field boundaries

3. Preserve TOC structure inside document part objects

  • update TOC docPartObject import to hoist sd:tableOfContents out of wrapper paragraphs
  • keep the imported PM structure aligned with what the layout/export pipeline expects:
    • heading paragraph
    • nested tableOfContents block
  • avoid creating empty wrapper paragraphs when a paragraph only contains pPr plus the TOC block
  • relax the tableOfContents node content model from paragraph+ to paragraph* so empty imported TOCs remain valid

Why

Word expects the TOC field to round-trip as a real complex field sequence. Without that, exported files can lose the TOC field markers and fail to behave correctly when reopened in Word.

@linear
Copy link

linear bot commented Mar 11, 2026

@github-actions
Copy link
Contributor

Status: PASS

The OOXML elements and attributes in this PR are all spec-compliant. Here's what I checked:

w:id (§17.5.2.18) — The val attribute is typed as ST_DecimalNumber. The old code was emitting <w:id w:val=""/> when the id was empty, which is invalid. The new sanitizeId helper correctly omits the element rather than writing an empty value. Per spec, omission of w:id is explicitly allowed — the processor will assign a new unique ID on open. Good fix.

One minor note: sanitizeId accepts any non-empty string, not just decimals. If upstream data ever contains a non-numeric id (e.g. "abc"), the emitted <w:id w:val="abc"/> would still be schema-invalid. That said, this is a pre-existing concern — the PR only tightens the empty-string guard, not introduces a new hole.

w:docPartObj / w:docPartGallery / w:docPartUnique (§17.5.2.13, §17.5.2.11, §17.5.2.14) — All used correctly. The boolean presence-equals-true pattern for w:docPartUnique matches spec, and the w:docPartGallery val is ST_String in the SDT context (not the restricted ST_DocPartGallery enum, which applies to glossary entries).

w:fldChar / w:fldCharType (§17.16.18, §17.18.29) — The three type values begin, separate, end are exactly right. The spec's canonical examples show each w:fldChar in its own w:r, but nothing prohibits co-location in a single run. expandNodeForFieldProcessing handling this case is a reasonable robustness measure for real-world documents.

w:instrText — Correctly named and used within w:r.

@luccas-harbour luccas-harbour marked this pull request as ready for review March 11, 2026 19:47
Copy link
Contributor

@caio-pizzol caio-pizzol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luccas-harbour nice fix — splitting runs with multiple field markers and pulling TOC blocks out of wrapper paragraphs handles the corruption well.

one edge case to be aware of in the field splitter, and one spot where we should copy instead of modify in place — left inline comments on both.

on tests: the existing visual/layout test data covers general TOC and fldChar rendering, but nothing exercises the specific case this PR fixes (all field markers in one run). two unit tests worth adding: one where a single paragraph has both regular content and a TOC element, and one that checks the schema accepts a TOC with no children. a behavior test importing a doc with single-run fldChar fields would also be a nice regression guard but not blocking.

Copy link
Collaborator

@harbournick harbournick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@harbournick harbournick merged commit 45b4452 into main Mar 13, 2026
7 checks passed
@harbournick harbournick deleted the luccas/sd-2183-bug-file-becomes-corrupted-on-export branch March 13, 2026 17:46
@superdoc-bot
Copy link
Contributor

superdoc-bot bot commented Mar 13, 2026

🎉 This PR is included in superdoc-cli v0.2.0-next.131

The release is available on GitHub release

@superdoc-bot
Copy link
Contributor

superdoc-bot bot commented Mar 13, 2026

🎉 This PR is included in superdoc v1.18.0-next.56

The release is available on GitHub release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants