Skip to content

fix: import of w:sdt nested in paragraph (SD-2190)#2380

Open
luccas-harbour wants to merge 9 commits intomainfrom
luccas/sd-2190-bug-import-fails-when-a-list-of-figurestables-is-the-first
Open

fix: import of w:sdt nested in paragraph (SD-2190)#2380
luccas-harbour wants to merge 9 commits intomainfrom
luccas/sd-2190-bug-import-fails-when-a-list-of-figurestables-is-the-first

Conversation

@luccas-harbour
Copy link
Contributor

Summary

Fix DOCX import failures caused by w:sdt w:docPartObj nodes being parsed as block content inside paragraph nodes.

This change makes paragraph import resilient when a block-level docPart SDT appears inside w:p, especially for cases like Table of Figures / Table of Contents placeholders at the start of a document.

Problem

Some documents contain XML like:

<w:p>
  <w:sdt>
    <w:sdtPr>
      <w:docPartObj>
        <w:docPartGallery w:val="Table of Figures"/>
      </w:docPartObj>
    </w:sdtPr>
    <w:sdtContent>...</w:sdtContent>
  </w:sdt>
</w:p>

Our importer translated that w:sdt into a documentPartObject block node, but still attempted to keep it inside paragraph.content.

Since paragraph only allows inline content, import could fail with:

Invalid content for node type paragraph

What changed

  • Added a shared isInlineNode() helper for schema-aware inline/block classification.
  • Strengthened the fallback inline detection used when schema metadata is unavailable.
  • Updated paragraph import normalization so block-producing children are hoisted out of w:p instead of being inserted into paragraph.content.
  • Split mixed paragraph content into sibling nodes when needed:
    • paragraph fragments for inline runs
    • block nodes for hoisted content like documentPartObject
  • Adjusted paragraph attribute distribution for split results:
    • paraId / textId stay on only one paragraph fragment
    • rsid* attrs are still copied to paragraph fragments
  • Preserved wrapper paragraph metadata on block-only hoisted documentPartObject nodes so paragraph formatting can round-trip through export.
  • Re-export wrapped w:p > w:sdt when wrapper paragraph metadata is present.
  • Preserved sectPr placement for hoisted trailing/block-only docPart cases.

Tests

Added and updated regressions for:

  • block-only docPartObj inside w:p
  • mixed inline text around docPartObj
  • no-schema import paths
  • split paragraph identity handling
  • block-only wrapper paragraph formatting
  • docPart export with preserved paragraph wrapper metadata

Limitations

The following limitations are known and accepted for now:

  • Wrapper paragraph metadata preserved on documentPartObject is primarily used for round-trip/export and is not fully reflected in rendering behavior.
  • Some paragraph-level attrs may still be duplicated across split paragraph fragments in edge cases.
  • The no-schema inline fallback is still heuristic-based and may not cover every inline media type.

Allow paragraph translators to return fragment arrays during import and
normalize them in the v2 paragraph importer.

When the legacy paragraph handler returns mixed fragment output, apply
encoded paragraph attributes only to paragraph nodes so embedded
documentPartObject fragments remain unchanged.

Add coverage for the array-return path in the paragraph translator tests.
@linear
Copy link

linear bot commented Mar 12, 2026

@github-actions
Copy link
Contributor

Status: FAIL

One spec violation in the export path, plus a minor default value issue.


1. translate-document-part-obj.js — inline SDT containing block content

The new wrapDocumentPartInParagraph function produces:

<w:p w:rsidRDefault="...">
  <w:pPr>...</w:pPr>
  <w:sdt>
    <w:sdtContent>
      <w:p>...</w:p>   <!-- paragraph child! -->
    </w:sdtContent>
  </w:sdt>
</w:p>

Per §17.5.2.31, a w:sdt inside a w:p is an inline-level SDT (CT_SdtRun) whose sdtContent must hold run-level content. A w:docPartObj SDT is explicitly defined (§17.5.2.13) as a block-level construct for tagging "a set of block-level objects." Nesting it inside w:p violates both CT_SdtRun's content model and docPartObj's semantics. See https://ooxml.dev/spec?q=sdtContent+inline+block and https://ooxml.dev/spec?q=docPartObj.

This is clearly intentional — the PR is round-tripping documents that already had this unusual structure — but the export output is technically non-conformant OOXML. If Word generated the original file, it probably placed the docPartObj SDT at body level. The safer export would be to emit the block SDT at body level (with the paragraph attributes forwarded as a preceding w:p if needed) rather than re-creating the malformed wrapping.

2. document-part-object.jsdocPartUnique default

docPartUnique: {
  default: true,   // pre-existing, not introduced here
},

This is pre-existing and not changed by this PR, but it interacts directly with the new wrapperParagraph machinery: any documentPartObject node created without an explicit docPartUnique: false will export <w:docPartUnique/>. Per §17.5.2.14, the element being absent is meaningful (the SDT is not a built-in document part container). Defaulting to true causes the exporter to assert built-in status on every SDT, including ones where the original <w:docPartUnique/> was absent. The default should be false or null. This PR doesn't introduce the problem but it does widen the surface where it matters since more nodes now go through this export path.


Everything else checks out — w:id/@w:val as ST_DecimalNumber (§17.5.2.18), w:docPartGallery/@w:val as ST_String (§17.5.2.11), w:docPartUnique as a toggle (§17.17.4 — present-with-no-val implies true), and w:keepNext/@w:val="true" are all handled correctly.

@luccas-harbour
Copy link
Contributor Author

luccas-harbour commented Mar 12, 2026

About the OOXML Spec Review failure above, the docx file that originated this PR has exactly the structure that the report is saying isn't valid but was created with Word:

<w:p w14:paraId="41964671" w14:textId="04598795" w:rsidR="00233D7B" w:rsidRPr="003104CE" w:rsidRDefault="00233D7B" w:rsidP="003104CE">
  <w:sdt>
    <w:sdtPr>
      <w:id w:val="123456789"/>
      <w:docPartObj>
        <w:docPartGallery w:val="Table of Figures"/>
        <w:docPartUnique/>
      </w:docPartObj>
    </w:sdtPr>
    <w:sdtContent>
      <w:p w14:paraId="11111111" w14:textId="11111111">
        <w:r>
          <w:t>Table of Figures</w:t>
        </w:r>
      </w:p>
      <w:p w14:paraId="22222222" w14:textId="22222222">
        <w:r>
          <w:t>Figure 1</w:t>
        </w:r>
        <w:r>
          <w:tab/>
        </w:r>
        <w:r>
          <w:t>1</w:t>
        </w:r>
      </w:p>
    </w:sdtContent>
  </w:sdt>
</w:p>

Therefore I don't think the analysis result is valid.

@luccas-harbour luccas-harbour marked this pull request as ready for review March 12, 2026 19:59
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2772dfcb84

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@github-actions
Copy link
Contributor

Status: PASS

The OOXML handling in this PR is spec-compliant. Here's what I checked:

w:docPartObj / w:docPartGallery / w:docPartUnique (§17.5.2.13–14): All three are used correctly. w:docPartGallery takes a string w:val, w:docPartUnique is treated as a presence-based boolean (element present → true), and both must be children of w:docPartObj inside w:sdtPr. The test fixtures match this structure exactly.

w:id on w:sdtPr (§17.5.2.18): Used with w:val="123456789" — a valid ST_DecimalNumber integer. ✓

w:keepNext serialization: The export test expects { name: 'w:keepNext', attributes: {} } (i.e., <w:keepNext/>). Per the common boolean property definition (§17.17.4), omitting w:val on a present element implies true. ✓

w:spacing attributes: w:after, w:line as integers and w:lineRule="auto" are all valid. ✓

Paragraph rsid attributes (w:rsidRDefault, w:rsidR, etc.) and w14:paraId / w14:textId: All valid w:p attributes. The identity-attribute partitioning (IDENTITY_ATTR_NAMES) correctly avoids duplicating paraId/textId across synthetic paragraph fragments, since each w:p must have a unique paraId per the Word 2010 namespace spec.

The wrapping behavior (wrapperParagraph): The round-trip produces <w:p><w:sdt w:docPartObj>...</w:sdt></w:p> — this is technically an inline-level SDT (§17.5.2.31) containing block content, which is non-standard. But this mirrors the structure of real-world documents being imported, not a new violation introduced by this PR. The code is faithfully preserving what was there.

No non-existent attributes, no incorrect defaults, and no new spec violations introduced. The wrapperParagraph: { default: null } addition is purely an internal PM attribute with no OOXML representation of its own.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant