fix(copyright): clean up Boost markup regressions by mstykow · Pull Request #565 · mstykow/provenant

mstykow · 2026-04-04T13:55:32Z

Summary

clean up shared copyright/author detection regressions exposed by the boostorg/boost compare-outputs --profile common run
add durable copyright golden regressions for XML author attributes, DocBook authorgroup extraction, and Boost CSS selector noise, and update stale golden expectations that were treating file names/scripts as authors
record the verified boostorg/boost result in the C++ parser verification scorecard

Scope and exclusions

Included:
- markup author extraction and entity decoding improvements in the shared copyright detector
- junk filtering for CSS/prose/path-like author and holder noise
- focused detector/refiner tests, narrow copyright golden suite coverage, docs scorecard update, and repeated Boost compare validation
Explicit exclusions:
- no parser-family feature expansion beyond the shared detection fixes surfaced by the compare run
- no changes to the kept Provenant-better normalization/improvement deltas unless they were clear regressions

Intentional differences from Python

keep cleaner Provenant-better results from the Boost compare run, including the extra real Boost copyright/author detections and cleaner normalization/deduplication in markup-heavy files
drop bogus golden “authors” that were actually file names or generator scripts (DynamicClockGatingTable.ctb, EnableASIC_StaticPwrMgtTable.ctb, EnableDispPowerGatingTable.ctb, createinit.py)

Expected-output fixture changes

Files changed: testdata/copyright-golden/authors/boost_xml_author_attr_entities.xml.yml, testdata/copyright-golden/authors/boost_docbook_authorgroup.html.yml, testdata/copyright-golden/copyrights/boostbook_css_noise.css.yml, testdata/copyright-golden/copyrights/misco4/linux-copyrights/drivers/gpu/drm/amd/include/atombios.h.yml, testdata/copyright-golden/copyrights/misco4/linux-copyrights/drivers/gpu/drm/radeon/atombios.h.yml, testdata/copyright-golden/copyrights/misco4/linux-copyrights/drivers/media/usb/dvb-usb/af9005-script.h.yml
Why the new expected output is correct:
- the new Boost fixtures lock in the compare-run regressions this branch fixed
- the updated Linux fixture expectations now match the improved shared author filter, which correctly treats those prior “authors” as code/path noise rather than people

Improve shared copyright and author detection exposed by the Boost compare run by extracting markup authors directly, suppressing CSS and prose noise, and adding durable golden regressions. Record the verified C++ scorecard result so the compare evidence and intentional differences stay documented.

mstykow merged commit 08cb04a into main Apr 4, 2026
13 checks passed

mstykow deleted the fix/boost-markup-copyright-regressions branch April 4, 2026 14:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(copyright): clean up Boost markup regressions#565

fix(copyright): clean up Boost markup regressions#565
mstykow merged 1 commit intomainfrom
fix/boost-markup-copyright-regressions

mstykow commented Apr 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mstykow commented Apr 4, 2026

Summary

Scope and exclusions

Intentional differences from Python

Expected-output fixture changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant