Redesign of repo tooling for mixed URI prefixes in 7.1#729
Redesign of repo tooling for mixed URI prefixes in 7.1#729tychonievich wants to merge 7 commits intov7.1from
Conversation
Closes #727 Unblocks #722 and #723 There's a lot going on here, and I spent several weeks trying to minimize this PR, split it into pieces, etc, but couldn't find a way that worked. So here's an outline of what's changed. 1. In the spec, many tables (notably enumsets) now have an explicit URI entry. I made the URIs their own column, not the two-line cells used for URIs of events and attributes earlier, mostly because I thought we have space. We can refactor that into using <br> if we prefer. 2. I re-wrote `build/uri-def.py` from the ground up as `build/extract-yaml-tsv.py`. `uri-def.py` was a 50-line script I wrote that I slowly changed into a 500-line monstristy that I find almost impossible to read or maintain. And it had many assumptions about the spec's format not only in its code (that's inevitable when parsing human text into machine-readable files) ubt also in its design. Trying to add the more versatile URIs was such a headache I decided to use it as an excuse for a long-verdue re-write. `extract-yaml-tsv.py` uses classes to have the code mirror the files it is designed to produce. It uses type annotations (which pass `mypy`) to avoid various typing bugs. It tries to do one thing in one place instead of doing all things possible everywhere. It's 100 lines longer than `uri-def.py` but faster to run and easier to maintain. I spent some effort to match ideosynracies of `uri-def.py`, like when to use `strings`, `'strings'`, or `"strings"` in the emitted YAML; but sometimes dong that was not possible without replicating `uri-def.py`'s worse parts, so there are some trivial changes (notably in the order of lists) in the files it creates. 3. I (accidentally) fixed various bugs of omission. `uri-def.py` was sloppy in how it handled the textual explantions in sections that had either gedstruct rules with alternatives or tables. Fixing that with a more principled approach means adding extra `specification:` entries for around half of all YAML files. 4. I tried but failed to match the sorting of the various .tsv files from `uri-def.py`, which was based on some kind of sorting of structures before they were used to generate multiple lines each. I was tempted to scrap that and simply sort the final lines, which is easy and well-defined, but this commit does not do that, instead doing a half-hearted effort at mimickiing the old sorting. I did all this on 7.1 because that was the original motivation, but if we like it I can port it back to the 7.0 branch (either by adding explicit URIs to the 7.0 spec or by adding special-case handling of tag-to-URI) with relative ease.
|
|
||
| type: structure | ||
|
|
||
| uri: https://gedcom.io/terms/v7/REPO |
There was a problem hiding this comment.
If the payload URI (line 23) is a v7.1 URI, shouldn't the structure URI also be v7.1?
There was a problem hiding this comment.
Discussion in GEDCOM Steering Committee meeting:
- The script should detect this and report it as an error
- Luther will update this PR to do so
There was a problem hiding this comment.
Pull request overview
This PR redesigns the repository tooling to support mixed URI prefixes (g7: and g7.1:) in version 7.1 of the GEDCOM specification. The changes address issue #727 by making URI handling more explicit and flexible.
Changes:
- Adds explicit URI columns to enumeration and month tables in specification markdown files
- Completely rewrites
uri-def.pyasextract-yaml-tsv.pywith better structure using classes and type annotations - Fixes bugs where textual explanations were omitted from YAML files
Reviewed changes
Copilot reviewed 140 out of 140 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| specification/gedcom-6-appendix-calendars.md | Added URI column to month tables for all calendars |
| specification/gedcom-3-structures-4-enumerations.md | Added URI column to all enumeration value tables and removed concatenation text |
| specification/gedcom-3-structures-3-meaning.md | Added deprecation notice for HEAD.SOUR.DATA |
| build/extract-yaml-tsv.py | New Python script with class-based design to extract YAML/TSV files from spec |
| build/uri-def.py | Deleted (replaced by extract-yaml-tsv.py) |
| build/Makefile | Updated to use new extract-yaml-tsv.py script |
| extracted-files/* | Updated YAML and TSV files with new URIs, labels, and specification text |
| out.count('[') | ||
|
|
There was a problem hiding this comment.
Line 41 has a statement out.count('[') that has no effect. This appears to be leftover debug code that should be removed.
| out.count('[') | |
| for k,v in pfx.items(): | ||
| md = re.sub(rf'\b{k}', v, md) |
There was a problem hiding this comment.
The variable pfx is used on line 29 but is not defined in the function signature or within the function. This appears to be a global variable reference that should be passed as a parameter to maintain proper function design.
Co-authored-by: Dave Thaler <dthaler1968@gmail.com>
Co-authored-by: Dave Thaler <dthaler1968@gmail.com>
Co-authored-by: Dave Thaler <dthaler1968@gmail.com>
Co-authored-by: Dave Thaler <dthaler1968@gmail.com>
Co-authored-by: Dave Thaler <dthaler1968@gmail.com>
Co-authored-by: Dave Thaler <dthaler1968@gmail.com>
Per committee discussion summarized at FamilySearch/GEDCOM#729 (comment) Signed-off-by: Dave Thaler <dthaler@armidalesoftware.com>
* substructures.tsv can have empty string in first column Per committee discussion summarized at FamilySearch/GEDCOM#729 (comment) Signed-off-by: Dave Thaler <dthaler@armidalesoftware.com> * Apply suggestion from @dthaler --------- Signed-off-by: Dave Thaler <dthaler@armidalesoftware.com> Co-authored-by: Dave Thaler <dthaler@armidalesoftware.com>
|
Based on FamilySearch/GEDCOM.io#187, this should be refactored to split out the YAML and TSV processing, extracting the YAML first and then creating the TSV from the YAML, using the same YAML-to-TSV script for this repo and the registries repo. |
Closes #727
Unblocks #722 and #723
There's a lot going on here, and I spent several weeks trying to minimize this PR, split it into pieces, etc, but couldn't find a way that worked. So here's an outline of what's changed.
In the spec, many tables (notably enumsets) now have an explicit URI entry.
I made the URIs their own column, not the two-line cells used for URIs of events and attributes earlier, mostly because I thought we have space. We can refactor that into using
<br>if we prefer.I re-wrote
build/uri-def.pyfrom the ground up asbuild/extract-yaml-tsv.py.uri-def.pywas a 50-line script I wrote that I slowly changed into a 500-line monstristy that I find almost impossible to read or maintain. And it had many assumptions about the spec's format not only in its code (that's inevitable when parsing human text into machine-readable files) but also in its design. Trying to add the more versatile URIs was such a headache I decided to use it as an excuse for a long-overdue re-write.extract-yaml-tsv.pyuses classes to have the code mirror the files it is designed to produce. It uses type annotations (which passmypy) to avoid various typing bugs. It tries to do one thing in one place instead of doing all things possible everywhere. It's 100 lines longer thanuri-def.pybut faster to run and easier to maintain.I spent some effort to match idiosyncrasies of
uri-def.py, like when to usestrings,'strings', or"strings"in the emitted YAML; but sometimes dong that was not possible without replicatinguri-def.py's worse parts, so there are some trivial changes (notably in the order of lists) in the files it creates.I (accidentally) fixed various bugs of omission.
uri-def.pywas sloppy in how it handled the textual explanations in sections that had either gedstruct rules with alternatives or tables. Fixing that with a more principled approach means adding extraspecification:entries for around half of all YAML files.I tried but failed to match the sorting of the various .tsv files from
uri-def.py, which was based on some kind of sorting of structures before they were used to generate multiple lines each. I was tempted to scrap that and simply sort the final lines, which is easy and well-defined, but this commit does not do that, instead doing a half-hearted effort at mimicking the old sorting.I did all this on 7.1 because that was the original motivation, but if we like it I can port it back to the 7.0 branch (either by adding explicit URIs to the 7.0 spec or by adding special-case handling of tag-to-URI) with relative ease.