Skip to content

Conversation

Copy link

Copilot AI commented Dec 23, 2025

Implementation Plan for Creator Biographical Information

  • Reverted previous commits to start over with better structure
  • 1. Refactor XML injection logic for better extensibility
  • 2. Add linked_agents to resolve parameter
  • 3. Import xml.sax.saxutils.escape
  • 4. Add get_creator_bioghist method
  • 5. Integrate bioghist into XML injection
  • 6. Update comment to reflect new behavior
  • 7. Add copilot-instructions.md with commit guidelines
  • 8. Clarify import guidelines in copilot-instructions.md
  • 9. Add newlines for XML readability
  • 10. Escape XML special characters in all labels
  • 11. Fix bioghist XML escaping to preserve EAD markup
  • 12. Document XML escaping patterns for future work
  • 13. Structure bioghist with nested elements per creator

Summary

Successfully restructured the implementation with granular, single-purpose commits:

  • Each commit has a clear, focused purpose
  • Logic changes (refactoring) separated from feature additions
  • Easier to review and understand the progression of changes

Documentation Added

Created .github/copilot-instructions.md with onboarding instructions for GitHub Copilot agents, documenting:

  • Granular commit style guidelines
  • One commit per logical change
  • Imports included with their usage (not as separate commits)
  • Clear examples of good vs too-dense vs too-granular commits
  • XML content handling patterns for the EAD/Traject/ArcLight pipeline

XML Formatting

  • Injected XML elements now include newlines for better human readability
  • Proper escaping strategy:
    • recordgroup/subgroup labels: Escaped (plain text that may contain special chars)
    • bioghist content: NOT escaped (already contains valid EAD XML markup from ArchivesSpace)
    • creator names in head elements: Escaped (plain text)

Bioghist Structure

Creates EAD-compliant nested <bioghist> elements for each creator with:

  • Separate elements per creator: Each creator gets their own <bioghist> element, enabling clear distinction between multiple creators
  • ID attribute: Populated with ArchivesSpace agent ID in format aspace_{agent_id} for unique identification
  • Head element: Contains creator name (XML-escaped) to identify which biography belongs to which creator
  • Paragraph formatting: Content split on single newlines into separate <p> tags while preserving existing EAD XML markup
  • Order preservation: Processes linked_agents in order to maintain consistency with <origination> element order (important for Solr indexing)

Example output structure:

<bioghist id="aspace_123"><head>John Doe</head><p>First paragraph</p><p>Second paragraph</p></bioghist>
<bioghist id="aspace_456"><head>Jane Smith</head><p>Another bio paragraph</p></bioghist>
Original prompt

As described in UIUCLibrary/arcuit#54, the arcflow data pipeline needs to be updated to support displaying creator biographical information.

The implementation should:

  1. Fetch the biographical/historical note from the ArchivesSpace agent record for agents linked to a collection with the 'creator' role.
  2. Inject this note content into a new <bioghist> element within the collection's EAD XML.
  3. This new <bioghist> element should be inserted inside the <archdesc level="collection"> section, immediately following the closing </did> tag.
  4. The implementation should follow the existing pattern for custom XML injection (used for record groups/subgroups) found in arcflow/main.py.

This pull request was created from Copilot chat.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI and others added 3 commits December 23, 2025 16:10
- Add get_creator_bioghist method to fetch biographical notes
- Update task_resource to resolve linked_agents
- Inject bioghist element into EAD XML after </did> tag
- Follow existing pattern for custom XML injection

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
- Use xml.sax.saxutils.escape for proper XML escaping
- Format bioghist notes as separate paragraphs
- Remove manual XML escaping in favor of library function

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
- Make error message more specific about the operation that failed

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Copilot AI changed the title [WIP] Update arcflow pipeline to display creator biographical information Add creator biographical information to EAD XML exports Dec 23, 2025
Copilot AI requested a review from alexdryden December 23, 2025 16:20
arcflow/main.py Outdated
'ead3': 'false',
})

# add record group and subgroup labels to EAD inside <archdesc level="collection">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to update the comments to reflect the new behavior

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in commit db640ce. The comment now reflects that the section handles multiple types of custom XML elements (record group/subgroup labels and biographical/historical notes).

Copy link
Contributor

@alexdryden alexdryden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before I continue with the review, I'd like for you to start over re-work this with more frequent and verbose commits. So, keep everything the same, but parse out the changes more clearly through more frequent, single purpose commits.

Even if humans committing to this repo have dense commits, I'd like for you to be more explicit. For example, in the first commit to main.py you are making changes to the like locating the closing </did> tag, which is great, but not really part of "Add creator biographical information to EAD XML exports".

This makes it easier to review and helps me understand your specific logic and choices.

Copilot AI and others added 10 commits December 23, 2025 16:48
Restructure the XML injection code to make it easier to add
additional XML elements. Changes:
- Move XML decode/encode outside the label-conditional block
- Calculate insertion point (after </did> tag) upfront
- Use extra_xml variable to accumulate multiple XML elements
- Maintains existing recordgroup/subgroup functionality

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Include linked_agents in the resource API call resolution to
fetch agent information along with the resource. This prepares
for extracting creator biographical information.

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Add import for xml_escape function to properly escape XML special
characters (ampersands, less-than, greater-than, quotes) when
injecting content into EAD XML.

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Implement method to fetch biographical/historical notes from
ArchivesSpace agent records. The method:
- Filters agents by 'creator' role
- Fetches agent details via API
- Extracts note_bioghist content from subnotes
- Escapes XML special characters using xml_escape
- Wraps each note in <p> tags for proper XML structure
- Returns concatenated paragraphs or None if no notes found

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Call get_creator_bioghist to fetch biographical notes and inject
them into the EAD XML as a <bioghist> element. The bioghist element
is added after </did> tag, following recordgroup/subgroup labels.

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Update the comment on the XML injection section to indicate that
it now handles multiple types of custom XML elements (record group/
subgroup labels AND biographical/historical notes), not just labels.

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Copy link
Author

Copilot AI commented Dec 23, 2025

Before I continue with the review, I'd like for you to start over re-work this with more frequent and verbose commits. So, keep everything the same, but parse out the changes more clearly through more...

I've reverted the previous commits and restructured the changes into 6 granular, single-purpose commits:

  1. b39d03b - Refactor XML injection logic (separates structure changes)
  2. a50571c - Add linked_agents to resolve parameter
  3. 600442f - Import xml.sax.saxutils.escape
  4. d7f1bc1 - Add get_creator_bioghist method
  5. 2ae937e - Integrate bioghist into XML injection
  6. db640ce - Update comment to reflect new behavior

Each commit now has a clear, focused purpose that's easier to review.

@alexdryden
Copy link
Contributor

@copilot Great! This commit style suits me much better for this repo. Can you add a .github/copilot-instructions.md file with some basic on boarding instructions that includes this commit style? Consult https://docs.github.com/en/enterprise-cloud@latest/copilot/tutorials/coding-agent/get-the-best-results if you need context for what this file is for. Keep it very simple for now and mainly focus the commit style. Make the file so it is easy and clear for another agent to add more instructions later.

Create onboarding instructions for GitHub Copilot agents with
focus on granular, single-purpose commit style. Includes:
- Guidelines for commit granularity
- Examples of good vs bad commit sequences
- Commit message format recommendations
- Instructions for extending the file

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Copy link
Author

Copilot AI commented Dec 23, 2025

@copilot Great! This commit style suits me much better for this repo. Can you add a .github/copilot-instructions.md file with some basic on boarding instructions that includes this commit style? Consu...

Created in commit b485f55. The file documents the granular commit style with clear examples and is structured to be easily extended by other agents.

@alexdryden
Copy link
Contributor

@leonelramirez I was in the middle of this right before break. Other than adding the note_bioghist field, this adds a few minor enhancements summarized above under the heading "Quality Improvements". This all looked good to me before break, but could you take a quick look to confirm?

@alexdryden
Copy link
Contributor

@copilot I'm working through a display issue in Arclight where raw xml is showing up in the display, and it looks like in addition to the bioghist_tesim we will need to create a bioghist_html_tesm with xml converted to html for display in Arclight. This is a standard part of this pipeline, and you should add a note in the documentation in the repo for copilot agents that points out this fact, as we are likely to need to do this kind of work in the future.

Follow the standard approach used in ArchivesSpace and Arclight to prepare the bioghist_html_tesm so that xml is properly converted to html. If you have questions or doubts, ask me before making commits so that we can keep the commit history relatively clean.

Copy link
Author

Copilot AI commented Jan 5, 2026

@copilot I'm working through a display issue in Arclight where raw xml is showing up in the display, and it looks like in addition to the bioghist_tesim we will need to create a bioghist_html_tesm...

I need some clarification before implementing:

Questions:

  1. Where should the HTML conversion happen?

    • Should I modify the EAD XML that arcflow generates to include an additional HTML version in a specific element/attribute?
    • Or should this be handled via a Traject configuration file that arcflow passes to the indexing process?
  2. What's the standard pattern?

    • Looking at line 506, I see arcflow uses arclight/lib/arclight/traject/ead2_config.rb for indexing
    • Should I create a custom traject config file that extends this and adds bioghist_html_tesm field generation?
    • Or does Arclight's standard config already handle this and I need to ensure the XML structure matches expectations?
  3. HTML conversion approach:

    • For converting XML to HTML, should I follow Arclight's pattern (likely using XSLT or similar)?
    • Are there existing examples in the Arclight/ArchivesSpace ecosystem I should reference?

Understanding these will help me implement the right solution with clean commits.

@alexdryden
Copy link
Contributor

@copilot Here is a little more detail:
When we were addressing the unescaped & I think we may have not been careful enough about what we were escaping and what we were passing along as real xml nodes, and are now escaping real xml nodes that should be entered as real nodes, not text. Here is an overview of the issue, as I understand it:

  1. Arcflow injects: "...text<emph render='super'>1</emph>..." as a string
    
  2. XML parser escapes it: "...text&lt;emph render='super'&gt;1&lt;/emph&gt;..."
    
  3. Traject's .to_html preserves the escaped version
    
  4. ArcLight's render_html_tags can't process escaped tags (they're literal text, not XML)
    

Let me know if that makes sense and if you can see a solution on your own, but run your solution by me before you implement it.

Copy link
Author

Copilot AI commented Jan 5, 2026

@copilot Here is a little more detail:
When we were addressing the unescaped & I think we may have not been careful enough about what we were escaping and what we were passing along as real xml nodes...

Yes, I understand the issue now. The problem is on line 549 in get_creator_bioghist():

Current (incorrect):

escaped_content = xml_escape(subnote['content'])
bioghist_paragraphs.append(f'<p>{escaped_content}</p>')

This escapes ALL content, including legitimate EAD XML tags like <emph>, <title>, etc.

Proposed solution:

The bioghist content from ArchivesSpace already contains properly formatted EAD XML markup. We should pass it through as-is without escaping, since:

  1. It comes from ArchivesSpace's structured data (not user input)
  2. It already contains valid XML nodes that need to be preserved
  3. ArchivesSpace has already handled any necessary escaping

Change:

# Remove xml_escape() for bioghist content - it's already valid EAD XML
bioghist_paragraphs.append(f'<p>{subnote["content"]}</p>')

Key distinction:

  • recordgroup/subgroup labels: Plain text strings → NEED escaping
  • bioghist content: Valid EAD XML markup → NO escaping (pass through as-is)

Should I proceed with removing the xml_escape() call on line 549?

@alexdryden
Copy link
Contributor

@copilot Yes, that looks like the correct solution and your analysis is spot on. Proceed.

Copilot AI and others added 2 commits January 5, 2026 18:35
The bioghist content from ArchivesSpace already contains valid
EAD XML markup (e.g., <emph>, <title> tags) that should be preserved
as actual XML nodes, not escaped text. Only recordgroup/subgroup
labels need escaping since they're plain text strings.

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Add guidelines for handling plain text vs structured XML content
in the EAD pipeline. Explains when to use xml_escape() and when
to pass content through unchanged, with rationale for the Traject
and ArcLight display pipeline.

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Copy link
Author

Copilot AI commented Jan 5, 2026

@copilot Yes, that looks like the correct solution and your analysis is spot on. Proceed.

Completed in commits fa57624 and 167cfe7:

  1. fa57624: Removed xml_escape() from bioghist content (line 549) so EAD XML markup is preserved as actual nodes
  2. 167cfe7: Added documentation to copilot-instructions.md explaining the escaping pattern for future reference

The bioghist content now passes through unchanged, preserving tags like <emph render='super'> as proper XML nodes that Traject and ArcLight can process correctly.

Copy link
Contributor

@leonelramirez leonelramirez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good to me. Nice job!

@alexdryden
Copy link
Contributor

Need to add in the paragraphing for the bioghist, as an example, here is what we are pulling storing:

<bioghist><p>Concerned by the lack of resources available to American choral directors, Robert Landers (director of the US Air Force Choir), Archie Jones (director of choirs at the University of Texas), and Maynard Klein (director of choirs at the University of Michigan) first conceived of a national choral association in 1957. The American Choral Directors Association (ACDA), originally called the American Choirmasters Association following the American Bandmasters Association, was founded two years later in Kansas City, Missouri during the biennial conference of the Music Teachers National Association. The ACDA, which began with 35 members, was established to serve the needs of pubilc school, collegiate, community, sacred, and professional choirs by sponsoring choral workshops, international conventions, concert exchanges, and symposia.
        The first executive board was led by Archie Jones (president), R. Wayne Hugoboom (executive director), Earl Willhoite (treasurer), Charles Hirt, Warner Imig, James Aliferis, Elwood Keister, and Harry Robert Wilson. During their first meeting, the executive board established a set of bylaws and yearly dues of $6.
        In March 1960, the organization held its first biannual convention in conjunction with the 1960 Music Educator's National Conference. This conference consisted of concerts, reading sessions, workshops, and panel discussions. The following May, the ACDA published the first issue of the <emph render='italic'>Choral Journal. </emph>Eleven years later, the organization held its first independent conference in Kansas City. Since then, the ACDA conference has expanded to include honor choirs and multiple sessions.
        In addition to their biannual conference, the organization has fostered the annual Schmitt and Pepper Choral Composition Competitions since 1962. In 1991, the ACDA added the Raymond W. Brock Student Composition Competition. In addition, the ACDA also provides awards for dissertations and professional research on choral music.
        Since 1961, the ACDA has been governed by both national and regional officers from six regions. In 1979, the ACDA added a seventh region to accomodate expanding membership. </p></bioghist>

Concerned by the lack of resources available to American choral directors, Robert Landers (director of the US Air Force Choir), Archie Jones (director of choirs at the University of Texas), and Maynard Klein (director of choirs at the University of Michigan) first conceived of a national choral association in 1957. The American Choral Directors Association (ACDA), originally called the American Choirmasters Association following the American Bandmasters Association, was founded two years later in Kansas City, Missouri during the biennial conference of the Music Teachers National Association. The ACDA, which began with 35 members, was established to serve the needs of pubilc school, collegiate, community, sacred, and professional choirs by sponsoring choral workshops, international conventions, concert exchanges, and symposia.
The first executive board was led by Archie Jones (president), R. Wayne Hugoboom (executive director), Earl Willhoite (treasurer), Charles Hirt, Warner Imig, James Aliferis, Elwood Keister, and Harry Robert Wilson. During their first meeting, the executive board established a set of bylaws and yearly dues of $6.
In March 1960, the organization held its first biannual convention in conjunction with the 1960 Music Educator's National Conference. This conference consisted of concerts, reading sessions, workshops, and panel discussions. The following May, the ACDA published the first issue of the Choral Journal. Eleven years later, the organization held its first independent conference in Kansas City. Since then, the ACDA conference has expanded to include honor choirs and multiple sessions.
In addition to their biannual conference, the organization has fostered the annual Schmitt and Pepper Choral Composition Competitions since 1962. In 1991, the ACDA added the Raymond W. Brock Student Composition Competition. In addition, the ACDA also provides awards for dissertations and professional research on choral music.
Since 1961, the ACDA has been governed by both national and regional officers from six regions. In 1979, the ACDA added a seventh region to accomodate expanding membership.

Copy link
Contributor

@alexdryden alexdryden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are going to need more structure to the bioghist content. Two things are happening:

  1. content is being read in with line breaks, and those need to be represented as xml paragraphs with individual <p> tags for each paragraph.
  2. in cases where there are multiple creators, there is no way to distinguish in bioghist which content belongs to which creator.

So, we need to make sure that we are capturing the formatting in the xml by using xml paragraphs.
We also need to devise a way to represent multiple creators' biographies in the bioghist element. EAD permits nesting of multiple bioghist elements. Order matters a lot for solr, so we'll need a way to be sure that the order of the creators in bioghist matches the order in the <origination>. If it isn't practical to verify that at this stage, just let me know and we'll keep thinking. In any case, put the name of the creator in a tag before the narrative content is inserted. This should be done even in cases where there is only one creator.

Copy link
Author

Copilot AI commented Jan 9, 2026

Need to add in the paragraphing for the bioghist, as an example, here is what we are pulling storing:

...

The comment appears incomplete - could you provide the example of what we're pulling/storing? This will help me understand the structure better.

We are going to need more structure to the bioghist content. Two things are happening:

  1. content is being read in with line breaks, and those need to be represented as xml paragraphs with individual...

I understand you need:

  1. Line breaks converted to separate <p> tags for each paragraph
  2. Multiple creators' bioghist content structured to distinguish between them
  3. Creator order matching <origination> element order
  4. Creator name included in (the comment appears cut off here)

Before implementing, I have questions:

For paragraphing: Should I split content on \n\n (double newline) or \n (single newline) to create paragraphs?

For multiple creators: What structure do you prefer?

  • Option A: Nested <bioghist> elements with creator name in an attribute or sub-element?
  • Option B: Single <bioghist> with <head> elements for each creator?
  • Option C: Something else?

For creator names: Where should the name go? In a <head> element, <persname>/<corpname>, or as an attribute?

Could you complete the thought about where to put the creator name, and clarify the preferred structure?

@alexdryden
Copy link
Contributor

alexdryden commented Jan 9, 2026

@copilot for the incomplete comment, it looks like a formatting issue with quoting xml syntax in markdown in the comment--regardless, that comment was a note to myself and you can ignore it.

For your questions:
For paragraphing: split on single \n
For multiple creators structure, let's go with a "yes and" strategy and we can pull back later if we need to. Use nested <bioghist> elements with creator names in a <head> element of each nested <bioghist>. The only attribute I'm seeing here (https://www.loc.gov/ead/tglib1998/tlin020.html) that would make sense is ID. Go ahead and also use that attribute and populate it with the agent id from archivesspace.

I'm not an expert in EAD, so check these instructions against generally accepted practices and let me know if anything looks off.

@alexdryden
Copy link
Contributor

@copilot see my updated comment above--I needed to escape some of the xml tags

Copy link
Author

Copilot AI commented Jan 9, 2026

@copilot for the incomplete comment, it looks like a formatting issue with quoting xml syntax in markdown in the comment--regardless, that comment was a note to myself and you can ignore it.

For yo...

Understood. I'll implement:

  1. Split bioghist content on single \n to create separate <p> tags
  2. Nest <bioghist> elements for each creator
  3. Add <head> element with creator name in each nested <bioghist>
  4. Include id attribute populated with ArchivesSpace agent ID

EAD Standards Check:
According to EAD 2002 standards:

  • <bioghist> can be nested within <archdesc>
  • <head> element is valid within <bioghist>
  • id attribute is valid on <bioghist>
  • Multiple <bioghist> elements are permitted ✓

The approach looks compliant with EAD standards. I'll also ensure:

  • Order preservation: Process linked_agents in the order they appear (which should match <origination> order from ArchivesSpace)
  • Extract agent names (title/display_name) from agent records for the <head> element

Proceeding with implementation.

Implement improved bioghist structure:
- Create separate <bioghist> element for each creator
- Add <head> element with creator name (XML-escaped)
- Include id attribute with ArchivesSpace agent ID
- Split content on single newlines to create <p> paragraphs
- Maintain linked_agents order for consistency with origination

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants