Skip to content

Conversation

@brianaydemir
Copy link
Contributor

@brianaydemir brianaydemir commented Jan 22, 2026

Summary

The main innovation of this PR is introducing a scheme by which a developer can run Pelican's build-and-test workflow on a branch in their fork of the main PelicanPlatform/pelican repo. This includes running the workflow in a version of the testing container that has been updated to take into account changes the branch has made to Pelican's dependencies and the workflows themselves.

As a demonstration and test of this scheme, this PR includes an update to the definition of the testing container so that the build-and-test workflow actually runs the TestHTCondorPlugin unit test, and also fixes the test so that it runs as intended.

The Scheme

Requirements:

  • The developer has a project on hub.opensciencegrid.org that they can:

    1. Pull/push container images to
    2. Create a "robot" account with similar privileges

The basic idea:

  1. The developer updates build-and-test.yml with their GitHub fork and corresponding project on hub.opensciencegrid.org. (This needs to be done only once, as the change can be merged into the PelicanPlatform/pelican repo.)

  2. The developer enables the running of GitHub Actions on their fork. (They are disabled by default.)

  3. The developer adds PELICAN_HARBOR_ROBOT_USER and PELICAN_HARBOR_ROBOT_PASSWORD repository secrets to their fork, with contents befitting of their names.

  4. As needed, the developer creates a new (pre-)release on their fork. The tag should be a semvar tag following Pelican's release tagging scheme. The branch should be the one they will be using to create pull requests against PelicanPlatform/pelican.

Note

To be more precise, the developer will still need to make two PRs against PelicanPlatform/pelican: One to update dependencies or similar, so that a new version of the testing container is built. And a second with all of the changes that need to be tested in that updated container. The idea and point here is to provide a way that the developer can test the combined effects of both PRs before submitting them.

While this isn't quite what #2502 asked for, I feel like it's close enough.

Other Details

The entirety of the build-and-test workflows could use a bit of love, which I intend to take care of when I tackle #2997.

In the meantime, the number of files that #2995 had to touch is disappointing, so I've gone through and updated everything else to use go.mod and package.json as the authoritative version specifiers for Go and Node.

@brianaydemir brianaydemir added this to the v7.24 milestone Jan 22, 2026
@brianaydemir brianaydemir added bug Something isn't working test Improvements to the test suite infrastructure GitHub Actions, Release management, and CI internal Internal code improvements, not user-facing labels Jan 22, 2026
@github-advanced-security
Copy link

This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation.

@brianaydemir
Copy link
Contributor Author

@jhiemstrawisc I would suggest reviewing this PR commit-by-commit, because the commit messages contain more information about what I had in mind for each of the files being touched here. If your eyes glaze over, I'd focus on TestHTCondorPlugin, because that's something that was added specifically to catch release blockers sooner, so it's unfortunate that it hasn't actually been running in our CI workflow.

As I write this comment, the history at https://github.com/brianaydemir/pelicanplatform-pelican/actions has the record of me dogfooding this PR, so to speak.

@brianaydemir brianaydemir marked this pull request as ready for review January 22, 2026 23:47
@brianaydemir
Copy link
Contributor Author

To be more precise, the developer will still need to make two PRs against PelicanPlatform/pelican

Ironically, this PR isn't structured like that. But perhaps we can take this opportunity to demonstrate an alternative workflow: at reviewer discretion, they can accept the workflow runs from the fork on which the PR is based.

There's also a lesson to be learned here: Take care to structure commits so that there's one that updates dependencies, container image definitions, and the like; and a second that changes code as needed. (I accidentally squashed too many commits together during development….)

Copy link
Member

@jhiemstrawisc jhiemstrawisc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this feels pretty solid, most of my comments are minor nits, questions, or additional threads I hope to tempt you on pulling 😉

Can you also document the process for developers to trigger these tests in our CONTRIBUTE.md so its description lives somewhere other than this PR?

This isn't yet a complete review because I haven't yet tested the workflow. I hope to do that shortly.


// Set XRD_PLUGINCONFDIR so XRootD can load the pelican protocol handler
// This allows the cache to handle pelican:// URLs
t.Setenv("XRD_PLUGINCONFDIR", "/Users/bbockelm/projects/xrdcl-curl/build/release_dir/etc/xrootd/client.plugins.d/")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been trying to leave breadcrumbs wherever bad AI code was merged, especially in cases where I think a reviewer really should have caught it. It looks like this was merged without a second human review in #2900.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just an observation, or a request for me to leave a comment somewhere? If the latter, I'm not sure where you had in mind.

go.mod Outdated

go 1.25.0

toolchain go1.25.5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going through commit-by-commit like you suggested, so maybe I'll see the answer to this later, but why is pinning the toolchain to this specific version needed?

I ask because it looks like BrianB recently made the explicit choice to remove the toolchain definition. See:
2f798fa#diff-33ef32bf6c23acb95f5902d7097b7a1d5128ca061167ec0716715b0b9eeaa5f6R6

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I'd like to avoid setting toolchain unless necessary (unless I'm misunderstanding the point of the toolchain line?).

In my understanding, toolchain pins a quite specific version of the toolchain (as in, "you shall use 1.25.5, not 1.25.4 or the latest 1.25.x patch release"). Unless there's a reason why we don't want to pick up things like security fixes in patch releases, I think it's better to keep the line out.

Dug a little in GitHub history and it seems like #2247 introduced pinning the toolchain. It was labelled as a dependabot roll-up ... but it might have been inadvertent?

(Again, this is just based on my understanding of the toolchain line; let me know if I got it wrong)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the goal here is to have one location that defines the Go version to use, and go.mod seems as good a place as any because it already expresses an opinion on that.

I think the actual problem is that unless Go and/or GoReleaser automatically update to the latest version of the toolchain, even when the currently installed version already satisfies any stated requirements (e.g., the go line in go.mod), we've effectively been pinning the toolchain version in images/Dockerfile since at least #1877.

That said, there's an API that can be used to look up what the latest version is.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, not 100% sure what Go's behavior is here. One of the nifty (scary?) things introduced last year is the ability to have the runtime and compiler be independent -- go version N compiler can compile code for N+2, for example.

So, despite the CI/CD using 1.25.5 at https://github.com/PelicanPlatform/pelican/blob/main/images/Dockerfile#L39, it's a touch unclear what the behavior is.

If we do want to start freezing the toolchain (which I'm still not 100% convinced on), it seems like we need another metafile so we can set it in precisely one place (contrast with the various XRootD plugins...)



def main():
with open(DEFAULTS_FILE, mode="r", encoding="utf-8") as fp:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're touching this, is there any reason not to also double check parameters that define osdf_default against config/resources/osdf.yaml?

We also define some other keys in the parameters.yaml file I'm not totally sure what to do about validation wise. They are root_default, client_default and server_default. I don't think we can address these here.

Copy link
Contributor Author

@brianaydemir brianaydemir Feb 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the one hand, I don't think anything other default shows up at https://docs.pelicanplatform.org/parameters.

On the other hand, what's the worst that could happen if I add the check:

RuntimeError: 
./config/resources/osdf.yaml: Xrootd.ManagerHost: The 'osdf_default' key is not in './docs/parameters.yaml'
./config/resources/osdf.yaml: Xrootd.SummaryMonitoringHost: The 'osdf_default' key is not in './docs/parameters.yaml'
./config/resources/osdf.yaml: Xrootd.DetailedMonitoringHost: The 'osdf_default' key is not in './docs/parameters.yaml'
./config/resources/osdf.yaml: Federation.DiscoveryUrl: The 'osdf_default' key is not in './docs/parameters.yaml'
./config/resources/osdf.yaml: Federation.TopologyNamespaceUrl: Expected 'https://topology.opensciencegrid.org/osdf/namespaces?production=1', but found 'https://topology.opensciencegrid.org/osdf/namespaces' for 'osdf_default' in './docs/parameters.yaml'
./config/resources/osdf.yaml: Topology.DisableDowntime: The 'osdf_default' key is not in './docs/parameters.yaml'
./config/resources/osdf.yaml: Topology.DisableCacheX509: The 'osdf_default' key is not in './docs/parameters.yaml'
./config/resources/osdf.yaml: Topology.DisableOriginX509: The 'osdf_default' key is not in './docs/parameters.yaml'
./config/resources/osdf.yaml: Topology.DisableCaches: The 'osdf_default' key is not in './docs/parameters.yaml'
./config/resources/osdf.yaml: Topology.DisableOrigins: The 'osdf_default' key is not in './docs/parameters.yaml'

As for the rest, see my responses to other comments.

# If the defaults file has a value, it must match the parameters
# file. The reverse need not be true (the default value may be set
# via other means, i.e., code).
if default != parameters[key]["default"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script still feels inadequate to me since the bulk of our defaults aren't actually defined in config/resources/defaults.yaml.

What if instead the solution did something like:

  1. Build Pelican.
  2. In a clean environment with no existing Pelican configuration, run pelican config dump or pelican config dump --json (whichever format is easier to work with).
  3. Check the output of this against docs/parameters.yaml.

Not only might this work with params that aren't defined via config/resources/defaults.yaml, but because some of our defaults use bashisms that reference other parameters:

name: Origin.Url
description: |+
  The origin's configured URL, as reported to XRootD. This is the file transfer endpoint for the origin.
type: url
default: https://${Server.Hostname}:${Origin.Port}
components: ["origin"]

we could actually evaluate correctness in these more complex cases.

Feel free to say "out of scope," but I thought I'd bring it up since you took the time to clean things up here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I would rather defer this work to another issue and PR. It's a good suggestion, but it's significantly beyond the current approach of some basic text comparisons.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And from another comment:

We also define some other keys in the parameters.yaml file I'm not totally sure what to do about validation wise. They are root_default, client_default and server_default. I don't think we can address these here.

For the moment, I'd like to the validation to be of the form: Obvious what it intends to check, obvious that it does so correctly, and (to the cognoscenti) obvious where it is inadequate.

# file. The reverse need not be true (the default value may be set
# via other means, i.e., code).
if default != parameters[key]["default"]:
errors.append(f"Default value does not match the parameters file: {key}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you modify the error message to provide both the deduced and expected values, and leave a note about which two files should be inspected to correct the issue?

If you worry about clutter from printing the filepaths multiple times in the case that multiple defaults fail this check, you could add that part in the join under if errors a few lines below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See this comment for the new phrasing.

I'm not concerned about repetitiveness or verbosity because the only time these errors should be appearing are for newly added parameters (or the mythical day when these checks are improved even further).

#
# Reference: https://docs.docker.com/reference/dockerfile/#automatic-platform-args-in-the-global-scope
#
# ARG BUILDPLATFORM
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that you touched these here, but it wasn't clear to me why these ARGs are left commented -- are you trying to enumerate the variables used to distinguish cross-compilation stuff like "building on" and "building for"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to clarify the comment.

Basically, I'm too lazy to open up a web browser, so it's sometimes nice to have a "quick reference" or "cheatsheet" of sorts in the file itself.

And if you're not me, I imagine that it might be nice to have all the possible build args listed in a single location, at the top of the file, even if you don't need to ever set them. (The only thing worse than needing to update a magic string is… updating a magic string buried in the middle of a file you otherwise don't care about.)

RUN --mount=type=bind,source=go.mod,target=/context/go.mod \
--mount=type=bind,source=web_ui/frontend/package.json,target=/context/package.json \
--mount=type=cache,id=dnf-${BUILDPLATFORM},target=/var/cache/dnf,sharing=locked \
<<ENDRUN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I get what's going on here (keeping image layers clean by avoiding COPY of these context files, providing a common cache for dnf), but I've never had to cook something like this up for myself and definitely don't know how to assess whether this accomplishes its goal, so I'm limited in my ability to review this section very effectively. I trust you know what you're doing!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of that magic revolves around controlling what the commands being run inside the container being built have access to vs. what files are actually persisted in the final container image.

--mount=type=bind is fairly straightforward in that it just makes files and directories available from the build context (here, the Git repo, essentially) inside the container being built. It's like a COPY line, but without the persistence. (Paraphrase Einstien.)

--mount=type=cache is a bit tricker. Here, I'm using it as a persistent scratch or temp space: Multiple build stages benefit from the dnf cache (it is not exactly the fast thing to build from scratch), but actually including the cache in the final image is a waste of space. Better to let the Docker daemon hold on to that directory itself. And if it loses the cache, that's not a functional or correctness problem.

useradd -m -s /bin/bash alice
# Add some normal user accounts for testing.
# The inspiration for their names: https://en.wikipedia.org/wiki/Alice_and_Bob#Cast_of_characters.
for name in alice bob carol dave; do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently you don't care about security because you didn't let Eve join in the fun 😛

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

640k users should be enough for anyone.

if ${PUSH_DEV} || ${PUSH_SERVER}; then
IS_PRIMARY_GITHUB_REPO=false

case '${{ github.repository }}' in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me want to learn more bash!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shell quoting is a fraught exercise, doubly so in a workflow like this, where the script that's being run is what we get after GitHub mangles the text.

# images somewhere.

brianaydemir/pelicanplatform-pelican)
REGISTRY_REPO="brian.aydemir.r1" ;;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Little hard to piece this together -- is REGISTRY_REPO always going to be tied hub.opensciencegrid.org later on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a practical matter, I would bet yes.

As a theoretical matter, there's no need to assume it. In fact, the way this workflow constructs image names in general is in need of cleaning up, which I'm doing in #3051. Look forward to the "fix" there.

- Update to Python 3.14

- Update the action to trigger on pushes to 'main' and release branches

- Format the scripts using `black`

- Fix assorted logic errors and inadequacies
- Update to v4

- Add TypeScript

- Fix the regex for release branches to allow multi-digit components

- Remove comments that are out-of-date; add a link to GitHub Docs
- Use 'package.json' to determine the Node.js version

- Use 'go.mod' to determine the Go version

- Restrict them to running on the PelicanPlatform/pelican repo
- Use 'go.mod' to determine the Go version

- Update the action to trigger on pushes to 'main' and release branches
- Use 'package.json' to determine the Node.js version

- Use 'go.mod' to determine the Go version

- Update the action to trigger on pushes to 'main' and release branches
- Use 'go.mod' to determine the Go version

- Fix up the Next.js cache action based on the current guide
Also, install HTCondor into the correct container. We need it to be
available in the GitHub Actions build-and-test workflow.

I'm also making some assorted tweaks to try and reduce the layer count of
the final images, while also trying to avoid putting things in layers where
they're not really needed.

I'm also trying to be consistent about creating user accounts before
installing software that might rely on those accounts existing.
- Use 'package.json' to determine the Node.js version

- Use 'go.mod' to determine the Go version

- Set up Go, because the testing container image no longer does so

- Fix up the Next.js cache action based on the current guide
Now that the testing container actually has Condor installed
into it, it turns out that this test was assuming that it was
running as non-root.

When running as root, we need to be more precise about which
user is running what, and what they need access to.

Because we're assuming the existence of a HTCondor installation,
we might as well use its condor_config_val to determine the
LIBEXEC directory, rather than (incorrectly) trying to piece the
path together ourselves.

Also, given how this is run within GitHub Actions, it's arguably
better to depend on how 'go test' runs tests than it is to assume
that the repo checkout has been marked as safe.
More precisely, the most recent version for the language version specified
in `go.mod`.

While I have some concerns about the underlying API endpoint not being well
documented, it's also used by the setup-go action in our GitHub Actions
workflows, so as a practical matter, any changes to the API would have quite
the blast radius.

An alternative is `dnf install golang-1.25.*` (or similar), but I don't know
how quickly AlmaLinux's AppStream updates.
@brianaydemir brianaydemir force-pushed the actions-fixes branch 4 times, most recently from b315415 to 47d446e Compare February 2, 2026 18:06
@brianaydemir
Copy link
Contributor Author

brianaydemir commented Feb 3, 2026

I believe I've addressed all the comments, even if I didn't necessarily add a reply. (If there's a nice, concise listing of the comments that fits on one screen, I haven't managed to find it in GitHub's UI….)

I haven't shouldn't have force-pushed over any previously existing commits.

The significant change is getting rid of toolchain in go.mod and using the go.dev API in images/Dockerfile to find the most recent version of the toolchain to install. The configuration for the setup-go action has been updated similarly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working infrastructure GitHub Actions, Release management, and CI internal Internal code improvements, not user-facing test Improvements to the test suite

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PRs that touch Dockerfile should run tests in a container image based on that Dockerfile

3 participants