Skip to content

Conversation

@claudia-lola
Copy link
Contributor

Add tasks to eessi/configure.yml and compute-init.yml to run the EESSI link_nvidia_host_libraries.sh script on gpu nodes with nvidia drivers installed. The tasks will be run when either site.yml is run or a rebuild via slurm is completed.

@claudia-lola claudia-lola requested a review from a team as a code owner October 28, 2025 15:31
@claudia-lola claudia-lola self-assigned this Oct 28, 2025
@claudia-lola
Copy link
Contributor Author

fat image build

@claudia-lola claudia-lola requested a review from sjpb November 12, 2025 13:45
Copy link
Collaborator

@sjpb sjpb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine, except the built images need adding back into the CI config.

@sjpb sjpb self-requested a review November 19, 2025 09:21
@sjpb
Copy link
Collaborator

sjpb commented Nov 19, 2025

Failed CI run above (note previous attempts failed due to cloudflare being down):

  • compute node has reimaged to openhpc-rl9-251112-1307-e34d64c4, so that's ok
  • ansible-init failed:
Nov 19 10:31:27 slurmci-RL9-357-compute-0 ansible-init[34868]: TASK [Add base CVMFS config] ***************************************************
Nov 19 10:31:27 slurmci-RL9-357-compute-0 ansible-init[34868]: fatal: [127.0.0.1]: FAILED! => {"msg": "Unable to look up a name or access an attribute in template string ({{ cvmfs_config | dict2items }}).\nMake sure your variable name does not contain invalid characters like '-': dict2items requires a dictionary, got <class 'ansible.template.Ans
ibleUndefined'> instead.. dict2items requires a dictionary, got <class 'ansible.template.AnsibleUndefined'> instead.. Unable to look up a name or access an attribute in template string ({{ cvmfs_config | dict2items }}).\nMake sure your variable name does not contain invalid characters like '-': dict2items requires a dictionary, got <class 'ansib
le.template.AnsibleUndefined'> instead.. dict2items requires a dictionary, got <class 'ansible.template.AnsibleUndefined'> instead."}
Nov 19 10:31:27 slurmci-RL9-357-compute-0 ansible-init[34868]: PLAY RECAP *********************************************************************
Nov 19 10:31:27 slurmci-RL9-357-compute-0 ansible-init[34868]: 127.0.0.1                  : ok=19   changed=3    unreachable=0    failed=1    skipped=29   rescued=0    ignored=0
Nov 19 10:31:27 slurmci-RL9-357-compute-0 ansible-init[34821]: Traceback (most recent call last):
Nov 19 10:31:27 slurmci-RL9-357-compute-0 ansible-init[34821]:  File "/usr/bin/ansible-init", line 91, in <module>
Nov 19 10:31:27 slurmci-RL9-357-compute-0 ansible-init[34821]:    ansible_exec(
Nov 19 10:31:27 slurmci-RL9-357-compute-0 ansible-init[34821]:  File "/usr/bin/ansible-init", line 39, in ansible_exec
Nov 19 10:31:27 slurmci-RL9-357-compute-0 ansible-init[34821]:    subprocess.run([cmd, *args], env = environ, check = True, **kwargs)
Nov 19 10:31:27 slurmci-RL9-357-compute-0 ansible-init[34821]:  File "/usr/lib64/python3.9/subprocess.py", line 528, in run
Nov 19 10:31:27 slurmci-RL9-357-compute-0 ansible-init[34821]:    raise CalledProcessError(retcode, process.args,

Copy link
Collaborator

@sjpb sjpb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggested refactoring was wrong, sorry! Lets try again - will need a new image build

@claudia-lola claudia-lola requested a review from sjpb November 19, 2025 15:20
Copy link
Collaborator

@sjpb sjpb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@sjpb sjpb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sjpb sjpb merged commit 06503bd into main Nov 19, 2025
41 of 42 checks passed
@sjpb sjpb deleted the configure-gpus branch November 19, 2025 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants