-
Notifications
You must be signed in to change notification settings - Fork 6
feat: install the Azure HPC Diagnostics script #76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: install the Azure HPC Diagnostics script #76
Conversation
Reviewer's GuideAdds automated installation of the Azure HPC Diagnostics tool to the role, including package dependencies, archive download, templated patching of the upstream script, relocation of diagnostics output, and documentation of the new hpc_install_diagnostics toggle. Sequence diagram for Azure HPC Diagnostics installation via Ansible rolesequenceDiagram
participant ControlNode
participant TargetVM
participant GitHubAzhpcDiagnostics
ControlNode->>TargetVM: Run role tasks/main.yml
rect rgb(235,235,235)
Note over ControlNode,TargetVM: Install Azure HPC Diagnostics tool block (when hpc_install_diagnostics)
ControlNode->>TargetVM: stat path=__hpc_azure_tools_dir/gather_azhpc_vm_diagnostics.sh
TargetVM-->>ControlNode: __hpc_azure_diags_installed
alt diagnostics_not_installed
ControlNode->>TargetVM: package name=__hpc_azure_diagnostics_packages state=present
loop until_success
TargetVM-->>ControlNode: __hpc_azure_diagnostics_packages_install
end
ControlNode->>TargetVM: include_tasks download_extract_package.yml
ControlNode->>GitHubAzhpcDiagnostics: HTTP GET hpcdiag_archive (url in __hpc_azhpc_diags_info)
GitHubAzhpcDiagnostics-->>ControlNode: tarball
ControlNode->>TargetVM: Upload and extract archive to __hpc_pkg_extracted.path
ControlNode->>ControlNode: tempfile hpc_diags_XXXX.patch
ControlNode->>ControlNode: template azhpc_vm_diagnostics.sh.patch.j2 -> __hpc_diags_patch_file.path
ControlNode->>TargetVM: patch src=__hpc_diags_patch_file.path dest=__hpc_pkg_extracted.path/Linux/src/gather_azhpc_vm_diagnostics.sh
ControlNode->>TargetVM: copy patched script -> __hpc_azure_tools_dir/gather_azhpc_vm_diagnostics.sh mode=0755
ControlNode->>ControlNode: file state=absent path=__hpc_diags_patch_file.path
ControlNode->>TargetVM: file state=absent path=__hpc_pkg_extracted.path
else diagnostics_already_installed
Note over ControlNode,TargetVM: Skips download, patch, and install steps
end
end
ControlNode-->>TargetVM: Continue with remaining role tasks
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
f12a9f3 to
bd20207
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey - I've found 2 issues
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `tasks/main.yml:1447-1451` </location>
<code_context>
+ dest: "{{ __hpc_diags_patch_file.path }}"
+ mode: '0644'
+
+ - name: Patch Diagnostics script
+ patch:
+ src: "{{ __hpc_diags_patch_file.path }}"
+ dest: "{{ __hpc_pkg_extracted.path }}/Linux/src/gather_azhpc_vm_diagnostics.sh"
+ remote_src: false
+ state: present
+ strip: 1
</code_context>
<issue_to_address>
**issue (bug_risk):** Patch task likely mixes up controller/remote paths via `remote_src: false`.
Since `__hpc_diags_patch_file.path` is created via `tempfile` on the remote host, `src` points to a remote path. With `remote_src: false`, the `patch` module will look for this file on the controller instead and fail with file-not-found. Set `remote_src: true` so `patch` reads the file from the remote host where it actually exists.
</issue_to_address>
### Comment 2
<location> `README.md:203` </location>
<code_context>
+
+The Azure HPC Diagnostics tool gathers system information for triage and
+debugging purposes. It collects information and state from the hardware, OS,
+azure envinroment and installed applications and packages it into a tarball
+to simplify the process of system support of bug triage.
+
</code_context>
<issue_to_address>
**issue (typo):** Fix typo and capitalization in "azure envinroment".
Suggest using "Azure environment" here to fix the spelling and align capitalization with earlier usage.
```suggestion
Azure environment and installed applications and packages it into a tarball
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
02bd5e0 to
2c6078e
Compare
Define up the Azure HPC diagnostics tarball download location, set up a SHA256 sum for it (as github does not supply one), then add the config documentation and infrastructure to pull it down and unpack it, ready to extract the diagnostics script from it. Signed-off-by: Dave Chinner <dchinner@redhat.com>
The script itself installs certain tools via repository deep links. Some of these tools are provided by OS pacakages, so pull them in via the system-role as a dependent package. Signed-off-by: Dave Chinner <dchinner@redhat.com>
The Azure HPC diagnostics script captures information about the
system hardware and software for dianostic purposes. It is intended
to supplement the RHEL sosreport diagnostics to cover the Azure
specific hardware and software that the sosreport does not capture.
This script will be used for information gathering in support
contexts, it is not intended to be run on active HPC nodes.
Before we install the downloaded diagnostic script, we need to
change a few things in the script:
- the output should be in {{ __hpc_azure_runtime_dir }}/diagnostics
- permanently disable the auto update code
- fix the version number instead of assuming the script it running
from a local git repository
- change from defaulting to online mode (requires internet access)
to offline mode. --offline option goes away, replaced by --online
option
- Indicate that the diagnostic log files should be passed on to
Red Hat, not Microsoft.
To make this easy, we will add a patch file to the system role that
contains the code changes we need to make to the script. This is
much simpler to apply that needing to do complex parser based
matches and replacements to make the changes we need.
The resultant patch file will then need to be treated as a
template to do path substitution for the runtime output directory.
This will place the diagnostic output in a well known place by
default, rather than where-ever the script was run from.
The script will be installed to {{ __hpc_azure_tools_dir }}. If the
script is already present in this location, then we will skip over
the installation entirely.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
2c6078e to
686a52d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well done with patch templating!
| - name: Clean up temporary patch file | ||
| file: | ||
| path: "{{ __hpc_diags_patch_file.path }}" | ||
| state: absent | ||
|
|
||
| - name: Remove extracted temp directory | ||
| file: | ||
| path: "{{ __hpc_pkg_extracted.path }}" | ||
| state: absent | ||
| changed_when: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - name: Clean up temporary patch file | |
| file: | |
| path: "{{ __hpc_diags_patch_file.path }}" | |
| state: absent | |
| - name: Remove extracted temp directory | |
| file: | |
| path: "{{ __hpc_pkg_extracted.path }}" | |
| state: absent | |
| changed_when: false | |
| - name: Clean up temporary patch file | |
| file: | |
| path: "{{ item }}" | |
| state: absent | |
| loop: | |
| - "{{ __hpc_diags_patch_file.path }}" | |
| - "{{ __hpc_pkg_extracted.path }}" |
Can be a loop here to avoid repetition.
| version: 0.3.4 | ||
| sha256: bab588b37f9a7d03fff82ff22d8a24c18a64e18eb2dad31f447a67b6fb76bd4c | ||
| url: https://github.com/Azure/Moneo/archive/refs/tags/v0.3.4.tar.gz | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| version: 20220316 | ||
| sha256: bcecba0ff8999131f45508718ac6eec8615550e046c77d69c148d3947647849f | ||
| url: https://github.com/Azure/azhpc-diagnostics/archive/refs/tags/hpcdiag-20220316.tar.gz | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - kernel-devel | ||
| - kernel-headers | ||
| __hpc_azure_diagnostics_packages: | ||
| # lsvmbus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it commented here? Can you either comment why this might eventually be needed or remove?
| dest: "{{ __hpc_diags_patch_file.path }}" | ||
| mode: '0644' | ||
|
|
||
| - name: Patch Diagnostics script |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only concern is what if dest script got updated and our patch stops working?
Enhancement:
The Azure HPC diagnostics script captures information about the
system hardware and software for dianostic purposes. It is intended
to supplement the RHEL sosreport diagnostics to cover the Azure
specific hardware and software that the sosreport does not capture.
This script will be used for information gathering in support
contexts, it is not intended to be run on active HPC nodes.
The script will be installed to /opt/hpc/azure/tools, and should be run
from there.
The log files generated from this script must be treated with the same
care as sosreport log files as they may contain customer information
and/or PII which falls under various data protection laws.
Result:
When run, a tarball containing all the log files will be deposited in /var/hpc/azure/diagnostics/.
The last lines output by the script will indicate the name and location of the tarball containing
the diagnostic information.
Issue Tracker Tickets (Jira or BZ if any): https://issues.redhat.com/browse/RHELHPC-110
Summary by Sourcery
Install and integrate the Azure HPC Diagnostics tool into the Azure HPC role for automated deployment and configuration.
New Features:
Enhancements:
Documentation: