Skip to content

Conversation

@alexandraBara
Copy link
Collaborator

@alexandraBara alexandraBara commented Sep 23, 2025

Functionality supported:

   version = self._get_amdsmi_version()
            processes = self.get_process()
            partition = self.get_partition()
            firmware = self.get_firmware()
            gpu_list = self.get_gpu_list()
            statics = self.get_static()

How to run:

node-scraper run-plugin AmdSmiPlugin

It is the users responsibility to have Python API installed: https://rocm.docs.amd.com/projects/amdsmi/en/latest/reference/amdsmi-py-api.html#

except Exception:
return None

def _vu(self, v: object, unit: str, *, required: bool = False) -> ValueUnit | None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: may be good to add some more descriptive function names for _vu, _vu_req, _nz.


return out

def _smi_try(self, fn, *a, default=None, **kw):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type hints should be added here


return out

def _get_soc_pstate(self, h) -> StaticSocPstate | None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type hints should be added, and docstring updated to reflect them.

except ValidationError:
return None

def _get_xgmi_plpd(self, h) -> StaticXgmiPlpd | None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type hints should be added, and docstring updated to reflect them.

except ValidationError:
return None

def _get_cache_info(self, h) -> list[StaticCacheInfoItem]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type hints should be added, and docstring updated to reflect them. Would also be good to use a more descriptive variable name.


return out

def _get_clock(self, h) -> StaticClockData | None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type hints should be added, and docstring updated to reflect them.

"""Collect AmdSmi data from system
Args:
args (_type_, optional): _description_. Defaults to None.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring placeholder needs to be updated.

self._log_event(
category=EventCategory.PLATFORM,
description=f"{key} is not consistent across all GPUs",
priority=EventPriority.ERROR,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is inconsistency here always an error condition? Just wondering if maybe 'warning' would be more appropriate for the priority here

Comment on lines +432 to +437
if args.l0_to_recovery_count_error_threshold is None:
args.l0_to_recovery_count_error_threshold = self.L0_TO_RECOVERY_COUNT_ERROR_THRESHOLD
if args.l0_to_recovery_count_warning_threshold is None:
args.l0_to_recovery_count_warning_threshold = (
self.L0_TO_RECOVERY_COUNT_WARNING_THRESHOLD
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be better to have these default values in the analyzer args model itself rather than having them as class variables and setting them manually like this.

Comment on lines +454 to +459
if args.expected_memory_partition_mode or args.expected_compute_partition_mode:
self.check_expected_memory_partition_mode(
data.partition,
args.expected_memory_partition_mode,
args.expected_compute_partition_mode,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this check does not depend on the 'static' attribute of data, why is it being skipped when static is None?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants