Skip to content

Conversation

@SaboniAmine
Copy link
Member

Description

In the start_task / stop_task lifecycle, some operations slow down the global execution, which can be a non negligeable overhead in context of real-time inference, for example.

start_task():
  min:  0.156 ms
  p50:  0.872 ms
  p95:  3.450 ms
  p99:  18.234 ms

stop_task():
  min:  0.287 ms
  p50:  1.234 ms
  p95:  5.020 ms
  p99:  22.456 ms

In perticular, the get_global_energy_mix_data() method is called at each invocation to update data that shouldn't change during the whole execution lifetime.

Capture d’écran 2025-12-25 à 20 29 55

The same way, at tracker init the cpu detection runs multiple time for Apple silicon, as it is used in the resource tracker (ResourceTracker._get_install_instructions) and in the PowerMetrics CLI setup (ApplePowermetrics._setup_cli) in addition to the CPU hardware detection in cpu.py.

Total __init__ time: 157ms

Top cumulative time:
├─ subprocess.poll():           140ms (89%)
├─ detect_cpu_model() x3:       133ms (85%)
├─ is_powermetrics_available():  60ms (38%)
├─ TDP.__init__():               49ms (31%)
└─ powermetrics._setup_cli():    45ms (29%)

In a first step towards a better internal data management, a quick way to reduce the I/O & cpuinfo.get_cpu_info() costly execution is to cache the extracted data. Start / stop task operations gained 0.4-0.5 ms for p50 :

start_task():
  min:  0.142 ms
  p50:  0.487 ms
  p95:  1.650 ms
  p99:  15.582 ms
  avg:  0.804 ms

stop_task():
  min:  0.239 ms
  p50:  0.750 ms
  p95:  3.325 ms
  p99:  9.258 ms
  avg:  1.014 ms

And results are better for slower requests, and globally for the get_global_energy_mix_data post-cache :

  • start_task() p95: 3.45ms → 1.65ms (-52%)
  • stop_task() p95: 5.02ms → 3.33ms (-34% )
  • get_global_energy_mix_data(): 2-3ms → 0.0007ms

Motivation and Context

In preparation to the integration of real time inference framework, based on the start / stop task API, blocking operations can be problematic when performance is optimized server side. At 1-2ms average execution time by start / top operation, the overhead seems limited for the moment. An asynchronous implementation could be considered but with caution, because when wrapping a single inference we might be synchronous when reading the data at start & stop time, otherwise the delta would measure an incorrect computation time if ran in background thread.

How Has This Been Tested?

A unit test has been added, benchmarks have been run post-implementation.

Screenshots (if appropriate):

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

Go over all the following points, and put an x in all the boxes that apply.

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING.md document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@SaboniAmine SaboniAmine requested a review from a team as a code owner December 25, 2025 19:45
…task execution

feat: add cache to detect cpu model function
@SaboniAmine SaboniAmine force-pushed the feat/add_caching_to_data_sources branch from c5bc726 to cb079e2 Compare December 25, 2025 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants