-
Couldn't load subscription status.
- Fork 3.5k
Linux device discovery for TRT-RTX Ep #26210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
||
| // metadata | ||
| gpu_device.metadata.Add("card_idx", MakeString(path_info.card_idx)); | ||
| gpu_device.metadata.Add("bus_id", std::filesystem::read_symlink(sysfs_path / "device").filename().string()); // e.g. 0000:65:00.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to infer whether the GPU is discrete from the bus_id? I'm trying to figure out how to determine that - any pointers would be appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! I don't know whether this is reliable, but it seems that the parent bus of a iGPU seems to have a parent bus with id of pattern picXXXX:00. dGPU seems to have names like card1 or higher while iGPU would be named card0. So it's rather the topology of the bus than the bus id of the GPU itself
dGPU
realpath /sys/class/drm/card1/device
/sys/devices/pci0000:64/0000:64:00.0/0000:65:00.0
iGPUs would often have the parent path and be named card0
/sys/devices/picXXXX:00/XXXX:00:00.0/
I don't have a iGPU so my node
00:00.0 Host bridge: Intel Corporation Sky Lake-E DMI3 Registers (rev 07)
or
/sys/devices/pci0000:00/
wouldn't have a any drm nodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good to know, thanks for sharing that observation
8e33417 to
bc7f4bb
Compare
| // Tests autoEP feature to automatically select an EP that supports the GPU. | ||
| // Currently only works on Windows. | ||
| TEST(NvExecutionProviderTest, AutoEp_PreferGpu) { | ||
| PathString model_name = ORT_TSTR("nv_execution_provider_auto_ep.onnx"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test is currently still failing: NvTensorRtRtxEpFactory::CreateEp is called but not implemented https://github.com/theHamsta/onnxruntime/blob/bc7f4bbc92ed0609e18e1c4c80c3c8d560ce1729/onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_factory.cc#L698. The other test passes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where would would the internal factories on Windows come from
https://github.com/theHamsta/onnxruntime/blob/20d2c8912708b37fb914fbb0c719914a4040830c/onnxruntime/core/session/provider_policy_context.cc#L351-L358 ? I don't have them on Linux when no execution providers are appended to the session options. My desired device get selected, but NvEp::CreateEp is called.
NvEp like other EPs does not implement CreateEp. I guess IExecutionProvider.CreateProviders is the modern version of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a plugin EP should implement CreateEp. the internal factory is not used for plugin EPs.
8820257 to
6c77185
Compare
5f49245 to
0bf0c6b
Compare
| #endif | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@edgchen1 This resolves the failing test under Linux. I could alternatively put std::filesystem::read_symlink(std::filesystem::path("/proc/self/exe")).parent_path() in the unit test instead and move PosixEnv::GetRuntimePath to a separate PR.
I'm not sure whether the Windows behavior to set runtime path to current exe's dir is always desired (but it would be needed to behave the same in tests like on Windows)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could alternatively put std::filesystem::read_symlink(std::filesystem::path("/proc/self/exe")).parent_path() in the unit test instead and move PosixEnv::GetRuntimePath to a separate PR.
maybe for this PR, just have the unit test pass in an absolute path to the plugin EP library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, I think it would be useful to implement PosixEnv::GetRuntimePath() too. it would also help fix another test on Linux (only enabled for Windows now).
onnxruntime/onnxruntime/test/autoep/test_selection.cc
Lines 171 to 175 in b3ba580
| TEST(AutoEpSelection, CudaEP) { | |
| Ort::KeyValuePairs provider_options; | |
| provider_options.Add("prefer_nhwc", "1"); | |
| RunBasicTest(kCudaExecutionProvider, "onnxruntime_providers_cuda", provider_options); | |
| } |
I think we can just include it in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created #26424 since enabling the tests with CUDA EP required another small tweak. Let's do this PR after enabling the PosixEnv::GetRuntimePath
0bf0c6b to
b4a38fc
Compare
| #endif | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could alternatively put std::filesystem::read_symlink(std::filesystem::path("/proc/self/exe")).parent_path() in the unit test instead and move PosixEnv::GetRuntimePath to a separate PR.
maybe for this PR, just have the unit test pass in an absolute path to the plugin EP library.
onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_factory.cc
Outdated
Show resolved
Hide resolved
| // Licensed under the MIT License. | ||
|
|
||
| #include "core/platform/device_discovery.h" | ||
| #include "core/framework/ortdevice.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usually, we want to stick to this dependency ordering:
onnxruntime/cmake/onnxruntime.cmake
Lines 238 to 254 in 5ef8ce2
| # This list is a reversed topological ordering of library dependencies. | |
| # Earlier entries may depend on later ones. Later ones should not depend on earlier ones. | |
| set(onnxruntime_INTERNAL_LIBRARIES | |
| onnxruntime_session | |
| ${onnxruntime_libs} | |
| ${onnxruntime_INTERNAL_PROVIDER_LIBRARIES} | |
| ${onnxruntime_winml} | |
| onnxruntime_optimizer | |
| onnxruntime_providers | |
| onnxruntime_lora | |
| onnxruntime_framework | |
| onnxruntime_graph | |
| onnxruntime_util | |
| ${ONNXRUNTIME_MLAS_LIBS} | |
| onnxruntime_common | |
| onnxruntime_flatbuffers | |
| ) |
so onnxruntime_common (which contains core/platform/linux/device_discovery.cc) should not depend on onnxruntime_framework (core/framework/ortdevice.h).
we can move the vendor id enum and helper function to be a separate header, in onnxruntime_common perhaps, or just move the helper function to this file for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've put the helper to the linux file. I will try to generalize in a follow-up.
| // Tests autoEP feature to automatically select an EP that supports the GPU. | ||
| // Currently only works on Windows. | ||
| TEST(NvExecutionProviderTest, AutoEp_PreferGpu) { | ||
| PathString model_name = ORT_TSTR("nv_execution_provider_auto_ep.onnx"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a plugin EP should implement CreateEp. the internal factory is not used for plugin EPs.
b4a38fc to
8a1ec61
Compare
3ffffed to
60264fa
Compare
60264fa to
d3ee0a6
Compare
Description
This change adds vendor id and PCIe bus_id to the properties detected during Linux device discovery.
Those properties are used to enable device discovery on Linux for the TRT-RTX execution provider.
Motivation and Context
I want to use device discovery for TRT-EP also on Linux.
This changes have already been tested with the newly added inference samples microsoft/onnxruntime-inference-examples#529 . Some further testing is required to see whether this repo's CI passes (therefore setting draft for now)
@gedoensmax for visibilty