-
Notifications
You must be signed in to change notification settings - Fork 97
Description
NIM_IMAGE: nvcr.io/nim/meta/llama-3.1-70b-instruct:1.3.0
page: nim-deploy/kserve at main · NVIDIA/nim-deploy · GitHub
step:
- install k8s and kserve (cloud-native-stack/playbooks at master · NVIDIA/cloud-native-stack (github.com))
- git clone nim-deploy/kserve at main · NVIDIA/nim-deploy · GitHub
3.export NGC_API_KEY=<NGC_API_KEY>
4.export HF_TOKEN=<HF_TOKEN>
5.export NODE_NAME=<NODE_NAME> - cd ~/nim-deploy/kserve
- bash scripts/setup.sh
- download model: kubectl create -f download-profile.yaml
- kubectl apply -f llama-3.1-70b-instruct_2xgpu_1.1.0.yaml
Issue 1: hit permission issue when download model to cache folder
after step 8 , Observe:
INFO 2024-11-26 10:19:04.812 pre_download.py:87] Fetching contents for profile tensorrt_llm-h100_nvl-fp8-tp2-pp1-throughput
INFO 2024-11-26 10:19:04.812 pre_download.py:93] {
feat_lora : false",
gpu : H100_NVL",
gpu_device : 2321:10de",
llm_engine : tensorrt_llm",
pp : 1",
precision : fp8",
profile : throughput",
tp : 2"
}
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:128] One or more errors fetching files:
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
Traceback (most recent call last):
File /opt/nim/llm/.venv/bin/download-to-cache", line 6, in
sys.exit(download_to_cache())
File /opt/nim/llm/nim_llm_sdk/hub/pre_download.py", line 97, in download_to_cache
cached_files = repo.get_all()
Exception: I/O error Permission denied (os error 13)
Analyze:
after give folder /raid/nvidia-nim 777 permission: sudo chmod -R 777 /raid/nvidia-nim, able to download nim model
suggest: add :sudo chmod -R 777 /raid/nvidia-nim in setup.sh script or remind user give 777 permission to model cache folder
Issue 2: hit OOM and Crash issue after step 9
Observe:
6b8cd071-2f06-4025-8c51-1fd5b17e6ee3
Analyze:
when set cpu memory < 77G in runtime yaml file(default value is 32G), it always get OOM and Crash issue
if set cpu memory >= 77G in runtime yaml file, then able to deploy nim with no issue
8cc731b2-3e45-42b0-878d-09659707d59d
suggest: tell user to increase cpu and gpu memory if they encounter this issue.