Guide on how to deploy a llama model using FastAPI and dockerising it.
pip freeze > requirements.txtIm using a windows machine with Docker Desktop (WSL2). I have a GPU (NVIDIA 3060TI) running cuda 11.6. A little outdated, I know.
Use this https://pytorch.org/get-started/previous-versions/ to map your cude version to a pytorch version.
Error 1. Windows testing error.
AssertionError: Torch not compiled with CUDA enabled
Fix: pip install torch==1.13.1+cu116 -f https://download.pytorch.org/whl/torch_stable.html
Error 2. When docker compose up --build, you will get error that linux cant download cu116. Just remove it. Linux fix.
torch==1.13.1 # Remove +cu116 from requirements .txt when executing docker compose
Error 3.
Fix for dtype error for pytorch version < 2.10. Since we are using 1.13.1, this error occurs.
Source: https://github.com/meta-llama/llama3/issues/110
Error message:
Traceback (most recent call last):
File "C:\Users\Admin\Desktop\Learning\RDAI\model.py", line 45, in <module>
output = pipe(messages, **generation_args)
File "C:\Users\Admin\Desktop\Learning\RDAI\venv\lib\site-packages\transformers\pipelines\text_generation.py", line 267, in __call__
return super().__call__(Chat(text_inputs), **kwargs)
File "C:\Users\Admin\Desktop\Learning\RDAI\venv\lib\site-packages\transformers\pipelines\base.py", line 1302, in __call__
return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
File "C:\Users\Admin\Desktop\Learning\RDAI\venv\lib\site-packages\transformers\pipelines\base.py", line 1309, in run_single
model_outputs = self.forward(model_inputs, **forward_params)
File "C:\Users\Admin\Desktop\Learning\RDAI\venv\lib\site-packages\transformers\pipelines\base.py", line 1209, in forward
model_outputs = self._forward(model_inputs, **forward_params)
File "C:\Users\Admin\Desktop\Learning\RDAI\venv\lib\site-packages\transformers\pipelines\text_generation.py", line 370, in _forward
generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
File "C:\Users\Admin\Desktop\Learning\RDAI\venv\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "C:\Users\Admin\Desktop\Learning\RDAI\venv\lib\site-packages\transformers\generation\utils.py", line 2215, in generate
result = self._sample(
File "C:\Users\Admin\Desktop\Learning\RDAI\venv\lib\site-packages\transformers\generation\utils.py", line 3206, in _sample
File "C:\Users\Admin\Desktop\Learning\RDAI\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\Admin\Desktop\Learning\RDAI\venv\lib\site-packages\transformers\models\llama\modeling_llama.py", line 1190, in forward
outputs = self.model(
File "C:\Users\Admin\Desktop\Learning\RDAI\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\Admin\Desktop\Learning\RDAI\venv\lib\site-packages\transformers\models\llama\modeling_llama.py", line 915, in forward
causal_mask = self._update_causal_mask(
causal_mask = self._prepare_4d_causal_attention_mask_with_cache_position(
File "C:\Users\Admin\Desktop\Learning\RDAI\venv\lib\site-packages\transformers\models\llama\modeling_llama.py", line 1090, in _prepare_4d_causal_attention_mask_with_cache_position
causal_mask = torch.triu(causal_mask, diagonal=1)
Fix: Line 1089
if sequence_length != 1:
# causal_mask = torch.triu(causal_mask, diagonal=1)
causal_mask = causal_mask.to(torch.float32)#
causal_mask = torch.triu(causal_mask, diagonal=1)#
causal_mask = causal_mask.to('cuda', dtype=torch.bfloat16)#
Error 4. When running docker compose up, the model will die here as the build did not account for error 2/3.
File: C:\Users\Admin\Desktop\Learning\RDAI\venv\Lib\site-packages\transformers\models\llama\modeling_llama.py
Fix: Run this either in dockerfile or exec into the container.
cp /app/venv/Lib/site-packages/transformers/models/llama/modeling_llama.py /usr/local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.pyError 5. Cant find rust compiler. Add in fix to dockerfile.
1176.6 warning: no files found matching '*.json' under directory 'src/python_interpreter'
1176.6 writing manifest file 'maturin.egg-info/SOURCES.txt'
1176.6 warning: build_py: byte-compiling is disabled, skipping.
1176.6
1176.6 running build_ext
1176.6 running build_rust
1176.6 error: can't find Rust compiler
Fix: https://stackoverflow.com/questions/75085152/cant-find-rust-compiler-to-install-transformers
# Install Rust compiler
RUN curl https://sh.rustup.rs -sSf | bash -s -- -y
ENV PATH="/root/.cargo/bin:${PATH}"
Error 6. No Nvidia GPU in docker
Fix: https://docs.docker.com/compose/how-tos/gpu-support/
File "/usr/local/lib/python3.9/site-packages/torch/cuda/__init__.py", line 229, in _lazy_init
server-1 | torch._C._cuda_init()
server-1 | RuntimeError: Found no NVIDIA driver on you
Fix: https://stackoverflow.com/questions/57066162/how-to-get-docker-to-recognize-nvidia-drivers
Use: docker run --gpus all -it rdai-server
# Added new config into docker compose file. Ran with docker compose up
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all # alternatively, use `count: all` for all GPUs
capabilities: [gpu]
Running with uvicorn
uvicorn main:app --reload --host 0.0.0.0 --port 8000
Using fastapi /docs to help.
http://localhost:8000/docs#/default/ask_ask_post
Looks good when running locally.
Added a simple post processing flag. Its a little tricky to retrain the model to embed this inside so this will do for now.
Docker compose up after building. Everything looks fine now.
Git hub error fix.
https://carldesouza.com/wrong-user-when-pushing-to-github-from-visual-studio-code/






