Skip to content

Conversation

@brianrudolf
Copy link
Contributor

@brianrudolf brianrudolf commented Nov 13, 2025

Proposed changes

This change adds (optional) new configuration to the engine container spec to set a postStart lifecycle command hook.

The motivation for this change comes from encountering difficulty running the Flux model on Google Kubernetes Engine. Following the guide for running GPUs in GKE does not use Nvidia's GPU Operator to configure nodes for use with CUDA applications and instead follows a similar but slightly different configuration approach that relies on the LD_LIBRARY_PATH environment variable to access the Nvidia drivers and CUDA libraries.

Due to the technical complexities of operating Flux, its startup process actually clears this environment variable early (but not immediately) in the start up sequence. A simple solution to this problem is for the engine container to run ldconfig to create the necessary run time bindings prior to the engine startup, which Kubernetes facilitates with this postStart hook.

Use of the toJson function ensures proper formatting of the command value:

          lifecycle:
            postStart:
              exec:
                command: ["/sbin/ldconfig"]

Relevant information from Google's documentation:

About the NVIDIA CUDA-X libraries

To use CUDA applications, the image that you use must have the libraries. To add the NVIDIA CUDA-X libraries, you can build and use your own image by including the following values in the LD_LIBRARY_PATH environment variable in your container specification:

/usr/local/nvidia/lib64: the location of the NVIDIA device drivers.
/usr/local/cuda-CUDA_VERSION/lib64: the location of the NVIDIA CUDA-X libraries on the node.

Types of changes

What types of changes does your code introduce to the Deepgram self-hosted resources?
Put an x in the boxes that apply

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update or tests (if none of the other choices apply)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

  • I have read the CONTRIBUTING doc
  • I have tested my changes in my local self-hosted environment
    • I have deployed this chart with Flux enabled and see the engine container start successfully using the GPU
  • I have added necessary documentation (if appropriate)

Further comments

@brianrudolf brianrudolf marked this pull request as ready for review November 13, 2025 16:20
@brianrudolf brianrudolf requested review from a team and therealevanhenry as code owners November 13, 2025 16:20
Copy link
Contributor

@pcgeek86 pcgeek86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants