feat: Use Helm hooks for applying database migrations (MPT-12683) #730

arturbalabanov · 2025-08-19T14:22:27Z

Description

Use Helm Hooks for database migrations for all the services using alembic, clickhouse or mongodb migrations:

rest_api - alembic
auth - alembic
herald - alembic
jira_bus - alembic
slacker - alembic
katara - alembic
risp_worker - clickhouse
metroculus_worker - clickhouse
gemini_worker - clickhouse
insider_worker - mongo
diworker - mongo

There is a seperate hook for each of the above services. That way we can:

ensure that the migrations are run only once per deployment (no matter how many pods are spawned for each of them)
have per-service dependancies for running the migrations. For example, the diworker's migrations require not only mongo, clickhouse and rabbitmq but also rest_api to be running before they are applied because some of these migrations make calls to rest_api. And of course the rest_api's own migrations can't require the service to be running as they are executed before the server starts.

Related issue number

MPT-12683: https://softwareone.atlassian.net/browse/MPT-12683

Special notes

In our Optscale deployment we're facing issues with migrations depending on EtcdLock, mostly as we have a custom deployment of etcd. So, to work around that (and not rely on etcd) we're instead running the migrations in a Helm hook job.

As I needed to apply this to multiple services, I also extracted the migrate.py scripts into a common place -- tools/db (and made some minor changes to allow it to be run for any service and inside a helm job).

Checklist

The pull request title is a good summary of the changes
~~Unit tests for the changes exist~~ N/A
New and existing unit tests pass locally

ffaraone

just some comment to start the usual discussion about how to name things :)

please check db connection string urlencoding otherwise we will have problem in presence of some characters.

tools/migrate/setup.py

tools/migrate/migrate.py

arturbalabanov · 2025-09-18T13:00:45Z

Looks like I missed a few services:

Let me know if others are missing still. I'm marking this PR as Draft until this is finished

sd-hystax · 2025-09-23T15:24:14Z

Looks like I missed a few services:

risp_worker

insider_worker

metroculus_worker

gemini_worker

diworker

Let me know if others are missing still. I'm marking this PR as Draft until this is finished

I checked all services that use migrations (both with and without locks). Your list of missed services is correct.

ffaraone

🥇

…ker and jira_bus too :)

…mongo migrations (MPT-12683)

…strings (MPT-12683)

…2683)

…tcd client (MPT-12683)

…pendent k8s hooks (per service) instead of init containers and thus guarantee they will always run once per deployment (MPT-12683)

nexusriot · 2025-10-14T22:28:33Z

i've tried to start cluster, but it failing with

(.venv) vlad@ops-experimental:~/optscale/optscale-deploy$ ./runkube.py --no-pull --with-elk  -o overlay/user_template.yml -- optscale local
21:50:05.120: Connecting to ctd daemon 172.25.1.157:2376
21:50:05.120: Сomparing local images for 172.25.1.157
21:50:10.870: Generating base overlay...
21:50:10.878: Connecting to ctd daemon 172.25.1.157:2376
21:50:13.760: Creating component_versions.yaml file to insert it into configmap
21:50:13.762: Deleting /configured key
21:50:13.765: etcd pod not found
21:50:13.775: Waiting for job deletion...
21:50:13.775: Starting helm chart optscale with name optscale on k8s cluster 172.25.1.157
Error: UPGRADE FAILED: pre-upgrade hooks failed: 1 error occurred:
  * timed out waiting for the condition


Traceback (most recent call last):
  File "/home/vlad/optscale/optscale-deploy/./runkube.py", line 485, in <module>
    acr.start(args.check, args.update_only)
  File "/home/vlad/optscale/optscale-deploy/./runkube.py", line 394, in start
    subprocess.run(update_cmd.split(), check=True)
  File "/usr/lib/python3.12/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['helm', 'upgrade', '--install', '-f', 'tmp/base_overlay', '-f', 'overlay/user_template.yml', 'optscale', 'optscale']' returned non-zero exit status 1.

this is some debug info

(.venv) vlad@ops-experimental:~/optscale/optscale-deploy$ helm list -A --all
helm status optscale -n default
helm history optscale -n default
kubectl get all -n default
kubectl get events -n default --sort-by=.metadata.creationTimestamp | tail -n 100
NAME      NAMESPACE REVISION  UPDATED                                 STATUS    CHART                             APP VERSION
ngingress default   1         2025-09-26 06:45:39.953910084 +0000 UTC deployed  nginx-ingress-controller-11.3.17  1.11.1     
optscale  default   2         2025-10-14 21:50:13.935580134 +0000 UTC failed    optscale-0.1.0                               
NAME: optscale
LAST DEPLOYED: Tue Oct 14 21:50:13 2025
NAMESPACE: default
STATUS: failed
REVISION: 2
TEST SUITE: None
REVISION  UPDATED                   STATUS  CHART           APP VERSION DESCRIPTION                                                           
1         Tue Oct 14 12:51:44 2025  failed  optscale-0.1.0              Release "optscale" failed: failed pre-install: 1 error occurred:      
                                                                          * t...                                                                
2         Tue Oct 14 21:50:13 2025  failed  optscale-0.1.0              Upgrade "optscale" failed: pre-upgrade hooks failed: 1 error occurr...
NAME                                                                  READY   STATUS     RESTARTS   AGE
pod/auth-migrations-hs8bq                                             0/1     Init:0/3   0          5m37s
pod/ngingress-nginx-ingress-controller-62zsh                          1/1     Running    0          18d
pod/ngingress-nginx-ingress-controller-default-backend-78ccb69cdxsz   1/1     Running    0          18d

NAME                                                         TYPE           CLUSTER-IP     EXTERNAL-IP   PORT(S)                      AGE
service/kubernetes                                           ClusterIP      10.96.0.1      <none>        443/TCP                      18d
service/ngingress-nginx-ingress-controller                   LoadBalancer   10.96.242.0    <pending>     80:29656/TCP,443:25388/TCP   18d
service/ngingress-nginx-ingress-controller-default-backend   ClusterIP      10.96.204.63   <none>        80/TCP                       18d

NAME                                                DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/ngingress-nginx-ingress-controller   1         1         1       1            1           <none>          18d

NAME                                                                 READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/ngingress-nginx-ingress-controller-default-backend   1/1     1            1           18d

NAME                                                                            DESIRED   CURRENT   READY   AGE
replicaset.apps/ngingress-nginx-ingress-controller-default-backend-78ccb69796   1         1         1       18d

NAME                        STATUS    COMPLETIONS   DURATION   AGE
job.batch/auth-migrations   Running   0/1           5m37s      5m37s
LAST SEEN   TYPE     REASON             OBJECT                      MESSAGE
5m37s       Normal   Killing            pod/auth-migrations-6rq5w   Stopping container wait-elk
5m37s       Normal   Scheduled          pod/auth-migrations-hs8bq   Successfully assigned default/auth-migrations-hs8bq to ops-experimental
5m37s       Normal   SuccessfulCreate   job/auth-migrations         Created pod: auth-migrations-hs8bq
5m36s       Normal   Pulled             pod/auth-migrations-hs8bq   Container image "busybox:1.30.0" already present on machine
5m36s       Normal   Created            pod/auth-migrations-hs8bq   Created container: wait-elk
5m36s       Normal   Started            pod/auth-migrations-hs8bq   Started container wait-elk



(.venv) vlad@ops-experimental:~/optscale/optscale-deploy$ kubectl get pods -n default --field-selector=status.phase!=Running -o wide
NAME                    READY   STATUS     RESTARTS   AGE    IP           NODE               NOMINATED NODE   READINESS GATES
auth-migrations-hs8bq   0/1     Init:0/3   0          9m1s   10.254.0.5   ops-experimental   <none>           <none>


(.venv) vlad@ops-experimental:~/optscale/optscale-deploy$ kubectl describe pod auth-migrations-hs8bq
Name:             auth-migrations-hs8bq
Namespace:        default
Priority:         0
Service Account:  default
Node:             ops-experimental/172.25.1.157
Start Time:       Tue, 14 Oct 2025 21:50:15 +0000
Labels:           batch.kubernetes.io/controller-uid=898579ed-47ef-4af3-a39f-9c5b81082b6f
                  batch.kubernetes.io/job-name=auth-migrations
                  controller-uid=898579ed-47ef-4af3-a39f-9c5b81082b6f
                  job-name=auth-migrations
Annotations:      <none>
Status:           Pending
IP:               10.254.0.5
IPs:
  IP:           10.254.0.5
Controlled By:  Job/auth-migrations
Init Containers:
  wait-elk:
    Container ID:  containerd://1e33925de202b130455f029fabfa30246e0783dff15ac671a2705fd388abbe9e
    Image:         busybox:1.30.0
    Image ID:      docker.io/library/busybox@sha256:7964ad52e396a6e045c39b5a44438424ac52e12e4d5a25d94895f2058cb863a0
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      until nc -z elk.default.svc.cluster.local 9200 -w 2; do sleep 2; done && until nc -z elk.default.svc.cluster.local 12201 -w 2; do sleep 2; done
    State:          Running
      Started:      Tue, 14 Oct 2025 21:50:16 +0000
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4mfwf (ro)
  wait-etcd-client:
    Container ID:  
    Image:         busybox:1.30.0
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      until nc -z etcd-client.default.svc.cluster.local 2379 -w 2; do sleep 2; done
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4mfwf (ro)
  wait-mariadb:
    Container ID:  
    Image:         mariadb:local
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      until mysql --connect-timeout=2 -h mariadb.default.svc.cluster.local -p$MYSQL_ROOT_PASSWORD -e "SELECT 1"; do sleep 2; done
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      MYSQL_ROOT_PASSWORD:  <set to the key 'password' in secret 'mariadb-secret'>  Optional: false
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4mfwf (ro)
Containers:
  auth-migrations:
    Container ID:  
    Image:         auth:local
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
    Args:
      uv run --project "auth" db migrate "auth"
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      HX_ETCD_HOST:  etcd-client
      HX_ETCD_PORT:  2379
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4mfwf (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 False 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  kube-api-access-4mfwf:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  10m   default-scheduler  Successfully assigned default/auth-migrations-hs8bq to ops-experimental
  Normal  Pulled     10m   kubelet            Container image "busybox:1.30.0" already present on machine
  Normal  Created    10m   kubelet            Created container: wait-elk
  Normal  Started    10m   kubelet            Started container wait-elk

it looks for me cluster is trying to start, but failing waiting for wait elk, but it cannot start before hook.

I'm not absolutely sure , but maybe need to wait for Jobs without a deletion policy of hook-succeeded or hook-failed.

"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded

so, helm will

Create the Job
Mark the hook “executed” immediately
Don't not wait for Job completion

Deletes the Job after it succeeds (or before the next upgrade), this hook will run in background, and Helm won’t block waiting for it, but it looks like disrupt is possible here, because the services may start in parallel.

Also a couple of concerns from my side:

Helm hooks can run DB migrations reliably but need to be 100% sure migrations is idempotent, transactional, retry-safe, and serialized. Helm can’t “roll back” database. If release fails after a schema change, Helm’s rollback won’t undo that change.
In our case several services touch one DB and one service touches many DBs, currently we don't have
a single owner of each database’s schema and migrations. It looks like need to locking so only one migration runs at a time (tool-level locks, or DB advisory locks).

So, in my opinion, in the current case need to render one migration Job per service, not only for DB.

nexusriot · 2025-10-20T11:24:46Z

After investigating problematic I will not suggest to trigger migrations with Helm hooks, Hooks are synchronous relative to Helm, not our cluster. IMHO if we have to implement best practice to separate migrations from service, need to ship migrations as a Kubernetes Job (not a hook) .Run it as a batch/v1 Job with sane backoffLimit (need to make sure all our changes is idempotent and retriable)

This keeps migrations visible/observable and decoupled from Helm’s lifecycle. It’s also a widely recommended approach.

https://www.linkedin.com/pulse/navigating-database-migrations-kubernetes-helm-hooks-vs-bdour-akram-bpaoe

https://medium.com/@inchararlingappa/handling-migration-with-helm-28b9884c94a6

https://devops.stackexchange.com/questions/15261/helm-long-running-jobs-vs-long-running-hooks

When Helm hooks are okay?

Small projects, quick one-off tasks, or “install-time sanity checks.” (Not out case with complicated waiters and startup logic)

arturbalabanov mentioned this pull request Aug 19, 2025

Use Helm hooks for applying database migrations softwareone-platform/optscale#202

Closed

3 tasks

ffaraone suggested changes Aug 19, 2025

View reviewed changes

tools/migrate/setup.py Outdated Show resolved Hide resolved

tools/migrate/migrate.py Outdated Show resolved Hide resolved

tools/migrate/migrate.py Outdated Show resolved Hide resolved

tools/migrate/migrate.py Outdated Show resolved Hide resolved

arturbalabanov force-pushed the feat/MPT-12683-use-helm-hooks-for-db-migrations branch from 624819c to 40c05e2 Compare September 1, 2025 13:25

arturbalabanov changed the title ~~feat: Use Helm hooks for applying database migrations (MPT-12683)~~ draft: feat: Use Helm hooks for applying database migrations (MPT-12683) Sep 18, 2025

arturbalabanov marked this pull request as draft September 18, 2025 13:01

arturbalabanov changed the title ~~draft: feat: Use Helm hooks for applying database migrations (MPT-12683)~~ feat: Use Helm hooks for applying database migrations (MPT-12683) Sep 18, 2025

arturbalabanov force-pushed the feat/MPT-12683-use-helm-hooks-for-db-migrations branch from 40c05e2 to fef00d7 Compare September 25, 2025 14:22

arturbalabanov marked this pull request as ready for review October 6, 2025 12:20

arturbalabanov requested a review from ffaraone October 6, 2025 12:21

ffaraone approved these changes Oct 6, 2025

View reviewed changes

arturbalabanov added 12 commits October 6, 2025 15:08

feat: rest_api: use helm hooks for db migrations

e99d445

feat: use helm hooks for db migrations for katara, herald, auth, slac…

4e7447e

…ker and jira_bus too :)

feat: remove outdated TODOs

1beb2e5

feat: delete temporary model column used only for testing

dd6bbe9

feat: tool migrate: convert to uv and add support for clickhouse and …

e61eee9

…mongo migrations (MPT-12683)

feat: Rename the CLI tool from migrate to db (MPT-12683)

a9411fd

feat: Take into account URL escaping when building DB url connection …

aa63769

…strings (MPT-12683)

feat: cli - add back support for generating alembic migrations (MPT-1…

2d9681e

…2683)

feat: Add run_test.sh script for the db tool (MPT-12683)

4d52850

feat: revert changes relating to decoupling the migrations from the e…

d8473df

…tcd client (MPT-12683)

feat: Large refactoring allowing the migrations to be applied as inde…

cbb9017

…pendent k8s hooks (per service) instead of init containers and thus guarantee they will always run once per deployment (MPT-12683)

feat: Fix linting problems (MPT-12683)

13bbf61

arturbalabanov force-pushed the feat/MPT-12683-use-helm-hooks-for-db-migrations branch from 6df48f4 to 13bbf61 Compare October 6, 2025 14:08

maxb-hystax requested review from nexusriot, nk-hystax and sd-hystax October 9, 2025 07:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Use Helm hooks for applying database migrations (MPT-12683) #730

feat: Use Helm hooks for applying database migrations (MPT-12683) #730

Uh oh!

arturbalabanov commented Aug 19, 2025 •

edited

Loading

Uh oh!

ffaraone left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arturbalabanov commented Sep 18, 2025 •

edited

Loading

Uh oh!

sd-hystax commented Sep 23, 2025

Uh oh!

ffaraone left a comment

Uh oh!

nexusriot commented Oct 14, 2025 •

edited

Loading

Uh oh!

nexusriot commented Oct 20, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: Use Helm hooks for applying database migrations (MPT-12683) #730

Are you sure you want to change the base?

feat: Use Helm hooks for applying database migrations (MPT-12683) #730

Uh oh!

Conversation

arturbalabanov commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issue number

Special notes

Checklist

Uh oh!

ffaraone left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arturbalabanov commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sd-hystax commented Sep 23, 2025

Uh oh!

ffaraone left a comment

Choose a reason for hiding this comment

Uh oh!

nexusriot commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nexusriot commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

arturbalabanov commented Aug 19, 2025 •

edited

Loading

arturbalabanov commented Sep 18, 2025 •

edited

Loading

nexusriot commented Oct 14, 2025 •

edited

Loading

nexusriot commented Oct 20, 2025 •

edited

Loading