Skip to content

Locked tables at the final stage with FATAL critical-load met #1596

@sirsova-readdle

Description

@sirsova-readdle

command:

gh-ost \
  --max-load=Threads_running=300,Threads_connected=600 \
  --critical-load=Threads_running=1000,Threads_connected=1000 \
  --chunk-size=20000 \
  --max-lag-millis=5000 \
  --dml-batch-size=200 \
  --user="root" \
  --password="SECRET" \ 
  --host=127.0.0.1 \ 
  --port=3306 \ 
  --throttle-control-replicas=127.0.0.1:3307 \
  --port=3306 \
  --gcp \
  --allow-on-master \
  --database="db" \
  --table="users" \
  --verbose \
  --switch-to-rbr \
  --allow-master-master \
  --cut-over=default \
  --exact-rowcount \
  --concurrent-rowcount \
  --default-retries=120 \
  --panic-flag-file=/tmp/ghost.panic.flag \
  --postpone-cut-over-flag-file=/tmp/ghost.postpone.flag \
  --alter="MODIFY name VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL"
  --execute

The command was successfully executed, and all that remained was to perform a cut-over to replace the tables.

2025-10-20 08:59:28 INFO Locking `db`.`users`, `db`.`_users_del`
Copy: 4025114/4025114 100.0%; Applied: 35738; Backlog: 44/1000; Time: 14m36s(total), 12m31s(copy); streamer: mysql-bin.3335455:24883896; Lag: 0.33s, HeartbeatLag: 0.25s, State: migrating; ETA: due
2025-10-20 08:59:28 INFO Copy: 4025114/4025114 100.0%; Applied: 35738; Backlog: 44/1000; Time: 14m36s(total), 12m31s(copy); streamer: mysql-bin.3335455:24883896; Lag: 0.33s, HeartbeatLag: 0.25s, State: migrating; ETA: due []
Copy: 4025114/4025114 100.0%; Applied: 35738; Backlog: 0/1000; Time: 14m37s(total), 12m31s(copy); streamer: mysql-bin.3335455:27128516; Lag: 0.47s, HeartbeatLag: 0.52s, State: migrating; ETA: due
2025-10-20 08:59:29 INFO Copy: 4025114/4025114 100.0%; Applied: 35738; Backlog: 0/1000; Time: 14m37s(total), 12m31s(copy); streamer: mysql-bin.3335455:27128516; Lag: 0.47s, HeartbeatLag: 0.52s, State: migrating; ETA: due []
2025-10-20 08:59:29 FATAL critical-load met: Threads_connected=1019, >=1000

However, at that moment, critical_load jumped to its limit values and the process was interrupted with an error (FATAL).
The lock seemed to hang for 1-2 minutes, and a spike was visible on the dashboard at that moment.
The worst thing is that it is impossible to restore the process...
Perhaps it is not necessary to fail immediately, but rather allow the value to be raised through the socket to complete the execution?
Or you have a better idea how to prevent such cutoffs at the end?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions