KAFKA-19960: Close pending tasks on shutdown.#21365
KAFKA-19960: Close pending tasks on shutdown.#21365Nikita-Shupletsov wants to merge 6 commits intoapache:trunkfrom
Conversation
Added logic to close pending tasks to init. Made standby task closure similar to the one for active tasks. Added a separate method for getting standby tasks from task registry. Added an integration test that reproduces the issue.
mjsax
left a comment
There was a problem hiding this comment.
Made a first pass.
Can you update the PR description adding context on what the bug exactly is, and when we hit it? It seems to be related to not closing "pending tasks", but might be good to give some more context.
.../src/test/java/org/apache/kafka/streams/integration/RebalanceTaskClosureIntegrationTest.java
Outdated
Show resolved
Hide resolved
.../src/test/java/org/apache/kafka/streams/integration/RebalanceTaskClosureIntegrationTest.java
Outdated
Show resolved
Hide resolved
.../src/test/java/org/apache/kafka/streams/integration/RebalanceTaskClosureIntegrationTest.java
Outdated
Show resolved
Hide resolved
.../src/test/java/org/apache/kafka/streams/integration/RebalanceTaskClosureIntegrationTest.java
Show resolved
Hide resolved
.../src/test/java/org/apache/kafka/streams/integration/RebalanceTaskClosureIntegrationTest.java
Show resolved
Hide resolved
.../src/test/java/org/apache/kafka/streams/integration/RebalanceTaskClosureIntegrationTest.java
Show resolved
Hide resolved
...ntegration-tests/src/test/java/org/apache/kafka/streams/integration/KafkaStreamsWrapper.java
Show resolved
Hide resolved
streams/src/main/java/org/apache/kafka/streams/processor/internals/Tasks.java
Outdated
Show resolved
Hide resolved
streams/src/main/java/org/apache/kafka/streams/processor/internals/TaskManager.java
Show resolved
Hide resolved
streams/src/main/java/org/apache/kafka/streams/processor/internals/Tasks.java
Show resolved
Hide resolved
…ms/integration/RebalanceTaskClosureIntegrationTest.java Co-authored-by: Matthias J. Sax <mjsax@apache.org>
Minor refactoring.
| } else { | ||
| standbyTasks.add(pendingTask); | ||
| } | ||
| } |
There was a problem hiding this comment.
Thanks for updating the PR description. It say "shutdown during rebalance when active task become standby tasks" but seems it goes either way, and the PR is actually fixing both direction?
streams/src/main/java/org/apache/kafka/streams/processor/internals/TaskManager.java
Show resolved
Hide resolved
streams/src/main/java/org/apache/kafka/streams/processor/internals/Tasks.java
Outdated
Show resolved
Hide resolved
streams/src/main/java/org/apache/kafka/streams/processor/internals/Tasks.java
Show resolved
Hide resolved
lucasbru
left a comment
There was a problem hiding this comment.
Good find, thanks for identifying the problem.
An alternative way to fix it would be, I think, to consider pendingTasksToInit as a subset of Tasks.allTasks. Then conceptually, it may be simpler to say "close all tasks". But we'd have to check carefully what other places Tasks.allTasks and related methods are used, and if we are changing one of those places. In that case, we'd still need the "remove" fix you did.
streams/src/main/java/org/apache/kafka/streams/processor/internals/Tasks.java
Outdated
Show resolved
Hide resolved
streams/src/main/java/org/apache/kafka/streams/processor/internals/Tasks.java
Show resolved
Hide resolved
On second thought, that would probably create more problems than it solves. But I wonder if we should rename "allTasks" to "allInitializedTasks" then? |
the problem with drain is that we always try to remove the task we are closing from the task registry: https://github.com/Nikita-Shupletsov/kafka/blob/8ee99ee82df2cf89fbb769d26c66395fd3a63761/streams/src/main/java/org/apache/kafka/streams/processor/internals/TaskManager.java#L1405 if we drain the pending tasks, there will be no way for us to confirm it ever existed there. so we either need to relax the check in the remove method or have a separate branch for closing pending to init tasks |
Yes, that is what I mean by this -- after we drain the pending-tasks-to-init, we would add them into some new "pending-task-to-close" member inside It's to some extend "cosmetics" as we only move all "pending task" from one collection into a different one, but it preserves the invariant we have, and makes the code cleaner. |
|
No problem, it's a minor thing. |
mjsax
left a comment
There was a problem hiding this comment.
Few follow up comments. -- Still working through the integration test, but might be good to share what I have right away.
| final Set<Task> standbyTasks = new TreeSet<>(Comparator.comparing(Task::id)); | ||
| standbyTasks.addAll(tasks.standbyTasks()); | ||
|
|
||
| Set<Task> pendingActiveTasks = tasks.drainPendingActiveTasksToInit(); |
There was a problem hiding this comment.
Seems this should be final -- wondering why the build did not fail with a checktsyle error? Can we check this?
| Set<Task> pendingActiveTasks = tasks.drainPendingActiveTasksToInit(); | ||
| activeTasks.addAll(pendingActiveTasks); | ||
| tasks.addPendingTasksToClose(pendingActiveTasks); | ||
| Set<Task> pendingStandbyTasks = tasks.drainPendingStandbyTasksToInit(); |
| } | ||
|
|
||
| @Override | ||
| public void addPendingTasksToClose(Collection<Task> tasks) { |
| // TODO: change type to `StreamTask` | ||
| final Set<Task> activeTasks = new TreeSet<>(Comparator.comparing(Task::id)); | ||
| activeTasks.addAll(tasks.activeTasks()); | ||
| final Set<Task> standbyTasks = new TreeSet<>(Comparator.comparing(Task::id)); |
There was a problem hiding this comment.
nit: should this be StrandbyTask type (compare comment above about StreamTask type -- we would also just update the comment, and try to do a follow up PR to improve types later?
| final StreamTask activeTask1 = statefulTask(taskId00, taskId00ChangelogPartitions) | ||
| .inState(State.RUNNING) | ||
| .withInputPartitions(taskId00Partitions).build(); | ||
| final StreamTask activeTask2 = statefulTask(taskId01, taskId01ChangelogPartitions) |
There was a problem hiding this comment.
Why are we removing this task from the test? Is using a single task sufficient (and it's unclear why we did use two tasks to begin with)?
| final TaskManager taskManager = setUpTaskManager(ProcessingMode.AT_LEAST_ONCE, tasks); | ||
|
|
||
| final StandbyTask standbyTask00 = standbyTask(taskId00, taskId00ChangelogPartitions) | ||
| .inState(State.RUNNING) |
| .build(); | ||
|
|
||
| final StreamTask activeTask01 = statefulTask(taskId01, taskId00ChangelogPartitions) | ||
| .inState(State.RUNNING) |
| tasks.addPendingTasksToInit(Set.of(activeTask1, activeTask2, standbyTask1, standbyTask2)); | ||
|
|
||
| final Set<Task> standbyTasksToInit = tasks.drainPendingStandbyTasksToInit(); | ||
| assertEquals(2, standbyTasksToInit.size()); |
There was a problem hiding this comment.
| assertEquals(2, standbyTasksToInit.size()); | |
| assertEquals(2, standbyTasksToInit.size()); |
| assertFalse(tasks.allTasks().contains(activeTask1)); | ||
|
|
||
| tasks.addPendingTasksToClose(List.of(activeTask1)); | ||
| assertTrue(tasks.pendingTasksToClose().contains(activeTask1)); |
There was a problem hiding this comment.
Why do we add it back and verify? Could we not do the verification above, right after the added the task the first time?
| * The conditions that we need to meet: | ||
| * <p><ul> | ||
| * <li>There is a task with an open store in {@link org.apache.kafka.streams.processor.internals.TasksRegistry#pendingTasksToInit}</li> | ||
| * <li>StreamThread gets into PENDING_SHUTDOWN state, so that {@link StreamThread#isStartingRunningOrPartitionAssigned} return false |
There was a problem hiding this comment.
| * <li>StreamThread gets into PENDING_SHUTDOWN state, so that {@link StreamThread#isStartingRunningOrPartitionAssigned} return false | |
| * <li>StreamThread gets into PENDING_SHUTDOWN state, so that {@link StreamThread#isStartingRunningOrPartitionAssigned} returns false |
This PR fixes a bug when KS doesn't close stores if the shutdown was
triggered during rebalance where an active tasks gets converted to a
standby one and put into pendingTasksToInit