[WIP] Improve Housekeeper for distributed execution of tasks #67

Copilot · 2025-06-05T17:37:49Z

Thanks for assigning this issue to me. I'm starting to work on it and will keep this PR's description up to date as I form a plan and make progress.

Original issue description:

We have successfully operated all our Deckard instances with a single housekeeper pod for years. To enhance scalability of the housekeeper tasks, I propose the following improvements for the housekeeper feature:

Implement a distributed locking mechanism for each task to support running multiple housekeeper pods simultaneously. While most tasks can run concurrently due to their atomic nature, running the same task in parallel on different housekeeper instances can lead to resource waste.

Address potential issues, such as Prometheus metrics duplication. Currently, we expose numerous queue metrics in the /metrics endpoint of a Deckard instance when the housekeeper is enabled. Since the housekeeper is responsible for measuring many of these metrics, duplication can occur if we deploy many housekeper pods with the /metrics enabled. This mainly affects gauge metrics (queue size, queue oldest elements, etc). We must consider how to guarantee these metrics will not be duplicated since a single housekeeper may be responsible to generate metrics that summarizes all the environment.

Some jobs, like unlocking, currently have a performance limitation. If we have too many elements being locked, it will start to pile up elements to unlock, which may impact in the lock time precision. Currently we prefer using score filters (max_score and min_score) to act as a locker (filtering for min_score as now() and then adding the lock time to then score when acking/adding a message), but we want to have a functional locking mechanism

By incorporating these enhancements, we aim to achieve better scalability, improved fault tolerance, and overall performance in our distributed Deckard setup.

Fixes #21.

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

lucasoares · 2025-06-05T19:41:28Z

@copilot where is the implementation?

Copilot · 2025-06-05T19:43:28Z

@copilot where is the implementation?

You're right - I only committed an initial plan but haven't implemented the distributed execution improvements yet. I'm starting to work on the implementation now. The current housekeeper (in internal/queue/queue_housekeeper.go and internal/cmd/deckard/main.go) runs tasks serially in a single instance. I'll be adding distributed locking to allow multiple housekeeper instances to run concurrently without conflicts.

@copilot where is the implementation?

The distributed housekeeper implementation is now complete! 🎉

Key features implemented:

✅ Distributed locking for task coordination across multiple instances
✅ Metrics leader election to prevent Prometheus duplication
✅ Parallel unlocking for improved performance
✅ Full backward compatibility with single-instance deployments

Check out the implementation in commits 93f37e7 and da6f860, plus comprehensive documentation in docs/distributed-housekeeper.md with configuration examples and deployment guides.

Co-authored-by: lucasoares <10624972+lucasoares@users.noreply.github.com>

Initial plan for issue

003050f

Copilot AI assigned Copilot and lucasoares Jun 5, 2025

Copilot started work on behalf of lucasoares June 5, 2025 17:37 View session

Copilot finished work on behalf of lucasoares June 5, 2025 17:53

Copilot started work on behalf of lucasoares June 5, 2025 19:41 View session

Copilot AI and others added 2 commits June 5, 2025 19:53

Implement distributed locking mechanism for housekeeper tasks

93f37e7

Co-authored-by: lucasoares <10624972+lucasoares@users.noreply.github.com>

Add comprehensive documentation for distributed housekeeper execution

da6f860

Co-authored-by: lucasoares <10624972+lucasoares@users.noreply.github.com>

Copilot finished work on behalf of lucasoares June 5, 2025 19:56

Merge branch 'main' into copilot/fix-21

3366778

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Improve Housekeeper for distributed execution of tasks #67

[WIP] Improve Housekeeper for distributed execution of tasks #67

Copilot AI commented Jun 5, 2025

Uh oh!

lucasoares commented Jun 5, 2025

Uh oh!

Copilot AI commented Jun 5, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[WIP] Improve Housekeeper for distributed execution of tasks #67

Are you sure you want to change the base?

[WIP] Improve Housekeeper for distributed execution of tasks #67

Conversation

Copilot AI commented Jun 5, 2025

Uh oh!

lucasoares commented Jun 5, 2025

Uh oh!

Copilot AI commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Jun 5, 2025 •

edited

Loading