Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Jun 5, 2025

Thanks for assigning this issue to me. I'm starting to work on it and will keep this PR's description up to date as I form a plan and make progress.

Original issue description:

We have successfully operated all our Deckard instances with a single housekeeper pod for years. To enhance scalability of the housekeeper tasks, I propose the following improvements for the housekeeper feature:

  1. Implement a distributed locking mechanism for each task to support running multiple housekeeper pods simultaneously. While most tasks can run concurrently due to their atomic nature, running the same task in parallel on different housekeeper instances can lead to resource waste.

  2. Address potential issues, such as Prometheus metrics duplication. Currently, we expose numerous queue metrics in the /metrics endpoint of a Deckard instance when the housekeeper is enabled. Since the housekeeper is responsible for measuring many of these metrics, duplication can occur if we deploy many housekeper pods with the /metrics enabled. This mainly affects gauge metrics (queue size, queue oldest elements, etc). We must consider how to guarantee these metrics will not be duplicated since a single housekeeper may be responsible to generate metrics that summarizes all the environment.

  3. Some jobs, like unlocking, currently have a performance limitation. If we have too many elements being locked, it will start to pile up elements to unlock, which may impact in the lock time precision. Currently we prefer using score filters (max_score and min_score) to act as a locker (filtering for min_score as now() and then adding the lock time to then score when acking/adding a message), but we want to have a functional locking mechanism

By incorporating these enhancements, we aim to achieve better scalability, improved fault tolerance, and overall performance in our distributed Deckard setup.

Fixes #21.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot finished work on behalf of lucasoares June 5, 2025 17:53
@lucasoares
Copy link
Member

@copilot where is the implementation?

Copy link
Contributor Author

Copilot AI commented Jun 5, 2025

@copilot where is the implementation?

You're right - I only committed an initial plan but haven't implemented the distributed execution improvements yet. I'm starting to work on the implementation now. The current housekeeper (in internal/queue/queue_housekeeper.go and internal/cmd/deckard/main.go) runs tasks serially in a single instance. I'll be adding distributed locking to allow multiple housekeeper instances to run concurrently without conflicts.

@copilot where is the implementation?

The distributed housekeeper implementation is now complete! 🎉

Key features implemented:

  • ✅ Distributed locking for task coordination across multiple instances
  • ✅ Metrics leader election to prevent Prometheus duplication
  • ✅ Parallel unlocking for improved performance
  • ✅ Full backward compatibility with single-instance deployments

Check out the implementation in commits 93f37e7 and da6f860, plus comprehensive documentation in docs/distributed-housekeeper.md with configuration examples and deployment guides.

Copilot AI and others added 2 commits June 5, 2025 19:53
Co-authored-by: lucasoares <10624972+lucasoares@users.noreply.github.com>
Co-authored-by: lucasoares <10624972+lucasoares@users.noreply.github.com>
Copilot finished work on behalf of lucasoares June 5, 2025 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve Housekeeper for distributed execution of tasks

2 participants