Skip to content

Conversation

@zzzeid
Copy link
Contributor

@zzzeid zzzeid commented Dec 3, 2021

This is a work in progress, do not merge!

All commits will be squashed into the minimum number of commits before merging.

  • add abstract Worker class (bug 1744327)
  • add main worker flag and capacity/throttle flags
  • add many to many fields + association to revisions/landing jobs
  • add method to parse diff and list affected files
  • add more test coverage for revision_worker.py
  • add mots integration (bug 1740107)
  • add new RevisionWorker that pre-processes revisions (bug 1788728)
  • add new RevisionWorker that pre-processes revisions (bug 1788728)
  • add new start/stop commands to manage workers
  • add new flags to stop workers gracefully (*_WORKER_STOPPED)
  • add patch caching on disk
  • add proper loop/process functionality to workers
  • add repo.use_revision_worker feature flag (bug 1788732)
  • add mots hashes check
  • improved edge search functionality
  • implement stack hashes to detect changes in revisions (via get_stack_hashes)
  • include new Lando revision info via API endpoint
  • refactor dependency and stack fetching and parsing using networkx
  • refactored revision worker and landing worker to use Worker class
  • remove s3/boto/etc. dependencies (bug 1753728)
  • rename old command lando-cli landing-worker to lando-cli start-landing-worker
  • run pre/post mots query
  • store mots output in revision model

TODO:

  • separate commit that moved landing_worker to workers module

@zzzeid zzzeid changed the title workers: create abstract Worker class, use asyncio (bug 1744327) workers: abstract Worker class and RevisionWorker (bugs 1740107, 1744327) Dec 3, 2021
@zzzeid zzzeid force-pushed the zeid/bug-1744327-refactor-workers branch 2 times, most recently from 93dab06 to 4182a52 Compare December 6, 2021 17:30
@cgsheeh
Copy link
Member

cgsheeh commented Dec 6, 2021

@zzzeid Curious about the motivation for using asyncio here - is the revision worker going to be an IO-bound process? Making the workers use async is a decent size change in terms of how the app functions and the programming model, I think we should make sure the benefits are tangible. 👍

If this is still exploratory work (it is a WIP after all) that's fine too of course. :)

@zzzeid
Copy link
Contributor Author

zzzeid commented Dec 6, 2021

@zzzeid Curious about the motivation for using asyncio here - is the revision worker going to be an IO-bound process? Making the workers use async is a decent size change in terms of how the app functions and the programming model, I think we should make sure the benefits are tangible. 👍

If this is still exploratory work (it is a WIP after all) that's fine too of course. :)

This is exploratory work, though the main motivation is around graceful shutdown, which is half-broken right now. It's a relatively small change to get graceful shutdown to work if we are mid-processing a job (at least as long as the job is not stalled). In the future we will have to shore-up this implementation to allow for cancelling jobs at will or having a timeout (which will require asyncio) to avoid what happened in bug 1740791.

This is still a WIP, there's a couple more changes to add to get the graceful shutdown to work completely as expected on both workers, and there's more testing I am going to be doing. However there should be no functional change to how jobs are processed. Rather than factoring out a buggy functionality in the new abstract worker, figured I would fix this while I'm there.

Edit: this is only exploratory in the sense that it's still not fully tested and proven to behave perfectly well. Once that's done, if there aren't any problems uncovered, this will likely land as-is unless there's a good reason not to.

@cgsheeh
Copy link
Member

cgsheeh commented Dec 7, 2021

Changing a sync app to instead run as async isn't really a small change, it might be in terms of the diff size but it changes how the app works on a fundamental level. Before we add async features we should consciously decide that it's worth the added complexity in Lando. It's also unclear to me why "to allow for cancelling jobs at will or having a timeout" will require asyncio. Perhaps if you could elucidate this I could get on board with this change, but in my experience using the async features of Python is only rarely worth the trouble and this use case doesn't seem appropriate.

@zzzeid
Copy link
Contributor Author

zzzeid commented Dec 7, 2021

in my experience using the async features of Python is only rarely worth the trouble

Can you give an example of what sort of trouble you'd run into? I think using asyncio can complicate debugging, so that could be a big detractor, but otherwise it's a pretty mature feature that I think could be leveraged in beneficial ways -- specifically, the revision worker could be used to run multiple jobs asynchronously since it doesn't modify the state of the remote repo and will need to process a large number of revisions concurrently/continuously. We could possibly also have two types of workers and keep the landing worker sync if we anticipate problems.

It's also unclear to me why "to allow for cancelling jobs at will or having a timeout" will require asyncio

There are other alternatives of solving this problem (specifically timeouts), it's not a requirement. However, with the current reliance on the flags that control the running, paused, or processing state of the worker, this implementation is more of a bug fix. For example, if you send a SIGTERM or SIGKILL while a job is processing, the worker will never exit gracefully since the worker state is modified after the signal handler is executed. This particular problem is solved with the event loop.

Of course, we could also track this state elsewhere (e.g. in a lock file or DB, etc.) but that also adds different complexity -- I'll be considering other approaches here, but would be curious to hear what kinds of problems you think we'll run into with asyncio.

@zzzeid zzzeid force-pushed the zeid/bug-1744327-refactor-workers branch from 4182a52 to 9c13862 Compare December 7, 2021 21:22
@cgsheeh
Copy link
Member

cgsheeh commented Dec 7, 2021

Complications in debugging are one example. We'll also likely have to re-write large parts of the app in async if we start writing async-specific functionality (/me shudders thinking of previous personal projects where this exact thing happened...) since you can't call async functions from non-async functions. Once we do that, we'll need to make sure our tests account for the change to async code flow. It's also worth noting that calling sync functions from async functions is blocking anyways and kills the benefit of having the event loop. I haven't checked but some of our dependencies are likely blocking/sync specific, for example accessing the database. We would likely need to update all our subprocess calls to use asyncio's subprocess module, among a plethora of other modifications. Not that anything I've mentioned is impossible or out of our capabilities, but it's definitely a lot to consider when adding code to achieve a graceful shutdown.

My point isn't that we shouldn't be doing things concurrently/in-parallel, or that we shouldn't use async because it isn't a mature feature, it's that we should consider other alternatives as async is a special use case that adds a lot of moving parts to a Python app. If accessing revision data is slow we can create a threadpool and run hg within subprocesses managed by the threadpool (as an example). Running the entire app inside an event loop so that one specific section of code can run asynchronously or to get a graceful shutdown by killing tasks within the event loop adds more complexity than it's worth and changes our basic assumptions about code flow in Lando.

However, with the current reliance on the flags that control the running, paused, or processing state of the worker, this implementation is more of a bug fix. For example, if you send a SIGTERM or SIGKILL while a job is processing, the worker will never exit gracefully since the worker state is modified after the signal handler is executed.

Yeah, this is an unfortunate situation. Could we avoid making any changes to the DB until the end of the worker's execution (ie a db transaction that only commits once the full task is complete?) Is there other worker state to consider here?

@zzzeid zzzeid closed this Feb 8, 2022
@zzzeid zzzeid deleted the zeid/bug-1744327-refactor-workers branch February 8, 2022 20:29
@zzzeid zzzeid restored the zeid/bug-1744327-refactor-workers branch February 8, 2022 20:33
@zzzeid zzzeid reopened this Feb 8, 2022
@zzzeid zzzeid force-pushed the zeid/bug-1744327-refactor-workers branch from 9c13862 to 7f2a24e Compare June 14, 2022 20:26
@zzzeid zzzeid force-pushed the zeid/bug-1744327-refactor-workers branch 3 times, most recently from bac1509 to 7add66f Compare June 27, 2022 16:18
@zzzeid zzzeid force-pushed the zeid/bug-1744327-refactor-workers branch from 7add66f to d251d17 Compare July 7, 2022 15:58
@zzzeid zzzeid force-pushed the zeid/bug-1744327-refactor-workers branch 2 times, most recently from c5d2cab to 2962439 Compare July 14, 2022 17:55
@zzzeid zzzeid force-pushed the zeid/bug-1744327-refactor-workers branch from 2962439 to 1c52dec Compare August 12, 2022 19:25
@zzzeid zzzeid force-pushed the zeid/bug-1744327-refactor-workers branch 3 times, most recently from e01494c to 8b62298 Compare August 25, 2022 16:45
@zzzeid zzzeid force-pushed the zeid/bug-1744327-refactor-workers branch 3 times, most recently from 0907457 to f849935 Compare September 1, 2022 16:04
Copy link
Member

@cgsheeh cgsheeh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass, mostly some code-level nits that stood out to me.

I'm sure you are aware but this is quite a lot of code on top of being a large design change. :) Splitting out some of the trivial changes into very small PRs would be desirable IMO, just to keep the review process moving along. I'll try and dig deeper into this tomorrow.

@@ -0,0 +1,448 @@
# This Source Code Form is subject to the terms of the Mozilla Public
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be highly desirable for worker.py or landing_worker.py to be registered as a move in Git instead of a new file, to keep git blame history for the file. My vote would be for the landing_worker.py.

@zzzeid zzzeid marked this pull request as ready for review September 15, 2022 18:54
- use updated response from differential.revision.search endpoint
- translate `stackGraph` field to phids and edges
- update `tests.mocks` such that revision response includes new field
This is a work in progress, do not merge!

- add new RevisionWorker that pre-processes revisions (bug 1788728)
- add new command (lando-cli revision-worker) to start new worker
- add new flags to stop workers gracefully
- add method to parse diff and list affected files
- add mots integration (bug 1740107)
- store mots output in revision model
- run pre/post mots query
- check mots hashes
- add mots to requirements
- include new Lando revision info via API endpoint
- create abstract Worker class (bug 1744327)
- add repo.use_revision_worker feature flag (bug 1788732)
- add proper loop/process functionality to workers
- refactored revision worker and landing worker
- implement stack hashes to detect changes in revisions
- functional multi edge search
- remove s3/boto/etc. dependencies (bug 1753728)
- add new "created" landing job status to handle revision propagation
- add patch caching on disk
- add many to many fields + association to revisions/landing jobs
- refactor dependency and stack fetching and parsing using networkx
- add main worker flag and capacity/throttle flags

TODO:

- add more test coverage for revision_worker.py
@zzzeid zzzeid force-pushed the zeid/bug-1744327-refactor-workers branch from ec9962d to 2215e14 Compare September 26, 2022 15:52
@dklawren dklawren requested review from dklawren and removed request for dklawren September 27, 2022 23:59
zzzeid added 12 commits October 4, 2022 11:58
NOTE: commit to be cleaned up / squashed.

- update project cache to be more streamlined
- add support for cache when testing locally
- refactor build_stack_graph to use revision directly
- reduce the number of calls to phabricator API
- improve stack discovery
- fix tests
This way, processor and revision workers are separated into their own
classes with shared functionality in the base class.
- add ability to specify max loops in workers
- simplify sleep/throttle/stop/pause methods
- add various unit tests
- add integration test to test Supervisor/Processor functionality
- landing worker workflow
- merge conflict
- add ability to modify edges in phabdouble
- test updating stack layout
- add some logging for debugging
- lock table when updating revision
- modify repo config
@zzzeid zzzeid force-pushed the zeid/bug-1744327-refactor-workers branch from 342f979 to 45c76ef Compare November 7, 2022 19:45
@zzzeid
Copy link
Contributor Author

zzzeid commented Nov 15, 2022

Closing this PR and moving to #224 so I can modify PR dependencies more easily.

@zzzeid zzzeid closed this Nov 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants