workers: revision worker implementation #178

zzzeid · 2021-12-03T20:55:43Z

This is a work in progress, do not merge!

All commits will be squashed into the minimum number of commits before merging.

add abstract Worker class (bug 1744327)
add main worker flag and capacity/throttle flags
add many to many fields + association to revisions/landing jobs
add method to parse diff and list affected files
add more test coverage for revision_worker.py
add mots integration (bug 1740107)
add new RevisionWorker that pre-processes revisions (bug 1788728)
add new RevisionWorker that pre-processes revisions (bug 1788728)
add new start/stop commands to manage workers
add new flags to stop workers gracefully (*_WORKER_STOPPED)
add patch caching on disk
add proper loop/process functionality to workers
add repo.use_revision_worker feature flag (bug 1788732)
add mots hashes check
improved edge search functionality
implement stack hashes to detect changes in revisions (via get_stack_hashes)
include new Lando revision info via API endpoint
refactor dependency and stack fetching and parsing using networkx
refactored revision worker and landing worker to use Worker class
remove s3/boto/etc. dependencies (bug 1753728)
rename old command lando-cli landing-worker to lando-cli start-landing-worker
run pre/post mots query
store mots output in revision model

TODO:

separate commit that moved landing_worker to workers module

cgsheeh · 2021-12-06T19:34:54Z

@zzzeid Curious about the motivation for using asyncio here - is the revision worker going to be an IO-bound process? Making the workers use async is a decent size change in terms of how the app functions and the programming model, I think we should make sure the benefits are tangible. 👍

If this is still exploratory work (it is a WIP after all) that's fine too of course. :)

zzzeid · 2021-12-06T19:49:34Z

@zzzeid Curious about the motivation for using asyncio here - is the revision worker going to be an IO-bound process? Making the workers use async is a decent size change in terms of how the app functions and the programming model, I think we should make sure the benefits are tangible. 👍

If this is still exploratory work (it is a WIP after all) that's fine too of course. :)

This is exploratory work, though the main motivation is around graceful shutdown, which is half-broken right now. It's a relatively small change to get graceful shutdown to work if we are mid-processing a job (at least as long as the job is not stalled). In the future we will have to shore-up this implementation to allow for cancelling jobs at will or having a timeout (which will require asyncio) to avoid what happened in bug 1740791.

This is still a WIP, there's a couple more changes to add to get the graceful shutdown to work completely as expected on both workers, and there's more testing I am going to be doing. However there should be no functional change to how jobs are processed. Rather than factoring out a buggy functionality in the new abstract worker, figured I would fix this while I'm there.

Edit: this is only exploratory in the sense that it's still not fully tested and proven to behave perfectly well. Once that's done, if there aren't any problems uncovered, this will likely land as-is unless there's a good reason not to.

cgsheeh · 2021-12-07T01:03:36Z

Changing a sync app to instead run as async isn't really a small change, it might be in terms of the diff size but it changes how the app works on a fundamental level. Before we add async features we should consciously decide that it's worth the added complexity in Lando. It's also unclear to me why "to allow for cancelling jobs at will or having a timeout" will require asyncio. Perhaps if you could elucidate this I could get on board with this change, but in my experience using the async features of Python is only rarely worth the trouble and this use case doesn't seem appropriate.

zzzeid · 2021-12-07T15:03:27Z

in my experience using the async features of Python is only rarely worth the trouble

Can you give an example of what sort of trouble you'd run into? I think using asyncio can complicate debugging, so that could be a big detractor, but otherwise it's a pretty mature feature that I think could be leveraged in beneficial ways -- specifically, the revision worker could be used to run multiple jobs asynchronously since it doesn't modify the state of the remote repo and will need to process a large number of revisions concurrently/continuously. We could possibly also have two types of workers and keep the landing worker sync if we anticipate problems.

It's also unclear to me why "to allow for cancelling jobs at will or having a timeout" will require asyncio

There are other alternatives of solving this problem (specifically timeouts), it's not a requirement. However, with the current reliance on the flags that control the running, paused, or processing state of the worker, this implementation is more of a bug fix. For example, if you send a SIGTERM or SIGKILL while a job is processing, the worker will never exit gracefully since the worker state is modified after the signal handler is executed. This particular problem is solved with the event loop.

Of course, we could also track this state elsewhere (e.g. in a lock file or DB, etc.) but that also adds different complexity -- I'll be considering other approaches here, but would be curious to hear what kinds of problems you think we'll run into with asyncio.

cgsheeh · 2021-12-07T23:02:20Z

Complications in debugging are one example. We'll also likely have to re-write large parts of the app in async if we start writing async-specific functionality (/me shudders thinking of previous personal projects where this exact thing happened...) since you can't call async functions from non-async functions. Once we do that, we'll need to make sure our tests account for the change to async code flow. It's also worth noting that calling sync functions from async functions is blocking anyways and kills the benefit of having the event loop. I haven't checked but some of our dependencies are likely blocking/sync specific, for example accessing the database. We would likely need to update all our subprocess calls to use asyncio's subprocess module, among a plethora of other modifications. Not that anything I've mentioned is impossible or out of our capabilities, but it's definitely a lot to consider when adding code to achieve a graceful shutdown.

My point isn't that we shouldn't be doing things concurrently/in-parallel, or that we shouldn't use async because it isn't a mature feature, it's that we should consider other alternatives as async is a special use case that adds a lot of moving parts to a Python app. If accessing revision data is slow we can create a threadpool and run hg within subprocesses managed by the threadpool (as an example). Running the entire app inside an event loop so that one specific section of code can run asynchronously or to get a graceful shutdown by killing tasks within the event loop adds more complexity than it's worth and changes our basic assumptions about code flow in Lando.

However, with the current reliance on the flags that control the running, paused, or processing state of the worker, this implementation is more of a bug fix. For example, if you send a SIGTERM or SIGKILL while a job is processing, the worker will never exit gracefully since the worker state is modified after the signal handler is executed.

Yeah, this is an unfortunate situation. Could we avoid making any changes to the DB until the end of the worker's execution (ie a db transaction that only commits once the full task is complete?) Is there other worker state to consider here?

landoapi/api/revisions.py

landoapi/models/landing_job.py

cgsheeh

First pass, mostly some code-level nits that stood out to me.

I'm sure you are aware but this is quite a lot of code on top of being a large design change. :) Splitting out some of the trivial changes into very small PRs would be desirable IMO, just to keep the review process moving along. I'll try and dig deeper into this tomorrow.

landoapi/hg.py

landoapi/models/revisions.py

landoapi/patches.py

landoapi/stacks.py

cgsheeh · 2022-09-14T20:58:19Z

landoapi/workers/landing_worker.py

@@ -0,0 +1,448 @@
+# This Source Code Form is subject to the terms of the Mozilla Public


I think it would be highly desirable for worker.py or landing_worker.py to be registered as a move in Git instead of a new file, to keep git blame history for the file. My vote would be for the landing_worker.py.

landoapi/workers/landing_worker.py

requirements.in

landoapi/models/revisions.py

landoapi/workers/revision_worker.py

- use updated response from differential.revision.search endpoint - translate `stackGraph` field to phids and edges - update `tests.mocks` such that revision response includes new field

This is a work in progress, do not merge! - add new RevisionWorker that pre-processes revisions (bug 1788728) - add new command (lando-cli revision-worker) to start new worker - add new flags to stop workers gracefully - add method to parse diff and list affected files - add mots integration (bug 1740107) - store mots output in revision model - run pre/post mots query - check mots hashes - add mots to requirements - include new Lando revision info via API endpoint - create abstract Worker class (bug 1744327) - add repo.use_revision_worker feature flag (bug 1788732) - add proper loop/process functionality to workers - refactored revision worker and landing worker - implement stack hashes to detect changes in revisions - functional multi edge search - remove s3/boto/etc. dependencies (bug 1753728) - add new "created" landing job status to handle revision propagation - add patch caching on disk - add many to many fields + association to revisions/landing jobs - refactor dependency and stack fetching and parsing using networkx - add main worker flag and capacity/throttle flags TODO: - add more test coverage for revision_worker.py

NOTE: commit to be cleaned up / squashed. - update project cache to be more streamlined - add support for cache when testing locally - refactor build_stack_graph to use revision directly - reduce the number of calls to phabricator API - improve stack discovery - fix tests

This way, processor and revision workers are separated into their own classes with shared functionality in the base class.

- add ability to specify max loops in workers - simplify sleep/throttle/stop/pause methods - add various unit tests - add integration test to test Supervisor/Processor functionality

- landing worker workflow - merge conflict - add ability to modify edges in phabdouble - test updating stack layout

- add some logging for debugging - lock table when updating revision - modify repo config

zzzeid · 2022-11-15T14:39:21Z

Closing this PR and moving to #224 so I can modify PR dependencies more easily.

zzzeid changed the title ~~workers: create abstract Worker class, use asyncio (bug 1744327)~~ workers: abstract Worker class and RevisionWorker (bugs 1740107, 1744327) Dec 3, 2021

zzzeid force-pushed the zeid/bug-1744327-refactor-workers branch 2 times, most recently from 93dab06 to 4182a52 Compare December 6, 2021 17:30

zzzeid force-pushed the zeid/bug-1744327-refactor-workers branch from 4182a52 to 9c13862 Compare December 7, 2021 21:22

zzzeid closed this Feb 8, 2022

zzzeid deleted the zeid/bug-1744327-refactor-workers branch February 8, 2022 20:29

zzzeid restored the zeid/bug-1744327-refactor-workers branch February 8, 2022 20:33

zzzeid reopened this Feb 8, 2022

zzzeid force-pushed the zeid/bug-1744327-refactor-workers branch from 9c13862 to 7f2a24e Compare June 14, 2022 20:26

zzzeid commented Jun 14, 2022

View reviewed changes

landoapi/api/revisions.py Outdated Show resolved Hide resolved

zzzeid force-pushed the zeid/bug-1744327-refactor-workers branch 3 times, most recently from bac1509 to 7add66f Compare June 27, 2022 16:18

zzzeid force-pushed the zeid/bug-1744327-refactor-workers branch from 7add66f to d251d17 Compare July 7, 2022 15:58

zzzeid commented Jul 7, 2022

View reviewed changes

landoapi/models/landing_job.py Outdated Show resolved Hide resolved

zzzeid force-pushed the zeid/bug-1744327-refactor-workers branch 2 times, most recently from c5d2cab to 2962439 Compare July 14, 2022 17:55

zzzeid force-pushed the zeid/bug-1744327-refactor-workers branch from 2962439 to 1c52dec Compare August 12, 2022 19:25

zzzeid force-pushed the zeid/bug-1744327-refactor-workers branch 3 times, most recently from e01494c to 8b62298 Compare August 25, 2022 16:45

zzzeid mentioned this pull request Aug 25, 2022

docker: add support for revision worker (bug 1787300) mozilla-conduit/suite#43

Closed

zzzeid force-pushed the zeid/bug-1744327-refactor-workers branch 3 times, most recently from 0907457 to f849935 Compare September 1, 2022 16:04

cgsheeh reviewed Sep 14, 2022

View reviewed changes

zzzeid commented Sep 15, 2022

View reviewed changes

landoapi/models/revisions.py Outdated Show resolved Hide resolved

zzzeid marked this pull request as ready for review September 15, 2022 18:54

zzzeid commented Sep 15, 2022

View reviewed changes

landoapi/workers/revision_worker.py Outdated Show resolved Hide resolved

zzzeid commented Sep 15, 2022

View reviewed changes

landoapi/workers/revision_worker.py Outdated Show resolved Hide resolved

zzzeid commented Sep 15, 2022

View reviewed changes

landoapi/workers/revision_worker.py Show resolved Hide resolved

zzzeid added 4 commits September 21, 2022 11:23

requirements: upgrade flask-caching to 2.0.1 (bug 1791807)

a704523

stacks: update build_stack_graph to use new endpoint (bug 1787516)

0ccc49d

- use updated response from differential.revision.search endpoint - translate `stackGraph` field to phids and edges - update `tests.mocks` such that revision response includes new field

code review feedback

2215e14

zzzeid force-pushed the zeid/bug-1744327-refactor-workers branch from ec9962d to 2215e14 Compare September 26, 2022 15:52

dklawren requested review from dklawren and removed request for dklawren September 27, 2022 23:59

zzzeid added 12 commits October 4, 2022 11:58

Refactor revision worker to have its own base class.

c47af6c

This way, processor and revision workers are separated into their own classes with shared functionality in the base class.

Add integration tests (part 1)

806d387

- add ability to specify max loops in workers - simplify sleep/throttle/stop/pause methods - add various unit tests - add integration test to test Supervisor/Processor functionality

Add integration test

c1f4523

Add more tests

c233330

- landing worker workflow - merge conflict - add ability to modify edges in phabdouble - test updating stack layout

Small bug fix

a74c7a4

remove "CREATED" status and redo migrations

1a90c95

Add check to result response from Phabricator

f776586

minor fix

767bc8b

enable revision worker flags

2d14b58

use diff ID directly from phabricator response

a8e25ba

Various fixes

45c76ef

- add some logging for debugging - lock table when updating revision - modify repo config

zzzeid force-pushed the zeid/bug-1744327-refactor-workers branch from 342f979 to 45c76ef Compare November 7, 2022 19:45

zzzeid added 2 commits November 8, 2022 14:48

code review feedback

e4db70c

Merge branch 'main' into zeid/bug-1744327-refactor-workers

96d8db7

zzzeid closed this Nov 15, 2022

		@@ -0,0 +1,448 @@
		# This Source Code Form is subject to the terms of the Mozilla Public

workers: revision worker implementation #178

workers: revision worker implementation #178

Uh oh!

Conversation

zzzeid commented Dec 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cgsheeh commented Dec 6, 2021

Uh oh!

zzzeid commented Dec 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cgsheeh commented Dec 7, 2021

Uh oh!

zzzeid commented Dec 7, 2021

Uh oh!

cgsheeh commented Dec 7, 2021

Uh oh!

Uh oh!

Uh oh!

cgsheeh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cgsheeh Sep 14, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zzzeid commented Nov 15, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zzzeid commented Dec 3, 2021 •

edited

Loading

zzzeid commented Dec 6, 2021 •

edited

Loading