-
Notifications
You must be signed in to change notification settings - Fork 16
Let's rework PAPR's architecture! #62
Description
Rough problem statement
Right now, our architecture is very much like a waterfall.
Events on GitHub cause a linear cascade of events that
eventually fires off PAPR to run for those specific events.
This has severe limitations:
- monolithic architecture makes it harder to try out locally
and thus harder to contribute - reliance on multiple linear infrastructures (CI bus,
Jenkins, OpenStack) results in a higher fault rate - no easy way to scale horizontally for HA
Other issues plaguing the current architecture:
- queue is not easily visible/not public so it's hard to
tell what's going on without manually checking the
internal Jenkins queue (also homu queue is sort of visible
but could be way better) - no job prioritization, purely FIFO. But ideally, we want
to be able to apply a set of rules as to how jobs should
prioritized. E.g. 'auto' and 'try' branch before PRs, PRs
with certain labels before others, etc... - the combination of GHPRB and Homu is confusing and creates
an inconsistent user experience
Rough solution
We split up the architecture into multiple small services:
- Homu/PAPR Scheduler (PAPRQ)
Proposed to run in CentOS CI - PAPR Workers
Need either OpenStack or Docker/Kubernetes
Bin packing problem - can we e.g. reuse Kubernetes
https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/ - PAPR Publishers
The scheduler receives events from GitHub and queues up the
jobs. It understands .papr.yml and splits them into
individual jobs each representing a testsuite. This allows
e.g. workers that only support containerized workloads to
still participate in the pool. It also allows container work
to be weighed differently from VM/cluster work.
Workers periodically (with forced polls on events if
implementable) query PAPRQ for available jobs. PAPRQ
prioritizes jobs by a given set of rules.
What it would entail
- The largest piece of work will be to enhance (fork?) Homu
to also handle PR events and add them to its queue. This
naturally resolves the confusing UX experience, and makes
optimizations like Add status-based test exemptions servo/homu#54
trivial to implement.
E.g. @rh-atomic-bot retry will actually know whether to
retry testing the PR, or retry testing the merge.
It also allows for more sophisticated syntax, like:
@rh-atomic-bot retry f26-sanity
-
Teach PAPR to connect to PAPRQ for jobs. This is either a
long-running service that polls, or is periodically started
by an external service (e.g. Jenkins, OCP) -
This can come later. Rather than the workers publishing
to e.g. S3 themselves, do similar to what Cockpit does and
stream logs and updates back to PAPRQ itself. This allows us
to (1) have publicly visible streaming logs, and (2) keep
all the secrets in PAPRQ and only require workers to have a
single token.
Let's finalize this work and split it up amongst team
members so that everyone understands how it works, and can
help manage it.
Risks
- Contributions/blocking on Servo/Homu team - getting review time is hard
Sub-alternative:
Sidecar/wrapper for Homu - PAPR intercepts github
events and forwards them to Homu as well, but also builds
its own state.
(Investigate organization-wide github events)
Transition plan
Can take down per-PR testing while keeping up testing on auto branch.
Alternatives
Customize Jenkins (Integrate better with CentOS CI)
Relationship with GHPRB there?
Hop on http://prow.k8s.io/ with Origin
Rely on Travis more
Other discussion
Standard test roles vs PAPR
PAPR describes more things, handles tasks like provisioning more declaratively
Could have stdtest in upstream git?
Colin: PAPR runs stdtest? Jonathan: Problem: Test in separate git repo. Unless upstream repo also holds stdtest definition?