Skip to content
This repository was archived by the owner on Mar 22, 2021. It is now read-only.
This repository was archived by the owner on Mar 22, 2021. It is now read-only.

Let's rework PAPR's architecture! #62

@jlebon

Description

@jlebon

Rough problem statement

Right now, our architecture is very much like a waterfall.
Events on GitHub cause a linear cascade of events that
eventually fires off PAPR to run for those specific events.

This has severe limitations:

  • monolithic architecture makes it harder to try out locally
    and thus harder to contribute
  • reliance on multiple linear infrastructures (CI bus,
    Jenkins, OpenStack) results in a higher fault rate
  • no easy way to scale horizontally for HA

Other issues plaguing the current architecture:

  • queue is not easily visible/not public so it's hard to
    tell what's going on without manually checking the
    internal Jenkins queue (also homu queue is sort of visible
    but could be way better)
  • no job prioritization, purely FIFO. But ideally, we want
    to be able to apply a set of rules as to how jobs should
    prioritized. E.g. 'auto' and 'try' branch before PRs, PRs
    with certain labels before others, etc...
  • the combination of GHPRB and Homu is confusing and creates
    an inconsistent user experience

Rough solution

We split up the architecture into multiple small services:

  1. Homu/PAPR Scheduler (PAPRQ)
    Proposed to run in CentOS CI
  2. PAPR Workers
    Need either OpenStack or Docker/Kubernetes
    Bin packing problem - can we e.g. reuse Kubernetes
    https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/
  3. PAPR Publishers

The scheduler receives events from GitHub and queues up the
jobs. It understands .papr.yml and splits them into
individual jobs each representing a testsuite. This allows
e.g. workers that only support containerized workloads to
still participate in the pool. It also allows container work
to be weighed differently from VM/cluster work.

Workers periodically (with forced polls on events if
implementable) query PAPRQ for available jobs. PAPRQ
prioritizes jobs by a given set of rules.

What it would entail

  1. The largest piece of work will be to enhance (fork?) Homu
    to also handle PR events and add them to its queue. This
    naturally resolves the confusing UX experience, and makes
    optimizations like Add status-based test exemptions servo/homu#54
    trivial to implement.

E.g. @rh-atomic-bot retry will actually know whether to
retry testing the PR, or retry testing the merge.

It also allows for more sophisticated syntax, like:
@rh-atomic-bot retry f26-sanity

  1. Teach PAPR to connect to PAPRQ for jobs. This is either a
    long-running service that polls, or is periodically started
    by an external service (e.g. Jenkins, OCP)

  2. This can come later. Rather than the workers publishing
    to e.g. S3 themselves, do similar to what Cockpit does and
    stream logs and updates back to PAPRQ itself. This allows us
    to (1) have publicly visible streaming logs, and (2) keep
    all the secrets in PAPRQ and only require workers to have a
    single token.

Let's finalize this work and split it up amongst team
members so that everyone understands how it works, and can
help manage it.

Risks

  • Contributions/blocking on Servo/Homu team - getting review time is hard

Sub-alternative:
Sidecar/wrapper for Homu - PAPR intercepts github
events and forwards them to Homu as well, but also builds
its own state.
(Investigate organization-wide github events)

Transition plan

Can take down per-PR testing while keeping up testing on auto branch.

Alternatives

Customize Jenkins (Integrate better with CentOS CI)
Relationship with GHPRB there?

Hop on http://prow.k8s.io/ with Origin

Rely on Travis more

Other discussion

Standard test roles vs PAPR
PAPR describes more things, handles tasks like provisioning more declaratively
Could have stdtest in upstream git?

Colin: PAPR runs stdtest? Jonathan: Problem: Test in separate git repo. Unless upstream repo also holds stdtest definition?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions