FLPATH-3294: Kessel ReBAC authorization integration for on-prem#5933
FLPATH-3294: Kessel ReBAC authorization integration for on-prem#5933jordigilh wants to merge 12 commits intoproject-koku:mainfrom
Conversation
905f8ad to
4708a4c
Compare
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces Kessel ReBAC authorization for on-prem Koku deployments, providing a more flexible and granular access control mechanism. It includes a new Django app, SpiceDB schema, middleware, and API changes to integrate Kessel with Koku. The changes aim to remove the dependency on the SaaS RBAC service for on-prem deployments and enable workspace-based access control. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Ignored Files
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces significant changes to integrate Kessel (ReBAC) authorization for on-premise deployments, marked by a major Django version upgrade to 5.2.0 and the addition of kessel-sdk and sqlalchemy dependencies. Key changes include the introduction of an ONPREM environment variable across various ClowdApp and Kustomize configurations to toggle Kessel functionality, alongside comprehensive documentation and Docker Compose setups for local Kessel development. Worker scaling configurations in clowdapp.yaml and related patches were updated to use autoScaler with Prometheus triggers. The /status API endpoint was simplified to return {"status": "OK"}, which is a breaking change for clients expecting detailed information. Additionally, a JVM setting related to GC locker retry allocation was removed from the Trino configuration, and trino_query.py was refactored to use a centralized report database accessor. The removal of the JVM setting is concerning as it was previously noted to mitigate a JDK bug, and its reintroduction of performance or stability issues should be verified.
Note: Security Review did not run due to the size of the PR.
I am having trouble creating individual review comments. Click here to see my feedback.
dev/containers/trino/etc/jvm.config (22-23)
The removal of -XX:GCLockerRetryAllocationCount=8 is a bit concerning. The original comment indicated it was added to mitigate a specific JDK bug related to thread starvation. Removing this setting might reintroduce performance or stability issues unless the underlying bug has been fixed by a JVM upgrade. Could you please provide some context for this removal or confirm that it's safe to do so?
docs/specs/openapi.json (11514-11573)
The response schema for the /status endpoint has been significantly simplified. This is a breaking change for any clients that were parsing the previous detailed response (which included api_version, commit, platform_info, etc.). While the new {"status": "OK"} response is simpler for basic health checks, this change should be clearly communicated as a breaking API change in the release notes.
4708a4c to
6d22155
Compare
Remove 4 ephemeral/working documents from kessel-integration/: - kessel-handover-session-2.md (handoff session notes) - FLPATH-3319-description.md (Jira ticket body copy) - kessel-ocp-test-plan-retired.md (dead retired test scenarios) - kessel-hld-gaps-and-updates.md (working notes, all applied to HLD) Fix broken links: - Redirect 6 kessel-ocp-implementation-guide.md links in HLD to kessel-ocp-detailed-design.md (which supersedes the impl guide) - Remove stale kessel-hld-gaps-and-updates.md references from HLD and detailed design - Replace kessel-ocp-test-plan-retired.md links in test plan with inline notes Closes PR project-koku#5887 which is fully superseded by PR project-koku#5933. Made-with: Cursor
|
@lcouzens @myersCody PTAL |
Remove 4 ephemeral/working documents from kessel-integration/: - kessel-handover-session-2.md (handoff session notes) - FLPATH-3319-description.md (Jira ticket body copy) - kessel-ocp-test-plan-retired.md (dead retired test scenarios) - kessel-hld-gaps-and-updates.md (working notes, all applied to HLD) Fix broken links: - Redirect 6 kessel-ocp-implementation-guide.md links in HLD to kessel-ocp-detailed-design.md (which supersedes the impl guide) - Remove stale kessel-hld-gaps-and-updates.md references from HLD and detailed design - Replace kessel-ocp-test-plan-retired.md links in test plan with inline notes Closes PR project-koku#5887 which is fully superseded by PR project-koku#5933. Made-with: Cursor
f7ef99c to
8a8083b
Compare
Remove 4 ephemeral/working documents from kessel-integration/: - kessel-handover-session-2.md (handoff session notes) - FLPATH-3319-description.md (Jira ticket body copy) - kessel-ocp-test-plan-retired.md (dead retired test scenarios) - kessel-hld-gaps-and-updates.md (working notes, all applied to HLD) Fix broken links: - Redirect 6 kessel-ocp-implementation-guide.md links in HLD to kessel-ocp-detailed-design.md (which supersedes the impl guide) - Remove stale kessel-hld-gaps-and-updates.md references from HLD and detailed design - Replace kessel-ocp-test-plan-retired.md links in test plan with inline notes Closes PR project-koku#5887 which is fully superseded by PR project-koku#5933. Made-with: Cursor
d2019d2 to
b93275d
Compare
|
Cross-references for the Kessel ReBAC integration:
|
|
Upstream schema PRs created (companion to this integration):
Both PRs upstream changes that were validated end-to-end using local copies bundled in the Helm chart. |
|
There is a meeting internally to discuss this one, so we will hold reviewing until after the meeting has happened. |
Hey Cody, when is that meeting happening so that I know when the review will resume? |
449e764 to
d94d70c
Compare
upadhyeammit
left a comment
There was a problem hiding this comment.
And I think we need to define SpiceDB loss vs Postgres loss? restore order for same?
| PermClass->>AuthAdapter: has_permission(<br/>user="alice",<br/>resource_type="openshift.cluster",<br/>action="read") | ||
| activate AuthAdapter | ||
|
|
||
| Note over AuthAdapter: Check cache (30s TTL) |
There was a problem hiding this comment.
I do see the TTL has mentioned inconsistently, 30 seconds to 300 seconds. HLD diagrams/Open Questions mention 30s in places; DD and KESSEL_CACHE_TIMEOUT use 300s. Which is authoritative for operators, and can Appendix D (KESSEL_CACHE_TTL default 30) be aligned or removed?
There was a problem hiding this comment.
Great catch, thank you. I aligned the docs to the implemented behavior in this branch: cache uses KESSEL_CACHE_TIMEOUT with a default of 300 seconds (from settings.py), and I removed stale KESSEL_CACHE_TTL/30s guidance from the HLD sections and appendix.
| ```bash | ||
| # Resync all OCP providers | ||
| python manage.py kessel_sync_resources --provider-type ocp | ||
|
|
||
| # Resync specific provider | ||
| python manage.py kessel_sync_resources --provider-id 123 | ||
|
|
||
| # Dry-run mode (show what would be synced) | ||
| python manage.py kessel_sync_resources --dry-run | ||
| ``` |
There was a problem hiding this comment.
Where these commands would be implemented? I dont see these on the same branch at least.
There was a problem hiding this comment.
You are absolutely right. Those manage.py kessel_* command examples were not implemented in this PR branch. I updated the docs to avoid presenting them as shipped behavior, clarified the current operational tooling, and removed command-based runbook steps that implied implementation in this branch.
| - `cost-management-inventory` — used by the Inventory API pod for its outbound Relations API (gRPC) calls | ||
| - **Relations API** enables JWT validation via `ENABLEAUTH=true` + `JWKSURL` pointing to Keycloak's JWKS endpoint | ||
| - **Inventory API** switches from `allow-unauthenticated: true` to an OIDC authenticator chain, and enables `enable-oidc-auth: true` for its outbound Relations API calls with SA credentials | ||
| - **Koku** uses a thread-safe `TokenProvider` ([`koku_rebac/kessel_auth.py`](../../../koku/koku_rebac/kessel_auth.py)) that acquires tokens via client_credentials grant and caches them until 30s before expiry |
There was a problem hiding this comment.
hmm.. how the operational rotation of Kessel Credentials would be handled?
Better explanation of situation using AI:
Access tokens (JWTs) — Koku gets them with OAuth2 client_credentials, caches them, and refreshes them in memory before they expire. That is automatic while the app runs; the workspace ADR describes this.
Client secrets — The Keycloak client secret (and similar mounted env vars) is read when the process starts. Rotating it in Keycloak/Kubernetes does not hot-reload inside a running pod; you normally update the Secret and roll the deployment so processes pick up the new value.
So: “token refresh” in the docs means short-lived JWT rotation, not rotating the OAuth client secret without a restart. A full operational runbook for secret rotation (order of updates, restarts, checks) is not spelled out in those architecture docs.
There was a problem hiding this comment.
Agreed, and thanks for the clear breakdown. I added an explicit credential-rotation runbook in the ADR that separates token refresh (automatic in-process) from client-secret rotation (Secret update plus rollout/restart plus verification plus old-secret revocation). This now documents the operational order explicitly.
| | Environment | Authorization backend | How selected | | ||
| |-------------|-----------------------|------------------------------------------------------------------| | ||
| | On-prem | ReBAC (Kessel) | `ONPREM=true` forces `AUTHORIZATION_BACKEND=rebac` at startup | | ||
| | SaaS | RBAC (default) | `AUTHORIZATION_BACKEND=rbac` (default, no env var needed) | | ||
| | SaaS future | ReBAC (Kessel) | Unleash flag overrides to `rebac` per org (documented hook only) | | ||
|
|
There was a problem hiding this comment.
Should we define Koku x Kessel x SpiceDB (or OCP) compatibility, I dont see thats mentioned anywhere? This also brings more questions about how we would flag out incompatible versions and force the upgrade?
There was a problem hiding this comment.
Good point. I added a dedicated compatibility/upgrade guardrails section in the DD covering Koku to Inventory API, Koku to Relations API, and Relations to SpiceDB contracts, plus operator actions on incompatibility. I also documented that this branch does not add a hard startup version gate in Koku; enforcement remains deployment-layer.
|
@upadhyeammit Thanks for the additional review note on SpiceDB vs Postgres loss/restore ordering. I addressed this in the HLD by adding an explicit failure-mode recovery matrix and ordered restore steps for: (1) Kessel/SpiceDB lost with Postgres intact, (2) Postgres lost with Kessel intact, and (3) both lost, plus a consistency-validation checklist.\n\nAlso pushed a doc-only follow-up commit with all review-driven updates: c2e0313. |
|
@upadhyeammit @myersCody Pushed a doc-only follow-up (c2e0313) addressing all review comments from today:
PTAL when you get a chance. |
…emas Add kessel-relations-client-python and kessel-inventory-client-python SDK packages. Add dev/kessel/ stack with docker-compose, SpiceDB schema, seed-roles.yaml, inventory config, and resource schemas for all Cost Management resource types. Made-with: Cursor
Add koku_rebac package: KesselAccessProvider, client, config, resource_reporter, workspace management, and exception handling. Add integration resource as first-class Kessel type with sources access permissions. Add comprehensive unit, integration, contract, and E2E tests. Includes fixes for SQL tuple extraction, cache variance on identity header, User.is_authenticated for DRF, resource tuple cleanup, schema pre-provisioning, and workspace-level Check preference over StreamedListObjects. Made-with: Cursor
…ream Add JWT/OIDC authentication for Kessel API calls with configurable token endpoint. Add connection failure detection and exponential backoff retry for gRPC calls. Sync seed-roles.yaml with ros-helm-chart and add integration:read permission. Fix ZED schema to make cost_management_all_read/write computed permissions. Fix resource reporting to use real cluster_id instead of provider UUID. Made-with: Cursor
Add architecture documents: HLD (kessel-ocp-integration.md), DD (kessel-ocp-detailed-design.md), test plan, and authorization delegation DD. Add ADRs for on-prem authorization backend selection, workspace management, and rbac-config reuse. Add insights-rbac feasibility analysis, ZED schema upstream delta tracker, and Kessel development guide. Consolidate all docs into kessel-integration/ directory with consistent cross-references. Made-with: Cursor
Add rebac-bridge-design.md: Go microservice exposing insights-rbac v1 compatible REST endpoints backed by Kessel/SpiceDB, with workspace abstraction, group-level resource assignment, and UI extraction strategy. Add IEEE 829 test plan (rebac-bridge-test-plan.md). Includes auth/authz hardening, pseudocode replacement with validated source, and ADR updates for pure Kessel architecture. Made-with: Cursor
Add README.md index with reading paths for architects, developers, and the Kessel team. Add Kafka-based resource reporting detailed design. Update kessel-ocp-detailed-design.md status to Implemented. Made-with: Cursor
- Fix wrong org_id derivation in cost model creation (use customer.org_id)
- Add missing on_resource_deleted call when cost models are destroyed
- Guard against None access dict crash in cost model queryset
- Remove spurious on_resource_created("settings") from middleware
(no cost_management/settings definition exists in schema.zed)
- Fix sources/api/view.py gating: use AUTHORIZATION_BACKEND == "rebac"
instead of ONPREM for on_resource_deleted and integration access filter
Made-with: Cursor
- Add auth headers to Relations API POST calls in _create_resource_tuples and create_structural_tuple (parity with existing DELETE path) - Add provider_uuid field to KesselSyncedResource model (folded into 0001_initial migration since it hasn't shipped upstream yet) - Thread provider_uuid through on_resource_created and _track_synced_resource - Update all call sites (provider_builder, ocp/aws/gcp/azure report accessors, aws_org_unit_crawler) to pass provider_uuid - Scope cleanup_orphaned_kessel_resources to filter by provider_uuid, preventing accidental deletion of resources from other providers - Replace fragile org_id extraction from schema_name in remove_expired.py with a proper Customer model lookup Made-with: Cursor
- Guard user.customer being None in KesselAccessProvider - Add token expiry tracking and refresh to RbacV2Resolver (was cached forever, mirroring the pattern already used in kessel_auth.py) - Validate token URL in kessel_auth.py and workspace.py to fail fast when KESSEL_AUTH_OIDC_ISSUER is unconfigured - Handle missing CA file in client.py with fallback to system roots - Add debug logging to _reset_client exception handler Made-with: Cursor
- Change relations-api depends_on from service_started to service_healthy so it waits for SpiceDB to be actually ready - Add HTTP healthchecks to relations-api and inventory-api services - Update inventory-api depends_on relations-api to use service_healthy - Add dev-only comment to inventory-config.yaml allow-unauthenticated - Add integration to resource types list in README (was missing) Made-with: Cursor
Resolve reviewer concerns by aligning cache and env-var docs with implemented settings, removing references to non-shipped management commands, adding Postgres↔Kessel recovery ordering guidance, documenting credential rotation runbook, and clarifying compatibility guardrails. Made-with: Cursor
3bfa133 to
883ada0
Compare
- Reformat all koku_rebac/ files with black - Reorder python imports across affected modules - Remove unused imports (F401): RbacService, time, caches, log_json, PropertyMock, MagicMock, patch, settings, override_settings - Fix JSON formatting in kessel resource schemas - Fix trailing whitespace in docs Made-with: Cursor
|
Closing for now — will reopen after resolving test failures from rebase. |
Summary
Adds Kessel ReBAC (Relationship-Based Access Control) as the authorization backend for on-prem Koku deployments, replacing the SaaS RBAC service dependency. This enables fine-grained, workspace-based access control using SpiceDB as the underlying authorization engine.
Builds on top of: #5895 (Sources API compatibility endpoints by @ELK4N4)
Key Changes
koku_rebacDjango app: New authorization module withKesselAccessProviderthat integrates with Kessel Inventory (gRPC) forStreamedListObjectsand Kessel Relations (REST) for tuple managementdev/kessel/schema.zed): Defines the full authorization model including workspaces, roles, role bindings, and cost management resource types (clusters, projects, nodes, VMs, integrations, and all cloud provider resources)cost_management/integrationresources with structural relationships (has_cluster,has_project) enabling computed permission cascading — access to a namespace automatically grants visibility of its parent cluster and integrationSettingsAccessPermissionwithSourcesAccessPermissionthat filters sources by Kesselintegration.readaccess; addsvary_on_headersfor per-user cache isolationIsAuthenticatedfor on-prem reads (non-sensitive settings), fixing 403s for non-admin usersKokuTenantMiddlewareresolves authorization viaKesselAccessProviderorRBACAccessProviderbased onAUTHORIZATION_BACKENDsettingArchitecture Documentation
docs/architecture/kessel-integration/onprem-workspace-management-adr.md— covers workspace hierarchy design, integration resource type rationale, PRD12 requirements coverage, Kessel gap inventory, and future expansion plans (VMs, AWS/Azure/GCP, split metering)docs/architecture/kessel-integration/kessel-development-guide.mdSpiceDB Schema Highlights
cost_management/integrationwith structural containment (has_cluster,has_account,has_subscription, etc.)openshift_project.readon a namespace automatically provides visibility of parent clusters and integrationsopenshift_vm,aws_account,aws_organizational_unit,azure_subscription_guid,gcp_account,gcp_projectt_parentinheritanceConfiguration
AUTHORIZATION_BACKENDrbackesselfor on-premKESSEL_INVENTORY_HOSTlocalhostKESSEL_INVENTORY_PORT9081KESSEL_RELATIONS_URLhttp://localhost:8100ENHANCED_ORG_ADMINFalseTest Plan
koku_rebacmodule (client, config, access_provider, resource_reporter, middleware)Kessel API Gaps Identified
Documented in the ADR for the Kessel team:
t_workspace; structural tuples (has_cluster,has_project) require direct SpiceDB accessintegrationtype with cascadingreadpermission is not expressible via Kessel Inventory APIzedCLI; no Kessel API for schema managementMade with Cursor
Related PRs