Skip to content

bug: APISIX Ingress Controller v2.0.0 - Service Inline Upstreams Not Updated on Endpoint Changes #2689

@jasaulakh1988

Description

@jasaulakh1988

Current Behavior

Bug Report: APISIX Ingress Controller v2.0.0 - Service Inline Upstreams Not Updated on Endpoint Changes

Summary

APISIX Ingress Controller v2.0.0 (with ADC sidecar) creates Services with inline upstreams, but does NOT update these inline upstreams when Kubernetes Endpoints change (e.g., pod restarts, rescheduling). This causes traffic to be routed to stale/non-existent pod IPs, resulting in 504 Gateway Timeout errors.

Environment

  • APISIX Gateway Version: 3.11.0
  • APISIX Ingress Controller Version: 2.0.0 (stable release)
  • Helm Chart Version: 1.1.0 (official Bitnami chart)
  • Kubernetes Version: OVH Managed Kubernetes
  • etcd: 3-node cluster

Controller Configuration

provider:
  type: apisix
  syncPeriod: 1m
  initSyncDelay: 30s
``


## Workaround

Manual update of etcd Service entries:
```bash
kubectl exec -n apisix-system apisix-etcd-0 -- \
  etcdctl put /apisix/services/<service-id> '<updated-json-with-correct-ip>'

Then send HUP signal to APISIX pods to reload config:

kubectl exec -n apisix-system <apisix-pod> -- kill -HUP 1

Impact

  • Severity: Critical for production use
  • Impact: Complete service outage for affected routes when pods restart
  • Affected: Any route using the ADC sync pattern with inline service upstreams

Additional Observations

  1. Hot reload not working: Even though APISIX supports hot reload from etcd, the Service updates are never pushed to etcd in the first place.

  2. Pattern difference: Routes created with the newer controller version use upstream_id references instead of inline upstreams. These DO get updated correctly. The issue affects routes/services created before this pattern change.

  3. No errors in controller logs: The controller doesn't log any errors about failing to update services. The sync appears to complete successfully but simply doesn't update inline upstreams.

Requested Action

  1. Investigate why Service inline upstreams are not updated on endpoint changes
  2. Either:
    • Fix the ADC sync to update inline upstreams in Services, OR
    • Change the sync pattern to always use upstream_id references instead of inline upstreams
  3. Document this limitation if it's expected behavior

Related Information

  • Controller logs show sync completing with correct service count
  • syncPeriod: 1m is being respected (syncs every minute)
  • Separate Upstream objects are updated on each sync cycle
  • Service objects are NOT updated after initial creation

Contact

Happy to provide additional logs, configurations, or test scenarios to help debug this issue.

Expected Behavior

Expected Behavior

When a Kubernetes pod restarts and gets a new IP address, the APISIX Ingress Controller should update the upstream nodes in APISIX to reflect the new pod IP. Traffic should continue flowing to the new pod IP without interruption.

Actual Behavior

  1. Controller creates Services with inline upstreams containing pod IPs
  2. Controller also creates separate Upstream objects with the same pod IPs
  3. When pods restart and get new IPs:
    • The separate Upstream objects ARE updated with new IPs ✅
    • The inline upstreams inside Services are NOT updated
  4. Routes reference Services (via service_id), not the separate Upstream objects
  5. Traffic continues to be routed to stale pod IPs that no longer exist
  6. Results in 504 Gateway Timeout errors

Error Logs

Evidence

Service in etcd (NOT updated - shows stale IP):

{
  "id": "f08f5c87",
  "name": "default_beta-websocket-routes_0",
  "update_time": 1766750592,  // December 26 - 3 days old!
  "upstream": {
    "type": "roundrobin",
    "nodes": [
      {
        "host": "10.2.4.10",   // OLD IP - pod no longer exists!
        "port": 4000,
        "weight": 100
      }
    ]
  }
}

Upstream object in etcd (updated correctly):

{
  "id": "f08f5c87",
  "name": "default_beta-websocket-routes_0",
  "update_time": 1767016352,  // Today - recently updated
  "nodes": [
    {
      "host": "10.2.16.3",    // CORRECT new IP
      "port": 4000,
      "weight": 100
    }
  ]
}

Route configuration:

{
  "name": "default_beta-websocket-routes_beta-game-core-api",
  "service_id": "f08f5c87",   // References Service, not Upstream
  "upstream_id": null         // Not using separate upstream
}

APISIX error logs:

upstream timed out (110: Connection timed out) while connecting to upstream,
upstream: "http://10.2.4.10:4000/...",  // Stale IP!

Kubernetes endpoint (actual pod IP):

NAME                       ENDPOINTS        AGE
game-core-web   10.2.16.3:4000   25d

Root Cause Analysis

The ADC (APISIX Declarative Configuration) sync mechanism appears to:

  1. Watch for EndpointSlice changes in Kubernetes
  2. Update the separate Upstream objects when endpoints change
  3. NOT update the inline upstream configuration inside Service objects

Since Routes reference Services (which have inline upstreams), the stale IPs persist even though the separate Upstream objects have correct IPs.

Steps to Reproduce

Reproduction Steps

  1. Deploy APISIX Ingress Controller v2.0.0 with ADC sidecar
  2. Create an ApisixRoute resource pointing to a Kubernetes Service
  3. Wait for controller to sync (creates Service with inline upstream in APISIX)
  4. Note the pod IP in the APISIX Service's inline upstream
  5. Delete the pod (e.g., kubectl delete pod <pod-name>)
  6. Wait for new pod to start with a new IP
  7. Check the APISIX Service - inline upstream still has OLD IP
  8. Check the APISIX Upstream object - has correct NEW IP
  9. Traffic fails with 504 timeout to the old IP

Environment

  • APISIX Ingress controller version (run apisix-ingress-controller version --long)
  • Kubernetes cluster version (run kubectl version)
  • OS version if running APISIX Ingress controller in a bare-metal environment (run uname -a)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions