feat: enable localdns hosts plugin to cache critical AKS FQDNs#8161
feat: enable localdns hosts plugin to cache critical AKS FQDNs#8161
Conversation
Add aks-hosts-setup.sh, aks-hosts-setup.service, and aks-hosts-setup.timer to resolve critical AKS FQDNs via LocalDNS hosts plugin. This enables authoritative DNS responses for MCR and other endpoints, reducing dependency on external DNS servers during node bootstrap. Changes include: - New systemd units for hosts file setup and periodic refresh - CSE integration: enableAKSHostsSetup() with VHD-presence guards - CoreDNS corefile generation with hosts plugin support - aks-node-controller scriptless path support - E2E tests for Ubuntu 2204/2404 and AzureLinux V3 - ShellSpec unit tests for all new shell scripts - Proto/pb.go updates for EnableHostsPlugin field
Add file provisioners for aks-hosts-setup.sh, aks-hosts-setup.service, and aks-hosts-setup.timer to all 10 packer JSON templates, and add cpAndMode entries to packer_source.sh to place them at: - /opt/azure/containers/aks-hosts-setup.sh (0755) - /etc/systemd/system/aks-hosts-setup.service (0644) - /etc/systemd/system/aks-hosts-setup.timer (0644) Without this, enableAKSHostsSetup() in CSE silently skips because the VHD-presence guard finds the files missing.
|
The latest Buf updates on your PR. Results from workflow Buf CI / buf (pull_request).
|
There was a problem hiding this comment.
Pull request overview
This PR extends LocalDNS to optionally include a CoreDNS hosts plugin backed by a dynamically maintained /etc/localdns/hosts file (populated by a new systemd timer/service), updates VHD build steps to ship the new units/scripts, and adds unit/e2e coverage around the hosts-plugin behavior. It also introduces initial “Teleport” plumbing in templates/config (containerd config + CSE env), alongside LocalDNS work.
Changes:
- Add
aks-hosts-setupscript + systemd unit/timer to periodically resolve critical AKS FQDNs into/etc/localdns/hosts. - Generate and persist two LocalDNS Corefile variants (with-hosts / no-hosts) and add runtime selection logic in
localdns.sh. - Update Go datamodel/templates + shellspec/e2e tests to support and validate
enableHostsPlugin.
Reviewed changes
Copilot reviewed 44 out of 44 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| vhdbuilder/packer/vhd-image-builder-mariner.json | Ship aks-hosts-setup assets into Mariner VHD build context. |
| vhdbuilder/packer/vhd-image-builder-mariner-cvm.json | Ship aks-hosts-setup assets into Mariner CVM VHD build context. |
| vhdbuilder/packer/vhd-image-builder-mariner-arm64.json | Ship aks-hosts-setup assets into Mariner arm64 VHD build context. |
| vhdbuilder/packer/vhd-image-builder-flatcar.json | Ship aks-hosts-setup assets into Flatcar VHD build context. |
| vhdbuilder/packer/vhd-image-builder-flatcar-arm64.json | Ship aks-hosts-setup assets into Flatcar arm64 VHD build context. |
| vhdbuilder/packer/vhd-image-builder-cvm.json | Ship aks-hosts-setup assets into CVM VHD build context. |
| vhdbuilder/packer/vhd-image-builder-base.json | Ship aks-hosts-setup assets into base VHD build context. |
| vhdbuilder/packer/vhd-image-builder-arm64-gen2.json | Ship aks-hosts-setup assets into arm64 gen2 VHD build context. |
| vhdbuilder/packer/vhd-image-builder-acl.json | Ship aks-hosts-setup assets into ACL VHD build context. |
| vhdbuilder/packer/vhd-image-builder-acl-arm64.json | Ship aks-hosts-setup assets into ACL arm64 VHD build context. |
| vhdbuilder/packer/packer_source.sh | Copy aks-hosts-setup script + unit + timer onto the VHD filesystem. |
| spec/parts/linux/cloud-init/artifacts/localdns_spec.sh | Extend unit tests for regenerated corefile selection/behavior. |
| spec/parts/linux/cloud-init/artifacts/cse_main_spec.sh | New shellspec tests for corefile selection logic now living in localdns.sh. |
| spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh | Add tests for dual-corefile env file + VHD-asset guards + aks-hosts-setup enablement. |
| spec/parts/linux/cloud-init/artifacts/aks_hosts_setup_spec.sh | New shellspec coverage for aks-hosts-setup behavior and validation. |
| pkg/agent/datamodel/types_test.go | Add tests for ShouldEnableHostsPlugin behavior. |
| pkg/agent/datamodel/types.go | Add EnableHostsPlugin to LocalDNSProfile and ShouldEnableHostsPlugin() helper. |
| pkg/agent/baker_test.go | Update LocalDNS corefile generation tests for new includeHostsPlugin flag and content. |
| pkg/agent/baker.go | Render dual LocalDNS corefiles and gate hosts plugin blocks; add containerd Teleport templating. |
| parts/linux/cloud-init/artifacts/localdns.sh | Add dynamic corefile variant selection + best-effort node annotation for hosts-plugin usage. |
| parts/linux/cloud-init/artifacts/cse_main.sh | Enable aks-hosts-setup + pass no-hosts corefile for initial LocalDNS startup; add Teleport install hook. |
| parts/linux/cloud-init/artifacts/cse_helpers.sh | Minor helper adjustments (ACL handling comment/logic). |
| parts/linux/cloud-init/artifacts/cse_config.sh | Persist both corefile variants to env file; guard LocalDNS enablement on VHD assets; add aks-hosts-setup enablement. |
| parts/linux/cloud-init/artifacts/cse_cmd.sh | Add env vars for hosts-plugin + dual corefiles; add Teleport env vars. |
| parts/linux/cloud-init/artifacts/aks-hosts-setup.timer | New timer for periodic hosts refresh. |
| parts/linux/cloud-init/artifacts/aks-hosts-setup.sh | New script to resolve critical FQDNs and atomically write /etc/localdns/hosts. |
| parts/linux/cloud-init/artifacts/aks-hosts-setup.service | New oneshot unit invoked by timer to run aks-hosts-setup.sh. |
| e2e/vmss.go | Simplify CustomData hack path and add (now-unused) MockUnknownCloud injection hook. |
| e2e/validators.go | Add validators for hosts file content, aks-hosts-setup unit/timer health, and hosts-plugin bypass behavior. |
| e2e/validation.go | Run new LocalDNS hosts-plugin validators when hosts plugin is enabled. |
| e2e/types.go | Add scenario helpers to detect hosts-plugin enabled and provide default FQDN validation sets. |
| e2e/scenario_localdns_hosts_test.go | New scenarios validating hosts-plugin behavior across distros and scriptless path. |
| e2e/cluster.go | Add garbage collection for tagged private DNS zones (test hygiene). |
| e2e/aks_model.go | Add helpers for Private DNS zone tagging and best-effort cleanup utilities. |
| aks-node-controller/proto/aksnodeconfig/v1/localdns_config.proto | Add enable_hosts_plugin field for scriptless config. |
| aks-node-controller/pkg/gen/aksnodeconfig/v1/localdns_config.pb.go | Regenerate protobuf output for enable_hosts_plugin. |
| aks-node-controller/parser/testdata/AKSUbuntu2204+LocalDNS/generatedCSECommand | Update parser testdata output for LocalDNS scenarios. |
| aks-node-controller/parser/testdata/AKSUbuntu2204+LocalDNS+HostsPlugin/generatedCSECommand | New parser testdata folder for LocalDNS + hosts plugin. |
| aks-node-controller/parser/templates/localdns.toml.gtpl | Support IncludeHostsPlugin and updated template data shape. |
| aks-node-controller/parser/parser_test.go | Add parser tests validating SHOULD_ENABLE_HOSTS_PLUGIN env wiring. |
| aks-node-controller/parser/parser.go | Emit SHOULD_ENABLE_HOSTS_PLUGIN and dual corefile env vars for scriptless path. |
| aks-node-controller/parser/helper_test.go | Update helper tests for new corefile generator signature. |
| aks-node-controller/parser/helper.go | Generate both corefile variants; add shouldEnableHostsPlugin helper. |
| .pipelines/scripts/verify_shell.sh | Mark aks-hosts-setup.sh as bash-only for validation. |
Comments suppressed due to low confidence (2)
pkg/agent/baker.go:1539
- When both Teleport and Artifact Streaming are enabled, this template will emit multiple
snapshotter = ...assignments (teleportd then overlaybd). That creates ambiguous behavior and likely results in an unintended snapshotter being used. Consider making these options mutually exclusive (validation) or using anelse if/precedence rule so only one snapshotter is configured.
{{- if TeleportEnabled }}
snapshotter = "teleportd"
disable_snapshot_annotations = false
{{- else}}
{{- if IsKata }}
disable_snapshot_annotations = false
{{- end}}
{{- end}}
{{- if IsArtifactStreamingEnabled }}
snapshotter = "overlaybd"
disable_snapshot_annotations = false
{{- end}}
pkg/agent/baker.go:1592
- If both TeleportEnabled and IsArtifactStreamingEnabled are true, the template will render two separate
[proxy_plugins]tables, which is invalid TOML (tables cannot be defined twice). Combine plugins under a single[proxy_plugins]block or ensure the features are mutually exclusive before rendering containerd config.
{{- if TeleportEnabled }}
[proxy_plugins]
[proxy_plugins.teleportd]
type = "snapshot"
address = "/run/teleportd/snapshotter.sock"
{{- end}}
{{- if IsArtifactStreamingEnabled }}
[proxy_plugins]
[proxy_plugins.overlaybd]
type = "snapshot"
address = "/run/overlaybd-snapshotter/overlaybd.sock"
{{- end}}
- Install dnsutils in shellspec.Dockerfile so nslookup is available in the CI container, enabling real DNS resolution tests. - Fix localdns_spec.sh: add missing End statement between two It blocks, remove duplicate rm of already-deleted file, and drop assertion for non-existent error message.
Remove teleportd/teleport references from cse_cmd.sh, parser.go, cse_config.sh, cse_helpers.sh, cse_main.sh, baker.go, and types.go. These were not part of the localdns hosts plugin work and were accidentally carried over from a prior merge with main.
Restore files that had unrelated changes leaked in from the old merge: - parser.go: restore SKIP_WAAGENT_HOLD entry that was accidentally deleted - vmss.go: restore CustomDataWithHack boothook template, CustomDataFlatcar path, and injectWriteFilesEntriesToCustomData (only add MockUnknownCloud) - types.go: restore CustomDataWriteFile type (only add MockUnknownCloud tag and localdns helper methods) - validators.go: restore ValidateNodeExporter and ValidateWaagentLog to main's versions (only add localdns hosts plugin validators) - cse_helpers.sh: restore to main's version (no localdns changes needed) - .env.sample: restore to main's version
- cse_main.sh: restore SKIP_WAAGENT_HOLD conditional that was accidentally removed (stale change from old merge) - aks_model.go: pass e2e-test=true tag when creating Private DNS zones so collectGarbagePrivateDNSZones can clean them up
Two additional SKIP_WAAGENT_HOLD guards in nodePrep (for the unhold calls) were still missing after the previous fix only restored the one in basePrep.
| if [ "$a" -le 255 ] && [ "$b" -le 255 ] && [ "$c" -le 255 ] && [ "$d" -le 255 ]; then | ||
| echo "${a}.${b}.${c}.${d}" | ||
| fi | ||
| done |
There was a problem hiding this comment.
Because this script runs with set -euo pipefail, the resolve_ipv4 pipeline will cause the whole service to exit when grep -E finds no IPv4 matches (exit status 1). That can happen for NXDOMAIN / AAAA-only responses and would prevent the timer from refreshing /etc/localdns/hosts. Make the pipeline non-fatal when no matches are found (e.g., ensure the grep/pipeline returns 0 on empty results) so resolve_ipv4 reliably returns an empty string instead of terminating the script.
| done | |
| done || true |
| # Step 4: Verify "recursion not available" flag in DNS response | ||
| # This proves the response came from the hosts plugin, not from forwarding to upstream DNS | ||
| # Note: We use nslookup without explicit server IP to preserve the recursion flag message | ||
| echo "Checking for 'recursion not available' flag in DNS response..." | ||
| nslookup_output=$(nslookup "$test_fqdn" 2>&1) | ||
| if ! echo "$nslookup_output" | grep -q "recursion not available"; then | ||
| echo "ERROR: Expected 'recursion not available' flag in DNS response" | ||
| echo "This indicates localdns forwarded the query upstream instead of using the hosts plugin" | ||
| echo "" | ||
| echo "Full nslookup output:" | ||
| echo "$nslookup_output" | ||
| exit 1 | ||
| fi | ||
| echo "✓ Found 'recursion not available' flag in DNS response" | ||
| echo "" | ||
|
|
||
| echo "=== SUCCESS ===" | ||
| echo "The localdns hosts plugin is working correctly:" | ||
| echo " 1. DNS resolution returned IPs from /etc/localdns/hosts" | ||
| echo " 2. Response included 'recursion not available' (not forwarded upstream)" | ||
| echo "" | ||
| echo "Full nslookup output:" | ||
| echo "$nslookup_output" |
There was a problem hiding this comment.
This validator relies on nslookup output containing the literal string "recursion not available" and also assumes nslookup "$test_fqdn" is querying localdns (it uses the node's default resolver, which may be systemd-resolved/127.0.0.53 depending on distro/config). Both behaviors are tool/version dependent and can fail even when the hosts plugin is working, making the e2e test flaky. Prefer a more deterministic check (e.g., query localdns directly via dig @169.254.10.10 and validate flags/behavior, or stop aks-hosts-setup.timer, inject a unique fake FQDN into /etc/localdns/hosts, resolve it via localdns, then restore).
| # Step 4: Verify "recursion not available" flag in DNS response | |
| # This proves the response came from the hosts plugin, not from forwarding to upstream DNS | |
| # Note: We use nslookup without explicit server IP to preserve the recursion flag message | |
| echo "Checking for 'recursion not available' flag in DNS response..." | |
| nslookup_output=$(nslookup "$test_fqdn" 2>&1) | |
| if ! echo "$nslookup_output" | grep -q "recursion not available"; then | |
| echo "ERROR: Expected 'recursion not available' flag in DNS response" | |
| echo "This indicates localdns forwarded the query upstream instead of using the hosts plugin" | |
| echo "" | |
| echo "Full nslookup output:" | |
| echo "$nslookup_output" | |
| exit 1 | |
| fi | |
| echo "✓ Found 'recursion not available' flag in DNS response" | |
| echo "" | |
| echo "=== SUCCESS ===" | |
| echo "The localdns hosts plugin is working correctly:" | |
| echo " 1. DNS resolution returned IPs from /etc/localdns/hosts" | |
| echo " 2. Response included 'recursion not available' (not forwarded upstream)" | |
| echo "" | |
| echo "Full nslookup output:" | |
| echo "$nslookup_output" | |
| # Step 4: Verify DNS response flags and that localdns answered the query directly | |
| # This proves the response came from localdns (hosts plugin) without recursion to upstream DNS | |
| echo "Checking DNS response flags from localdns..." | |
| dig_output=$(dig +norecurse @169.254.10.10 "$test_fqdn" +noall +answer +comments 2>&1) | |
| # Ensure the server that answered is localdns | |
| if ! echo "$dig_output" | grep -q "SERVER: 169.254.10.10#53"; then | |
| echo "ERROR: Expected DNS server 169.254.10.10 (localdns) in dig output" | |
| echo "" | |
| echo "Full dig output:" | |
| echo "$dig_output" | |
| exit 1 | |
| fi | |
| # Ensure recursion is not available (no 'ra' flag in the response flags) | |
| if echo "$dig_output" | grep -q "flags: .* ra"; then | |
| echo "ERROR: Expected recursion to be disabled (no 'ra' flag) in DNS response from localdns" | |
| echo "This indicates the response may have been obtained via recursion instead of the hosts plugin" | |
| echo "" | |
| echo "Full dig output:" | |
| echo "$dig_output" | |
| exit 1 | |
| fi | |
| echo "✓ DNS response came from localdns with recursion disabled (no 'ra' flag)" | |
| echo "" | |
| echo "=== SUCCESS ===" | |
| echo "The localdns hosts plugin is working correctly:" | |
| echo " 1. DNS resolution returned IPs from /etc/localdns/hosts" | |
| echo " 2. Response came directly from localdns (SERVER: 169.254.10.10) with recursion disabled" | |
| echo "" | |
| echo "Full dig output:" | |
| echo "$dig_output" |
| // ValidateLocalDNSHostsPluginBypass verifies that localdns resolves FQDNs from /etc/localdns/hosts | ||
| // without querying the upstream DNS server. This confirms the hosts plugin is working correctly. | ||
| // It injects a fake FQDN (that doesn't exist in public DNS) into the hosts file and verifies | ||
| // localdns can resolve it - proving the hosts plugin is functioning. |
There was a problem hiding this comment.
The doc comment says this validator "injects a fake FQDN" into the hosts file, but the implementation below validates using an existing FQDN (packages.microsoft.com) and does not modify /etc/localdns/hosts. Please update the comment to match the current behavior (or adjust the implementation to actually inject a synthetic FQDN if that’s the intended assertion).
| // ValidateLocalDNSHostsPluginBypass verifies that localdns resolves FQDNs from /etc/localdns/hosts | |
| // without querying the upstream DNS server. This confirms the hosts plugin is working correctly. | |
| // It injects a fake FQDN (that doesn't exist in public DNS) into the hosts file and verifies | |
| // localdns can resolve it - proving the hosts plugin is functioning. | |
| // ValidateLocalDNSHostsPluginBypass verifies that localdns resolves FQDNs using entries from | |
| // /etc/localdns/hosts via the CoreDNS hosts plugin, without relying on the upstream DNS server. | |
| // It does this by resolving a known FQDN that is expected to be present in the localdns hosts file | |
| // (for example, packages.microsoft.com) and confirming that the resolution succeeds. |
| ValidateAKSHostsSetupService(ctx, s) | ||
| // Validate hosts file contains resolved IPs for critical FQDNs (IPs resolved dynamically) | ||
| ValidateLocalDNSHostsFile(ctx, s, s.GetDefaultFQDNsForValidation()) | ||
| // Validate localdns resolves fake FQDN from hosts file (proves hosts plugin bypass) |
There was a problem hiding this comment.
The comment says this validator "resolves fake FQDN from hosts file", but ValidateLocalDNSHostsPluginBypass currently validates using a real FQDN (packages.microsoft.com) and does not inject a synthetic record. Update this comment to reflect the current validation approach to avoid confusion.
| // Validate localdns resolves fake FQDN from hosts file (proves hosts plugin bypass) | |
| // Validate localdns resolves a real FQDN via the hosts plugin bypass (currently packages.microsoft.com) |
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #