Skip to content

feat: enable localdns hosts plugin to cache critical AKS FQDNs#8161

Closed
saewoni wants to merge 7 commits intomainfrom
sakwa/localdns_poc
Closed

feat: enable localdns hosts plugin to cache critical AKS FQDNs#8161
saewoni wants to merge 7 commits intomainfrom
sakwa/localdns_poc

Conversation

@saewoni
Copy link
Contributor

@saewoni saewoni commented Mar 24, 2026

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Add aks-hosts-setup.sh, aks-hosts-setup.service, and aks-hosts-setup.timer
to resolve critical AKS FQDNs via LocalDNS hosts plugin. This enables
authoritative DNS responses for MCR and other endpoints, reducing
dependency on external DNS servers during node bootstrap.

Changes include:
- New systemd units for hosts file setup and periodic refresh
- CSE integration: enableAKSHostsSetup() with VHD-presence guards
- CoreDNS corefile generation with hosts plugin support
- aks-node-controller scriptless path support
- E2E tests for Ubuntu 2204/2404 and AzureLinux V3
- ShellSpec unit tests for all new shell scripts
- Proto/pb.go updates for EnableHostsPlugin field
Add file provisioners for aks-hosts-setup.sh, aks-hosts-setup.service,
and aks-hosts-setup.timer to all 10 packer JSON templates, and add
cpAndMode entries to packer_source.sh to place them at:
- /opt/azure/containers/aks-hosts-setup.sh (0755)
- /etc/systemd/system/aks-hosts-setup.service (0644)
- /etc/systemd/system/aks-hosts-setup.timer (0644)

Without this, enableAKSHostsSetup() in CSE silently skips because
the VHD-presence guard finds the files missing.
@github-actions
Copy link
Contributor

github-actions bot commented Mar 24, 2026

The latest Buf updates on your PR. Results from workflow Buf CI / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedMar 25, 2026, 3:58 AM

@saewoni saewoni changed the title Sakwa/localdns poc feat: enable localdns hosts plugin to cache critical AKS FQDNs Mar 24, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends LocalDNS to optionally include a CoreDNS hosts plugin backed by a dynamically maintained /etc/localdns/hosts file (populated by a new systemd timer/service), updates VHD build steps to ship the new units/scripts, and adds unit/e2e coverage around the hosts-plugin behavior. It also introduces initial “Teleport” plumbing in templates/config (containerd config + CSE env), alongside LocalDNS work.

Changes:

  • Add aks-hosts-setup script + systemd unit/timer to periodically resolve critical AKS FQDNs into /etc/localdns/hosts.
  • Generate and persist two LocalDNS Corefile variants (with-hosts / no-hosts) and add runtime selection logic in localdns.sh.
  • Update Go datamodel/templates + shellspec/e2e tests to support and validate enableHostsPlugin.

Reviewed changes

Copilot reviewed 44 out of 44 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
vhdbuilder/packer/vhd-image-builder-mariner.json Ship aks-hosts-setup assets into Mariner VHD build context.
vhdbuilder/packer/vhd-image-builder-mariner-cvm.json Ship aks-hosts-setup assets into Mariner CVM VHD build context.
vhdbuilder/packer/vhd-image-builder-mariner-arm64.json Ship aks-hosts-setup assets into Mariner arm64 VHD build context.
vhdbuilder/packer/vhd-image-builder-flatcar.json Ship aks-hosts-setup assets into Flatcar VHD build context.
vhdbuilder/packer/vhd-image-builder-flatcar-arm64.json Ship aks-hosts-setup assets into Flatcar arm64 VHD build context.
vhdbuilder/packer/vhd-image-builder-cvm.json Ship aks-hosts-setup assets into CVM VHD build context.
vhdbuilder/packer/vhd-image-builder-base.json Ship aks-hosts-setup assets into base VHD build context.
vhdbuilder/packer/vhd-image-builder-arm64-gen2.json Ship aks-hosts-setup assets into arm64 gen2 VHD build context.
vhdbuilder/packer/vhd-image-builder-acl.json Ship aks-hosts-setup assets into ACL VHD build context.
vhdbuilder/packer/vhd-image-builder-acl-arm64.json Ship aks-hosts-setup assets into ACL arm64 VHD build context.
vhdbuilder/packer/packer_source.sh Copy aks-hosts-setup script + unit + timer onto the VHD filesystem.
spec/parts/linux/cloud-init/artifacts/localdns_spec.sh Extend unit tests for regenerated corefile selection/behavior.
spec/parts/linux/cloud-init/artifacts/cse_main_spec.sh New shellspec tests for corefile selection logic now living in localdns.sh.
spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh Add tests for dual-corefile env file + VHD-asset guards + aks-hosts-setup enablement.
spec/parts/linux/cloud-init/artifacts/aks_hosts_setup_spec.sh New shellspec coverage for aks-hosts-setup behavior and validation.
pkg/agent/datamodel/types_test.go Add tests for ShouldEnableHostsPlugin behavior.
pkg/agent/datamodel/types.go Add EnableHostsPlugin to LocalDNSProfile and ShouldEnableHostsPlugin() helper.
pkg/agent/baker_test.go Update LocalDNS corefile generation tests for new includeHostsPlugin flag and content.
pkg/agent/baker.go Render dual LocalDNS corefiles and gate hosts plugin blocks; add containerd Teleport templating.
parts/linux/cloud-init/artifacts/localdns.sh Add dynamic corefile variant selection + best-effort node annotation for hosts-plugin usage.
parts/linux/cloud-init/artifacts/cse_main.sh Enable aks-hosts-setup + pass no-hosts corefile for initial LocalDNS startup; add Teleport install hook.
parts/linux/cloud-init/artifacts/cse_helpers.sh Minor helper adjustments (ACL handling comment/logic).
parts/linux/cloud-init/artifacts/cse_config.sh Persist both corefile variants to env file; guard LocalDNS enablement on VHD assets; add aks-hosts-setup enablement.
parts/linux/cloud-init/artifacts/cse_cmd.sh Add env vars for hosts-plugin + dual corefiles; add Teleport env vars.
parts/linux/cloud-init/artifacts/aks-hosts-setup.timer New timer for periodic hosts refresh.
parts/linux/cloud-init/artifacts/aks-hosts-setup.sh New script to resolve critical FQDNs and atomically write /etc/localdns/hosts.
parts/linux/cloud-init/artifacts/aks-hosts-setup.service New oneshot unit invoked by timer to run aks-hosts-setup.sh.
e2e/vmss.go Simplify CustomData hack path and add (now-unused) MockUnknownCloud injection hook.
e2e/validators.go Add validators for hosts file content, aks-hosts-setup unit/timer health, and hosts-plugin bypass behavior.
e2e/validation.go Run new LocalDNS hosts-plugin validators when hosts plugin is enabled.
e2e/types.go Add scenario helpers to detect hosts-plugin enabled and provide default FQDN validation sets.
e2e/scenario_localdns_hosts_test.go New scenarios validating hosts-plugin behavior across distros and scriptless path.
e2e/cluster.go Add garbage collection for tagged private DNS zones (test hygiene).
e2e/aks_model.go Add helpers for Private DNS zone tagging and best-effort cleanup utilities.
aks-node-controller/proto/aksnodeconfig/v1/localdns_config.proto Add enable_hosts_plugin field for scriptless config.
aks-node-controller/pkg/gen/aksnodeconfig/v1/localdns_config.pb.go Regenerate protobuf output for enable_hosts_plugin.
aks-node-controller/parser/testdata/AKSUbuntu2204+LocalDNS/generatedCSECommand Update parser testdata output for LocalDNS scenarios.
aks-node-controller/parser/testdata/AKSUbuntu2204+LocalDNS+HostsPlugin/generatedCSECommand New parser testdata folder for LocalDNS + hosts plugin.
aks-node-controller/parser/templates/localdns.toml.gtpl Support IncludeHostsPlugin and updated template data shape.
aks-node-controller/parser/parser_test.go Add parser tests validating SHOULD_ENABLE_HOSTS_PLUGIN env wiring.
aks-node-controller/parser/parser.go Emit SHOULD_ENABLE_HOSTS_PLUGIN and dual corefile env vars for scriptless path.
aks-node-controller/parser/helper_test.go Update helper tests for new corefile generator signature.
aks-node-controller/parser/helper.go Generate both corefile variants; add shouldEnableHostsPlugin helper.
.pipelines/scripts/verify_shell.sh Mark aks-hosts-setup.sh as bash-only for validation.
Comments suppressed due to low confidence (2)

pkg/agent/baker.go:1539

  • When both Teleport and Artifact Streaming are enabled, this template will emit multiple snapshotter = ... assignments (teleportd then overlaybd). That creates ambiguous behavior and likely results in an unintended snapshotter being used. Consider making these options mutually exclusive (validation) or using an else if/precedence rule so only one snapshotter is configured.
    {{- if TeleportEnabled }}
    snapshotter = "teleportd"
    disable_snapshot_annotations = false
    {{- else}}
      {{- if IsKata }}
      disable_snapshot_annotations = false
      {{- end}}
    {{- end}}
    {{- if IsArtifactStreamingEnabled }}
    snapshotter = "overlaybd"
    disable_snapshot_annotations = false
    {{- end}}

pkg/agent/baker.go:1592

  • If both TeleportEnabled and IsArtifactStreamingEnabled are true, the template will render two separate [proxy_plugins] tables, which is invalid TOML (tables cannot be defined twice). Combine plugins under a single [proxy_plugins] block or ensure the features are mutually exclusive before rendering containerd config.
{{- if TeleportEnabled }}
[proxy_plugins]
  [proxy_plugins.teleportd]
    type = "snapshot"
    address = "/run/teleportd/snapshotter.sock"
{{- end}}
{{- if IsArtifactStreamingEnabled }}
[proxy_plugins]
  [proxy_plugins.overlaybd]
    type = "snapshot"
    address = "/run/overlaybd-snapshotter/overlaybd.sock"
{{- end}}

- Install dnsutils in shellspec.Dockerfile so nslookup is available
  in the CI container, enabling real DNS resolution tests.
- Fix localdns_spec.sh: add missing End statement between two It
  blocks, remove duplicate rm of already-deleted file, and drop
  assertion for non-existent error message.
Remove teleportd/teleport references from cse_cmd.sh, parser.go,
cse_config.sh, cse_helpers.sh, cse_main.sh, baker.go, and types.go.
These were not part of the localdns hosts plugin work and were
accidentally carried over from a prior merge with main.
Restore files that had unrelated changes leaked in from the old merge:
- parser.go: restore SKIP_WAAGENT_HOLD entry that was accidentally deleted
- vmss.go: restore CustomDataWithHack boothook template, CustomDataFlatcar
  path, and injectWriteFilesEntriesToCustomData (only add MockUnknownCloud)
- types.go: restore CustomDataWriteFile type (only add MockUnknownCloud tag
  and localdns helper methods)
- validators.go: restore ValidateNodeExporter and ValidateWaagentLog to
  main's versions (only add localdns hosts plugin validators)
- cse_helpers.sh: restore to main's version (no localdns changes needed)
- .env.sample: restore to main's version
Copilot AI review requested due to automatic review settings March 24, 2026 23:42
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 44 out of 44 changed files in this pull request and generated 3 comments.

- cse_main.sh: restore SKIP_WAAGENT_HOLD conditional that was
  accidentally removed (stale change from old merge)
- aks_model.go: pass e2e-test=true tag when creating Private DNS zones
  so collectGarbagePrivateDNSZones can clean them up
Two additional SKIP_WAAGENT_HOLD guards in nodePrep (for the unhold
calls) were still missing after the previous fix only restored the
one in basePrep.
Copilot AI review requested due to automatic review settings March 25, 2026 03:58
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 44 out of 44 changed files in this pull request and generated 4 comments.

if [ "$a" -le 255 ] && [ "$b" -le 255 ] && [ "$c" -le 255 ] && [ "$d" -le 255 ]; then
echo "${a}.${b}.${c}.${d}"
fi
done
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this script runs with set -euo pipefail, the resolve_ipv4 pipeline will cause the whole service to exit when grep -E finds no IPv4 matches (exit status 1). That can happen for NXDOMAIN / AAAA-only responses and would prevent the timer from refreshing /etc/localdns/hosts. Make the pipeline non-fatal when no matches are found (e.g., ensure the grep/pipeline returns 0 on empty results) so resolve_ipv4 reliably returns an empty string instead of terminating the script.

Suggested change
done
done || true

Copilot uses AI. Check for mistakes.
Comment on lines +1726 to +1748
# Step 4: Verify "recursion not available" flag in DNS response
# This proves the response came from the hosts plugin, not from forwarding to upstream DNS
# Note: We use nslookup without explicit server IP to preserve the recursion flag message
echo "Checking for 'recursion not available' flag in DNS response..."
nslookup_output=$(nslookup "$test_fqdn" 2>&1)
if ! echo "$nslookup_output" | grep -q "recursion not available"; then
echo "ERROR: Expected 'recursion not available' flag in DNS response"
echo "This indicates localdns forwarded the query upstream instead of using the hosts plugin"
echo ""
echo "Full nslookup output:"
echo "$nslookup_output"
exit 1
fi
echo "✓ Found 'recursion not available' flag in DNS response"
echo ""

echo "=== SUCCESS ==="
echo "The localdns hosts plugin is working correctly:"
echo " 1. DNS resolution returned IPs from /etc/localdns/hosts"
echo " 2. Response included 'recursion not available' (not forwarded upstream)"
echo ""
echo "Full nslookup output:"
echo "$nslookup_output"
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This validator relies on nslookup output containing the literal string "recursion not available" and also assumes nslookup "$test_fqdn" is querying localdns (it uses the node's default resolver, which may be systemd-resolved/127.0.0.53 depending on distro/config). Both behaviors are tool/version dependent and can fail even when the hosts plugin is working, making the e2e test flaky. Prefer a more deterministic check (e.g., query localdns directly via dig @169.254.10.10 and validate flags/behavior, or stop aks-hosts-setup.timer, inject a unique fake FQDN into /etc/localdns/hosts, resolve it via localdns, then restore).

Suggested change
# Step 4: Verify "recursion not available" flag in DNS response
# This proves the response came from the hosts plugin, not from forwarding to upstream DNS
# Note: We use nslookup without explicit server IP to preserve the recursion flag message
echo "Checking for 'recursion not available' flag in DNS response..."
nslookup_output=$(nslookup "$test_fqdn" 2>&1)
if ! echo "$nslookup_output" | grep -q "recursion not available"; then
echo "ERROR: Expected 'recursion not available' flag in DNS response"
echo "This indicates localdns forwarded the query upstream instead of using the hosts plugin"
echo ""
echo "Full nslookup output:"
echo "$nslookup_output"
exit 1
fi
echo "✓ Found 'recursion not available' flag in DNS response"
echo ""
echo "=== SUCCESS ==="
echo "The localdns hosts plugin is working correctly:"
echo " 1. DNS resolution returned IPs from /etc/localdns/hosts"
echo " 2. Response included 'recursion not available' (not forwarded upstream)"
echo ""
echo "Full nslookup output:"
echo "$nslookup_output"
# Step 4: Verify DNS response flags and that localdns answered the query directly
# This proves the response came from localdns (hosts plugin) without recursion to upstream DNS
echo "Checking DNS response flags from localdns..."
dig_output=$(dig +norecurse @169.254.10.10 "$test_fqdn" +noall +answer +comments 2>&1)
# Ensure the server that answered is localdns
if ! echo "$dig_output" | grep -q "SERVER: 169.254.10.10#53"; then
echo "ERROR: Expected DNS server 169.254.10.10 (localdns) in dig output"
echo ""
echo "Full dig output:"
echo "$dig_output"
exit 1
fi
# Ensure recursion is not available (no 'ra' flag in the response flags)
if echo "$dig_output" | grep -q "flags: .* ra"; then
echo "ERROR: Expected recursion to be disabled (no 'ra' flag) in DNS response from localdns"
echo "This indicates the response may have been obtained via recursion instead of the hosts plugin"
echo ""
echo "Full dig output:"
echo "$dig_output"
exit 1
fi
echo "✓ DNS response came from localdns with recursion disabled (no 'ra' flag)"
echo ""
echo "=== SUCCESS ==="
echo "The localdns hosts plugin is working correctly:"
echo " 1. DNS resolution returned IPs from /etc/localdns/hosts"
echo " 2. Response came directly from localdns (SERVER: 169.254.10.10) with recursion disabled"
echo ""
echo "Full dig output:"
echo "$dig_output"

Copilot uses AI. Check for mistakes.
Comment on lines +1544 to +1547
// ValidateLocalDNSHostsPluginBypass verifies that localdns resolves FQDNs from /etc/localdns/hosts
// without querying the upstream DNS server. This confirms the hosts plugin is working correctly.
// It injects a fake FQDN (that doesn't exist in public DNS) into the hosts file and verifies
// localdns can resolve it - proving the hosts plugin is functioning.
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc comment says this validator "injects a fake FQDN" into the hosts file, but the implementation below validates using an existing FQDN (packages.microsoft.com) and does not modify /etc/localdns/hosts. Please update the comment to match the current behavior (or adjust the implementation to actually inject a synthetic FQDN if that’s the intended assertion).

Suggested change
// ValidateLocalDNSHostsPluginBypass verifies that localdns resolves FQDNs from /etc/localdns/hosts
// without querying the upstream DNS server. This confirms the hosts plugin is working correctly.
// It injects a fake FQDN (that doesn't exist in public DNS) into the hosts file and verifies
// localdns can resolve it - proving the hosts plugin is functioning.
// ValidateLocalDNSHostsPluginBypass verifies that localdns resolves FQDNs using entries from
// /etc/localdns/hosts via the CoreDNS hosts plugin, without relying on the upstream DNS server.
// It does this by resolving a known FQDN that is expected to be present in the localdns hosts file
// (for example, packages.microsoft.com) and confirming that the resolution succeeds.

Copilot uses AI. Check for mistakes.
ValidateAKSHostsSetupService(ctx, s)
// Validate hosts file contains resolved IPs for critical FQDNs (IPs resolved dynamically)
ValidateLocalDNSHostsFile(ctx, s, s.GetDefaultFQDNsForValidation())
// Validate localdns resolves fake FQDN from hosts file (proves hosts plugin bypass)
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says this validator "resolves fake FQDN from hosts file", but ValidateLocalDNSHostsPluginBypass currently validates using a real FQDN (packages.microsoft.com) and does not inject a synthetic record. Update this comment to reflect the current validation approach to avoid confusion.

Suggested change
// Validate localdns resolves fake FQDN from hosts file (proves hosts plugin bypass)
// Validate localdns resolves a real FQDN via the hosts plugin bypass (currently packages.microsoft.com)

Copilot uses AI. Check for mistakes.
@saewoni saewoni closed this Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants