mvp vm attestation by jordanhendricks · Pull Request #1091 · oxidecomputer/propolis

jordanhendricks · 2026-03-27T19:12:21Z

TODO:

understand why we see spurious attestation failures (@jordanhendricks) (edit: this was the system working as expected, failing requests before the boot digest is done)
enforce read-only-ness of the boot disk (@jordanhendricks)
understand why stopping an instance with this branch failed (example on berlin) (@iximeow) (edit: this became make Attest an async trait dice-util#360 and associated work)
understand vsock issues (edit: will be merged in XXX close connections #1094)
attestation module comment
fix phd test failures (@jordanhendricks)

Testing notes

No boot disk

Steps to test: create instance, stop it (or don't auto-start it), then remove the boot disk as a boot disk. Send a challenge from inside the guest.

Result: attestation server used just the instance UUID for qualifying data

21:24:25.538Z INFO propolis-server (vm_state_driver): vm conf is ready = VmInstanceConf { uuid: 1f1ec2e3-c5cf-4eaf-8a19-aa25ec1f6895, boot_digest: None }

Failed boot disk

in progress

iximeow · 2026-03-27T23:01:54Z

Cargo.toml

+# Attestation
+#dice-verifier = { git = "https://github.com/oxidecomputer/dice-util", branch = "jhendricks/update-sled-agent-types-versions", features = ["sled-agent"] }
+dice-verifier = { git = "https://github.com/oxidecomputer/dice-util", features = ["sled-agent"] }
+vm-attest = { git = "https://github.com/oxidecomputer/vm-attest", rev = "a7c2a341866e359a3126aaaa67823ec5097000cd", default-features = false }


most of the Cargo.lock weirdness from dice-verifier -> sled-agent-client -> omciron-common (some previous rev) and that's where the later API dependency stuff we saw in Omicron comes up when building the tuf. sled-agent-client re-exports items out of propolis-client which means we end up in a situation where propolis-server depends on a different rev of propolis-client and everything's Weird.

i'm not totally sure what we want or need to do about this, particularly because we're definitely not using the propolis-client-related parts of sled-agent! we're just using one small part of the API for the RoT calls. but sled-agent and propolis are (i think?) updated in the same deployment unit so the cyclic dependency is fine.

this fixes issues (read: panics) related to AttestSledAgent's internal `rt`, block_on, and dropping.

actually stop the `AttestationSock` when we stop other Propolis devices/backends, and along the way `tcp_attest` -> `attest_handle`.

jordanhendricks · 2026-04-02T00:09:52Z

I want to add some comments in the attestation module but from a code-structure perspective @iximeow and I are happy with this. Ready for review!

papertigers · 2026-04-02T18:12:44Z

bin/propolis-server/src/main.rs

        api_runtime.block_on(async { vnc.halt().await });
    }

+    // TODO: clean up attestation server.


This can be removed now?

done in 014950e

hawkw

Some of the Tokio stuff felt a bit awkward here --- I'd be happy to open a PR against this branch changing some of the things I mentioned, if that's easier for you?

hawkw · 2026-04-02T20:10:29Z

bin/propolis-server/src/lib/initializer.rs

+        // TODO: early return if none?
        if let Some(vsock) = &self.spec.vsock {


fwiw, i think the TODO is as easy as changing this to

Suggested change

// TODO: early return if none?

if let Some(vsock) = &self.spec.vsock {

// TODO: early return if none?

let Some(vsock) = &self.spec.vsock else { return; };

and then un-indenting everything else in the function basically.

done in 014950e

hawkw · 2026-04-02T20:11:57Z

bin/propolis-server/src/lib/initializer.rs

not super important but this string could be better probably

done in 014950e

hawkw · 2026-04-02T20:12:58Z

bin/propolis-server/src/lib/initializer.rs

turbo nit:

Suggested change

// table should be we sized this appropriately in testing, so

added a couple of commas here in 014950e

bin/propolis-server/src/lib/initializer.rs

hawkw · 2026-04-02T20:16:14Z

bin/propolis-server/src/lib/initializer.rs

+                    Some(backend.clone_volume())
+                } else {
+                    // Disk must be read-only to be used for attestation.
+                    slog::info!(self.log, "boot disk is not read-only");


maybe this should explicitly state that this means it will not be attested?

took a crack at this in 014950e

hawkw · 2026-04-02T20:35:26Z

lib/propolis/src/attestation/server.rs

+#[derive(Debug)]
+enum AttestationInitState {
+    Preparing {
+        vm_conf_send: oneshot::Sender<VmInstanceConf>,
+    },
+    /// A transient state while we're getting the initializer ready, having
+    /// taken `Preparing` and its `vm_conf_send`, but before we've got a
+    /// `JoinHandle` to track as running.
+    Initializing,
+    Running {
+        init_task: JoinHandle<()>,
+    },
+}
+
+/// This struct manages providing the requisite data for a corresponding
+/// `AttestationSock` to become fully functional.
+pub struct AttestationSockInit {
+    log: slog::Logger,
+    vm_conf_send: oneshot::Sender<VmInstanceConf>,
+    uuid: uuid::Uuid,
+    volume_ref: Option<crucible::Volume>,
+}
+
+impl AttestationSockInit {
+    /// Do any any remaining work of collecting VM RoT measurements in support
+    /// of this VM's attestation server.
+    pub async fn run(self) {
+        let AttestationSockInit { log, vm_conf_send, uuid, volume_ref } = self;
+
+        let mut vm_conf = vm_attest::VmInstanceConf { uuid, boot_digest: None };
+
+        if let Some(volume) = volume_ref {
+            // TODO(jph): make propolis issue, link to #1078 and add a log line
+            // TODO: load-bearing sleep: we have a Crucible volume, but we can
+            // be here and chomping at the bit to get a digest calculation
+            // started well before the volume has been activated; in
+            // `propolis-server` we need to wait for at least a subsequent
+            // instance start. Similar to the scrub task for Crucible disks,
+            // delay some number of seconds in the hopes that activation is done
+            // promptly.
+            //
+            // This should be replaced by awaiting for some kind of actual
+            // "activated" signal.
+            tokio::time::sleep(std::time::Duration::from_secs(10)).await;
+
+            let boot_digest =
+                match crate::attestation::boot_digest::boot_disk_digest(
+                    volume, &log,
+                )
+                .await
+                {
+                    Ok(digest) => digest,
+                    Err(e) => {
+                        // a panic here is unfortunate, but helps us debug for
+                        // now; if the digest calculation fails it may be some
+                        // retryable issue that a guest OS would survive. but
+                        // panicking here means we've stopped Propolis at the
+                        // actual error, rather than noticing the
+                        // `vm_conf_sender` having dropped elsewhere.
+                        panic!("failed to compute boot disk digest: {e:?}");
+                    }
+                };
+
+            vm_conf.boot_digest = Some(boot_digest);
+        } else {
+            slog::warn!(log, "not computing boot disk digest");
+        }
+
+        let send_res = vm_conf_send.send(vm_conf);
+        if let Err(_) = send_res {
+            slog::error!(
+                log,
+                "attestation server is not listening for its config?"
+            );
+        }
+    }
+}


Soo, it feels a bit funny to me that this thing is a task we spawn that, when it completes, sends a message over a oneshot channel and then exits, and then we have a JoinHandle<()> for that task. It kinda feels like this could just be a JoinHandle<VmInstanceConf> and make a bunch of this at least a bit simpler?

I'd be happy to throw together a patch that does that refactoring if it's too annoying.

That's fair. The JoinHandle was from a previous iteration of how we would structure things that looked more like the way we presently handle the VNC server. I'll take a look at how hard this is to remove.

Since this and also the change in this module that I suggested in #1091 (comment) are kinda just refactoring/tidying things up, I would be fine with leaving a lot of this as-is and then merge some refactoring later --- I'd be happy to open a follow-up PR after this has merged, if that makes life easier for you?

lib/propolis/src/attestation/server.rs

jordanhendricks · 2026-04-03T00:03:28Z

lib/propolis/src/attestation/boot_digest.rs

+        let mut buffer =
+            Buffer::new(this_block_count as usize, block_size as usize);
+
+        // TODO(jph): We don't want to panic in the case of a failed read. How


I still need to do this and test on dublin.

jordanhendricks · 2026-04-03T00:03:38Z

lib/propolis/src/attestation/mod.rs

+// License, v. 2.0. If a copy of the MPL was not distributed with this
+// file, You can obtain one at https://mozilla.org/MPL/2.0/.
+
+//! TODO: block comment


in progress

hawkw

The Crucible retry stuff seems pretty much correct, I commented on some minor nitpicks. I think it's fine to defer some of the async refactoring to a subsequent PR, as there isn't anything wrong there, I just think we could maybe make the code a bit simpler. Beyond that, I think that pending whatever testing you need to do, I have no major concerns.

hawkw · 2026-04-03T23:07:47Z

bin/propolis-server/src/lib/initializer.rs

+                    // Disk must be read-only to be used for attestation.
+                    slog::info!(self.log, "boot disk is not read-only (and will not be used for attestations)");


turbo nitpick:

Suggested change

// Disk must be read-only to be used for attestation.

slog::info!(self.log, "boot disk is not read-only (and will not be used for attestations)");

// Disk must be read-only to be used for attestation.

slog::info!(

self.log,

"boot disk is not read-only (and will not be used for attestations)",

);

hawkw · 2026-04-03T23:08:55Z

crates/propolis-config-toml/src/spec.rs

    name: &str,
    device: &super::Device,
 ) -> Result<VirtioSocket, TomlToSpecError> {
+    eprintln!("{:?}", device);


i'm guessing this was stuck in for temporary debugging purposes and ought to be removed before release?

Suggested change

eprintln!("{:?}", device);

hawkw · 2026-04-03T23:11:07Z

lib/propolis/src/attestation/boot_digest.rs

+        "starting hash of volume {:?} (total_size={}, block_size={} end_block={}, block_count={})",
+        vol_uuid,
+        vol_size,
+        block_size,
+        end_block,
+        block_count,


nitpicky, unimportant: should these perhaps be structured fields on the log record?

Suggested change

"starting hash of volume {:?} (total_size={}, block_size={} end_block={}, block_count={})",

vol_uuid,

vol_size,

block_size,

end_block,

block_count,

"starting hash of volume";

"volume_id" => %vol_uuid,,

"volume_size" => vol_size,

"block_size" => block_size,

"end_block" => end_block,

"block_count" => block_count,

hawkw · 2026-04-03T23:13:14Z

lib/propolis/src/attestation/boot_digest.rs

+                slog::error!(
+                    log,
+                    "read failed: {e:?}.
+                offset={offset},
+                this_block_cout={this_block_count},
+                block_size={block_size},
+                end_block={end_block}"


super weird formatting here, can we do something about that? also perhaps these ought to be structured fields...

and also, perhaps this ought to include the retry count?

hawkw · 2026-04-03T23:20:02Z

lib/propolis/src/attestation/boot_digest.rs

+        // XXX(JPH): Crucible scrub code also inserts a delay between reads. We probably
+        // don't want to do that but we'll see how this goes in production.


why would we not want to back off here?

No particular reason, but I see now this comment is quite confusing so I'll clean that up a bit.

hawkw · 2026-04-03T23:20:26Z

lib/propolis/src/attestation/boot_digest.rs

+        // Read the whole disk. If a read fails, we'll retry a given number of times, but if those
+        // fail, we return an error to the attestation machinery. It's unlikely that instance is
+        // doing well in this case, anyway, if it's boot disk is erroring on reads.
        //
-        // Options:
-        //
-        // * retry indefinitely or some N times
-        // * you never get an attestation (no boot disk digest) if the hashing
-        //   fails
-        // * you can still get attestations but without a boot disk digest measurement
-        //
-        // Crucible scrub code also inserts a delay between reads. We probably
-        // don't want to do that but release testing will reveal that,
-        // hopefully..
-        let res = vol.read(block, &mut buffer).await;
-
-        if let Err(e) = res {
-            panic!(
-                "read failed: {e:?}.
+        // XXX(JPH): Crucible scrub code also inserts a delay between reads. We probably
+        // don't want to do that but we'll see how this goes in production.
+        let retry_count = 5;
+        let mut n_retries = 0;
+        loop {
+            if n_retries >= retry_count {
+                slog::error!(log, "failed to read from boot disk {n_retries} tries; aborting boot
+                    digest hash");


a bunch of long lines which i'd like to see wrapped (considered not important)

jordanhendricks and others added 17 commits March 20, 2026 18:59

something that compiles

f6d25c1

starting to sketch out sled-agent attest code

5dbf46c

mvp attestation??

e12a38f

remove dep on libipcc

6335323

make boot digest parseable

ef01e4b

ready for a racklette spin

591b9f5

paper over async/sync/async bits

4ca28cb

added recv channel for vm conf in attestation server

5f12a78

moved tcp attest server inside of vm objects

b1c710c

remove warning

e4b4a52

start adding boot digest stuff

1c55d2b

might have strung all the needful through propolis-server?

1c6ed47

clippy lints and cargo fmt

14122a2

racklette debug :(

449a3b2

more debugging

19cfbf7

restore 4ca28cb

d89273b

remove todo file from tree

2d0a0e4

iximeow reviewed Mar 27, 2026

View reviewed changes

iximeow and others added 10 commits March 30, 2026 20:36

bump dice-util/vm-attest for AttestAsync

fea9dbb

this fixes issues (read: panics) related to AttestSledAgent's internal `rt`, block_on, and dropping.

enforce read-only boot disk

60c8c04

rev dice-util and vm-attest further

9efdfb6

rev dice-util, vm-attest

b137a90

shuffle things around to be able to reign in a cancelled init task

cf55c6e

halt cleanup

7f84255

actually stop the `AttestationSock` when we stop other Propolis devices/backends, and along the way `tcp_attest` -> `attest_handle`.

cleaning up some todos

776795a

how had i not rebuilt the server...??

9af75aa

testing a phd fix

60935ca

my turn to not compile propolis-server

50c24ff

jordanhendricks marked this pull request as ready for review April 2, 2026 00:08

jordanhendricks requested a review from hawkw April 2, 2026 00:41

jordanhendricks self-assigned this Apr 2, 2026

papertigers reviewed Apr 2, 2026

View reviewed changes

hawkw reviewed Apr 2, 2026

View reviewed changes

jordanhendricks commented Apr 3, 2026

View reviewed changes

first round of review feedback: minor things

014950e

jordanhendricks mentioned this pull request Apr 3, 2026

ls-apis CI test didn't fail for new cyclic dependency oxidecomputer/omicron#10214

Closed

jordanhendricks added 2 commits April 3, 2026 12:27

compiling, my bad

71b14da

add retries for crucible reads

2d8818d

hawkw approved these changes Apr 3, 2026

View reviewed changes

		// TODO: early return if none?
		if let Some(vsock) = &self.spec.vsock {


	// table should be we sized this appropriately in testing, so

		// Disk must be read-only to be used for attestation.
		slog::info!(self.log, "boot disk is not read-only (and will not be used for attestations)");

		// XXX(JPH): Crucible scrub code also inserts a delay between reads. We probably
		// don't want to do that but we'll see how this goes in production.

Conversation

jordanhendricks commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing notes

No boot disk

Failed boot disk

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jordanhendricks commented Apr 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hawkw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hawkw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jordanhendricks commented Mar 27, 2026 •

edited

Loading