Skip to content

mvp vm attestation#1091

Open
jordanhendricks wants to merge 30 commits intomasterfrom
jhendricks/rfd-605
Open

mvp vm attestation#1091
jordanhendricks wants to merge 30 commits intomasterfrom
jhendricks/rfd-605

Conversation

@jordanhendricks
Copy link
Copy Markdown
Contributor

@jordanhendricks jordanhendricks commented Mar 27, 2026

closes #1067

TODO:

Testing notes

No boot disk

Steps to test: create instance, stop it (or don't auto-start it), then remove the boot disk as a boot disk. Send a challenge from inside the guest.

Result: attestation server used just the instance UUID for qualifying data

21:24:25.538Z INFO propolis-server (vm_state_driver): vm conf is ready = VmInstanceConf { uuid: 1f1ec2e3-c5cf-4eaf-8a19-aa25ec1f6895, boot_digest: None }

Failed boot disk

in progress

Cargo.toml Outdated
# Attestation
#dice-verifier = { git = "https://github.com/oxidecomputer/dice-util", branch = "jhendricks/update-sled-agent-types-versions", features = ["sled-agent"] }
dice-verifier = { git = "https://github.com/oxidecomputer/dice-util", features = ["sled-agent"] }
vm-attest = { git = "https://github.com/oxidecomputer/vm-attest", rev = "a7c2a341866e359a3126aaaa67823ec5097000cd", default-features = false }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of the Cargo.lock weirdness from dice-verifier -> sled-agent-client -> omciron-common (some previous rev) and that's where the later API dependency stuff we saw in Omicron comes up when building the tuf. sled-agent-client re-exports items out of propolis-client which means we end up in a situation where propolis-server depends on a different rev of propolis-client and everything's Weird.

i'm not totally sure what we want or need to do about this, particularly because we're definitely not using the propolis-client-related parts of sled-agent! we're just using one small part of the API for the RoT calls. but sled-agent and propolis are (i think?) updated in the same deployment unit so the cyclic dependency is fine.

@jordanhendricks jordanhendricks marked this pull request as ready for review April 2, 2026 00:08
@jordanhendricks
Copy link
Copy Markdown
Contributor Author

I want to add some comments in the attestation module but from a code-structure perspective @iximeow and I are happy with this. Ready for review!

@jordanhendricks jordanhendricks requested a review from hawkw April 2, 2026 00:41
@jordanhendricks jordanhendricks self-assigned this Apr 2, 2026
api_runtime.block_on(async { vnc.halt().await });
}

// TODO: clean up attestation server.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed now?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 014950e

Copy link
Copy Markdown
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the Tokio stuff felt a bit awkward here --- I'd be happy to open a PR against this branch changing some of the things I mentioned, if that's easier for you?

Comment on lines 499 to 500
// TODO: early return if none?
if let Some(vsock) = &self.spec.vsock {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, i think the TODO is as easy as changing this to

Suggested change
// TODO: early return if none?
if let Some(vsock) = &self.spec.vsock {
// TODO: early return if none?
let Some(vsock) = &self.spec.vsock else { return; };

and then un-indenting everything else in the function basically.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 014950e

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not super important but this string could be better probably

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 014950e

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turbo nit:

Suggested change
// table should be we sized this appropriately in testing, so

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a couple of commas here in 014950e

Some(backend.clone_volume())
} else {
// Disk must be read-only to be used for attestation.
slog::info!(self.log, "boot disk is not read-only");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this should explicitly state that this means it will not be attested?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

took a crack at this in 014950e

Comment on lines +42 to +118
#[derive(Debug)]
enum AttestationInitState {
Preparing {
vm_conf_send: oneshot::Sender<VmInstanceConf>,
},
/// A transient state while we're getting the initializer ready, having
/// taken `Preparing` and its `vm_conf_send`, but before we've got a
/// `JoinHandle` to track as running.
Initializing,
Running {
init_task: JoinHandle<()>,
},
}

/// This struct manages providing the requisite data for a corresponding
/// `AttestationSock` to become fully functional.
pub struct AttestationSockInit {
log: slog::Logger,
vm_conf_send: oneshot::Sender<VmInstanceConf>,
uuid: uuid::Uuid,
volume_ref: Option<crucible::Volume>,
}

impl AttestationSockInit {
/// Do any any remaining work of collecting VM RoT measurements in support
/// of this VM's attestation server.
pub async fn run(self) {
let AttestationSockInit { log, vm_conf_send, uuid, volume_ref } = self;

let mut vm_conf = vm_attest::VmInstanceConf { uuid, boot_digest: None };

if let Some(volume) = volume_ref {
// TODO(jph): make propolis issue, link to #1078 and add a log line
// TODO: load-bearing sleep: we have a Crucible volume, but we can
// be here and chomping at the bit to get a digest calculation
// started well before the volume has been activated; in
// `propolis-server` we need to wait for at least a subsequent
// instance start. Similar to the scrub task for Crucible disks,
// delay some number of seconds in the hopes that activation is done
// promptly.
//
// This should be replaced by awaiting for some kind of actual
// "activated" signal.
tokio::time::sleep(std::time::Duration::from_secs(10)).await;

let boot_digest =
match crate::attestation::boot_digest::boot_disk_digest(
volume, &log,
)
.await
{
Ok(digest) => digest,
Err(e) => {
// a panic here is unfortunate, but helps us debug for
// now; if the digest calculation fails it may be some
// retryable issue that a guest OS would survive. but
// panicking here means we've stopped Propolis at the
// actual error, rather than noticing the
// `vm_conf_sender` having dropped elsewhere.
panic!("failed to compute boot disk digest: {e:?}");
}
};

vm_conf.boot_digest = Some(boot_digest);
} else {
slog::warn!(log, "not computing boot disk digest");
}

let send_res = vm_conf_send.send(vm_conf);
if let Err(_) = send_res {
slog::error!(
log,
"attestation server is not listening for its config?"
);
}
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Soo, it feels a bit funny to me that this thing is a task we spawn that, when it completes, sends a message over a oneshot channel and then exits, and then we have a JoinHandle<()> for that task. It kinda feels like this could just be a JoinHandle<VmInstanceConf> and make a bunch of this at least a bit simpler?

I'd be happy to throw together a patch that does that refactoring if it's too annoying.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. The JoinHandle was from a previous iteration of how we would structure things that looked more like the way we presently handle the VNC server. I'll take a look at how hard this is to remove.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this and also the change in this module that I suggested in #1091 (comment) are kinda just refactoring/tidying things up, I would be fine with leaving a lot of this as-is and then merge some refactoring later --- I'd be happy to open a follow-up PR after this has merged, if that makes life easier for you?

let mut buffer =
Buffer::new(this_block_count as usize, block_size as usize);

// TODO(jph): We don't want to panic in the case of a failed read. How
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to do this and test on dublin.

// License, v. 2.0. If a copy of the MPL was not distributed with this
// file, You can obtain one at https://mozilla.org/MPL/2.0/.

//! TODO: block comment
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in progress

Copy link
Copy Markdown
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Crucible retry stuff seems pretty much correct, I commented on some minor nitpicks. I think it's fine to defer some of the async refactoring to a subsequent PR, as there isn't anything wrong there, I just think we could maybe make the code a bit simpler. Beyond that, I think that pending whatever testing you need to do, I have no major concerns.

Comment on lines +763 to +764
// Disk must be read-only to be used for attestation.
slog::info!(self.log, "boot disk is not read-only (and will not be used for attestations)");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turbo nitpick:

Suggested change
// Disk must be read-only to be used for attestation.
slog::info!(self.log, "boot disk is not read-only (and will not be used for attestations)");
// Disk must be read-only to be used for attestation.
slog::info!(
self.log,
"boot disk is not read-only (and will not be used for attestations)",
);

name: &str,
device: &super::Device,
) -> Result<VirtioSocket, TomlToSpecError> {
eprintln!("{:?}", device);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm guessing this was stuck in for temporary debugging purposes and ought to be removed before release?

Suggested change
eprintln!("{:?}", device);

Comment on lines +36 to +41
"starting hash of volume {:?} (total_size={}, block_size={} end_block={}, block_count={})",
vol_uuid,
vol_size,
block_size,
end_block,
block_count,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpicky, unimportant: should these perhaps be structured fields on the log record?

Suggested change
"starting hash of volume {:?} (total_size={}, block_size={} end_block={}, block_count={})",
vol_uuid,
vol_size,
block_size,
end_block,
block_count,
"starting hash of volume";
"volume_id" => %vol_uuid,,
"volume_size" => vol_size,
"block_size" => block_size,
"end_block" => end_block,
"block_count" => block_count,

Comment on lines +89 to +95
slog::error!(
log,
"read failed: {e:?}.
offset={offset},
this_block_cout={this_block_count},
block_size={block_size},
end_block={end_block}"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super weird formatting here, can we do something about that? also perhaps these ought to be structured fields...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and also, perhaps this ought to include the retry count?

Comment on lines +74 to +75
// XXX(JPH): Crucible scrub code also inserts a delay between reads. We probably
// don't want to do that but we'll see how this goes in production.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would we not want to back off here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular reason, but I see now this comment is quite confusing so I'll clean that up a bit.

Comment on lines +70 to +81
// Read the whole disk. If a read fails, we'll retry a given number of times, but if those
// fail, we return an error to the attestation machinery. It's unlikely that instance is
// doing well in this case, anyway, if it's boot disk is erroring on reads.
//
// Options:
//
// * retry indefinitely or some N times
// * you never get an attestation (no boot disk digest) if the hashing
// fails
// * you can still get attestations but without a boot disk digest measurement
//
// Crucible scrub code also inserts a delay between reads. We probably
// don't want to do that but release testing will reveal that,
// hopefully..
let res = vol.read(block, &mut buffer).await;

if let Err(e) = res {
panic!(
"read failed: {e:?}.
// XXX(JPH): Crucible scrub code also inserts a delay between reads. We probably
// don't want to do that but we'll see how this goes in production.
let retry_count = 5;
let mut n_retries = 0;
loop {
if n_retries >= retry_count {
slog::error!(log, "failed to read from boot disk {n_retries} tries; aborting boot
digest hash");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bunch of long lines which i'd like to see wrapped (considered not important)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

mvp vm attestation support in propolis-server (rfd 605)

4 participants