Conversation
Cargo.toml
Outdated
| # Attestation | ||
| #dice-verifier = { git = "https://github.com/oxidecomputer/dice-util", branch = "jhendricks/update-sled-agent-types-versions", features = ["sled-agent"] } | ||
| dice-verifier = { git = "https://github.com/oxidecomputer/dice-util", features = ["sled-agent"] } | ||
| vm-attest = { git = "https://github.com/oxidecomputer/vm-attest", rev = "a7c2a341866e359a3126aaaa67823ec5097000cd", default-features = false } |
There was a problem hiding this comment.
most of the Cargo.lock weirdness from dice-verifier -> sled-agent-client -> omciron-common (some previous rev) and that's where the later API dependency stuff we saw in Omicron comes up when building the tuf. sled-agent-client re-exports items out of propolis-client which means we end up in a situation where propolis-server depends on a different rev of propolis-client and everything's Weird.
i'm not totally sure what we want or need to do about this, particularly because we're definitely not using the propolis-client-related parts of sled-agent! we're just using one small part of the API for the RoT calls. but sled-agent and propolis are (i think?) updated in the same deployment unit so the cyclic dependency is fine.
this fixes issues (read: panics) related to AttestSledAgent's internal `rt`, block_on, and dropping.
actually stop the `AttestationSock` when we stop other Propolis devices/backends, and along the way `tcp_attest` -> `attest_handle`.
|
I want to add some comments in the attestation module but from a code-structure perspective @iximeow and I are happy with this. Ready for review! |
bin/propolis-server/src/main.rs
Outdated
| api_runtime.block_on(async { vnc.halt().await }); | ||
| } | ||
|
|
||
| // TODO: clean up attestation server. |
There was a problem hiding this comment.
This can be removed now?
hawkw
left a comment
There was a problem hiding this comment.
Some of the Tokio stuff felt a bit awkward here --- I'd be happy to open a PR against this branch changing some of the things I mentioned, if that's easier for you?
| // TODO: early return if none? | ||
| if let Some(vsock) = &self.spec.vsock { |
There was a problem hiding this comment.
fwiw, i think the TODO is as easy as changing this to
| // TODO: early return if none? | |
| if let Some(vsock) = &self.spec.vsock { | |
| // TODO: early return if none? | |
| let Some(vsock) = &self.spec.vsock else { return; }; |
and then un-indenting everything else in the function basically.
There was a problem hiding this comment.
not super important but this string could be better probably
There was a problem hiding this comment.
turbo nit:
| // table should be we sized this appropriately in testing, so |
There was a problem hiding this comment.
added a couple of commas here in 014950e
| Some(backend.clone_volume()) | ||
| } else { | ||
| // Disk must be read-only to be used for attestation. | ||
| slog::info!(self.log, "boot disk is not read-only"); |
There was a problem hiding this comment.
maybe this should explicitly state that this means it will not be attested?
| #[derive(Debug)] | ||
| enum AttestationInitState { | ||
| Preparing { | ||
| vm_conf_send: oneshot::Sender<VmInstanceConf>, | ||
| }, | ||
| /// A transient state while we're getting the initializer ready, having | ||
| /// taken `Preparing` and its `vm_conf_send`, but before we've got a | ||
| /// `JoinHandle` to track as running. | ||
| Initializing, | ||
| Running { | ||
| init_task: JoinHandle<()>, | ||
| }, | ||
| } | ||
|
|
||
| /// This struct manages providing the requisite data for a corresponding | ||
| /// `AttestationSock` to become fully functional. | ||
| pub struct AttestationSockInit { | ||
| log: slog::Logger, | ||
| vm_conf_send: oneshot::Sender<VmInstanceConf>, | ||
| uuid: uuid::Uuid, | ||
| volume_ref: Option<crucible::Volume>, | ||
| } | ||
|
|
||
| impl AttestationSockInit { | ||
| /// Do any any remaining work of collecting VM RoT measurements in support | ||
| /// of this VM's attestation server. | ||
| pub async fn run(self) { | ||
| let AttestationSockInit { log, vm_conf_send, uuid, volume_ref } = self; | ||
|
|
||
| let mut vm_conf = vm_attest::VmInstanceConf { uuid, boot_digest: None }; | ||
|
|
||
| if let Some(volume) = volume_ref { | ||
| // TODO(jph): make propolis issue, link to #1078 and add a log line | ||
| // TODO: load-bearing sleep: we have a Crucible volume, but we can | ||
| // be here and chomping at the bit to get a digest calculation | ||
| // started well before the volume has been activated; in | ||
| // `propolis-server` we need to wait for at least a subsequent | ||
| // instance start. Similar to the scrub task for Crucible disks, | ||
| // delay some number of seconds in the hopes that activation is done | ||
| // promptly. | ||
| // | ||
| // This should be replaced by awaiting for some kind of actual | ||
| // "activated" signal. | ||
| tokio::time::sleep(std::time::Duration::from_secs(10)).await; | ||
|
|
||
| let boot_digest = | ||
| match crate::attestation::boot_digest::boot_disk_digest( | ||
| volume, &log, | ||
| ) | ||
| .await | ||
| { | ||
| Ok(digest) => digest, | ||
| Err(e) => { | ||
| // a panic here is unfortunate, but helps us debug for | ||
| // now; if the digest calculation fails it may be some | ||
| // retryable issue that a guest OS would survive. but | ||
| // panicking here means we've stopped Propolis at the | ||
| // actual error, rather than noticing the | ||
| // `vm_conf_sender` having dropped elsewhere. | ||
| panic!("failed to compute boot disk digest: {e:?}"); | ||
| } | ||
| }; | ||
|
|
||
| vm_conf.boot_digest = Some(boot_digest); | ||
| } else { | ||
| slog::warn!(log, "not computing boot disk digest"); | ||
| } | ||
|
|
||
| let send_res = vm_conf_send.send(vm_conf); | ||
| if let Err(_) = send_res { | ||
| slog::error!( | ||
| log, | ||
| "attestation server is not listening for its config?" | ||
| ); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Soo, it feels a bit funny to me that this thing is a task we spawn that, when it completes, sends a message over a oneshot channel and then exits, and then we have a JoinHandle<()> for that task. It kinda feels like this could just be a JoinHandle<VmInstanceConf> and make a bunch of this at least a bit simpler?
I'd be happy to throw together a patch that does that refactoring if it's too annoying.
There was a problem hiding this comment.
That's fair. The JoinHandle was from a previous iteration of how we would structure things that looked more like the way we presently handle the VNC server. I'll take a look at how hard this is to remove.
There was a problem hiding this comment.
Since this and also the change in this module that I suggested in #1091 (comment) are kinda just refactoring/tidying things up, I would be fine with leaving a lot of this as-is and then merge some refactoring later --- I'd be happy to open a follow-up PR after this has merged, if that makes life easier for you?
| let mut buffer = | ||
| Buffer::new(this_block_count as usize, block_size as usize); | ||
|
|
||
| // TODO(jph): We don't want to panic in the case of a failed read. How |
There was a problem hiding this comment.
I still need to do this and test on dublin.
| // License, v. 2.0. If a copy of the MPL was not distributed with this | ||
| // file, You can obtain one at https://mozilla.org/MPL/2.0/. | ||
|
|
||
| //! TODO: block comment |
hawkw
left a comment
There was a problem hiding this comment.
The Crucible retry stuff seems pretty much correct, I commented on some minor nitpicks. I think it's fine to defer some of the async refactoring to a subsequent PR, as there isn't anything wrong there, I just think we could maybe make the code a bit simpler. Beyond that, I think that pending whatever testing you need to do, I have no major concerns.
| // Disk must be read-only to be used for attestation. | ||
| slog::info!(self.log, "boot disk is not read-only (and will not be used for attestations)"); |
There was a problem hiding this comment.
turbo nitpick:
| // Disk must be read-only to be used for attestation. | |
| slog::info!(self.log, "boot disk is not read-only (and will not be used for attestations)"); | |
| // Disk must be read-only to be used for attestation. | |
| slog::info!( | |
| self.log, | |
| "boot disk is not read-only (and will not be used for attestations)", | |
| ); |
| name: &str, | ||
| device: &super::Device, | ||
| ) -> Result<VirtioSocket, TomlToSpecError> { | ||
| eprintln!("{:?}", device); |
There was a problem hiding this comment.
i'm guessing this was stuck in for temporary debugging purposes and ought to be removed before release?
| eprintln!("{:?}", device); |
| "starting hash of volume {:?} (total_size={}, block_size={} end_block={}, block_count={})", | ||
| vol_uuid, | ||
| vol_size, | ||
| block_size, | ||
| end_block, | ||
| block_count, |
There was a problem hiding this comment.
nitpicky, unimportant: should these perhaps be structured fields on the log record?
| "starting hash of volume {:?} (total_size={}, block_size={} end_block={}, block_count={})", | |
| vol_uuid, | |
| vol_size, | |
| block_size, | |
| end_block, | |
| block_count, | |
| "starting hash of volume"; | |
| "volume_id" => %vol_uuid,, | |
| "volume_size" => vol_size, | |
| "block_size" => block_size, | |
| "end_block" => end_block, | |
| "block_count" => block_count, |
| slog::error!( | ||
| log, | ||
| "read failed: {e:?}. | ||
| offset={offset}, | ||
| this_block_cout={this_block_count}, | ||
| block_size={block_size}, | ||
| end_block={end_block}" |
There was a problem hiding this comment.
super weird formatting here, can we do something about that? also perhaps these ought to be structured fields...
There was a problem hiding this comment.
and also, perhaps this ought to include the retry count?
| // XXX(JPH): Crucible scrub code also inserts a delay between reads. We probably | ||
| // don't want to do that but we'll see how this goes in production. |
There was a problem hiding this comment.
why would we not want to back off here?
There was a problem hiding this comment.
No particular reason, but I see now this comment is quite confusing so I'll clean that up a bit.
| // Read the whole disk. If a read fails, we'll retry a given number of times, but if those | ||
| // fail, we return an error to the attestation machinery. It's unlikely that instance is | ||
| // doing well in this case, anyway, if it's boot disk is erroring on reads. | ||
| // | ||
| // Options: | ||
| // | ||
| // * retry indefinitely or some N times | ||
| // * you never get an attestation (no boot disk digest) if the hashing | ||
| // fails | ||
| // * you can still get attestations but without a boot disk digest measurement | ||
| // | ||
| // Crucible scrub code also inserts a delay between reads. We probably | ||
| // don't want to do that but release testing will reveal that, | ||
| // hopefully.. | ||
| let res = vol.read(block, &mut buffer).await; | ||
|
|
||
| if let Err(e) = res { | ||
| panic!( | ||
| "read failed: {e:?}. | ||
| // XXX(JPH): Crucible scrub code also inserts a delay between reads. We probably | ||
| // don't want to do that but we'll see how this goes in production. | ||
| let retry_count = 5; | ||
| let mut n_retries = 0; | ||
| loop { | ||
| if n_retries >= retry_count { | ||
| slog::error!(log, "failed to read from boot disk {n_retries} tries; aborting boot | ||
| digest hash"); |
There was a problem hiding this comment.
a bunch of long lines which i'd like to see wrapped (considered not important)
closes #1067
TODO:
Testing notes
No boot disk
Steps to test: create instance, stop it (or don't auto-start it), then remove the boot disk as a boot disk. Send a challenge from inside the guest.
Result: attestation server used just the instance UUID for qualifying data
Failed boot disk
in progress