Skip to content

audit: module#268

Open
j-rafique wants to merge 8 commits intomasterfrom
audit-module
Open

audit: module#268
j-rafique wants to merge 8 commits intomasterfrom
audit-module

Conversation

@j-rafique
Copy link
Contributor

No description provided.

@roomote
Copy link

roomote bot commented Feb 6, 2026

Rooviewer Clock   See task

Reviewed latest changes (5a09a0a). All previously flagged issues remain resolved. No new issues found.

  • Case-sensitive proof hash comparison in VerifySliceProof handler could cause false negatives if hex characters arrive in uppercase
  • latestHeight only checks SdkBlock while estimateEpochDuration falls back to Block -- should use the same pattern
  • Synchronous chain transaction (SubmitEvidence) in the supernode retry loop adds latency per failed attempt and could consume the parent context's deadline
  • Mixed indentation in TestHandshakerConcurrentHandshakes -- the for loop header and first few body lines are one tab deeper than the rest of the loop body (gofmt violation)
Previous reviews

Mention @roomote in a comment to request specific changes to this pull request or fix all unresolved issues.

roomote[bot]
roomote bot previously approved these changes Feb 11, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR integrates the audit module from the Lumera blockchain (v1.11.0-rc) into the Supernode, introducing epoch-based storage challenge functionality and host reporting capabilities. The changes replace the legacy metrics reporting system with a new audit-driven architecture that submits evidence and epoch reports on-chain.

Changes:

  • Upgraded Lumera dependency from v1.10.0 to v1.11.0-rc across all modules
  • Added audit module integration (query and message modules) to the Lumera client
  • Implemented storage challenge service with deterministic selection, gRPC handlers, and evidence submission
  • Added host reporter service for epoch-scoped reporting (replacing legacy metrics collector)
  • Introduced cascade client failure evidence submission in the SDK

Reviewed changes

Copilot reviewed 46 out of 50 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
go.mod, tests/system/go.mod, sn-manager/go.mod, cmd/sncli/go.mod Bumped lumera dependency to v1.11.0-rc
go.sum, tests/system/go.sum, sn-manager/go.sum, cmd/sncli/go.sum Updated checksums for new lumera version
pkg/lumera/interface.go, pkg/lumera/client.go Added Audit() and AuditMsg() module accessors
pkg/lumera/modules/audit/*.go New audit query module implementation
pkg/lumera/modules/audit_msg/*.go New audit transaction module with evidence/report submission
supernode/storage_challenge/service.go Core storage challenge service with epoch-driven execution
supernode/storage_challenge/README.md Documentation for storage challenge protocol
supernode/transport/grpc/storage_challenge/handler.go gRPC handlers for GetSliceProof and VerifySliceProof
supernode/host_reporter/service.go Epoch report submission with port probing
pkg/storagechallenge/deterministic/*.go Deterministic selection algorithms for challengers/files
supernode/config/*.go, supernode/config.yml Added StorageChallengeConfig with defaults
supernode/cmd/start.go Replaced metrics collector with host reporter and storage challenge services
sdk/task/evidence.go, sdk/adapters/lumera/adapter.go Cascade failure evidence submission
proto/supernode/storage_challenge.proto Protobuf definitions for storage challenge service
gen/supernode/storage_challenge*.go Generated gRPC and protobuf code
pkg/testutil/lumera.go Added mock audit modules to test utilities
tests/integration/securegrpc/secure_connection_test.go Fixed test isolation by moving keyring creation into loop
.github/workflows/*.yml, .github/actions/setup-env/action.yml Commented out bust_lumera_retag parameter
Makefile Added storage_challenge.proto to generation targets

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +495 to +521
// parseHostAndPort parses a "host" or "host:port" string and returns a host and port.
// If a port is not present, defaultPort is returned. If a port is present but invalid,
func parseHostAndPort(address string, defaultPort int) (host string, port int, ok bool) {
address = strings.TrimSpace(address)
if address == "" {
return "", 0, false
}

// If it looks like a URL, parse and use the host[:port] portion.
if u, err := url.Parse(address); err == nil && u.Host != "" {
address = u.Host
}

if h, p, err := net.SplitHostPort(address); err == nil {
h = strings.TrimSpace(h)
if h == "" {
return "", 0, false
}
if n, err := strconv.Atoi(p); err == nil && n > 0 && n <= 65535 {
return h, n, true
}
return h, defaultPort, true
}

// No port present; return default.
return address, defaultPort, true
}
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In parseHostAndPort, when net.SplitHostPort returns an error (line 508), the function still attempts to use address as the host (line 520). However, if the address contains invalid characters or format, this could lead to unexpected behavior. While the caller performs additional validation, consider adding a check to ensure the address is valid before returning it as the host.

Copilot uses AI. Check for mistakes.
Comment on lines 4 to 46
@@ -33,17 +33,17 @@ runs:
sudo apt-get update
sudo apt-get install -y libwebp-dev make

- name: One-time reset retagged lumera checksums
if: ${{ inputs.bust_lumera_retag == 'true' }}
shell: bash
run: |
echo "Busting go.sum entries for github.com/LumeraProtocol/lumera v1.8.0 (one-time)"
# Remove stale checksums in all local modules
find . -name 'go.sum' -maxdepth 3 -print0 | xargs -0 -I{} sed -i \
'/github.com\/LumeraProtocol\/lumera v1.8.0/d' {}
# Clear module/build caches to avoid cached zips
go clean -modcache || true
rm -rf "$(go env GOCACHE)" || true
# - name: One-time reset retagged lumera checksums
# if: ${{ inputs.bust_lumera_retag == 'true' }}
# shell: bash
# run: |
# echo "Busting go.sum entries for github.com/LumeraProtocol/lumera v1.11.0-rc (one-time)"
# # Remove stale checksums in all local modules
# find . -name 'go.sum' -maxdepth 3 -print0 | xargs -0 -I{} sed -i \
# '/github.com\/LumeraProtocol\/lumera v1.11.0-rc/d' {}
# # Clear module/build caches to avoid cached zips
# go clean -modcache || true
# rm -rf "$(go env GOCACHE)" || true
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bust_lumera_retag input parameter has been commented out but not removed entirely. This suggests it may still be needed for future use. Consider either fully removing it if it's no longer needed, or adding a comment explaining why it's temporarily disabled and under what conditions it should be re-enabled.

Copilot uses AI. Check for mistakes.
Comment on lines +23 to +75
func (t *BaseTask) submitCascadeClientFailureEvidence(
ctx context.Context,
subjectAddress string,
targetSupernodeAccounts []string,
details map[string]string,
) {
subjectAddress = strings.TrimSpace(subjectAddress)
if subjectAddress == "" {
return
}

submitter, ok := any(t.client).(cascadeClientFailureEvidenceSubmitter)
if !ok {
t.logger.Debug(ctx, "Cascade client failure evidence submitter not configured")
return
}

if details == nil {
details = map[string]string{}
}
if _, exists := details["task_id"]; !exists {
details["task_id"] = t.TaskID
}
if _, exists := details["action_id"]; !exists {
details["action_id"] = t.ActionID
}

targetsCopy := append([]string(nil), targetSupernodeAccounts...)
detailsCopy := make(map[string]string, len(details))
for k, v := range details {
detailsCopy[k] = v
}

// Evidence submission should not block retry loops.
go func(parent context.Context, subject string, actionID string, targets []string, metadata map[string]string) {
submitCtx, cancel := context.WithTimeout(context.WithoutCancel(parent), cascadeEvidenceSubmitTimeout)
defer cancel()

if err := submitter.SubmitCascadeClientFailureEvidence(
submitCtx,
subject,
actionID,
targets,
metadata,
); err != nil {
t.logger.Warn(submitCtx, "Failed to submit cascade client failure evidence",
"subject_address", subject,
"targets", targets,
"error", err,
)
}
}(ctx, subjectAddress, t.ActionID, targetsCopy, detailsCopy)
}
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The submitCascadeClientFailureEvidence function launches a goroutine with context.WithoutCancel(parent) to ensure evidence submission completes even if the parent context is cancelled. However, this could lead to goroutine leaks if the submission context never completes (e.g., network issues). The 10-second timeout provides some protection, but consider adding monitoring or limiting the number of concurrent evidence submissions to prevent resource exhaustion in high-failure scenarios.

Copilot uses AI. Check for mistakes.
Comment on lines +125 to +259
func (s *Service) Run(ctx context.Context) error {
if !s.cfg.Enabled {
<-ctx.Done()
return nil
}

if err := s.initClients(); err != nil {
return err
}

// Effective knobs (production defaults). Jitter is bounded by the epoch length
// to avoid sleeping past the epoch window on short epochs.
lookbackEpochs := scCandidateKeysLookbackEpochs
respTimeout := scResponseTimeout
affirmTimeout := scAffirmationTimeout
logtrace.Debug(ctx, "storage challenge runtime knobs", logtrace.Fields{
"start_jitter_ms": scStartJitterMs,
"response_timeout_ms": respTimeout.Milliseconds(),
"affirmation_timeout_ms": affirmTimeout.Milliseconds(),
"submit_evidence_config": s.cfg.SubmitEvidence,
"poll_interval_ms": s.cfg.PollInterval.Milliseconds(),
"sc_files_per_challenger": scFilesPerChallenger,
"sc_replica_count": scReplicaCount,
"sc_observer_threshold": scObserverThreshold,
"sc_keys_lookback_epochs": lookbackEpochs,
})

ticker := time.NewTicker(s.cfg.PollInterval)
defer ticker.Stop()

var lastRunEpoch uint64
var lastRunOK bool
var loggedAlreadyRanEpoch uint64
var loggedNotSelectedEpoch uint64
var loggedDisabledEpoch uint64

for {
select {
case <-ctx.Done():
return ctx.Err()
case <-ticker.C:
height, ok := s.latestHeight(ctx)
if !ok {
continue
}

params, ok := s.auditParams(ctx)
if !ok {
continue
}

epochID, ok := deterministic.EpochID(height, params.EpochZeroHeight, params.EpochLengthBlocks)
if !ok {
continue
}
if !params.ScEnabled {
if loggedDisabledEpoch != epochID {
logtrace.Debug(ctx, "storage challenge disabled by on-chain params", logtrace.Fields{"epoch_id": epochID})
loggedDisabledEpoch = epochID
}
lastRunEpoch = epochID
lastRunOK = true
continue
}
if lastRunOK && lastRunEpoch == epochID {
if loggedAlreadyRanEpoch != epochID {
logtrace.Debug(ctx, "storage challenge already ran this epoch; skipping", logtrace.Fields{"epoch_id": epochID})
loggedAlreadyRanEpoch = epochID
}
continue
}

anchorResp, err := s.lumera.Audit().GetEpochAnchor(ctx, epochID)
if err != nil || anchorResp == nil || anchorResp.Anchor.EpochId != epochID {
// Anchor may not be committed yet at epoch boundary; retry on next tick.
continue
}
anchor := anchorResp.Anchor

challengers := deterministic.SelectChallengers(anchor.ActiveSupernodeAccounts, anchor.Seed, epochID, params.ScChallengersPerEpoch)
if !containsString(challengers, s.identity) {
if loggedNotSelectedEpoch != epochID {
logtrace.Debug(ctx, "storage challenge: not selected challenger; skipping", logtrace.Fields{
"epoch_id": epochID,
"identity": s.identity,
"selected": len(challengers),
"sc_param": params.ScChallengersPerEpoch,
})
loggedNotSelectedEpoch = epochID
}
lastRunEpoch = epochID
lastRunOK = true
continue
}

// Bound jitter by a conservative estimate of epoch duration (assume ~1s blocks).
// This is intentionally simple and is primarily to avoid sleeping past the epoch window.
jitterMaxMs := scStartJitterMs
epochBudgetMs := uint64(params.EpochLengthBlocks) * 1000
if epochBudgetMs > 0 && epochBudgetMs/2 < jitterMaxMs {
jitterMaxMs = epochBudgetMs / 2
}

jitterMs := deterministic.DeterministicJitterMs(anchor.Seed, epochID, s.identity, jitterMaxMs)
if jitterMs > 0 {
logtrace.Debug(ctx, "storage challenge jitter sleep", logtrace.Fields{
"epoch_id": epochID,
"jitter_ms": jitterMs,
"jitter_max_ms": jitterMaxMs,
"challenger_id": s.identity,
})
timer := time.NewTimer(time.Duration(jitterMs) * time.Millisecond)
select {
case <-ctx.Done():
timer.Stop()
return ctx.Err()
case <-timer.C:
}
}

if err := s.runEpoch(ctx, anchor, params, lookbackEpochs, respTimeout, affirmTimeout); err != nil {
logtrace.Warn(ctx, "storage challenge epoch run error", logtrace.Fields{
"epoch_id": epochID,
"error": err.Error(),
})
lastRunEpoch = epochID
lastRunOK = false
continue
}

lastRunEpoch = epochID
lastRunOK = true
}
}
}
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The storage challenge service uses lastRunOK and lastRunEpoch variables to track whether a challenge has already run for an epoch (lines 155-156). However, these variables are not protected by any synchronization mechanism. If the Run method could be called concurrently (though it appears to be designed for single-threaded use), this could lead to race conditions. Consider documenting that Run must not be called concurrently, or add synchronization if concurrent calls are expected.

Copilot uses AI. Check for mistakes.
}

func RegisterStorageChallengeServiceServer(s grpc.ServiceRegistrar, srv StorageChallengeServiceServer) {
// If the following call pancis, it indicates UnimplementedStorageChallengeServiceServer was
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in generated code comment: "pancis" should be "panics". This appears to be in generated code from protoc-gen-go-grpc. Consider updating the generator version or reporting this upstream if it persists.

Suggested change
// If the following call pancis, it indicates UnimplementedStorageChallengeServiceServer was
// If the following call panics, it indicates UnimplementedStorageChallengeServiceServer was

Copilot uses AI. Check for mistakes.
}

// parseHostAndPort parses a "host" or "host:port" string and returns a host and port.
// If a port is not present, defaultPort is returned. If a port is present but invalid,
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function comment states "If a port is not present, defaultPort is returned. If a port is present but invalid," but the sentence is incomplete. The comment should be completed to explain what happens when a port is present but invalid (the code shows it returns the host with defaultPort).

Copilot uses AI. Check for mistakes.
Comment on lines +114 to +194
func (s *Server) persistRecipientProof(ctx context.Context, req *supernode.GetSliceProofRequest, resp *supernode.GetSliceProofResponse) {
if s.store == nil {
return
}

challenge := types.MessageData{
ChallengerID: req.ChallengerId,
RecipientID: req.RecipientId,
Observers: append([]string(nil), req.ObserverIds...),
Challenge: types.ChallengeData{
FileHash: req.FileKey,
StartIndex: int(req.RequestedStart),
EndIndex: int(req.RequestedEnd),
Timestamp: time.Now().UTC(),
},
}
challengeBz, _ := json.Marshal(challenge)
_ = s.store.InsertStorageChallengeMessage(types.StorageChallengeLogMessage{
MessageType: int(types.ChallengeMessageType),
ChallengeID: req.ChallengeId,
Data: challengeBz,
Sender: s.identity,
SenderSignature: []byte{},
})

response := types.MessageData{
ChallengerID: req.ChallengerId,
RecipientID: req.RecipientId,
Observers: append([]string(nil), req.ObserverIds...),
Response: types.ResponseData{
Hash: resp.ProofHashHex,
Timestamp: time.Now().UTC(),
},
}
responseBz, _ := json.Marshal(response)
_ = s.store.InsertStorageChallengeMessage(types.StorageChallengeLogMessage{
MessageType: int(types.ResponseMessageType),
ChallengeID: req.ChallengeId,
Data: responseBz,
Sender: s.identity,
SenderSignature: []byte{},
})

logtrace.Debug(ctx, "storage challenge proof served", logtrace.Fields{
"challenge_id": req.ChallengeId,
"file_key": req.FileKey,
"start": resp.Start,
"end": resp.End,
})
}

func (s *Server) persistObserverVerification(ctx context.Context, req *supernode.VerifySliceProofRequest, resp *supernode.VerifySliceProofResponse) {
if s.store == nil {
return
}

eval := types.MessageData{
ChallengerID: req.ChallengerId,
RecipientID: req.RecipientId,
Observers: []string{s.identity},
ObserverEvaluation: types.ObserverEvaluationData{
IsEvaluationResultOK: resp.Ok,
Reason: resp.Error,
TrueHash: req.ProofHashHex,
Timestamp: time.Now().UTC(),
},
}
bz, _ := json.Marshal(eval)
_ = s.store.InsertStorageChallengeMessage(types.StorageChallengeLogMessage{
MessageType: int(types.AffirmationMessageType),
ChallengeID: req.ChallengeId,
Data: bz,
Sender: s.identity,
SenderSignature: []byte{},
})

logtrace.Debug(ctx, "storage challenge proof verified", logtrace.Fields{
"challenge_id": req.ChallengeId,
"ok": resp.Ok,
})
}
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error handling in persistRecipientProof and persistObserverVerification silently ignores errors from json.Marshal and InsertStorageChallengeMessage. While these operations are best-effort logging, consider at least logging these errors at debug level for troubleshooting purposes, similar to how other errors are handled in this codebase.

Copilot uses AI. Check for mistakes.
Comment on lines +168 to +215
func (s *Service) buildStorageChallengeObservations(ctx context.Context, epochID uint64, requiredOpenPorts []uint32, targets []string) []*audittypes.StorageChallengeObservation {
if len(targets) == 0 {
return nil
}

out := make([]*audittypes.StorageChallengeObservation, len(targets))

type workItem struct {
index int
target string
}

work := make(chan workItem)
done := make(chan struct{})

worker := func() {
defer func() { done <- struct{}{} }()
for item := range work {
out[item.index] = s.observeTarget(ctx, epochID, requiredOpenPorts, item.target)
}
}

workers := maxConcurrentTargets
if workers > len(targets) {
workers = len(targets)
}
for i := 0; i < workers; i++ {
go worker()
}

for i, t := range targets {
work <- workItem{index: i, target: t}
}
close(work)

for i := 0; i < workers; i++ {
<-done
}

// ensure no nil elements (MsgSubmitEpochReport rejects nil observations)
final := make([]*audittypes.StorageChallengeObservation, 0, len(out))
for i := range out {
if out[i] != nil {
final = append(final, out[i])
}
}
return final
}
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buildStorageChallengeObservations function uses goroutines to probe targets concurrently but writes to a shared out slice without synchronization. While the workers process distinct indices making this technically safe, consider using proper synchronization primitives (like sync.Mutex) or channels to make the concurrency pattern more explicit and maintainable. This would prevent potential data races if the code is refactored in the future.

Copilot uses AI. Check for mistakes.
Comment on lines +104 to +134
func SelectFileKeys(candidateKeys []string, seed []byte, epochID uint64, challengerID string, count uint32) []string {
if count == 0 || len(candidateKeys) == 0 {
return nil
}

keys := append([]string(nil), candidateKeys...)
sort.Strings(keys)

want := int(count)
if want > len(keys) {
want = len(keys)
}

out := make([]string, 0, want)
seedHex := hex.EncodeToString(seed)

for i := 0; i < want; i++ {
msg := []byte("sc:files:" + challengerID + ":" + strconv.FormatUint(epochID, 10) + ":" + seedHex + ":" + strconv.Itoa(i))
sum := blake3.Sum256(msg)
idx := int(binary.BigEndian.Uint64(sum[0:8]) % uint64(len(keys)))
out = append(out, keys[idx])

// remove selected key (stable, deterministic)
keys = append(keys[:idx], keys[idx+1:]...)
if len(keys) == 0 {
break
}
}

return out
}
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SelectFileKeys function removes selected keys from the candidate list to avoid duplicates (line 127). However, this modifies the slice during iteration which could be problematic. While the current implementation appears correct (creating a copy with append and reassigning), consider adding a comment explaining this deduplication behavior to make the intent clearer for future maintainers.

Copilot uses AI. Check for mistakes.
Comment on lines +105 to +155
func (s *Service) tick(ctx context.Context) {
epochResp, err := s.lumera.Audit().GetCurrentEpoch(ctx)
if err != nil || epochResp == nil {
return
}
epochID := epochResp.EpochId
reachability.SetCurrentEpochID(epochID)

anchorResp, err := s.lumera.Audit().GetEpochAnchor(ctx, epochID)
if err != nil || anchorResp == nil || anchorResp.Anchor.EpochId != epochID {
// Anchor may not be committed yet at the epoch boundary; retry on next tick.
return
}

// Idempotency: if a report exists for this epoch, do nothing.
if _, err := s.lumera.Audit().GetEpochReport(ctx, epochID, s.identity); err == nil {
return
} else if status.Code(err) != codes.NotFound {
return
}

assignResp, err := s.lumera.Audit().GetAssignedTargets(ctx, s.identity, epochID)
if err != nil || assignResp == nil {
return
}

storageChallengeObservations := s.buildStorageChallengeObservations(ctx, epochID, assignResp.RequiredOpenPorts, assignResp.TargetSupernodeAccounts)

hostReport := audittypes.HostReport{
// Intentionally submit 0% usage for CPU/memory so the chain treats these as "unknown".
// Disk usage is reported accurately (legacy-aligned) so disk-based enforcement can work.
CpuUsagePercent: 0,
MemUsagePercent: 0,
}
if diskUsagePercent, ok := s.diskUsagePercent(ctx); ok {
hostReport.DiskUsagePercent = diskUsagePercent
}

if _, err := s.lumera.AuditMsg().SubmitEpochReport(ctx, epochID, hostReport, storageChallengeObservations); err != nil {
logtrace.Warn(ctx, "epoch report submit failed", logtrace.Fields{
"epoch_id": epochID,
"error": err.Error(),
})
return
}

logtrace.Info(ctx, "epoch report submitted", logtrace.Fields{
"epoch_id": epochID,
"storage_challenge_observations_count": len(storageChallengeObservations),
})
}
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tick function queries the epoch anchor and checks for an existing report, but there's a race condition window between the idempotency check (line 120) and the report submission (line 143). If multiple instances or goroutines call tick simultaneously, they could both pass the idempotency check and attempt to submit duplicate reports. Consider using a mutex or atomic flag to prevent concurrent execution of the tick logic for the same epoch.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant