Skip to content

pdpv0: self-service way for maintainers to get SP logs #712

@BigLep

Description

@BigLep

Done Criteria

Curio operators can easily opt-in to sharing their logs with maintainers to enable faster troubleshooting of PDP service provider failures and issues reported from the Synapse SDK.

Why Important

Between now and our November GA launch, making our product robust and reliable is a key focus. We'll be spending significant time debugging issues across distributed service providers. Centralizing logs will dramatically reduce our debugging cycle time by allowing maintainers to:

  • Investigate issues without waiting for SP engagement
  • Identify patterns across multiple SPs
  • Reduce back-and-forth communication overhead

User/Customer

  1. Maintainers/implementers - Can debug issues independently without being blocked waiting for SPs to manually retrieve and share logs
  2. Service Providers (SPs) - Only get pulled into investigations when truly necessary; can opt-in with minimal configuration overhead
  3. End users - Benefit from faster issue resolution

Notes

  1. This issue is created in Curio because it will presumably require changes or documentation in curio.
  2. There is a corresponding issue in the synapse SDK for getting client-side info as well: Self-service way for maintainers to get client side errors/logs FilOzone/synapse-sdk#328

Principles

  1. Speed over perfection - We don't want best to be the enemy of better. Focus on getting something functional quickly rather than getting lost in configuration of a bespoke system. Spending money to move quicker or be able to debug more effectively is likely a good tradeoff.
  2. Ease of operation - Current implementers/maintainers are not veteran infrastructure teams. We should strongly favor managed SaaS solutions over self-hosted infrastructure that requires ongoing operational expertise, at least for now in this crunch time.
  3. Opt-in and trust - SPs need to trust where logs are going. They should be sending logs to FOC WG, with clear documentation on data handling and retention.

Technical Considerations

  • This is useful even without reeuqest/trace IDs. That has been pulled into a separate item: pdpv0: tracing with requestIds #721
  • Solution should ideally support structured logging (JSON) for easier querying
  • Need simple integration path - I assume ideally just configuration changes or a lightweight sidecar
  • I don't have Curio expertise, but I assume this should handle heterogeneous SP environments (different OS, container setups, etc.)

Potential Technology Options to Evaluate

Managed SaaS:

  • Better Stack (Logtail) - Simple HTTP ingestion, modern UI, quick setup
  • Datadog - Comprehensive observability, proven at scale
  • New Relic - Good balance of features and cost
  • Sumo Logic - Cloud-native focus

Self-hosted (probably not recommended given time and team experience):

  • Grafana Loki + Grafana
  • ELK Stack (Elasticsearch + Logstash + Kibana)

Endpoint for maintainers to poll PDP SPs:

  • Rather than SPs publishing logs to a centralized logstore, maybe insted they expose an endpoint where maintainers can request logs from the SP?

Potential Implementation Approach

  1. Select and set up centralized logging service (target: < 1 day)
  2. Provide SPs with simple configuration guide or lightweight logging agent
  3. Create runbook for maintainers on how to query logs for troubleshooting

Success Metrics

  • Time from issue report to log access < 5 minutes
  • Majority of PDP SPs successfully sending logs within 1 week of implementation
  • Reduced median time-to-resolution for debugging issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    team/fs-wgItems being worked on or tracked by the "FS Working Group". See FilOzone/github-mgmt #10

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions