Skip to content

Conversation

@hsmatulis
Copy link

@hsmatulis hsmatulis commented Jun 27, 2025

This PR introduces a new package, secrets, to prometheus/common. This package provides a unified way to handle secrets within configuration files for Prometheus and its ecosystem components. It is designed to be extensible and observable. See the proposal here

@hsmatulis hsmatulis force-pushed the main branch 3 times, most recently from cf118e6 to 24ec0f7 Compare July 7, 2025 11:07
@hsmatulis hsmatulis force-pushed the main branch 19 times, most recently from 694bf26 to ef24c3d Compare October 2, 2025 07:40
@hsmatulis hsmatulis changed the title feat(secrets): Introduce new package for dynamic secret management feat(secrets): Add new secrets management package Oct 2, 2025
@hsmatulis hsmatulis marked this pull request as ready for review October 2, 2025 07:42
@hsmatulis
Copy link
Author

hsmatulis commented Oct 2, 2025

@pintohutch @bernot-dev @bwplotka PTAL, should be ready!

@bwplotka bwplotka self-requested a review October 8, 2025 15:15
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, amazing work!

Generally, it's great - maybe first big question is if we want SecretField to store so much state or do we want to move some of this state to manager (or and providers). This might be more composable and easier to reason about. Prometheus discovery is doing some of this. I mentioned one idea (A) 1 and 2 in comments.

However, it's not a blocker, even in the current state, I would say we could try this out on Prometheus (and someone might try on AM side!). What matters is that I see a clean YAML surface format, clean code, idiomatic struct reflection to find fields and healthy amount of test, so amazing! With this we can iterate.

No matter if you want to try (A) or skip it for now, I think I would try to check before merging:

Then I would like to review in more depth the manager code, but generally... we could start with this! 🎉 Thanks!

}

type SecretFieldSettings struct {
RefreshInterval time.Duration `yaml:"refreshInterval,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Hm, good question if refresh is really relevant to all providers... E.g it's not for inline and.. also not for file based? Refreshing in general is a key concept here so we at least needs a good commentary in the code for this field on how it suppose to be implemented and consumed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I guess it is not relevant to inline but could be relevant to file. Although we might want to think about listening to filesystem changes for a file based provider instead, but I found that to complicate the API quite a bit...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, comment would be nice. I am worried it's YAGNI - we don't know if refresh in such a form would be even useful for other providers (we are guessing). Perhaps moving this to inside provider config is safer?

return secret.secret
}

func (m *Manager) triggerRefresh(s *SecretField) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to myself: Manager code needs deeper review, quite many locks 🙈 prone to errors, but perhaps needed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it was very complicated before with the verify logic, I think it should be a bit simpler now.. Still need some cleanup there, but should be more readable

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks solid, thanks!

Still some few comments to address, but overall looks close to be merged!

@hsmatulis hsmatulis force-pushed the main branch 2 times, most recently from 37e45e0 to 3abfb21 Compare November 20, 2025 06:36
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid work, did early pass, will go with more review next week, thanks! 💪🏽

@@ -0,0 +1,18 @@
# Secret Management
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

One thing to discuss/put in README.

How it relates to

type Secret string

What would be the plan forward? Deprecate config.Secret?


// PolicyFunc returns true if secrets should be printed (insecure),
// and false if they should be scrubbed (secure).
type PolicyFunc func() bool
Copy link
Member

@bwplotka bwplotka Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: We have epic opportunity to deglobalize this and have Field.redact bool field with perhaps a DAG method to set it on or all off per config. WDYT?

Also if we do bool, we could just do simpler global as in https://github.com/prometheus/common/blob/d80d8544703e59a080a204b6f7429ac6561fb24f/config/config.go#L33C5-L33C23

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this work! Solid, high quality code.

Current Algorithm

To review properly I researched what's the algorithm and noted on the way. Perhaps useful for other reviewers:

  1. Unmarshal instantiates secret fields in a raw state. It does NOT unmarshal the content of the secret YAML yet. We keep it raw Field.rawConfig.
  2. secrets.Manager "registers" the unmarshalled config structure.
  • It uses value reflection (not type reflection) to gather all configured fields into a map by "path"
  • It goes through paths (fields) and unmarshall here properly (parseRawConfig)
    • It uses some hack convertConfig to marshal any to YAML and unmarshal back to certain struct (wonder if there's alternative, but good for now I guess?)
    • For complex providers (not inline) it constructs provider data and settings
    • It retrieves providerConfig to unmarshal
    • It unmarshals providerConfig
    • It unmarshals settings (refresh interval if any)
    • packs all in fieldState
  • registers each secret field State as a "managedSecret" by path+provider (or update fresh interval)
  1. Manager Start starts goroutines that actually fetches secrets async
  2. You have to ask for readiness using manager.SecretsReady(&cfg)
    • it goes through all paths again (why? We have all in m.secrets)
  • "// This method should be called before accessing any secret values to ensure
    // that they are available."
  1. You can get value without error

High Level Suggestions

Sorry for complaining. I do think we could slightly improve the flow. It's mostly great, but I have suspicion it might be awkward to fit in the Prometheus ApplyConfig reloading pattern. The main bottlenecks IMO are:

  • (5) Get is trying to avoid returning error, but I wonder if this use case is even worth optimizing worth. In the current flow you have to ask for SecretReady (4) before every Get (reading from comments)? Which returns error, so why not havingGet() (ret, error) wait until ready or return old value 🤔
  • Prometheus dynamic reloading makes it so we unmarshal config, sometimes every minute. Here it seems you need new manager for every unmarshal? That might be a problem.
  • Do we envision problems with duplicates provider name vs field level settings?
  • Smaller nits (inlined comments)

How to progress

Some of the above suggestions are opinionated. I might be wrong. So... have you tried adopting this PR on Prometheus? I think it would be the ultimate test how it looks like. Maybe you did and it fits well. Can we have a quick draft PR that imports this commit? I think that's a blocker to really merge this PR (proof this code will work). WDYT?

Also such a long iteration cycles makes it harder to review (context gathering). Plus it's easy to complain. I wonder if we could do some pair programming, or I could propose changes in a PR towards your PR? Would you like that help to make a progress? Or would you prefer to continue with async reviews only?

// convertConfig takes a yaml-parsed any and unmarshals it into a typed struct.
// It achieves this by first marshalling the input to YAML and then unmarshalling
// it into the target struct.
func convertConfig[T any](source any, target T) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps more readable?

Suggested change
func convertConfig[T any](source any, target T) error {
func unmarshalTo[T any](source any, target T) error {

// It achieves this by first marshalling the input to YAML and then unmarshalling
// it into the target struct.
func convertConfig[T any](source any, target T) error {
bytes, err := yaml.Marshal(source)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting, I thought any is internally just bytes of unmarshalled blob, sounds like it's not so simple?


// splitProviderAndSettings separates provider-specific configuration from the generic SecretField settings.
func splitProviderAndSettings(provReg *ProviderRegistry, baseMap mapType) (providerName string, providerData interface{}, settingsMap mapType, err error) {
settingsMap = make(mapType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We typically in Go just inline map instantiation, no need for mapType

providerName = k
providerData = v
} else {
// If it's not a provider key, treat it as a setting.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability (was confused reading this code)

Suggested change
// If it's not a provider key, treat it as a setting.
// If it's not a provider key, treat it as a setting. We will detect missing provider later.

providerData = v
} else {
// If it's not a provider key, treat it as a setting.
settingsMap[k] = v
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean settings fields can conflict with provider name. Is that a problem? Do we need settings? Example does not show it. Can we apply YAGNI to descope (add it later when needed?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't also mind if we kill FieldSettings and duplicate refresh and other logic in the provider. We can share common things once we have at least two complex providers.. (:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess to avoid/handle dups one could try to unmarshal FieldSettings here? Anyway just a bit pain to add more fields, you never know if it conflicts with dynamic provider someone might implement in fork, perhaps easier to inline with provider?

metricLabels: labels,
provider: provider,
}
m.secrets[state.id()] = ms
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do id() once and save local var?

m.secrets[state.id()] = ms
m.secretState.With(labels).Set(stateInitializing)
m.fetchSuccessTotal.With(labels).Add(0)
m.fetchFailuresTotal.With(labels).Add(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return early?


// Create a secret manager. This discovers and manages all Fields in cfg.
// The manager will handle refreshing secrets in the background.
manager, err := secrets.NewManager(promRegisterer, secrets.Providers, &cfg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: What's the process for refreshing secrets ON reconfiguration (very common in Prometheus -- ApplyConfig). It sounds like with this abstraction we need to recreate manager? (secrets.NewManager has to be re-run?)

}

type SecretFieldSettings struct {
RefreshInterval time.Duration `yaml:"refreshInterval,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, comment would be nice. I am worried it's YAGNI - we don't know if refresh in such a form would be even useful for other providers (we are guessing). Perhaps moving this to inside provider config is safer?

Signed-off-by: Henrique Matulis <hmatulis@google.com>
Copilot AI review requested due to automatic review settings December 4, 2025 11:29
Copilot finished reviewing on behalf of hsmatulis December 4, 2025 11:33
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new secrets package to prometheus/common, providing a unified and extensible framework for managing secrets in Prometheus configuration files. The implementation supports multiple secret providers (inline and file built-in, with custom provider support) and includes comprehensive observability through Prometheus metrics.

Key Changes

  • New secrets management framework with Field type for configuration structs
  • Provider registry system supporting pluggable secret providers (inline and file providers included)
  • Manager component that discovers, manages, and automatically refreshes secrets with configurable intervals
  • Comprehensive test coverage including unit tests and example tests

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 16 comments.

Show a summary per file
File Description
secrets/field.go Core Field type with YAML/JSON marshaling and visibility policy support
secrets/provider.go Provider and ProviderConfig interfaces for extensibility
secrets/registry.go Provider registry for registering and retrieving secret providers
secrets/internal_providers.go Built-in inline and file providers with init registration
secrets/manager.go Secret manager with automatic discovery, refresh loops, and Prometheus metrics
secrets/resolve.go Reflection-based field discovery for traversing configuration structs
secrets/lifecycle.go Secret preparation hooks for config preprocessing
secrets/util.go Utility functions for counting non-zero values
secrets/*_test.go Comprehensive test coverage for all components
secrets/doc.go Package documentation
secrets/README.md User-facing documentation with usage examples
config/secrets.go Integration with existing config package visibility policy
go.mod Dependency update moving client_golang from indirect to direct

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

func NewManager(ctx context.Context, r prometheus.Registerer, providers *ProviderRegistry, config interface{}) (*Manager, error) {
manager := &Manager{
secrets: make(map[string]*managedSecret),
providers: providers,
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refreshC channel is never initialized in NewManager. This will cause a nil channel panic when triggerRefresh is called (line 272). Add refreshC: make(chan struct{}, 1) to the Manager initialization.

Suggested change
providers: providers,
providers: providers,
refreshC: make(chan struct{}, 1),

Copilot uses AI. Check for mistakes.
})

t.Run("FetchSecret_NotFound", func(t *testing.T) {
badFPC := &FileProviderConfig{Path: filepath.Join(tempDir, "non-existant.txt")}
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error: "non-existant" should be "non-existent".

Suggested change
badFPC := &FileProviderConfig{Path: filepath.Join(tempDir, "non-existant.txt")}
badFPC := &FileProviderConfig{Path: filepath.Join(tempDir, "non-existent.txt")}

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,79 @@
package secrets
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing copyright header. All other files in the package have a copyright header starting with // Copyright 2025 The Prometheus Authors. This file should include the same header for consistency.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,12 @@
package secrets
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing copyright header. All other files in the package have a copyright header starting with // Copyright 2025 The Prometheus Authors. This file should include the same header for consistency.

Copilot uses AI. Check for mistakes.
if err != nil {
return fmt.Errorf("%s: %w", path, err)
}
ctx, cancel := context.WithTimeout(ctx, time.Millisecond)
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout of time.Millisecond for the initial fetch attempt is extremely short and may be insufficient for most providers. This could cause unnecessary failures during initialization. Consider using a more reasonable timeout (e.g., 1-5 seconds) or using fetchTimeout constant defined elsewhere in the file.

Suggested change
ctx, cancel := context.WithTimeout(ctx, time.Millisecond)
ctx, cancel := context.WithTimeout(ctx, fetchTimeout)

Copilot uses AI. Check for mistakes.
val: val,
depth: w.depth + 1,
}

Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an unnecessary blank line before the closing brace. Consider removing it for consistency with Go code style.

Suggested change

Copilot uses AI. Check for mistakes.
func (h Hydrator) OrFromFile(
secretField **Field, filePath string) error {
if h.manager == nil || h.manager.hydrating == false {
panic("secret field is sealed by the manager; call this from PrepareSecrets")
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The panic message "secret field is sealed by the manager; call this from PrepareSecrets" is unclear. Consider rewording to something like "OrFromFile can only be called during PrepareSecrets; the manager has not yet started hydration" for better clarity.

Suggested change
panic("secret field is sealed by the manager; call this from PrepareSecrets")
panic("OrFromFile can only be called during PrepareSecrets; the manager has not yet started hydration")

Copilot uses AI. Check for mistakes.
m.mtx.RUnlock()

if !ok {
promslog.NewNopLogger()
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The promslog.NewNopLogger() call has no effect since its result is not used or assigned. This appears to be leftover debug code. Consider either removing it or logging a warning message about the secret not being found.

Suggested change
promslog.NewNopLogger()
promslog.Warnf("triggerRefresh: secret with id %s not found; refresh not triggered", s.state.id())

Copilot uses AI. Check for mistakes.

}

func (h Hydrator) OrFromFile(
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function name OrFromFile is unclear. Consider renaming to something more descriptive like SetFromFieldOrFile or HydrateFromFieldOrFile to better convey that it's setting a field value from either the field itself or a file path.

Suggested change
func (h Hydrator) OrFromFile(
func (h Hydrator) SetFromFieldOrFile(

Copilot uses AI. Check for mistakes.
Comment on lines +148 to +155
m.fetchDuration = promauto.With(r).NewHistogramVec(
prometheus.HistogramOpts{
Name: "prometheus_remote_secret_fetch_duration_seconds",
Help: "Duration of secret fetch attempts.",
Buckets: prometheus.DefBuckets,
},
labels,
)
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fetchDuration metric is registered but never used in the code to record actual fetch durations. Consider adding timing instrumentation in the fetchAndStoreSecret method to observe the time taken for secret fetches.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants