Implement a prototype that sends a "config expired" message from the core-agent to the rc-client when the config has expired #34329

dave-fisher-datadog · 2025-02-21T20:54:50Z

What does this PR do?

THIS IS A DRAFT. This is sketching an approach to an OKR for Jira RC-2051.

Motivation

Currently, when the agent loses connection to the backend and the configs expire, the clients have no indication that anything has gone wrong. Clients have different ideas of what they want to do, so this approach makes what happened explicit.

Describe how you validated your changes

Possible Drawbacks / Trade-offs

Additional Notes

bits-bot · 2025-02-21T20:54:55Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

ameske

On the right track! but there are other levels you have to catch the TUF repository going bad on us so that we don't just send an error back.

ameske · 2025-02-21T21:49:26Z

pkg/config/remote/service/service.go

@@ -847,6 +855,7 @@ func (s *CoreAgentService) ClientGetConfigs(_ context.Context, request *pbgo.Cli
 		Targets:       canonicalTargets,
 		TargetFiles:   targetFiles,
 		ClientConfigs: matchedClientConfigs,
+		ConfigStatus:  s.configStatus,


I think the issue today is that you'll never get to here. Attempting to access the TUF repository when the TUF repository becomes expired causes an error and we spit an error back out to the tracers as a result.

For example, here in the ClientGetConfigs hander as we're preparing an update we try to access some TUF data:

datadog-agent/pkg/config/remote/service/service.go

Lines 817 to 820 in 654a85e

directorTargets, err := s.uptane.Targets()

if err != nil {

return nil, err

}

This results in a call here:

datadog-agent/pkg/config/remote/uptane/client.go

Lines 310 to 323 in 654a85e

func (c *Client) unsafeTargets() (data.TargetFiles, error) {

err := c.verify()

if err != nil {

return nil, err

}

return c.directorTUFClient.Targets()

}

// Targets returns the current targets of this uptane client

func (c *Client) Targets() (data.TargetFiles, error) {

c.Lock()

defer c.Unlock()

return c.unsafeTargets()

}

As part of that second link you see we attempt verify the TUF repository. If the TUF repository has become expired that's where the error occurs.

I think adding a field to send the status makes perfect sense and is a good approach, but you're going to have to catch erroneous TUF states earlier and send a non-error message that lets them know the status is "expired" or "malformed".

I see! Does this mean though that we don't need to catch the bad response from the backend? I.e. it'd be enough to rely on Uptane failure, since a bad response from the backend will simply not update the timestamps file.

Correct. If the RC backend is down or sending gibberish the agent's TUF repo state won't change. If that happens long enough it expires. That's all we're trying to catch here so we can instruct tracer's to drop config.

If the backend sends us gibberish that doesn't validate we don't evict the agent's current repo state, we just drop the update and ignore it.

ameske · 2025-02-21T21:51:01Z

pkg/config/remote/service/service.go

@@ -682,6 +685,10 @@ func (s *CoreAgentService) refresh() error {
 	if err != nil {
 		s.backoffErrorCount = s.backoffPolicy.IncError(s.backoffErrorCount)
 		s.lastUpdateErr = fmt.Errorf("tuf: %v", err)
+		var errDecodeFailed tufclient.ErrDecodeFailed
+		if errors.As(err, &errDecodeFailed) {
+			s.configStatus = state.ConfigStatusExpired


As mentioned in another comment, it's not just on a refresh of state from the backend that this can fail. We can receive a valid update from the backend, and then be unable to receive a new update, during which the current local repo expires. So you'll have to catch that too in addition to updates just being plain bad during an immediate refresh from the RC backend.

…an expired value from the backend

mellon85 · 2025-02-25T15:17:34Z

pkg/config/remote/service/service.go

+func (s *CoreAgentService) flushCacheResponse() (*pbgo.ClientGetConfigsResponse, error) {
+	return &pbgo.ClientGetConfigsResponse{
+		Roots:         nil,
+		Targets:       s.cachedTargets,


What's this for?

First swing at an implementation of this

051649f

github-actions bot added team/remote-config medium review PR review might take time labels Feb 21, 2025

ameske reviewed Feb 21, 2025

View reviewed changes

dave-fisher-datadog added 2 commits February 24, 2025 15:01

Update the approach to use an expired result from Uptane rather than …

827eb38

…an expired value from the backend

Cache targets to send to client in case of expired configs

419097f

mellon85 reviewed Feb 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a prototype that sends a "config expired" message from the core-agent to the rc-client when the config has expired #34329

Implement a prototype that sends a "config expired" message from the core-agent to the rc-client when the config has expired #34329

dave-fisher-datadog commented Feb 21, 2025 •

edited

Loading

bits-bot commented Feb 21, 2025

ameske left a comment

ameske Feb 21, 2025

dave-fisher-datadog Feb 24, 2025

ameske Feb 24, 2025

ameske Feb 21, 2025

mellon85 Feb 25, 2025

	directorTargets, err := s.uptane.Targets()
	if err != nil {
	return nil, err
	}

	func (c *Client) unsafeTargets() (data.TargetFiles, error) {
	err := c.verify()
	if err != nil {
	return nil, err
	}
	return c.directorTUFClient.Targets()
	}

	// Targets returns the current targets of this uptane client
	func (c *Client) Targets() (data.TargetFiles, error) {
	c.Lock()
	defer c.Unlock()
	return c.unsafeTargets()
	}

Implement a prototype that sends a "config expired" message from the core-agent to the rc-client when the config has expired #34329

Are you sure you want to change the base?

Implement a prototype that sends a "config expired" message from the core-agent to the rc-client when the config has expired #34329

Conversation

dave-fisher-datadog commented Feb 21, 2025 • edited Loading

What does this PR do?

Motivation

Describe how you validated your changes

Possible Drawbacks / Trade-offs

Additional Notes

bits-bot commented Feb 21, 2025

ameske left a comment

Choose a reason for hiding this comment

ameske Feb 21, 2025

Choose a reason for hiding this comment

dave-fisher-datadog Feb 24, 2025

Choose a reason for hiding this comment

ameske Feb 24, 2025

Choose a reason for hiding this comment

ameske Feb 21, 2025

Choose a reason for hiding this comment

mellon85 Feb 25, 2025

Choose a reason for hiding this comment

dave-fisher-datadog commented Feb 21, 2025 •

edited

Loading