-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a prototype that sends a "config expired" message from the core-agent to the rc-client when the config has expired #34329
base: main
Are you sure you want to change the base?
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the right track! but there are other levels you have to catch the TUF repository going bad on us so that we don't just send an error back.
pkg/config/remote/service/service.go
Outdated
@@ -847,6 +855,7 @@ func (s *CoreAgentService) ClientGetConfigs(_ context.Context, request *pbgo.Cli | |||
Targets: canonicalTargets, | |||
TargetFiles: targetFiles, | |||
ClientConfigs: matchedClientConfigs, | |||
ConfigStatus: s.configStatus, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue today is that you'll never get to here. Attempting to access the TUF repository when the TUF repository becomes expired causes an error and we spit an error back out to the tracers as a result.
For example, here in the ClientGetConfigs hander as we're preparing an update we try to access some TUF data:
datadog-agent/pkg/config/remote/service/service.go
Lines 817 to 820 in 654a85e
directorTargets, err := s.uptane.Targets() | |
if err != nil { | |
return nil, err | |
} |
This results in a call here:
datadog-agent/pkg/config/remote/uptane/client.go
Lines 310 to 323 in 654a85e
func (c *Client) unsafeTargets() (data.TargetFiles, error) { | |
err := c.verify() | |
if err != nil { | |
return nil, err | |
} | |
return c.directorTUFClient.Targets() | |
} | |
// Targets returns the current targets of this uptane client | |
func (c *Client) Targets() (data.TargetFiles, error) { | |
c.Lock() | |
defer c.Unlock() | |
return c.unsafeTargets() | |
} |
As part of that second link you see we attempt verify the TUF repository. If the TUF repository has become expired that's where the error occurs.
I think adding a field to send the status makes perfect sense and is a good approach, but you're going to have to catch erroneous TUF states earlier and send a non-error message that lets them know the status is "expired" or "malformed".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see! Does this mean though that we don't need to catch the bad response from the backend? I.e. it'd be enough to rely on Uptane failure, since a bad response from the backend will simply not update the timestamps file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. If the RC backend is down or sending gibberish the agent's TUF repo state won't change. If that happens long enough it expires. That's all we're trying to catch here so we can instruct tracer's to drop config.
If the backend sends us gibberish that doesn't validate we don't evict the agent's current repo state, we just drop the update and ignore it.
pkg/config/remote/service/service.go
Outdated
@@ -682,6 +685,10 @@ func (s *CoreAgentService) refresh() error { | |||
if err != nil { | |||
s.backoffErrorCount = s.backoffPolicy.IncError(s.backoffErrorCount) | |||
s.lastUpdateErr = fmt.Errorf("tuf: %v", err) | |||
var errDecodeFailed tufclient.ErrDecodeFailed | |||
if errors.As(err, &errDecodeFailed) { | |||
s.configStatus = state.ConfigStatusExpired |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in another comment, it's not just on a refresh of state from the backend that this can fail. We can receive a valid update from the backend, and then be unable to receive a new update, during which the current local repo expires. So you'll have to catch that too in addition to updates just being plain bad during an immediate refresh from the RC backend.
…an expired value from the backend
func (s *CoreAgentService) flushCacheResponse() (*pbgo.ClientGetConfigsResponse, error) { | ||
return &pbgo.ClientGetConfigsResponse{ | ||
Roots: nil, | ||
Targets: s.cachedTargets, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this for?
What does this PR do?
THIS IS A DRAFT. This is sketching an approach to an OKR for Jira RC-2051.
Motivation
Currently, when the agent loses connection to the backend and the configs expire, the clients have no indication that anything has gone wrong. Clients have different ideas of what they want to do, so this approach makes what happened explicit.
Describe how you validated your changes
Possible Drawbacks / Trade-offs
Additional Notes