Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a prototype that sends a "config expired" message from the core-agent to the rc-client when the config has expired #34329

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

dave-fisher-datadog
Copy link

@dave-fisher-datadog dave-fisher-datadog commented Feb 21, 2025

What does this PR do?

THIS IS A DRAFT. This is sketching an approach to an OKR for Jira RC-2051.

Motivation

Currently, when the agent loses connection to the backend and the configs expire, the clients have no indication that anything has gone wrong. Clients have different ideas of what they want to do, so this approach makes what happened explicit.

Describe how you validated your changes

Possible Drawbacks / Trade-offs

Additional Notes

@bits-bot
Copy link
Collaborator

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@github-actions github-actions bot added team/remote-config medium review PR review might take time labels Feb 21, 2025
Copy link
Contributor

@ameske ameske left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the right track! but there are other levels you have to catch the TUF repository going bad on us so that we don't just send an error back.

@@ -847,6 +855,7 @@ func (s *CoreAgentService) ClientGetConfigs(_ context.Context, request *pbgo.Cli
Targets: canonicalTargets,
TargetFiles: targetFiles,
ClientConfigs: matchedClientConfigs,
ConfigStatus: s.configStatus,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue today is that you'll never get to here. Attempting to access the TUF repository when the TUF repository becomes expired causes an error and we spit an error back out to the tracers as a result.

For example, here in the ClientGetConfigs hander as we're preparing an update we try to access some TUF data:

directorTargets, err := s.uptane.Targets()
if err != nil {
return nil, err
}

This results in a call here:

func (c *Client) unsafeTargets() (data.TargetFiles, error) {
err := c.verify()
if err != nil {
return nil, err
}
return c.directorTUFClient.Targets()
}
// Targets returns the current targets of this uptane client
func (c *Client) Targets() (data.TargetFiles, error) {
c.Lock()
defer c.Unlock()
return c.unsafeTargets()
}

As part of that second link you see we attempt verify the TUF repository. If the TUF repository has become expired that's where the error occurs.

I think adding a field to send the status makes perfect sense and is a good approach, but you're going to have to catch erroneous TUF states earlier and send a non-error message that lets them know the status is "expired" or "malformed".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see! Does this mean though that we don't need to catch the bad response from the backend? I.e. it'd be enough to rely on Uptane failure, since a bad response from the backend will simply not update the timestamps file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. If the RC backend is down or sending gibberish the agent's TUF repo state won't change. If that happens long enough it expires. That's all we're trying to catch here so we can instruct tracer's to drop config.

If the backend sends us gibberish that doesn't validate we don't evict the agent's current repo state, we just drop the update and ignore it.

@@ -682,6 +685,10 @@ func (s *CoreAgentService) refresh() error {
if err != nil {
s.backoffErrorCount = s.backoffPolicy.IncError(s.backoffErrorCount)
s.lastUpdateErr = fmt.Errorf("tuf: %v", err)
var errDecodeFailed tufclient.ErrDecodeFailed
if errors.As(err, &errDecodeFailed) {
s.configStatus = state.ConfigStatusExpired
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in another comment, it's not just on a refresh of state from the backend that this can fail. We can receive a valid update from the backend, and then be unable to receive a new update, during which the current local repo expires. So you'll have to catch that too in addition to updates just being plain bad during an immediate refresh from the RC backend.

func (s *CoreAgentService) flushCacheResponse() (*pbgo.ClientGetConfigsResponse, error) {
return &pbgo.ClientGetConfigsResponse{
Roots: nil,
Targets: s.cachedTargets,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this for?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
medium review PR review might take time team/remote-config
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants