-
Notifications
You must be signed in to change notification settings - Fork 176
Move beat receiver component logic to the otel manager #8737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This pull request does not have a backport label. Could you fix it @swiatekm? 🙏
|
4302811
to
6ef024f
Compare
6ef024f
to
25658e3
Compare
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
176ece7
to
eeb2f75
Compare
This pull request is now in conflicts. Could you fix it? 🙏
|
eeb2f75
to
5da6654
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did left some comments, @swiatekm take a look and tell me what you think. PS: I am more than happy to have a dedicated sync about that 🙂
5da6654
to
2edcb08
Compare
8057361
to
86fc4fc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for addressing my comments @swiatekm. As you have already mentioned, there are points that we should revisit in the future but this PR feels to me a step towards the right direction:
- the final otel config is produced inside the otel manager
- statuses are fabricated and emitted from the manager. A future improvement here is to investigate if we could emit only one status and satisfy both the collector and components status needs, which probably will help in simplifying the code
- sure the manager loop got a little bit "heavier" but we can try in the future to improve code readability
So code changes wise this counts as an improvement. Now that said, I think we have an issue 😄
I did compile and run this on my machine and for both execution modes I see only the following reported through elastic-agent status --output full
┌─ fleet
│ └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
├─ status: (HEALTHY) Running
├─ info
│ ├─ id: c3189bbf-0e33-4b6b-aab7-80a4a3a0647b
│ ├─ version: 9.2.0
│ └─ commit: 86fc4fcea1209c63a3b810f941449b743bc0b440
└─ system/metrics-default
├─ status: (STARTING) Starting: spawned pid '1284'
├─ system/metrics-default
│ ├─ status: (STARTING) Starting: spawned pid '1284'
│ └─ type: OUTPUT
└─ system/metrics-default-unique-system-metrics-input
├─ status: (STARTING) Starting: spawned pid '1284'
└─ type: INPUT
which makes me believe that there is a tiny bit somewhere in the status channels that is missed!? but is this reproducible also on your end?
|
💛 Build succeeded, but was flaky
Failed CI StepsHistory
cc @swiatekm |
Can you post your exact configuration? I tested outputs:
default:
type: elasticsearch
hosts: [127.0.0.1:9200]
username: "elastic"
password: "..."
agent:
logging:
to_stderr: true
monitoring:
enabled: false
inputs:
- data_stream:
namespace: default
id: unique-system-metrics-input
streams:
- data_stream:
dataset: system.cpu
metricsets:
- cpu
- data_stream:
dataset: system.memory
metricsets:
- memory
- data_stream:
dataset: system.network
metricsets:
- network
- data_stream:
dataset: system.filesystem
metricsets:
- filesystem
type: system/metrics
use_output: default
_runtime_experimental: otel and got:
and if I switch back to process mode for the system/metrics input:
|
@swiatekm 👋 this is my config as reported from the inspect sub-command
for the subprocess approach I would expect to see something like this (PS: I get the same for the in-process approach without the
The interesting thing here is that after some minutes (not sure how many) these do appear in the output even with the code from this PR. I need to re-run elastic-agent from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did run again the code of this PR again with both execution modes of the collector (in-process and sub-process) and now all according statuses appear in-time through elastic-agent status --output full
so I am gonna choose to unravel this mystery by saying that my previous attempt was on Friday and probably I missed something 🙂
LGTM
* Add initial otel component manager implementation * Update coordinator to use the new manager * Move logging to the coordinator * Add more tests * Don't use a real otel manager in tests * Move the logic to the otel manager * Ignore the test collector binary * Rename some dangling attributes back * Comment out temporarily unused code * Restore manager e2e test * Fix import order * Write synthetic status updates directly into the external channel * Update collector config and components in one call * Rename the mutex in the otel manager * Discard intermediate statuses * Emit component updates in a single batch * Undo timeout increase in test (cherry picked from commit 503421f) # Conflicts: # internal/pkg/agent/application/coordinator/coordinator.go # internal/pkg/agent/application/coordinator/coordinator_unit_test.go
…l manager (#8990) * Move beat receiver component logic to the otel manager (#8737) * Add initial otel component manager implementation * Update coordinator to use the new manager * Move logging to the coordinator * Add more tests * Don't use a real otel manager in tests * Move the logic to the otel manager * Ignore the test collector binary * Rename some dangling attributes back * Comment out temporarily unused code * Restore manager e2e test * Fix import order * Write synthetic status updates directly into the external channel * Update collector config and components in one call * Rename the mutex in the otel manager * Discard intermediate statuses * Emit component updates in a single batch * Undo timeout increase in test (cherry picked from commit 503421f) # Conflicts: # internal/pkg/agent/application/coordinator/coordinator.go # internal/pkg/agent/application/coordinator/coordinator_unit_test.go * Fix conflicts in coordinator.go * Fix conflicts in coordinator_unit_test.go --------- Co-authored-by: Mikołaj Świątek <[email protected]>
* Add initial otel component manager implementation * Update coordinator to use the new manager * Move logging to the coordinator * Add more tests * Don't use a real otel manager in tests * Move the logic to the otel manager * Ignore the test collector binary * Rename some dangling attributes back * Comment out temporarily unused code * Restore manager e2e test * Fix import order * Write synthetic status updates directly into the external channel * Update collector config and components in one call * Rename the mutex in the otel manager * Discard intermediate statuses * Emit component updates in a single batch * Undo timeout increase in test
NOTE: This is an alternative implementation of #8529, which adds the new functionality to the existing Otel manager, instead of creating a new object for it.
What does this PR do?
Move beat receiver component logic to the otel manager.
This logic consists of two tasks conceptually:
Up until now, the logic for these was haphazardly spread across the agent coordinator. This PR moves all of it into the OtelManager, which can now run both raw otel collector configurations and agent components in a single otel collector instance.
The OtelManager now encapsulates all the logic involved in interfacing between the agent coordinator and the otel collector. In the near future, it will also take on additional responsibilities, like generating diagnostics for components it runs.
The only new logic this PR introduces lives in the new manager's main loop, and has to do with how updates and configurations are moved around. The rest is either existing logic moved to a new location, and new tests for that old logic.
Why is it important?
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files[ ] I have added an entry in./changelog/fragments
using the changelog tool[ ] I have added an integration test or an E2E testRelated issues