Skip to content

Move beat receiver component logic to the otel manager #8737

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Jul 14, 2025

Conversation

swiatekm
Copy link
Contributor

@swiatekm swiatekm commented Jun 30, 2025

NOTE: This is an alternative implementation of #8529, which adds the new functionality to the existing Otel manager, instead of creating a new object for it.

What does this PR do?

Move beat receiver component logic to the otel manager.

This logic consists of two tasks conceptually:

  • Translating agent configurations into beats receiver configurations for the otel collector.
  • Translating otel collector statuses into component states.

Up until now, the logic for these was haphazardly spread across the agent coordinator. This PR moves all of it into the OtelManager, which can now run both raw otel collector configurations and agent components in a single otel collector instance.

The OtelManager now encapsulates all the logic involved in interfacing between the agent coordinator and the otel collector. In the near future, it will also take on additional responsibilities, like generating diagnostics for components it runs.

The only new logic this PR introduces lives in the new manager's main loop, and has to do with how updates and configurations are moved around. The rest is either existing logic moved to a new location, and new tests for that old logic.

Why is it important?

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • [ ] I have added an entry in ./changelog/fragments using the changelog tool
  • [ ] I have added an integration test or an E2E test

Related issues

@swiatekm swiatekm changed the title Chore/otel manager components Move beat receiver component logic to the otel manager Jun 30, 2025
Copy link
Contributor

mergify bot commented Jun 30, 2025

This pull request does not have a backport label. Could you fix it @swiatekm? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label that automatically backports to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@swiatekm swiatekm force-pushed the chore/otel-manager-components branch 4 times, most recently from 4302811 to 6ef024f Compare June 30, 2025 14:56
@swiatekm swiatekm force-pushed the chore/otel-manager-components branch from 6ef024f to 25658e3 Compare July 4, 2025 16:24
@swiatekm swiatekm added enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team backport-8.19 Automated backport to the 8.19 branch labels Jul 4, 2025
@swiatekm swiatekm marked this pull request as ready for review July 4, 2025 16:27
@swiatekm swiatekm requested a review from a team as a code owner July 4, 2025 16:27
@swiatekm swiatekm requested review from blakerouse and kaanyalti July 4, 2025 16:27
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@swiatekm swiatekm force-pushed the chore/otel-manager-components branch 2 times, most recently from 176ece7 to eeb2f75 Compare July 4, 2025 19:35
@leehinman leehinman self-requested a review July 8, 2025 13:05
Copy link
Contributor

mergify bot commented Jul 9, 2025

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b chore/otel-manager-components upstream/chore/otel-manager-components
git merge upstream/main
git push upstream chore/otel-manager-components

@swiatekm swiatekm force-pushed the chore/otel-manager-components branch from eeb2f75 to 5da6654 Compare July 9, 2025 09:26
Copy link
Contributor

@pkoutsovasilis pkoutsovasilis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did left some comments, @swiatekm take a look and tell me what you think. PS: I am more than happy to have a dedicated sync about that 🙂

@swiatekm swiatekm force-pushed the chore/otel-manager-components branch from 5da6654 to 2edcb08 Compare July 11, 2025 11:05
@swiatekm swiatekm requested a review from pkoutsovasilis July 11, 2025 14:23
@swiatekm swiatekm force-pushed the chore/otel-manager-components branch from 8057361 to 86fc4fc Compare July 11, 2025 14:33
Copy link
Contributor

@pkoutsovasilis pkoutsovasilis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for addressing my comments @swiatekm. As you have already mentioned, there are points that we should revisit in the future but this PR feels to me a step towards the right direction:

  • the final otel config is produced inside the otel manager
  • statuses are fabricated and emitted from the manager. A future improvement here is to investigate if we could emit only one status and satisfy both the collector and components status needs, which probably will help in simplifying the code
  • sure the manager loop got a little bit "heavier" but we can try in the future to improve code readability

So code changes wise this counts as an improvement. Now that said, I think we have an issue 😄

I did compile and run this on my machine and for both execution modes I see only the following reported through elastic-agent status --output full

┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (HEALTHY) Running
   ├─ info
   │  ├─ id: c3189bbf-0e33-4b6b-aab7-80a4a3a0647b
   │  ├─ version: 9.2.0
   │  └─ commit: 86fc4fcea1209c63a3b810f941449b743bc0b440
   └─ system/metrics-default
      ├─ status: (STARTING) Starting: spawned pid '1284'
      ├─ system/metrics-default
      │  ├─ status: (STARTING) Starting: spawned pid '1284'
      │  └─ type: OUTPUT
      └─ system/metrics-default-unique-system-metrics-input
         ├─ status: (STARTING) Starting: spawned pid '1284'
         └─ type: INPUT

which makes me believe that there is a tiny bit somewhere in the status channels that is missed!? but is this reproducible also on your end?

Copy link

@elasticmachine
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

History

cc @swiatekm

@swiatekm
Copy link
Contributor Author

thanks for addressing my comments @swiatekm. As you have already mentioned, there are points that we should revisit in the future but this PR feels to me a step towards the right direction:

* the final otel config is produced inside the otel manager

* statuses are fabricated and emitted from the manager. A future improvement here is to investigate if we could emit only one status and satisfy both the collector and components status needs, which probably will help in simplifying the code

* sure the manager loop got a little bit "heavier" but we can try in the future to improve code readability

So code changes wise this counts as an improvement. Now that said, I think we have an issue 😄

I did compile and run this on my machine and for both execution modes I see only the following reported through elastic-agent status --output full

┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (HEALTHY) Running
   ├─ info
   │  ├─ id: c3189bbf-0e33-4b6b-aab7-80a4a3a0647b
   │  ├─ version: 9.2.0
   │  └─ commit: 86fc4fcea1209c63a3b810f941449b743bc0b440
   └─ system/metrics-default
      ├─ status: (STARTING) Starting: spawned pid '1284'
      ├─ system/metrics-default
      │  ├─ status: (STARTING) Starting: spawned pid '1284'
      │  └─ type: OUTPUT
      └─ system/metrics-default-unique-system-metrics-input
         ├─ status: (STARTING) Starting: spawned pid '1284'
         └─ type: INPUT

which makes me believe that there is a tiny bit somewhere in the status channels that is missed!? but is this reproducible also on your end?

Can you post your exact configuration? I tested

outputs:
  default:
    type: elasticsearch
    hosts: [127.0.0.1:9200]
    username: "elastic"
    password: "..."

agent:
  logging:
    to_stderr: true
  monitoring:
    enabled: false

inputs:
- data_stream:
    namespace: default
  id: unique-system-metrics-input
  streams:
  - data_stream:
      dataset: system.cpu
    metricsets:
    - cpu
  - data_stream:
      dataset: system.memory
    metricsets:
    - memory
  - data_stream:
      dataset: system.network
    metricsets:
    - network
  - data_stream:
      dataset: system.filesystem
    metricsets:
    - filesystem
  type: system/metrics
  use_output: default
  _runtime_experimental: otel

and got:

┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (HEALTHY) Running
   ├─ info
   │  ├─ id: fd4b48da-1a8a-42f3-aca2-321df32b45d7
   │  ├─ version: 9.2.0
   │  └─ commit: 86fc4fcea1209c63a3b810f941449b743bc0b440
   └─ system/metrics-default
      ├─ status: (HEALTHY) HEALTHY
      ├─ system/metrics-default
      │  ├─ status: (HEALTHY) Healthy
      │  └─ type: OUTPUT
      └─ system/metrics-default-unique-system-metrics-input
         ├─ status: (HEALTHY) Healthy
         └─ type: INPUT

and if I switch back to process mode for the system/metrics input:

┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (HEALTHY) Running
   ├─ info
   │  ├─ id: fd4b48da-1a8a-42f3-aca2-321df32b45d7
   │  ├─ version: 9.2.0
   │  └─ commit: 86fc4fcea1209c63a3b810f941449b743bc0b440
   └─ system/metrics-default
      ├─ status: (HEALTHY) Healthy: communicating with pid '517716'
      ├─ system/metrics-default
      │  ├─ status: (HEALTHY) Healthy
      │  └─ type: OUTPUT
      └─ system/metrics-default-unique-system-metrics-input
         ├─ status: (HEALTHY) Healthy
         └─ type: INPUT

@pkoutsovasilis
Copy link
Contributor

@swiatekm 👋 this is my config as reported from the inspect sub-command

agent:
  logging:
    to_stderr: true
  monitoring:
    _runtime_experimental: otel
    enabled: true
inputs:
- data_stream:
    namespace: default
  id: unique-system-metrics-input
  streams:
  - data_stream:
      dataset: system.cpu
    metricsets:
    - cpu
  - data_stream:
      dataset: system.memory
    metricsets:
    - memory
  - data_stream:
      dataset: system.network
    metricsets:
    - network
  - data_stream:
      dataset: system.filesystem
    metricsets:
    - filesystem
  type: system/metrics
  use_output: default
outputs:
  default:
    api_key: <REDACTED>
    hosts:
    - 127.0.0.1:9200
    preset: balanced
    type: elasticsearch

for the subprocess approach I would expect to see something like this (PS: I get the same for the in-process approach without the extensions section)

┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (DEGRADED) 1 or more components/units in a degraded state
   ├─ info
   │  ├─ id: f24c5572-89c7-450e-87b8-14cfd8741a83
   │  ├─ version: 9.2.0
   │  └─ commit: 86fc4fcea1209c63a3b810f941449b743bc0b440
   ├─ beat/metrics-monitoring
   │  ├─ status: (DEGRADED) DEGRADED
   │  ├─ beat/metrics-monitoring
   │  │  ├─ status: (DEGRADED) Elasticsearch request failed: dial tcp 127.0.0.1:9200: connect: connection refused
   │  │  └─ type: OUTPUT
   │  └─ beat/metrics-monitoring-metrics-monitoring-beats
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ filestream-monitoring
   │  ├─ status: (DEGRADED) DEGRADED
   │  ├─ filestream-monitoring
   │  │  ├─ status: (DEGRADED) Elasticsearch request failed: dial tcp 127.0.0.1:9200: connect: connection refused
   │  │  └─ type: OUTPUT
   │  └─ filestream-monitoring-filestream-monitoring-agent
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ http/metrics-monitoring
   │  ├─ status: (DEGRADED) DEGRADED
   │  ├─ http/metrics-monitoring
   │  │  ├─ status: (DEGRADED) Elasticsearch request failed: dial tcp 127.0.0.1:9200: connect: connection refused
   │  │  └─ type: OUTPUT
   │  └─ http/metrics-monitoring-metrics-monitoring-agent
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ system/metrics-default
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '67276'
   │  ├─ system/metrics-default
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ system/metrics-default-unique-system-metrics-input
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   └─ extensions
      ├─ status: StatusOK
      ├─ extension:healthcheckv2/a8aa1840-f3cd-4464-9b0b-68251f8ce626
      │  └─ status: StatusOK

The interesting thing here is that after some minutes (not sure how many) these do appear in the output even with the code from this PR. I need to re-run elastic-agent from main I have the impression that these appear in the output way faster

Copy link
Contributor

@pkoutsovasilis pkoutsovasilis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did run again the code of this PR again with both execution modes of the collector (in-process and sub-process) and now all according statuses appear in-time through elastic-agent status --output full so I am gonna choose to unravel this mystery by saying that my previous attempt was on Friday and probably I missed something 🙂

LGTM

@swiatekm swiatekm merged commit 503421f into main Jul 14, 2025
19 checks passed
@swiatekm swiatekm deleted the chore/otel-manager-components branch July 14, 2025 10:51
mergify bot pushed a commit that referenced this pull request Jul 14, 2025
* Add initial otel component manager implementation

* Update coordinator to use the new manager

* Move logging to the coordinator

* Add more tests

* Don't use a real otel manager in tests

* Move the logic to the otel manager

* Ignore the test collector binary

* Rename some dangling attributes back

* Comment out temporarily unused code

* Restore manager e2e test

* Fix import order

* Write synthetic status updates directly into the external channel

* Update collector config and components in one call

* Rename the mutex in the otel manager

* Discard intermediate statuses

* Emit component updates in a single batch

* Undo timeout increase in test

(cherry picked from commit 503421f)

# Conflicts:
#	internal/pkg/agent/application/coordinator/coordinator.go
#	internal/pkg/agent/application/coordinator/coordinator_unit_test.go
swiatekm added a commit that referenced this pull request Jul 14, 2025
…l manager (#8990)

* Move beat receiver component logic to the otel manager (#8737)

* Add initial otel component manager implementation

* Update coordinator to use the new manager

* Move logging to the coordinator

* Add more tests

* Don't use a real otel manager in tests

* Move the logic to the otel manager

* Ignore the test collector binary

* Rename some dangling attributes back

* Comment out temporarily unused code

* Restore manager e2e test

* Fix import order

* Write synthetic status updates directly into the external channel

* Update collector config and components in one call

* Rename the mutex in the otel manager

* Discard intermediate statuses

* Emit component updates in a single batch

* Undo timeout increase in test

(cherry picked from commit 503421f)

# Conflicts:
#	internal/pkg/agent/application/coordinator/coordinator.go
#	internal/pkg/agent/application/coordinator/coordinator_unit_test.go

* Fix conflicts in coordinator.go

* Fix conflicts in coordinator_unit_test.go

---------

Co-authored-by: Mikołaj Świątek <[email protected]>
khushijain21 pushed a commit to khushijain21/elastic-agent that referenced this pull request Jul 16, 2025
* Add initial otel component manager implementation

* Update coordinator to use the new manager

* Move logging to the coordinator

* Add more tests

* Don't use a real otel manager in tests

* Move the logic to the otel manager

* Ignore the test collector binary

* Rename some dangling attributes back

* Comment out temporarily unused code

* Restore manager e2e test

* Fix import order

* Write synthetic status updates directly into the external channel

* Update collector config and components in one call

* Rename the mutex in the otel manager

* Discard intermediate statuses

* Emit component updates in a single batch

* Undo timeout increase in test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.19 Automated backport to the 8.19 branch enhancement New feature or request skip-changelog Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants