Move beat receiver component logic to the otel manager #8737

swiatekm · 2025-06-30T13:28:36Z

NOTE: This is an alternative implementation of #8529, which adds the new functionality to the existing Otel manager, instead of creating a new object for it.

What does this PR do?

Move beat receiver component logic to the otel manager.

This logic consists of two tasks conceptually:

Translating agent configurations into beats receiver configurations for the otel collector.
Translating otel collector statuses into component states.

Up until now, the logic for these was haphazardly spread across the agent coordinator. This PR moves all of it into the OtelManager, which can now run both raw otel collector configurations and agent components in a single otel collector instance.

The OtelManager now encapsulates all the logic involved in interfacing between the agent coordinator and the otel collector. In the near future, it will also take on additional responsibilities, like generating diagnostics for components it runs.

The only new logic this PR introduces lives in the new manager's main loop, and has to do with how updates and configurations are moved around. The rest is either existing logic moved to a new location, and new tests for that old logic.

Why is it important?

Checklist

I have read and understood the pull request guidelines of this project.
My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~[ ] I have made corresponding changes to the documentation~~
~~[ ] I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
~~[ ] I have added an entry in ./changelog/fragments using the changelog tool~~
~~[ ] I have added an integration test or an E2E test~~

Related issues

Relates Ensure that diagnostics contain all information necessary to debug beat receivers when using the OTel runtime #8208

mergify · 2025-06-30T13:29:12Z

This pull request does not have a backport label. Could you fix it @swiatekm? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-./d./d is the label that automatically backports to the 8./d branch. /d is the digit
backport-active-all is the label that automatically backports to all active branches.
backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

elasticmachine · 2025-07-04T16:27:54Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

mergify · 2025-07-09T06:42:31Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b chore/otel-manager-components upstream/chore/otel-manager-components
git merge upstream/main
git push upstream chore/otel-manager-components

pkoutsovasilis

I did left some comments, @swiatekm take a look and tell me what you think. PS: I am more than happy to have a dedicated sync about that 🙂

internal/pkg/agent/application/coordinator/coordinator.go

internal/pkg/otel/manager/manager.go

internal/pkg/otel/manager/manager_test.go

pkoutsovasilis

thanks for addressing my comments @swiatekm. As you have already mentioned, there are points that we should revisit in the future but this PR feels to me a step towards the right direction:

the final otel config is produced inside the otel manager
statuses are fabricated and emitted from the manager. A future improvement here is to investigate if we could emit only one status and satisfy both the collector and components status needs, which probably will help in simplifying the code
sure the manager loop got a little bit "heavier" but we can try in the future to improve code readability

So code changes wise this counts as an improvement. Now that said, I think we have an issue 😄

I did compile and run this on my machine and for both execution modes I see only the following reported through elastic-agent status --output full

┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (HEALTHY) Running
   ├─ info
   │  ├─ id: c3189bbf-0e33-4b6b-aab7-80a4a3a0647b
   │  ├─ version: 9.2.0
   │  └─ commit: 86fc4fcea1209c63a3b810f941449b743bc0b440
   └─ system/metrics-default
      ├─ status: (STARTING) Starting: spawned pid '1284'
      ├─ system/metrics-default
      │  ├─ status: (STARTING) Starting: spawned pid '1284'
      │  └─ type: OUTPUT
      └─ system/metrics-default-unique-system-metrics-input
         ├─ status: (STARTING) Starting: spawned pid '1284'
         └─ type: INPUT

which makes me believe that there is a tiny bit somewhere in the status channels that is missed!? but is this reproducible also on your end?

elastic-sonarqube · 2025-07-11T18:03:11Z

Quality Gate passed

Issues
4 New issues
2 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
81.3% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

elasticmachine · 2025-07-11T18:15:48Z

💛 Build succeeded, but was flaky

Buildkite Build
Commit: f0305ec

Failed CI Steps

History

💔 Build #23989 failed 86fc4fc
💚 Build #23970 succeeded f76030a
💔 Build #23635 failed 176ece7
💔 Build #23280 failed 6ef024f
💔 Build #23278 failed 4302811
💔 Build #23271 failed 54df9df

cc @swiatekm

swiatekm · 2025-07-11T19:02:27Z

thanks for addressing my comments @swiatekm. As you have already mentioned, there are points that we should revisit in the future but this PR feels to me a step towards the right direction:
* the final otel config is produced inside the otel manager

* statuses are fabricated and emitted from the manager. A future improvement here is to investigate if we could emit only one status and satisfy both the collector and components status needs, which probably will help in simplifying the code

* sure the manager loop got a little bit "heavier" but we can try in the future to improve code readability
So code changes wise this counts as an improvement. Now that said, I think we have an issue 😄

I did compile and run this on my machine and for both execution modes I see only the following reported through elastic-agent status --output full
┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (HEALTHY) Running
   ├─ info
   │  ├─ id: c3189bbf-0e33-4b6b-aab7-80a4a3a0647b
   │  ├─ version: 9.2.0
   │  └─ commit: 86fc4fcea1209c63a3b810f941449b743bc0b440
   └─ system/metrics-default
      ├─ status: (STARTING) Starting: spawned pid '1284'
      ├─ system/metrics-default
      │  ├─ status: (STARTING) Starting: spawned pid '1284'
      │  └─ type: OUTPUT
      └─ system/metrics-default-unique-system-metrics-input
         ├─ status: (STARTING) Starting: spawned pid '1284'
         └─ type: INPUT
which makes me believe that there is a tiny bit somewhere in the status channels that is missed!? but is this reproducible also on your end?

Can you post your exact configuration? I tested

outputs:
  default:
    type: elasticsearch
    hosts: [127.0.0.1:9200]
    username: "elastic"
    password: "..."

agent:
  logging:
    to_stderr: true
  monitoring:
    enabled: false

inputs:
- data_stream:
    namespace: default
  id: unique-system-metrics-input
  streams:
  - data_stream:
      dataset: system.cpu
    metricsets:
    - cpu
  - data_stream:
      dataset: system.memory
    metricsets:
    - memory
  - data_stream:
      dataset: system.network
    metricsets:
    - network
  - data_stream:
      dataset: system.filesystem
    metricsets:
    - filesystem
  type: system/metrics
  use_output: default
  _runtime_experimental: otel

and got:

┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (HEALTHY) Running
   ├─ info
   │  ├─ id: fd4b48da-1a8a-42f3-aca2-321df32b45d7
   │  ├─ version: 9.2.0
   │  └─ commit: 86fc4fcea1209c63a3b810f941449b743bc0b440
   └─ system/metrics-default
      ├─ status: (HEALTHY) HEALTHY
      ├─ system/metrics-default
      │  ├─ status: (HEALTHY) Healthy
      │  └─ type: OUTPUT
      └─ system/metrics-default-unique-system-metrics-input
         ├─ status: (HEALTHY) Healthy
         └─ type: INPUT

and if I switch back to process mode for the system/metrics input:

┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (HEALTHY) Running
   ├─ info
   │  ├─ id: fd4b48da-1a8a-42f3-aca2-321df32b45d7
   │  ├─ version: 9.2.0
   │  └─ commit: 86fc4fcea1209c63a3b810f941449b743bc0b440
   └─ system/metrics-default
      ├─ status: (HEALTHY) Healthy: communicating with pid '517716'
      ├─ system/metrics-default
      │  ├─ status: (HEALTHY) Healthy
      │  └─ type: OUTPUT
      └─ system/metrics-default-unique-system-metrics-input
         ├─ status: (HEALTHY) Healthy
         └─ type: INPUT

pkoutsovasilis · 2025-07-14T06:52:40Z

@swiatekm 👋 this is my config as reported from the inspect sub-command

agent:
  logging:
    to_stderr: true
  monitoring:
    _runtime_experimental: otel
    enabled: true
inputs:
- data_stream:
    namespace: default
  id: unique-system-metrics-input
  streams:
  - data_stream:
      dataset: system.cpu
    metricsets:
    - cpu
  - data_stream:
      dataset: system.memory
    metricsets:
    - memory
  - data_stream:
      dataset: system.network
    metricsets:
    - network
  - data_stream:
      dataset: system.filesystem
    metricsets:
    - filesystem
  type: system/metrics
  use_output: default
outputs:
  default:
    api_key: <REDACTED>
    hosts:
    - 127.0.0.1:9200
    preset: balanced
    type: elasticsearch

for the subprocess approach I would expect to see something like this (PS: I get the same for the in-process approach without the extensions section)

┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (DEGRADED) 1 or more components/units in a degraded state
   ├─ info
   │  ├─ id: f24c5572-89c7-450e-87b8-14cfd8741a83
   │  ├─ version: 9.2.0
   │  └─ commit: 86fc4fcea1209c63a3b810f941449b743bc0b440
   ├─ beat/metrics-monitoring
   │  ├─ status: (DEGRADED) DEGRADED
   │  ├─ beat/metrics-monitoring
   │  │  ├─ status: (DEGRADED) Elasticsearch request failed: dial tcp 127.0.0.1:9200: connect: connection refused
   │  │  └─ type: OUTPUT
   │  └─ beat/metrics-monitoring-metrics-monitoring-beats
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ filestream-monitoring
   │  ├─ status: (DEGRADED) DEGRADED
   │  ├─ filestream-monitoring
   │  │  ├─ status: (DEGRADED) Elasticsearch request failed: dial tcp 127.0.0.1:9200: connect: connection refused
   │  │  └─ type: OUTPUT
   │  └─ filestream-monitoring-filestream-monitoring-agent
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ http/metrics-monitoring
   │  ├─ status: (DEGRADED) DEGRADED
   │  ├─ http/metrics-monitoring
   │  │  ├─ status: (DEGRADED) Elasticsearch request failed: dial tcp 127.0.0.1:9200: connect: connection refused
   │  │  └─ type: OUTPUT
   │  └─ http/metrics-monitoring-metrics-monitoring-agent
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ system/metrics-default
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '67276'
   │  ├─ system/metrics-default
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ system/metrics-default-unique-system-metrics-input
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   └─ extensions
      ├─ status: StatusOK
      ├─ extension:healthcheckv2/a8aa1840-f3cd-4464-9b0b-68251f8ce626
      │  └─ status: StatusOK

The interesting thing here is that after some minutes (not sure how many) these do appear in the output even with the code from this PR. I need to re-run elastic-agent from main I have the impression that these appear in the output way faster

pkoutsovasilis

I did run again the code of this PR again with both execution modes of the collector (in-process and sub-process) and now all according statuses appear in-time through elastic-agent status --output full so I am gonna choose to unravel this mystery by saying that my previous attempt was on Friday and probably I missed something 🙂

LGTM

* Add initial otel component manager implementation * Update coordinator to use the new manager * Move logging to the coordinator * Add more tests * Don't use a real otel manager in tests * Move the logic to the otel manager * Ignore the test collector binary * Rename some dangling attributes back * Comment out temporarily unused code * Restore manager e2e test * Fix import order * Write synthetic status updates directly into the external channel * Update collector config and components in one call * Rename the mutex in the otel manager * Discard intermediate statuses * Emit component updates in a single batch * Undo timeout increase in test (cherry picked from commit 503421f) # Conflicts: # internal/pkg/agent/application/coordinator/coordinator.go # internal/pkg/agent/application/coordinator/coordinator_unit_test.go

…l manager (#8990) * Move beat receiver component logic to the otel manager (#8737) * Add initial otel component manager implementation * Update coordinator to use the new manager * Move logging to the coordinator * Add more tests * Don't use a real otel manager in tests * Move the logic to the otel manager * Ignore the test collector binary * Rename some dangling attributes back * Comment out temporarily unused code * Restore manager e2e test * Fix import order * Write synthetic status updates directly into the external channel * Update collector config and components in one call * Rename the mutex in the otel manager * Discard intermediate statuses * Emit component updates in a single batch * Undo timeout increase in test (cherry picked from commit 503421f) # Conflicts: # internal/pkg/agent/application/coordinator/coordinator.go # internal/pkg/agent/application/coordinator/coordinator_unit_test.go * Fix conflicts in coordinator.go * Fix conflicts in coordinator_unit_test.go --------- Co-authored-by: Mikołaj Świątek <[email protected]>

* Add initial otel component manager implementation * Update coordinator to use the new manager * Move logging to the coordinator * Add more tests * Don't use a real otel manager in tests * Move the logic to the otel manager * Ignore the test collector binary * Rename some dangling attributes back * Comment out temporarily unused code * Restore manager e2e test * Fix import order * Write synthetic status updates directly into the external channel * Update collector config and components in one call * Rename the mutex in the otel manager * Discard intermediate statuses * Emit component updates in a single batch * Undo timeout increase in test

swiatekm added the skip-changelog label Jun 30, 2025

swiatekm changed the title ~~Chore/otel manager components~~ Move beat receiver component logic to the otel manager Jun 30, 2025

mergify bot assigned swiatekm Jun 30, 2025

swiatekm force-pushed the chore/otel-manager-components branch 4 times, most recently from 4302811 to 6ef024f Compare June 30, 2025 14:56

swiatekm mentioned this pull request Jun 30, 2025

Add otel component manager to the coordinator #8529

Closed

4 tasks

swiatekm force-pushed the chore/otel-manager-components branch from 6ef024f to 25658e3 Compare July 4, 2025 16:24

swiatekm added enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team backport-8.19 Automated backport to the 8.19 branch labels Jul 4, 2025

swiatekm marked this pull request as ready for review July 4, 2025 16:27

swiatekm requested a review from a team as a code owner July 4, 2025 16:27

swiatekm requested review from blakerouse and kaanyalti July 4, 2025 16:27

swiatekm requested review from cmacknz, pkoutsovasilis and faec July 4, 2025 16:29

swiatekm force-pushed the chore/otel-manager-components branch 2 times, most recently from 176ece7 to eeb2f75 Compare July 4, 2025 19:35

leehinman self-requested a review July 8, 2025 13:05

swiatekm force-pushed the chore/otel-manager-components branch from eeb2f75 to 5da6654 Compare July 9, 2025 09:26

pkoutsovasilis reviewed Jul 10, 2025

View reviewed changes

swiatekm added 3 commits July 11, 2025 11:17

Add initial otel component manager implementation

eab3412

Update coordinator to use the new manager

b2a14af

Move logging to the coordinator

f39970b

swiatekm added 8 commits July 11, 2025 11:17

Don't use a real otel manager in tests

bb621ba

Move the logic to the otel manager

c3f87b9

Ignore the test collector binary

c365fe8

Rename some dangling attributes back

711b7de

Comment out temporarily unused code

1f46af6

Restore manager e2e test

8aab35a

Fix import order

a6a9678

Write synthetic status updates directly into the external channel

2edcb08

swiatekm force-pushed the chore/otel-manager-components branch from 5da6654 to 2edcb08 Compare July 11, 2025 11:05

swiatekm added 3 commits July 11, 2025 14:08

Update collector config and components in one call

9b4d73e

Rename the mutex in the otel manager

086588b

Discard intermediate statuses

f76030a

swiatekm requested a review from pkoutsovasilis July 11, 2025 14:23

Emit component updates in a single batch

86fc4fc

swiatekm force-pushed the chore/otel-manager-components branch from 8057361 to 86fc4fc Compare July 11, 2025 14:33

pkoutsovasilis reviewed Jul 11, 2025

View reviewed changes

internal/pkg/otel/manager/manager_test.go Outdated Show resolved Hide resolved

Undo timeout increase in test

f0305ec

pkoutsovasilis reviewed Jul 11, 2025

View reviewed changes

pkoutsovasilis approved these changes Jul 14, 2025

View reviewed changes

swiatekm merged commit 503421f into main Jul 14, 2025
19 checks passed

swiatekm deleted the chore/otel-manager-components branch July 14, 2025 10:51

mergify bot mentioned this pull request Jul 14, 2025

[8.19] (backport #8737) Move beat receiver component logic to the otel manager #8990

Merged

4 tasks

swiatekm mentioned this pull request Jul 15, 2025

Add component and unit diagnostics for beats receivers #8991

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Move beat receiver component logic to the otel manager #8737

Move beat receiver component logic to the otel manager #8737

Uh oh!

swiatekm commented Jun 30, 2025 •

edited

Loading

Uh oh!

mergify bot commented Jun 30, 2025

Uh oh!

elasticmachine commented Jul 4, 2025

Uh oh!

mergify bot commented Jul 9, 2025

Uh oh!

pkoutsovasilis left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pkoutsovasilis left a comment

Uh oh!

elastic-sonarqube bot commented Jul 11, 2025

Uh oh!

elasticmachine commented Jul 11, 2025

Uh oh!

swiatekm commented Jul 11, 2025

Uh oh!

pkoutsovasilis commented Jul 14, 2025

Uh oh!

pkoutsovasilis left a comment

Uh oh!

Uh oh!

Uh oh!

Move beat receiver component logic to the otel manager #8737

Move beat receiver component logic to the otel manager #8737

Uh oh!

Conversation

swiatekm commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Why is it important?

Checklist

Related issues

Uh oh!

mergify bot commented Jun 30, 2025

Uh oh!

elasticmachine commented Jul 4, 2025

Uh oh!

mergify bot commented Jul 9, 2025

Uh oh!

pkoutsovasilis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pkoutsovasilis left a comment

Choose a reason for hiding this comment

Uh oh!

elastic-sonarqube bot commented Jul 11, 2025

Quality Gate passed

Uh oh!

elasticmachine commented Jul 11, 2025

💛 Build succeeded, but was flaky

Failed CI Steps

History

Uh oh!

swiatekm commented Jul 11, 2025

Uh oh!

pkoutsovasilis commented Jul 14, 2025

Uh oh!

pkoutsovasilis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

swiatekm commented Jun 30, 2025 •

edited

Loading