Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] RDS Aurora autodiscovery doesn't work when using Docker labels to configure it #29717

Open
public opened this issue Oct 2, 2024 · 1 comment

Comments

@public
Copy link

public commented Oct 2, 2024

Agent Environment
Version: 7.57.1

Running in an ECS container in EC2.

Describe what happened:

The documentation describes autodiscovery working with Aurora based on AWS tags and that it can be configured using Docker labels.

https://docs.datadoghq.com/database_monitoring/guide/aurora_autodiscovery/?tab=postgres
https://docs.datadoghq.com/containers/docker/integrations/?tab=labels

Doing this results in no errors being reported by the agent, but also it doesn't actually do any autodiscovery and it does not connect to the databases.

Here is my label configuration.

com.datadoghq.ad.checks: '{
  "postgres":
    {
      "ad_identifiers": [
        "_dbm_postgres_aurora"
      ],
      "instances": [
        {
          "host": "%%host%%",
          "port": "%%port%%",
          "username": "datadog",
          "dbm": true,
          "aws": {
            "instance_endpoint": "%%host%%",
            "region": "%%extra_region%%",
            "managed_authentication": {
              "enabled": "%%extra_managed_authentication_enabled%%"
            }
          },
          "tags": [
            "dbclusteridentifier:%%extra_dbclusteridentifier%%",
            "region:%%extra_region%%"
          ]
        }
      ]
    }
  }'

Disabling managed_authentication has no effect.

Describe what you expected:

I expected to be able to provide the autodiscovery information via Docker labels and that it would discover my tagged AWS Aurora clusters automatically and try to connect to them.

Steps to reproduce the issue:

Follow the documentation.

Additional environment details (Operating System, Cloud provider, etc):

The agent is running inside an ECS container. This works fine for normal monitoring and also works fine for non-autodiscovery integration for database monitoring via Docker labels configuration. i.e. if I manually specify every DB to connect to it works and reports metrics just fine.

@viktorvsk-dualentry
Copy link

viktorvsk-dualentry commented Jan 10, 2025

I think I have a similar issue, but with a little bit more details so maybe it helps.
I'm running aurora cluster and my datadog-agent is deployed as a sidecar container into ECS (FARGATE).
If I set host to actual hostname, all works fine. If I use %%host%% it doesn't work.
(Aurora cluster has scrape tag)

I decided to launch datadog-agent in Docker from my local machine.
That is a part of my VPC, with access to Aurora, ECS etc.
With this command (keep in mind, running this same command but with actual hostname works as expected):

docker run -e "DD_API_KEY=${DD_API_KEY}" \
  -v /var/run/docker.sock:/var/run/docker.sock:ro \
  -l com.datadoghq.ad.check_names='["postgres"]' \
  -l com.datadoghq.ad.init_configs='[{}]' \
  -l com.datadoghq.ad.instances='[{
    "dbm": true,
    "host": "%%host%%",
    "port": 5432,
    "username": "datadog",
    "password": "datadog",
    "ignore_databases": ["template0", "template1", "postgres"]
  }]' \
  --platform linux/amd64 \
  gcr.io/datadoghq/agent:latest

But I get this error

2025-01-10 12:21:36 UTC | CORE | ERROR | (pkg/collector/worker/check_logger.go:71 in Error) | check:postgres | Error running check: [{"message":"connection to server at \"172.17.0.2\", port 5432 failed: Connection refused
\tIs the server running on that host and accepting TCP/IP connections?
","traceback":"Traceback (most recent call last):
  File \"/opt/datadog-agent/embedded/lib/python3.12/site-packages/datadog_checks/base/checks/base.py\", line 1276, in run
    initialization()
  File \"/opt/datadog-agent/embedded/lib/python3.12/site-packages/datadog_checks/postgres/postgres.py\", line 854, in _connect
    with self.db() as conn:
         ^^^^^^^^^
  File \"/opt/datadog-agent/embedded/lib/python3.12/contextlib.py\", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File \"/opt/datadog-agent/embedded/lib/python3.12/site-packages/datadog_checks/postgres/postgres.py\", line 226, in db
    self._db = self._new_connection(self._config.dbname)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/opt/datadog-agent/embedded/lib/python3.12/site-packages/datadog_checks/postgres/postgres.py\", line 839, in _new_connection
    conn = psycopg2.connect(**args)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/opt/datadog-agent/embedded/lib/python3.12/site-packages/psycopg2/__init__.py\", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           psycopg2.OperationalError: connection to server at \"172.17.0.2\", port 5432 failed: Connection refused
\tIs the server running on that host and accepting TCP/IP connections?

"}]

Notice this line psycopg2.OperationalError: connection to server at \"172.17.0.2\", port 5432 failed: Connection refused\n\tIs the server running on that host and accepting TCP/IP connections?

It seems like it discovers docker bridge IP or something.
Somewhere in the docs I've found %%hostname%% should be used in AWSVPC environments, so I changed the docker run command to:

docker run -e "DD_API_KEY=${DD_API_KEY}" \
  -v /var/run/docker.sock:/var/run/docker.sock:ro \
  -l com.datadoghq.ad.check_names='["postgres"]' \
  -l com.datadoghq.ad.init_configs='[{}]' \
  -l com.datadoghq.ad.instances='[{
    "dbm": true,
    "host": "%%hostname%%",
    "port": 5432,
    "username": "datadog",
    "password": "datadog",
    "ignore_databases": ["template0", "template1", "postgres"]
  }]' \
  --platform linux/amd64 \
  gcr.io/datadoghq/agent:latest

And I get this:

2025-01-10 12:17:21 UTC | CORE | ERROR | (pkg/collector/worker/check_logger.go:71 in Error) | check:postgres | Error running check: [{"message":"connection to server at \"098e03a7c8ff\" (172.17.0.2), port 5432 failed: Connection refused
\tIs the server running on that host and accepting TCP/IP connections?
","traceback":"Traceback (most recent call last):
  File \"/opt/datadog-agent/embedded/lib/python3.12/site-packages/datadog_checks/base/checks/base.py\", line 1276, in run
    initialization()
  File \"/opt/datadog-agent/embedded/lib/python3.12/site-packages/datadog_checks/postgres/postgres.py\", line 854, in _connect
    with self.db() as conn:
         ^^^^^^^^^
  File \"/opt/datadog-agent/embedded/lib/python3.12/contextlib.py\", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File \"/opt/datadog-agent/embedded/lib/python3.12/site-packages/datadog_checks/postgres/postgres.py\", line 226, in db
    self._db = self._new_connection(self._config.dbname)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/opt/datadog-agent/embedded/lib/python3.12/site-packages/datadog_checks/postgres/postgres.py\", line 839, in _new_connection
    conn = psycopg2.connect(**args)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/opt/datadog-agent/embedded/lib/python3.12/site-packages/psycopg2/__init__.py\", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\npsycopg2.OperationalError: connection to server at \"098e03a7c8ff\" (172.17.0.2), port 5432 failed: Connection refused\n\tIs the server running on that host and accepting TCP/IP connections?\n\n"}]

Notice this line psycopg2.OperationalError: connection to server at \"098e03a7c8ff\" (172.17.0.2), port 5432 failed: Connection refused\n\tIs the server running on that host and accepting TCP/IP connections?

So it seems like some hostname was found, but incorrect.
I've tried a lot of different combinations and datadog-agent versions from 7.36.1 to latest (btw, some versions gave different error but I don't remember which exactly and what errors) so I assume its most likely on datadog-agent side

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants