Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup monitoring of ECS nodes #104

Open
2 tasks
hellais opened this issue Sep 25, 2024 · 3 comments
Open
2 tasks

Setup monitoring of ECS nodes #104

hellais opened this issue Sep 25, 2024 · 3 comments
Assignees
Labels
epic A large user story that needs to be broken down funder/otffoss2025 priority/medium Normal priority issue

Comments

@hellais
Copy link
Member

hellais commented Sep 25, 2024

Currently we don't have observability into the container host of the ECS cluster. Moreover we are only able to scrape aggregate metrics from the services that are behind the balancer, which means we end up with the metrics "flapping".

Ideally we would have a way of scraping metrics for the container host, but also the per-servicer docker containers.

In summary we would like to collect two classes of metrics:

  • Host container metrics (the ec2 nodes that run docker and we deploy docker containers to), using node_exporter
  • Docker container application metrics, which are exposed using the instrumentator and we would like to scrape independently per each host container
@hellais hellais self-assigned this Sep 25, 2024
@hellais hellais added the priority/medium Normal priority issue label Sep 25, 2024
@DecFox DecFox self-assigned this Oct 7, 2024
@hellais hellais added the epic A large user story that needs to be broken down label Dec 9, 2024
@hellais hellais added this to Roadmap Jan 7, 2025
@hellais hellais moved this to Backlog in Roadmap Jan 13, 2025
@hellais hellais assigned LDiazN and unassigned hellais and DecFox Jan 22, 2025
@LDiazN
Copy link
Contributor

LDiazN commented Jan 23, 2025

I think this might be the way: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#ec2_sd_config

The problem I'm seeing right now is that the monitor server is not in AWS, so we have some issues with the connection between that server and the ec2 instances:

  1. We have to set up IAM credentials for the server, as mentioned in the link above
  2. The prometheus server needs a way to reach the ec2 instances (not the load balancer), but they're probably not open to internet traffic (and I don't think they should), what can we do about this?

@LDiazN
Copy link
Contributor

LDiazN commented Jan 29, 2025

To solve 2), I had some talk with @hellais , and we had the idea of using nginx as proxy. You send scrape requests to the nginx proxy and it knows to which host to send it based on the path of the request.

For example, if you want to request metrics from the node foo:

  1. The prometheus server would send the request: get nginx.proxy.com/foo/metrics/
  2. nginx would then perform the get foo/metrics using proxy_pass and forward it to the prometheus server

After testing this with a docker compose with nginx and two go servers, I think it's possible using the following nginx configuration:

events { }

http {
    server {
        listen 80;

        location ~ /([a-zA-Z0-9_\.]+)/([a-zA-Z0-9_]*) {
            proxy_pass http://$1:8080;
        }
    }
}

As a side note, you can't use the hostname of the docker service you want to reach, you have to use the local ip address

@hellais hellais moved this from Sprint Backlog to Epic in Roadmap Jan 29, 2025
@hellais hellais moved this from 🏗 Planned to 🚀 In Progress in Roadmap Jan 31, 2025
LDiazN added a commit that referenced this issue Feb 13, 2025
This PR adds the node exporter service to ECS cluster machines. This is
necessary for #178 (and therefore for #104).

Node exporter is installed with the user-data script that initializes
cluster machines
LDiazN added a commit that referenced this issue Feb 18, 2025
This PR adds support for scraping application level metrics straight
from the ECS nodes, without going through the load balancer, solving the
flapping behavior described in #104

To achieve this we had to solve the following problems: 

- Reach the cluster nodes in AWS, this was solved by
#182
- Discover ECS tasks with their corresponding port and IP address. This
was a bit more tricky, we solved it by adding a cronjob that runs a
python script that will request ECS data from AWS using the `boto3`
client, and then storing that information into a [Prometheus file based
discovery](https://prometheus.io/docs/guides/file-sd/) compatible file

So, this PR will add: 
- A Python script to collect ECS tasks information from aws and writing
it into a file
- An Ansible configuration for creating and running this file with a
cronjob
- An update to the Nginx configuration used to proxy metrics requests
from the monitoring host to the EC2 instances in AWS
- Security groups configuration to allow traffic from the proxy host to
the ECS cluster nodes
- IAM credentials used for requesting tasks information from AWS 

This PR solves ooni/backend#937 and
ooni/backend#938 and is related to
#104
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic A large user story that needs to be broken down funder/otffoss2025 priority/medium Normal priority issue
Projects
Status: 🚀 In Progress
Development

No branches or pull requests

3 participants