Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSoC Project Proposal]: Prometheus server- graphs and monitoring for ERDDAP™ #72

Open
ChrisJohnNOAA opened this issue Jan 31, 2025 · 17 comments
Labels
GSoC25 project idea Designates a proposed project idea

Comments

@ChrisJohnNOAA
Copy link

ChrisJohnNOAA commented Jan 31, 2025

Project Description

We recently started adding Prometheus metrics to ERDDAP™. The main goal of this project is to build an example Prometheus Server which can monitor one or more ERDDAP™ instances. This may involve adding new metrics to the ERDDAP™ project which is where Java would be used.

Expected Outcomes

A Prometheus Server configuration that is runnable through Docker and can be used to monitor one or more ERDDAP™ instances. This will help ERDDAP™ admin's monitor their servers and provide usage insight that can help guide future ERDDAP™ development.

Skills Required

Java, Prometheus, YML, Docker

Additional Background/Issues

The main ERDDAP™ repo is here.

Mentor(s)

Chris John (@ChrisJohnNOAA) [email protected]

Expected Project Size

175 hours

Project Difficulty

Intermediate

@ChrisJohnNOAA ChrisJohnNOAA added GSoC25 project idea Designates a proposed project idea labels Jan 31, 2025
@7yl4r
Copy link
Contributor

7yl4r commented Feb 4, 2025

I'm running a grafana monitoring stack for the MBON dashboard server. I've read about prometheus but haven't tried it. Could be a great solution for improved ERDDAP monitoring and reporting.

@ayushsingh01042003
Copy link

Is there an achitecture in mind for this? Will we have an erddap-monitoring module in the current repository or something like a new repository for everything related to monitoring?

And since this example server will be containerized and runnable through docker, we use the existing image for prometheus on docker hub or do we get the entire prometheus server binary and configure that to be run through docker?

@ChrisJohnNOAA
Copy link
Author

@ayushsingh01042003 There are two main use cases I see for this. The more I think about it, the more I think they are distinct enough that it will require 2 separate configs ( and ways of running).

  1. ERDDAP™ admins (and their organization) to monitor their ERDDAP™ instance. This should be easy to run with an ERDDAP™ server, but not required to be used. I think the easiest way to do that, would be using the official Prometheus server binary + configuration, and Docker Compose to make it runnable on Docker alongside the ERDDAP™ instance. It needs to be possible to run the Prometheus server/container on a separate machine to support organizations that want to be able to monitor/alert if the primary machine goes down. Ideally this would be usable/helpful for admins that aren't using Docker at all, but that might be limited to being able to share config files. My instinct is this should be in the ERDDAP™ repo, but I don't know if there's complications (like with how it gets published to Docker Hub) that would make this better in a separate repo.

  2. The second use case is for core ERDDAP™ team being able to collect data from a number of servers. This would likely be run locally to aid in debugging specific problems across the fleet, investigating current configurations (like what percent of the servers have a flag turned on), and investigating usage of various features to aid in development decisions. I don't believe this will be published on Docker Hub, so most likely this should just be in the main ERDDAP™ repo.

@ayushsingh01042003
Copy link

  • I understand the second approach, the use cases could be pretty interesting for this so configuring a prometheus binary from scratch in a separate repository is the approach to go with for sure.

  • I don't understand the use of prometheus binary in the erddap repo, could we not just use the official prometheus docker image and configure that image to make it runnable through docker-compose to monitor that erddap instance? The prometheus image alone can be published to docker hub making it possible to use it and run it on a separate server that can monitor multiple erddap instances.
    At least, this is what I have in mind after trying to undestand the above use cases.

@ChrisJohnNOAA
Copy link
Author

Yes, sorry my language was unspecific. We should be able to use the official Prometheus image and provide configuration for it.

As for single or multiple ERDDAP™ instances. I think there are going to be different data/graphs that are important for the different audiences. For example for monitoring the fleet of ERDDAP™'s, I'd love to be able to see what percent have new feature flags turned on. For monitoring a single ERDDAP™ server (most admins are in this situation) they likely don't care at all to see the feature flag metrics. They likely care about how much traffic each dataset id is getting, but for monitoring many ERDDAP™s I'd want to see how much traffic dataset types and protocols are getting, not individual dataset ids.

There's different needs for monitoring ERDDAP™ overall vs a single ERDDAP™ instance, so we're going to need 2 configs for Prometheus.

@ayushsingh01042003
Copy link

Understood, thanks for clearing that up.

@lareinahu-2023
Copy link

Hi @ChrisJohnNOAA,

I’m Jiahui Hu, a Master’s student in Computer Software Engineering at Northeastern University (Seattle). I’m very interested in contributing to this Prometheus Monitoring for ERDDAP™ project for GSoC.

I have experience with Java, Docker, and system monitoring. During my software engineering internship, I configured Prometheus and Grafana to track system performance, reducing anomaly detection time by 50%. I also optimized Kafka-based data pipelines, cutting latency by 35%, and worked on API traffic management using AWS API Gateway, improving request handling by 20%. These experiences align well with setting up a Prometheus server, defining useful metrics, and containerizing the deployment for ERDDAP™.

I reviewed ERDDAP’s existing metrics and saw that while many Prometheus metrics are available, areas like query response times, memory usage, and data request patterns could provide deeper insights. My plan is to integrate custom Prometheus metrics within ERDDAP™ to give admins better visibility into performance and bottlenecks.

Would it be beneficial to include automated alerts for key performance thresholds, or is the focus primarily on visualization? Also, are there specific performance bottlenecks in ERDDAP™ that need more attention?

Looking forward to your thoughts!

Best,
Jiahui Hu
Email: [email protected]
GitHub: github.com/lareinahu-2023

@ChrisJohnNOAA
Copy link
Author

Hi @lareinahu-2023, thanks for your interest. Where did you do your internship?

I recently added some additional Prometheus metrics which includes query response times and request patterns. The main file for the Prometheus metrics is here: https://github.com/ERDDAP/erddap/blob/main/WEB-INF/classes/gov/noaa/pfel/erddap/util/Metrics.java

I agree there are more metrics that would be beneficial to add, even after those changes.

As for automated alerts, they may be useful, in particular for individual server admins. Nobody has requested alerts (there have been many requests for visualization, particularly for things like what datasets are popular and how much traffic), but I do imagine some admins would appreciate having alerts.

@lareinahu-2023
Copy link

Hi @ChrisJohnNOAA,
Thank you for your reply. I did my internship at Terra Byte X,an company focused on developing AI-powered EdTech tools.

After reviewing the Prometheus metrics you've already implemented, I have a few clarification questions:

  1. You mentioned that many have requested visualization for dataset popularity and traffic data. Could you elaborate on what specific aspects of this data would be most valuable to visualize? For example, are administrators more interested in time-based trends, geographic distribution of requests, or comparative metrics between datasets?

  2. I've reviewed the Metrics.java file you shared. While it implements several Prometheus metrics including query response times and request patterns, I'd like to understand which specific monitoring gaps still exist that would be most impactful to address in this project.

  3. The project description mentions monitoring "one or more ERDDAP instances." In your experience, what's the typical deployment scenario for ERDDAP administrators - are they usually managing single instances, or would the solution need to scale to monitoring clusters of instances from the beginning?

Thanks again for your guidance. I'm excited about the possibility of contributing to this project!

@ayushsingh01042003
Copy link

ayushsingh01042003 commented Mar 4, 2025

@ChrisJohnNOAA I had a similar question, for the Prometheus config used by administrators, I understand that metrics like jvm, datset traffic and my guess is few but not all metrics from the status page are going to be implemented.

However for the core team configured Prometheus are there some other metrics in mind apart from the percentage of certain tags being used across the fleet of servers, that you or the other members of the core team have in mind now?

@ChrisJohnNOAA
Copy link
Author

@lareinahu-2023

  1. There have been several discussions and projects around this. For example @callumrollo lead a project to parse ERDDAP™ logs (GitHub, blog, YouTube video ). There was also a discussion about this on the ERDDAP™ github. There are also multiple issues that have been filed essentially asking for a machine readable format (Log all significant events in a standard format ERDDAP/erddap#80, Record requests in a structured format ERDDAP/erddap#118). The most consistent request I've heard is to be able to track dataset usage (partially to defend continued funding of collecting datasets I believe).
  2. The new metrics haven't launched yet (next build happening soon). So nobody has really used them yet. I imagine admins and myself will have a better idea what's missing once we've had a chance to use these. I would be interested in getting additional information about errors and failures to help detect problems.
  3. Generally each ERDDAP™ server is pretty unique. Even if an admin is running multiple servers it might not make sense to aggregate their stats (though it's possible some would want to). There's also the possibility that a larger organization (like NOAA) might want some kind of aggregated view of all their ERDDAP™ servers. In that situation I'd expect the organization to construct their own configuration. However my main goal of monitoring multiple instances is to collect data to assist the ERDDAP™ development team in making decisions about future work. Things like if feature flags are enabled on servers, what dataset types are used, what file download types are used, which ERDDAP™ protocols (griddap, tabledap, files, wms, etc...) are used, is browser or script based querying more common, should we put effort into mobile device support. Another thing useful for the core ERDDAP™ team would be detecting errors/issues.

@ayushsingh01042003 My thoughts on metrics for the core team are in point 3 above. Several of the metrics for the status page are in the new Metrics. We could add others, but need to be careful about cardinality explosion for some of them.

@lareinahu-2023
Copy link

lareinahu-2023 commented Mar 5, 2025

@ChrisJohnNOAA
Understood . Thanks a lot for your detailed guidance

@lareinahu-2023
Copy link

Hi @ChrisJohnNOAA,
I have drafted an initial project plan that addresses several key areas and would greatly appreciate your feedback.
1. ERDDAP™ Metrics Enhancement
Extend gov.noaa.pfel.erddap.util.Metrics.java with Prometheus-compatible metrics using Micrometer API. Implement counters, histograms, gauges, cardinality-aware metrics, custom collectors, and JVM metrics integration.
2. Prometheus Server Configuration
Develop Prometheus server configuration in YAML with intelligent scraping intervals, job configurations, relabeling rules, recording rules, retention policies, and service discovery mechanisms.
3. Docker Deployment Architecture
Build containerized monitoring stack using Docker Compose with Prometheus, Alertmanager, and Grafana. Implement volume mounts, parameterized environment configuration, health checks, container resource limits, and network configurations.
4. Grafana Dashboard Suite
Develop specialized Grafana dashboards using PromQL including dataset popularity dashboard, operational dashboard, protocol usage dashboard, comparison dashboard, and template variables.
5. Alerting Framework
Implement tiered alerting system using Prometheus Alertmanager with predefined alert rules, intelligent thresholds, notification channels, and alert grouping and silencing policies.
6. Feature Usage Analytics
Develop metrics for feature flag adoption, file format counters (NetCDF, CSV, JSON), client device tracking, protocol utilization metrics (griddap, tabledap, WMS), and development team dashboards.
7. Performance Optimization Insights
Implement query performance tracking with histogram metrics, resource utilization metrics, connection pool metrics, heat map dashboards, and resource saturation alerts.
8. Documentation & Deployment Guide
Create documentation for monitoring stack, deployment guide for various platforms (local, cloud, Kubernetes), metrics explanations, dashboard usage tutorials, and resource allocation recommendations.

I'm eager to refine this proposal based on your expertise and project vision. I'm flexible and open to adjusting the scope and focus based on your guidance. 


@ChrisJohnNOAA
Copy link
Author

@lareinahu-2023 Why do you recommend using Micrometer? What benefit would it provide ERDDAP™?

Many of the metrics you mention adding are already collected in Metrics.java. For example JVM metrics are included through JvmMetrics. In early February I also added a number of additional metrics which include feature flag state, request information (including file format, protocol, response times, and more - you can see the metrics for our canary server here). There may be other metrics that would be useful to collect, but I don't want to duplicate metrics we already have.

Something I don't see called out in the proposal is that we likely need two different configurations for dashboards. One for the ERDDAP™ team to better understand usage across servers from different organizations and the other for administrators running one to a small handful of servers to monitor their server(s).

@moShehata-1811
Copy link

Dear [Chris John],

I hope you’re doing well. My name is Mohamed Shehata, and I am from Egypt. As a student at the Faculty of Science, Menoufia University, I have developed a strong foundation in Java, which is the primary language used in my studies. My passion for Java, along with the certification I obtained and my experience with Docker, has driven me to explore Prometheus in depth.

After conducting thorough research on Prometheus and its integration with monitoring systems, I believe I am well-suited for this project. I am eager to contribute and further enhance my skills while making a meaningful impact. This opportunity aligns perfectly with both my academic and professional aspirations, and I am excited about the possibility of joining the project.

I would love to discuss how I can contribute effectively. Looking forward to your thoughts!

Best regards,
Mohamed Shehata

@lareinahu-2023
Copy link

Hi @ChrisJohnNOAA,
Thank you for your feedback on my draft idea. I have revised it according to your suggestions and the template structure.
And I have submit proposal: Prometheus Monitoring Implementation for ERDDAP™: Enabling Data-Driven Server Management in the GSoC website with contributor name Jiahui Hu.


I'm looking forward to receiving your feedback on my proposal. I'm excited about the opportunity to contribute to this project and would be grateful for any guidance you can provide.

@lareinahu-2023
Copy link

Hi @ChrisJohnNOAA ,
I created a PR : ERDDAP/erddap#266
I am looking forward to any further feedback from you.
Jiahui Hu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GSoC25 project idea Designates a proposed project idea
Projects
None yet
Development

No branches or pull requests

5 participants