Releases: m-lab/prometheus-support
One-off bug fix release
This release includes a fix to a bug introduced in the previous release which was causing floods of spurious alerts to fire for LameDuckMetricMissingForNode
alert.
Weekly release: 2018-09-10 to 2018-09-18
This release introduces a new k8s deployment, service and ingress for the Github Maintenance Exporter.
Weekly release: 2018-08-28 to 2018-09-10
This release features:
- A number of improvements to alerting, including fixes for some existing alerts to make them less noisy, plus some new alerts.
- Updates data-processing-cluster's Prometheus instance to v2.3.2.
- Adds a new BQ exporter query to check for completeness of NDT test annotations.
Weekly release: 2018-08-23 to 2018-08-28
Including a typo fix for the ParserDailyVolumeTooLow dashboard.
Increasing the timeout for the SnmpScrapingDownAtSite alert to 60m.
Weekly release: 2018-08-14 to 2018-08-23
This release increases the default RAM allocated to prometheus in mlab-oti and increases the cache index flag parameters to improve interactive query support.
As well, the ParserDailyVolumeTooLow alert is now built on a recording rule that should make the evaluation much more efficient.
33 new commits with various new features, improvements and bug fixes
Overview
- Repos
prometheus-snmp-exporter
andprometheus-script-exporter
were renamed tosnmp-exporter-support
andscript-exporter-support
, respectively. - A major bug was fixed in a couple AlertMananger inhibit rules which went undiscovered due to the fact that AM was never getting reloaded to read in the new configmap which introduced the breakage. This is fixed now.
- Grafana data sources as code! Data sources are now managed YAML files, similar to how dashboards are managed through JSON files.
- Ad-hoc monitoring for new platform k8s cluster nodes mlab3.lga03 and mlab3.lax02.
- Ad-hoc monitoring for the new U.S. ndt-cloud VMs.
- ConfigMap reloader added to the AlertManager pod.
- Automatic Google Cloud DNS entries created for properly configured k8s services.
- New monitoring/metrics for "fast sidestream" and "tcp-info".
Client-geohash query + metric_relabel_configs for sidestream
This release contains three main changes:
-
A new bq_exporter query to aggregate NDT geospatially. This will allow us to create and maybe publish heatmaps for NDT performance.
-
Label rewrite rules for legacy targets to set the experiment index label correctly. The index numbers will be used associate Sidestream aggregate traffic on a per experiment basis.
-
We now scrape node_exporters running on dns.measurementlab.net and mirror.measurementlab.net, and alert on them too.
Auto-deploy cloud VM targets + AM inhibit rules
This release contains three main changes:
-
Prometheus targets for ndt-cloud GCE VMs are now stored as static files in this repository, and are deployed to k8s persistent storage automatically by Travis builds. From here forward, if ndt-cloud targets need to change, edit them in this repo.
-
A couple new AlertManager inhibit rules were put in place to prevent cascading failure alerts related to the snmp-exporter and script-exporter.
-
3 Grafana JSON dashboards were edited to auto-refresh various template variables every time the dashboard is reloaded, whereas previous it was set to never reload and was just a static list that got out of date.
Grafana 5.1.4 + auto configmap reloads + Grafana Worldmap plugin
This release has 3 main components, along with a number of smaller changes and bugfix commits:
- We now use Grafana 5.1.4. The interface has roughly the same feel, but has many new features and better panel layout options. It also uses the new Grafana provisioning feature for dashboards, instead of the deprecated
dashbaords.json
feature we were using before. - Configmaps should now get reloaded by their applications when there are changes. Previously, we would have to manually reload, for example, Prometheus when it's configuration changed. With this new feature, services should reload their configmaps automatically.
- Grafana now has the Worldmap plugin, which when tied to some data from the bigquery_exporter, can display very interesting mapped data.
Weekly release: 2018-06-11 to 2018-06-25
This release adds a number of significant changes and new additions to monitoring:
Additions:
- new Ops_PodOverview dashboard.
- new mlabns stackdriver metric collection and accompanying dashboard.
- node exporter is a standard deployment for prometheus clusters.
Changes:
- SNMP exporter and Script exporter are scraped using their private VPC network DNS names
- Prometheus RAM & CPU allocs are increased for all projects so RAM == 2x HEAP size.
- Github Receiver runs with "inmemory" mode for sandbox and staging. Previously it was only run in mlab-oti.
- Github Receiver supports native prometheus metrics, for error rates and available API rate limits.
- Enhancements to Prometheus SelfMetrics dashboard
- kube-state-metrics is the latest version and as useful as possible.
- blackbox exporters run redundantly