Skip to content

Commit fc83e6a

Browse files
committed
Add postgres-exporter alerts
1 parent 47d5d05 commit fc83e6a

File tree

2 files changed

+169
-0
lines changed

2 files changed

+169
-0
lines changed

postgres/system-alerts.yaml.tmpl

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# PROMETHEUS RULES
2+
# DO NOT REMOVE line above, used in `pre-commit` hook
3+
4+
groups:
5+
- name: postgres-exporter
6+
rules:
7+
- alert: "PostgresExporterDown"
8+
expr: |
9+
job="postgres-exporter" == 0
10+
for: 15m
11+
labels:
12+
team: infra
13+
annotations:
14+
summary: "Postgres Exporter is down"
15+
impact: "Postgres instances are not monitored"
16+
qonto_runbook: https://qonto.github.io/database-monitoring-framework/latest/runbooks/postgresql/SQLExporterDown
17+
action: |
18+
Check if postgres-exporter is running in sys-prom namespace.
19+
Check the logs.
20+
Restart pods.
21+
22+
- alert: "PostgresExporterScrapingLimit"
23+
expr: |
24+
avg_over_time(pg_exporter_last_scrape_duration_seconds{job="postgres-exporter", instance!=""}[10m]) > 30
25+
for: 5m
26+
labels:
27+
alerttype: stock
28+
alertgroup: Postgres
29+
annotations:
30+
summary: "Postgres Exporter scraping is taking a long time"
31+
impact: "Postgres instances are not monitored,"
32+
runbook_url: https://qonto.github.io/database-monitoring-framework/latest/runbooks/postgresql/SQLExporterScrapingLimit
33+
action: |
34+
Check postgres-exporter logs and resource usage.
35+
Check postgres database for any long-running queries.
36+
Restart pods.

postgres/team-alerts.yaml.tmpl

Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
# PROMETHEUS RULES
2+
# DO NOT REMOVE line above, used in `pre-commit` hook
3+
4+
groups:
5+
# Uses `uw_rds_owner_team` recording rule created in `system-alerts/rds` for team detection.
6+
# Based on https://github.com/qonto/database-monitoring-framework/blob/main/charts/prometheus-postgresql-alerts/values.yaml
7+
- name: Postgres
8+
rules:
9+
- alert: "PostgreSQLInactiveLogicalReplicationSlot"
10+
expr: |
11+
max by (target, slot_name) (pg_replication_slots_active{slot_type="logical"}) < 1
12+
+ on (dbidentifier) group_left (team) uw_rds_owner_team
13+
for: 10m
14+
labels:
15+
alerttype: stock
16+
alertgroup: Postgres
17+
annotations:
18+
summary: "Logical replication slot {{ $labels.slot_name }} on {{ $labels.target }} is inactive"
19+
impact: "Potential disk space saturation and replication slot no longer being usable."
20+
qonto_runbook: https://qonto.github.io/database-monitoring-framework/latest/runbooks/postgresql/PostgreSQLInactiveLogicalReplicationSlot
21+
action: |
22+
Check the replication slot disk space consumption trend.
23+
Identify the non-running logical replication slot.
24+
nvestigate and fix the replication slot client.
25+
26+
- alert: "PostgreSQLInactivePhysicalReplicationSlot"
27+
expr: |
28+
max by (target, slot_name) (pg_replication_slots_active{slot_type="physical"}) < 1
29+
+ on (dbidentifier) group_left (team) uw_rds_owner_team
30+
for: 10m
31+
labels:
32+
alerttype: stock
33+
alertgroup: Postgres
34+
annotations:
35+
summary: "Physical replication slot {{ $labels.slot_name }} on {{ $labels.target }} is inactive"
36+
impact: "Potential disk space saturation and replication slot no longer being usable."
37+
qonto_runbook: https://qonto.github.io/database-monitoring-framework/latest/runbooks/postgresql/PostgreSQLInactivePhysicalReplicationSlot
38+
action: |
39+
Check the replication slot disk space consumption trend.
40+
Check replica lag and instance logs.
41+
Increase disk space on the primary instance if necessary.
42+
43+
- alert: "PostgreSQLInvalidIndex"
44+
# pint disable promql/series
45+
expr: |
46+
count by (datname, relname, indexrelname, server) (
47+
pg_stat_user_indexes_idx_scan{indisvalid="false"}
48+
) > 0
49+
+ on (dbidentifier) group_left (team) uw_rds_owner_team
50+
for: 1h
51+
labels:
52+
alerttype: stock
53+
alertgroup: Postgres
54+
annotations:
55+
summary:
56+
"Index {{ $labels.indexrelname }} of {{ $labels.relname }} table on {{ $labels.datname
57+
}} database on {{ $labels.target }} is invalid"
58+
impact: "PostgreSQL does not use the index for query execution, which could degrade query performances."
59+
qonto_runbook: https://qonto.github.io/database-monitoring-framework/latest/runbooks/postgresql/PostgreSQLInvalidIndex
60+
action: |
61+
Diagnose the root cause.
62+
Delete and recreate the index.
63+
64+
- alert: "PostgreSQLLongRunningQueries"
65+
expr: |
66+
pg_long_running_transactions_oldest_timestamp_seconds > 1800
67+
+ on (dbidentifier) group_left (team) uw_rds_owner_team
68+
for: 1m
69+
labels:
70+
alerttype: stock
71+
alertgroup: Postgres
72+
annotations:
73+
summary: "Long running query on {{ $labels.instance }} for >30 minutes."
74+
impact: "Potential block on other queries, WAL file rotation and vacuum operations."
75+
qonto_runbook: https://qonto.github.io/database-monitoring-framework/latest/runbooks/postgresql/PostgreSQLLongRunningQueries
76+
action: |
77+
Identify and diagnose the blocking queries.
78+
Terminate these queries if you are sure that is safe to do.
79+
80+
- alert: "PostgreSQLMaxConnections"
81+
expr: |
82+
max by (target) (pg_stat_connections_count)
83+
* 100
84+
/ max by (target) (pg_settings_max_connections)
85+
> 80
86+
+ on (dbidentifier) group_left (team) uw_rds_owner_team
87+
for: 10m
88+
labels:
89+
alerttype: stock
90+
alertgroup: Postgres
91+
annotations:
92+
summary: "{{ $labels.target }} uses >80% of the maximum database connections"
93+
impact: "New clients might not be able to connect."
94+
qonto_runbook: https://qonto.github.io/database-monitoring-framework/latest/runbooks/postgresql/PostgreSQLMaxConnections
95+
action: |
96+
Reduce number of clients.
97+
Increase max_connections (check memory first!).
98+
99+
- alert: "PostgreSQLReplicationSlotStorageLimit"
100+
expr: |
101+
max by (target, slot_name) (pg_replication_slots_available_storage_percent{}) < 20
102+
+ on (dbidentifier) group_left (team) uw_rds_owner_team
103+
for: 5m
104+
labels:
105+
alerttype: stock
106+
alertgroup: Postgres
107+
annotations:
108+
summary: "{{ $labels.slot_name }} on {{ $labels.target }} is close to its storage limit"
109+
impact: "Potential disk space saturation and replication slot no longer being usable."
110+
qonto_runbook: https://qonto.github.io/database-monitoring-framework/latest/runbooks/postgresql/PostgreSQLReplicationSlotStorageLimit
111+
action: |
112+
Check replication slot client logs and performances.
113+
Correct the root cause on replication slot client.
114+
Increase max_slot_wal_keep_size to allow more disk space for the replication slot (check free storage first!).
115+
Increase server storage.
116+
117+
- alert: "PostgresExporterMissingTarget"
118+
expr: |
119+
min(up{job="postgres-exporter", instance!=""}) by (instance) == 0
120+
+ on (dbidentifier) group_left (team) uw_rds_owner_team
121+
for: 15m
122+
labels:
123+
alerttype: stock
124+
alertgroup: Postgres
125+
annotations:
126+
summary: "Postgres Exporter scrape for {{ $labels.target }} failed"
127+
impact: "{{ $labels.target }} instance is not monitored."
128+
qonto_runbook: https://qonto.github.io/database-monitoring-framework/latest/runbooks/postgresql/SQLExporterMissingTarget
129+
action: |
130+
Check postgres-exporter logs in sys-prom namespace.
131+
Check if the postgres instance is down.
132+
Check postgres database connection logs.
133+
Check if there's an issue with the prometheus_postgres_exporter user.

0 commit comments

Comments
 (0)