Skip to content

Commit 4aa8a64

Browse files
authored
Merge pull request #7 from rhythmictech/ENGB360-22
Monitor improvements for multi-env
2 parents 96642b7 + 5b1cbf0 commit 4aa8a64

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

45 files changed

+929
-406
lines changed

README.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,3 +27,29 @@ module "monitor" {
2727
```
2828

2929
## About
30+
31+
<!-- BEGIN_TF_DOCS -->
32+
## Requirements
33+
34+
No requirements.
35+
36+
## Providers
37+
38+
No providers.
39+
40+
## Modules
41+
42+
No modules.
43+
44+
## Resources
45+
46+
No resources.
47+
48+
## Inputs
49+
50+
No inputs.
51+
52+
## Outputs
53+
54+
No outputs.
55+
<!-- END_TF_DOCS -->

aws/alb/README.md

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ Configures the following for ALBs based on tags matches:
2020

2121
| Name | Version |
2222
|------|---------|
23-
| <a name="provider_datadog"></a> [datadog](#provider\_datadog) | >= 3.37 |
23+
| <a name="provider_datadog"></a> [datadog](#provider\_datadog) | 3.37.0 |
2424

2525
## Modules
2626

@@ -46,23 +46,26 @@ No modules.
4646
| <a name="input_base_tags"></a> [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` | <pre>[<br> "resource:alb"<br>]</pre> | no |
4747
| <a name="input_cost_center"></a> [cost\_center](#input\_cost\_center) | Cost Center of the monitored resource (leave blank to omit tag) | `string` | `null` | no |
4848
| <a name="input_dashboard_link"></a> [dashboard\_link](#input\_dashboard\_link) | Dashboard link to include in message | `string` | `null` | no |
49-
| <a name="input_env"></a> [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | n/a | yes |
49+
| <a name="input_env"></a> [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | `null` | no |
5050
| <a name="input_evaluation_delay"></a> [evaluation\_delay](#input\_evaluation\_delay) | Monitor evaluation delay (see [https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#set-alert-conditions](Datadog Docs)) | `number` | `900` | no |
5151
| <a name="input_http_5xx_responses_enabled"></a> [http\_5xx\_responses\_enabled](#input\_http\_5xx\_responses\_enabled) | Enable HTTP 5xx response monitor | `bool` | `false` | no |
5252
| <a name="input_http_5xx_responses_evaluation_window"></a> [http\_5xx\_responses\_evaluation\_window](#input\_http\_5xx\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
5353
| <a name="input_http_5xx_responses_no_data_window"></a> [http\_5xx\_responses\_no\_data\_window](#input\_http\_5xx\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
5454
| <a name="input_http_5xx_responses_threshold_critical"></a> [http\_5xx\_responses\_threshold\_critical](#input\_http\_5xx\_responses\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no |
5555
| <a name="input_http_5xx_responses_threshold_warning"></a> [http\_5xx\_responses\_threshold\_warning](#input\_http\_5xx\_responses\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no |
56+
| <a name="input_http_5xx_responses_use_message"></a> [http\_5xx\_responses\_use\_message](#input\_http\_5xx\_responses\_use\_message) | Whether to use the query alert base message | `bool` | `false` | no |
5657
| <a name="input_http_5xx_tg_responses_enabled"></a> [http\_5xx\_tg\_responses\_enabled](#input\_http\_5xx\_tg\_responses\_enabled) | Enable HTTP 5xx response monitor (target group) | `bool` | `false` | no |
5758
| <a name="input_http_5xx_tg_responses_evaluation_window"></a> [http\_5xx\_tg\_responses\_evaluation\_window](#input\_http\_5xx\_tg\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
5859
| <a name="input_http_5xx_tg_responses_no_data_window"></a> [http\_5xx\_tg\_responses\_no\_data\_window](#input\_http\_5xx\_tg\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
5960
| <a name="input_http_5xx_tg_responses_threshold_critical"></a> [http\_5xx\_tg\_responses\_threshold\_critical](#input\_http\_5xx\_tg\_responses\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no |
6061
| <a name="input_http_5xx_tg_responses_threshold_warning"></a> [http\_5xx\_tg\_responses\_threshold\_warning](#input\_http\_5xx\_tg\_responses\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no |
62+
| <a name="input_http_5xx_tg_responses_use_message"></a> [http\_5xx\_tg\_responses\_use\_message](#input\_http\_5xx\_tg\_responses\_use\_message) | Whether to use the query alert base message | `bool` | `false` | no |
6163
| <a name="input_latency_enabled"></a> [latency\_enabled](#input\_latency\_enabled) | Enable latency monitor | `bool` | `false` | no |
6264
| <a name="input_latency_evaluation_window"></a> [latency\_evaluation\_window](#input\_latency\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
6365
| <a name="input_latency_no_data_window"></a> [latency\_no\_data\_window](#input\_latency\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
6466
| <a name="input_latency_threshold_critical"></a> [latency\_threshold\_critical](#input\_latency\_threshold\_critical) | Critical threshold (seconds) | `number` | `null` | no |
6567
| <a name="input_latency_threshold_warning"></a> [latency\_threshold\_warning](#input\_latency\_threshold\_warning) | Warning threshold (seconds) | `number` | `null` | no |
68+
| <a name="input_latency_use_message"></a> [latency\_use\_message](#input\_latency\_use\_message) | Whether to use the query alert base message | `bool` | `false` | no |
6669
| <a name="input_monitor_exclude_tags"></a> [monitor\_exclude\_tags](#input\_monitor\_exclude\_tags) | Tags to be excluded in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
6770
| <a name="input_monitor_include_tags"></a> [monitor\_include\_tags](#input\_monitor\_include\_tags) | Tags to be included in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
6871
| <a name="input_new_group_delay"></a> [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before generating alerts for a new resource | `number` | `300` | no |
@@ -71,10 +74,14 @@ No modules.
7174
| <a name="input_no_healthy_instances_no_data_window"></a> [no\_healthy\_instances\_no\_data\_window](#input\_no\_healthy\_instances\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
7275
| <a name="input_no_healthy_instances_threshold_critical"></a> [no\_healthy\_instances\_threshold\_critical](#input\_no\_healthy\_instances\_threshold\_critical) | Critical threshold (percentage) | `number` | `0` | no |
7376
| <a name="input_no_healthy_instances_threshold_warning"></a> [no\_healthy\_instances\_threshold\_warning](#input\_no\_healthy\_instances\_threshold\_warning) | Warning threshold (percentage) | `number` | `null` | no |
77+
| <a name="input_no_healthy_instances_use_message"></a> [no\_healthy\_instances\_use\_message](#input\_no\_healthy\_instances\_use\_message) | Whether to use the query alert base message | `bool` | `true` | no |
7478
| <a name="input_notify_alert_override"></a> [notify\_alert\_override](#input\_notify\_alert\_override) | List of notifications for alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
79+
| <a name="input_notify_crit_override"></a> [notify\_crit\_override](#input\_notify\_crit\_override) | List of notifications for 24x7 alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
7580
| <a name="input_notify_default"></a> [notify\_default](#input\_notify\_default) | List of alert notifications (can be overridden based on alert type) | `list(string)` | n/a | yes |
7681
| <a name="input_notify_no_data"></a> [notify\_no\_data](#input\_notify\_no\_data) | Alert if no matching data is found | `bool` | `false` | no |
7782
| <a name="input_notify_nodata_override"></a> [notify\_nodata\_override](#input\_notify\_nodata\_override) | List of notifications for no data (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
83+
| <a name="input_notify_nonprod_override"></a> [notify\_nonprod\_override](#input\_notify\_nonprod\_override) | List of notifications for non-prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
84+
| <a name="input_notify_prod_override"></a> [notify\_prod\_override](#input\_notify\_prod\_override) | List of notifications for 12x5 prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
7885
| <a name="input_notify_recovery_override"></a> [notify\_recovery\_override](#input\_notify\_recovery\_override) | List of notifications for alert recovery (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
7986
| <a name="input_notify_warn_override"></a> [notify\_warn\_override](#input\_notify\_warn\_override) | List of notifications for alerts in warning threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
8087
| <a name="input_renotify_interval"></a> [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `0` | no |

aws/alb/main.tf

Lines changed: 19 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -4,16 +4,16 @@ locals {
44
monitor_warn_default_priority = null
55
monitor_nodata_default_priority = null
66

7-
title_prefix = "${var.title_prefix == null ? "" : "[${var.title_prefix}]"}"
7+
title_prefix = var.title_prefix == null ? "" : "[${var.title_prefix}]"
88
title_suffix = var.title_suffix == null ? "" : " (${var.title_suffix})"
99
}
1010

1111
resource "datadog_monitor" "http_5xx_responses" {
1212
count = var.http_5xx_responses_enabled ? 1 : 0
1313

1414
name = join("", [local.title_prefix, "ALB 5xx Responses - {{loadbalancer.name}}", local.title_suffix])
15-
include_tags = true
16-
message = local.query_alert_base_message
15+
include_tags = false
16+
message = var.http_5xx_responses_use_message ? local.query_alert_base_message : ""
1717
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
1818
type = "query alert"
1919

@@ -27,8 +27,8 @@ resource "datadog_monitor" "http_5xx_responses" {
2727

2828
query = <<END
2929
min(${var.http_5xx_responses_evaluation_window}):
30-
default(avg:aws.applicationelb.httpcode_elb_5xx${local.query_filter} by {aws_account,env,loadbalancer,region}.as_rate(), 0) / (
31-
default(avg:aws.applicationelb.request_count${local.query_filter} by {aws_account,env,loadbalancer,region}.as_rate(), 1)
30+
default(avg:aws.applicationelb.httpcode_elb_5xx${local.query_filter} by {aws_account,env,datadog_managed,loadbalancer,region}.as_rate(), 0) / (
31+
default(avg:aws.applicationelb.request_count${local.query_filter} by {aws_account,env,datadog_managed,loadbalancer,region}.as_rate(), 1)
3232
) * 100 > ${var.http_5xx_responses_threshold_critical}
3333
END
3434

@@ -42,8 +42,8 @@ resource "datadog_monitor" "http_5xx_tg_responses" {
4242
count = var.http_5xx_tg_responses_enabled ? 1 : 0
4343

4444
name = join("", [local.title_prefix, "ALB Target Group 5xx Responses - {{loadbalancer.name}}", local.title_suffix])
45-
include_tags = true
46-
message = local.query_alert_base_message
45+
include_tags = false
46+
message = var.http_5xx_tg_responses_use_message ? local.query_alert_base_message : ""
4747
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
4848
type = "query alert"
4949

@@ -57,8 +57,8 @@ resource "datadog_monitor" "http_5xx_tg_responses" {
5757

5858
query = <<END
5959
min(${var.http_5xx_tg_responses_evaluation_window}):
60-
default(avg:aws.applicationelb.httpcode_elb_5xx${local.query_filter} by {loadbalancer,region,aws_account,targetgroup,env}.as_rate(), 0) / (
61-
default(avg:aws.applicationelb.request_count${local.query_filter} by {loadbalancer,region,aws_account,targetgroup,env}.as_rate(), 1)
60+
default(avg:aws.applicationelb.httpcode_elb_5xx${local.query_filter} by {loadbalancer,region,aws_account,targetgroup,env,datadog_managed}.as_rate(), 0) / (
61+
default(avg:aws.applicationelb.request_count${local.query_filter} by {loadbalancer,region,aws_account,targetgroup,env,datadog_managed}.as_rate(), 1)
6262
) * 100 > ${var.http_5xx_tg_responses_threshold_critical}
6363
END
6464

@@ -72,9 +72,9 @@ END
7272
resource "datadog_monitor" "latency" {
7373
count = var.latency_enabled ? 1 : 0
7474

75-
name = join("", [local.title_prefix, "{{loadbalancer.name}} ALB latency - {{value}}s ", local.title_suffix])
76-
include_tags = true
77-
message = local.query_alert_base_message
75+
name = join("", [local.title_prefix, "ALB latency - {{loadbalancer.name}} {{value}}s", local.title_suffix])
76+
include_tags = false
77+
message = var.latency_use_message ? local.query_alert_base_message : ""
7878
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
7979
type = "query alert"
8080

@@ -88,7 +88,7 @@ resource "datadog_monitor" "latency" {
8888

8989
query = <<END
9090
avg(${var.latency_evaluation_window}):
91-
default(avg:aws.applicationelb.target_response_time.average${local.query_filter} by {aws_account,env,loadbalancer,region}, 0
91+
default(avg:aws.applicationelb.target_response_time.average${local.query_filter} by {aws_account,env,datadog_managed,loadbalancer,region}, 0
9292
) > ${var.latency_threshold_critical}
9393
END
9494

@@ -101,9 +101,9 @@ END
101101
resource "datadog_monitor" "no_healthy_instances" {
102102
count = var.no_healthy_instances_enabled ? 1 : 0
103103

104-
name = join("", [local.title_prefix, "{{loadbalancer.name}} ALB healthy instances is at {{value}}%", local.title_suffix])
105-
include_tags = true
106-
message = local.query_alert_base_message
104+
name = join("", [local.title_prefix, "ALB available healthy instances - {{loadbalancer.name}} {{value}}%", local.title_suffix])
105+
include_tags = false
106+
message = var.no_healthy_instances_use_message ? local.query_alert_base_message : ""
107107
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
108108
type = "query alert"
109109

@@ -117,9 +117,9 @@ resource "datadog_monitor" "no_healthy_instances" {
117117

118118
query = <<END
119119
min(${var.no_healthy_instances_evaluation_window}): (
120-
sum:aws.applicationelb.healthy_host_count.minimum${local.query_filter} by {aws_account,env,region,loadbalancer} / (
121-
sum:aws.applicationelb.healthy_host_count.minimum${local.query_filter} by {aws_account,env,region,loadbalancer} +
122-
sum:aws.applicationelb.un_healthy_host_count.maximum${local.query_filter} by {aws_account,env,region,loadbalancer} )
120+
sum:aws.applicationelb.healthy_host_count.minimum${local.query_filter} by {aws_account,env,datadog_managed,region,loadbalancer} / (
121+
sum:aws.applicationelb.healthy_host_count.minimum${local.query_filter} by {aws_account,env,datadog_managed,region,loadbalancer} +
122+
sum:aws.applicationelb.un_healthy_host_count.maximum${local.query_filter} by {aws_account,env,datadog_managed,region,loadbalancer} )
123123
) * 100 <= ${var.no_healthy_instances_threshold_critical}
124124
END
125125

aws/alb/variables.tf

Lines changed: 28 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ variable "base_tags" {
1717
# HTTP 5xx Response Codes (ALB)
1818
########################################
1919
variable "http_5xx_responses_enabled" {
20-
default = false
20+
default = true
2121
description = "Enable HTTP 5xx response monitor"
2222
type = bool
2323
}
@@ -46,11 +46,17 @@ variable "http_5xx_responses_threshold_warning" {
4646
type = number
4747
}
4848

49+
variable "http_5xx_responses_use_message" {
50+
description = "Whether to use the query alert base message"
51+
type = bool
52+
default = false
53+
}
54+
4955
########################################
5056
# HTTP 5xx Response Codes (Target Group)
5157
########################################
5258
variable "http_5xx_tg_responses_enabled" {
53-
default = false
59+
default = true
5460
description = "Enable HTTP 5xx response monitor (target group)"
5561
type = bool
5662
}
@@ -79,11 +85,17 @@ variable "http_5xx_tg_responses_threshold_warning" {
7985
type = number
8086
}
8187

88+
variable "http_5xx_tg_responses_use_message" {
89+
description = "Whether to use the query alert base message"
90+
type = bool
91+
default = false
92+
}
93+
8294
########################################
8395
# Latency Instances
8496
########################################
8597
variable "latency_enabled" {
86-
default = false
98+
default = true
8799
description = "Enable latency monitor"
88100
type = bool
89101
}
@@ -101,7 +113,7 @@ variable "latency_no_data_window" {
101113
}
102114

103115
variable "latency_threshold_critical" {
104-
default = null
116+
default = 3
105117
description = "Critical threshold (seconds)"
106118
type = number
107119
}
@@ -112,6 +124,12 @@ variable "latency_threshold_warning" {
112124
type = number
113125
}
114126

127+
variable "latency_use_message" {
128+
description = "Whether to use the query alert base message"
129+
type = bool
130+
default = false
131+
}
132+
115133
########################################
116134
# No Healthy Instances
117135
########################################
@@ -144,3 +162,9 @@ variable "no_healthy_instances_threshold_warning" {
144162
description = "Warning threshold (percentage)"
145163
type = number
146164
}
165+
166+
variable "no_healthy_instances_use_message" {
167+
description = "Whether to use the query alert base message"
168+
type = bool
169+
default = true
170+
}

0 commit comments

Comments
 (0)