You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The `significant_terms` aggregation lets you spot unusual or interesting term occurrences in a filtered subset relative to the rest of the data in an index.
12
+
The `significant_terms` aggregation identifies terms that occur unusually frequently in a subset of documents (foreground set) compared to a broader reference set (background set). By default, the background set targets all documents in the target indexes. You can narrow it with `background_filter`. Use this aggregation to retrieve the *most overrepresented* values, for which a plain `terms` aggregation that shows you the *most common* values is insufficient.
13
13
14
-
A foreground set is the set of documents that you filter. A background set is a set of all documents in an index.
15
-
The `significant_terms` aggregation examines all documents in the foreground set and finds a score for significant occurrences in contrast to the documents in the background set.
14
+
Each result bucket includes:
16
15
17
-
In the sample web log data, each document has a field containing the `user-agent` of the visitor. This example searches for all requests from an iOS operating system. A regular `terms` aggregation on this foreground set returns Firefox because it has the most number of documents within this bucket. On the other hand, a `significant_terms` aggregation returns Internet Explorer (IE) because IE has a significantly higher appearance in the foreground set as compared to the background set.
16
+
-`key`: The term value.
17
+
-`doc_count`: The number of foreground documents containing the term.
18
+
-`bg_count`: The number of background documents containing the term.
19
+
-`score`: Specifies how strongly the term stands out in the foreground relative to the background. For more information, see [Heuristics and scoring](#heuristics-and-scoring).
20
+
21
+
If the aggregation returns no buckets, it usually means that the foreground isn't filtered (for example, you used a `match_all` query) or the term distribution in the foreground is the same as in the background.
22
+
{: .note}
23
+
24
+
## Basic example: Identify distinctive terms in high‑value returns for an e-commerce application
25
+
26
+
Create an index that contains customer orders:
18
27
19
28
```json
20
-
GET opensearch_dashboards_sample_data_logs/_search
Run the following query to identify `payment_method` values that are unusually common among orders that were returned and cost over $500, compared to the entire index:
74
+
75
+
```json
76
+
GET /retail_orders/_search
21
77
{
22
78
"size": 0,
23
79
"query": {
24
-
"terms": {
25
-
"machine.os.keyword": [
26
-
"ios"
80
+
"bool": {
81
+
"filter": [
82
+
{ "term": { "status": "RETURNED" } },
83
+
{ "range": { "order_total": { "gte": 500 } } }
27
84
]
28
85
}
29
86
},
30
87
"aggs": {
31
-
"significant_response_codes": {
88
+
"payment_signals": {
32
89
"significant_terms": {
33
-
"field": "agent.keyword"
90
+
"field": "payment_method"
34
91
}
35
92
}
36
93
}
37
94
}
38
95
```
39
96
{% include copy-curl.html %}
40
97
41
-
#### Example response
98
+
The returned aggregation shows that among the five high-value returns, `gift_card` appears 3 times (60%), compared to 3 out of 12 times in the entire index (25%). As a result, it is flagged as the most overrepresented payment method:
42
99
43
100
```json
44
-
...
45
-
"aggregations" : {
46
-
"significant_response_codes" : {
47
-
"doc_count" : 2737,
48
-
"bg_count" : 14074,
49
-
"buckets" : [
50
-
{
51
-
"key" : "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)",
52
-
"doc_count" : 818,
53
-
"score" : 0.01462731514608217,
54
-
"bg_count" : 4010
55
-
},
56
-
{
57
-
"key" : "Mozilla/5.0 (X11; Linux x86_64; rv:6.0a1) Gecko/20110421 Firefox/6.0a1",
58
-
"doc_count" : 1067,
59
-
"score" : 0.009062566630410223,
60
-
"bg_count" : 5362
101
+
{
102
+
...
103
+
"hits": {
104
+
"total": {
105
+
"value": 5,
106
+
"relation": "eq"
107
+
},
108
+
"max_score": null,
109
+
"hits": []
110
+
},
111
+
"aggregations": {
112
+
"payment_signals": {
113
+
"doc_count": 5,
114
+
"bg_count": 12,
115
+
"buckets": [
116
+
{
117
+
"key": "gift_card",
118
+
"doc_count": 3,
119
+
"score": 0.84,
120
+
"bg_count": 3
121
+
}
122
+
]
123
+
}
124
+
}
125
+
}
126
+
```
127
+
128
+
## Multi‑set analysis
129
+
130
+
You can determine the unusual values for each category by first grouping documents into buckets and then running a `significant_terms` aggregation within each bucket.
131
+
132
+
### Example: Unusual `cancel_reason` per region
133
+
134
+
The following example groups by region with a terms aggregation and within each bucket runs `significant_terms` to identify cancellation reasons disproportionately common in that region:
Suppose that you have a dataset of field incidents at sites across a country. Each document contains a point location `site.location` of type `geo_point` and a categorical field `issue.keyword` (for example, `POWER_OUTAGE`, `FIBER_CUT`, or `VANDALISM`). You want to identify the issue types that are overrepresented within specific map tiles compared to a broader reference set. You can use a `geotile_grid` to divide the map into zoom‑level tiles. Higher `precision` produces smaller tiles, such as street or city blocks, while lower `precision` produces larger tiles, such as a city or region. Run a `significant_terms` aggregation within each tile to identify the local outliers.
157
+
158
+
Segment the data by map tiles and identify the `issue.keyword` values that are unusually frequent in those tiles:
## Use a `background_filter` to narrow the background set
179
+
180
+
By default, the background contains the entire index. Use a `background_filter` to restrict background documents for more precise results.
181
+
182
+
### Example: Compare Toronto to the rest of Canada
183
+
184
+
The following example filters the foreground to "Toronto" and sets a `background_filter` for "Canada". `significant_terms` highlights topics that are unusually frequent in Toronto relative to other Canadian cities:
Using a custom background requires additional processing because the background frequency for each candidate term must be computed by applying the filter. This can be slower than using the default index-wide counts.
206
+
{: .warning}
207
+
208
+
## Field type considerations
209
+
210
+
`significant_terms` aggregations work best on exact-value fields (for example, `keyword` or `numeric`). Running `significant_terms` aggregations on heavily tokenized text can be memory intensive. For analyzed text, consider using [`significant_text` aggregations]({{site.url}}{{site.baseurl}}/aggregations/bucket/significant-text/), which are designed for full-text fields and support the same significance heuristics.
211
+
212
+
## Heuristics and scoring
213
+
214
+
The `score` ranks terms based on how much their foreground frequency differs from the background frequency. It has no units and is meaningful only for comparison within the same request and heuristic.
215
+
216
+
You can select one heuristic per request by specifying it under `significant_terms`. The following heuristics are supported.
217
+
218
+
### JLH
219
+
220
+
The Jensen–Shannon Lift Heuristic (JLH) is suitable for most general‑purpose scenarios. It balances both the absolute frequency of a term and its relative overrepresentation compared to the background set, favoring terms that increase both *absolutely* and *relatively*.
221
+
222
+
```json
223
+
"significant_terms": {
224
+
"field": "payment_method.keyword",
225
+
"jlh": {}
65
226
}
66
227
```
67
228
68
-
If the `significant_terms` aggregation doesn't return any result, you might have not filtered the results with a query. Alternatively, the distribution of terms in the foreground set might be the same as the background set, implying that there isn't anything unusual in the foreground set.
A term whose frequency increases slightly from a large baseline will score higher than a term with the same absolute increase from a very small background share.
236
+
237
+
#### Score example calculation using JLH
238
+
239
+
Suppose that your foreground set (high‑value returns) contains `2,000` orders, and the background set (all orders) contains `120,000` orders. Consider a single term in the `significant_terms` aggregation, which has the following counts:
240
+
241
+
-`doc_count = 160`
242
+
-`bg_count = 3,200`
243
+
244
+
Percentages of documents containing the term are calculated as follows:
This positive score means that the searched term is notably more prevalent in high‑value returns than it is overall. Scores are relative: use them to rank terms, not as absolute probabilities.
252
+
253
+
### Mutual information
254
+
255
+
Mutual information (MI) prefers frequent terms and identifies popular but still distinctive terms. Set `include_negatives: false` to ignore terms that are less common in the foreground than the background. If your background is not a superset of the foreground, set `background_is_superset: false`:
256
+
257
+
```json
258
+
"significant_terms": {
259
+
"field": "product.keyword",
260
+
"mutual_information": {
261
+
"include_negatives": false,
262
+
"background_is_superset": true
263
+
}
264
+
}
265
+
```
266
+
267
+
### Chi‑square
268
+
269
+
Chi-square is a statistical test that measures how much the observed frequency of a term in a subset (foreground) deviates from the expected frequency based on a reference set (background). Similarly to [MI](#mutual-information), chi-square supports `include_negatives` and `background_is_superset`:
270
+
271
+
```json
272
+
"significant_terms": {
273
+
"field": "error.keyword",
274
+
"chi_square": { "include_negatives": false }
275
+
}
276
+
```
277
+
278
+
### Google Normalized Distance
279
+
280
+
Google Normalized Distance (GND) favors strong co‑occurrence. It is useful for synonym discovery or items that tend to appear together:
281
+
282
+
```json
283
+
"significant_terms": {
284
+
"field": "tag.keyword",
285
+
"gnd": {}
286
+
}
287
+
```
288
+
289
+
### Percentage
290
+
291
+
Percentage sorts terms by the `doc_count`/`bg_count` ratio and identifies the number of foreground hits that a term has relative to its background hits. It doesn't account for the overall sizes of the two sets, so very rare terms can dominate:
292
+
293
+
```json
294
+
"significant_terms": {
295
+
"field": "sku.keyword",
296
+
"percentage": {}
297
+
}
298
+
```
299
+
300
+
### Scripted heuristic
301
+
302
+
To provide a custom heuristic formula, use the following variables:
303
+
304
+
-`_subset_freq`: The number of documents containing the term in the foreground set.
305
+
-`_superset_freq`: The number of documents containing the term in the background set.
306
+
-`_subset_size`: The total number of documents in the foreground set.
307
+
-`_superset_size`: The total number of documents in the background set.
308
+
309
+
The following request runs a `significant_terms` aggregation on `field.keyword` using a custom script heuristic to score terms based on their frequency in the foreground relative to the background:
The default source of statistical information for background term frequencies is the entire index. You can narrow this scope with a background filter for more focus
0 commit comments