Skip to content

Commit 4ead5cb

Browse files
AntonEliatrakolchfa-awsnatebower
authored
expanding on significant terms (#11178)
* expanding on significant terms Signed-off-by: Anton Rubin <[email protected]> * fixing vale errors Signed-off-by: Anton Rubin <[email protected]> * addressing PR comments Signed-off-by: Anton Rubin <[email protected]> * Apply suggestions from code review Co-authored-by: kolchfa-aws <[email protected]> Signed-off-by: AntonEliatra <[email protected]> * addressing PR comments Signed-off-by: Anton Rubin <[email protected]> * Apply suggestions from code review Co-authored-by: kolchfa-aws <[email protected]> Signed-off-by: AntonEliatra <[email protected]> * Apply suggestions from code review Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: AntonEliatra <[email protected]> * addressing PR comments Signed-off-by: Anton Rubin <[email protected]> --------- Signed-off-by: Anton Rubin <[email protected]> Signed-off-by: AntonEliatra <[email protected]> Co-authored-by: kolchfa-aws <[email protected]> Co-authored-by: Nathan Bower <[email protected]>
1 parent 9b930ce commit 4ead5cb

File tree

1 file changed

+284
-32
lines changed

1 file changed

+284
-32
lines changed

_aggregations/bucket/significant-terms.md

Lines changed: 284 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -9,63 +9,315 @@ redirect_from:
99

1010
# Significant terms aggregations
1111

12-
The `significant_terms` aggregation lets you spot unusual or interesting term occurrences in a filtered subset relative to the rest of the data in an index.
12+
The `significant_terms` aggregation identifies terms that occur unusually frequently in a subset of documents (foreground set) compared to a broader reference set (background set). By default, the background set targets all documents in the target indexes. You can narrow it with `background_filter`. Use this aggregation to retrieve the *most overrepresented* values, for which a plain `terms` aggregation that shows you the *most common* values is insufficient.
1313

14-
A foreground set is the set of documents that you filter. A background set is a set of all documents in an index.
15-
The `significant_terms` aggregation examines all documents in the foreground set and finds a score for significant occurrences in contrast to the documents in the background set.
14+
Each result bucket includes:
1615

17-
In the sample web log data, each document has a field containing the `user-agent` of the visitor. This example searches for all requests from an iOS operating system. A regular `terms` aggregation on this foreground set returns Firefox because it has the most number of documents within this bucket. On the other hand, a `significant_terms` aggregation returns Internet Explorer (IE) because IE has a significantly higher appearance in the foreground set as compared to the background set.
16+
- `key`: The term value.
17+
- `doc_count`: The number of foreground documents containing the term.
18+
- `bg_count`: The number of background documents containing the term.
19+
- `score`: Specifies how strongly the term stands out in the foreground relative to the background. For more information, see [Heuristics and scoring](#heuristics-and-scoring).
20+
21+
If the aggregation returns no buckets, it usually means that the foreground isn't filtered (for example, you used a `match_all` query) or the term distribution in the foreground is the same as in the background.
22+
{: .note}
23+
24+
## Basic example: Identify distinctive terms in high‑value returns for an e-commerce application
25+
26+
Create an index that contains customer orders:
1827

1928
```json
20-
GET opensearch_dashboards_sample_data_logs/_search
29+
PUT /retail_orders
30+
{
31+
"mappings": {
32+
"properties": {
33+
"status": { "type": "keyword" },
34+
"order_total": { "type": "double" },
35+
"payment_method": { "type": "keyword" }
36+
}
37+
}
38+
}
39+
```
40+
{% include copy-curl.html %}
41+
42+
Ingest sample documents into the index:
43+
44+
```json
45+
POST _bulk
46+
{ "index": { "_index": "retail_orders" } }
47+
{ "status":"RETURNED", "order_total": 950, "payment_method":"gift_card" }
48+
{ "index": { "_index": "retail_orders" } }
49+
{ "status":"RETURNED", "order_total": 720, "payment_method":"gift_card" }
50+
{ "index": { "_index": "retail_orders" } }
51+
{ "status":"RETURNED", "order_total": 540, "payment_method":"gift_card" }
52+
{ "index": { "_index": "retail_orders" } }
53+
{ "status":"RETURNED", "order_total": 820, "payment_method":"credit_card" }
54+
{ "index": { "_index": "retail_orders" } }
55+
{ "status":"RETURNED", "order_total": 500, "payment_method":"paypal" }
56+
{ "index": { "_index": "retail_orders" } }
57+
{ "status":"DELIVERED", "order_total": 130, "payment_method":"credit_card" }
58+
{ "index": { "_index": "retail_orders" } }
59+
{ "status":"DELIVERED", "order_total": 75, "payment_method":"paypal" }
60+
{ "index": { "_index": "retail_orders" } }
61+
{ "status":"DELIVERED", "order_total": 260, "payment_method":"paypal" }
62+
{ "index": { "_index": "retail_orders" } }
63+
{ "status":"DELIVERED", "order_total": 45, "payment_method":"credit_card" }
64+
{ "index": { "_index": "retail_orders" } }
65+
{ "status":"DELIVERED", "order_total": 310, "payment_method":"credit_card" }
66+
{ "index": { "_index": "retail_orders" } }
67+
{ "status":"DELIVERED", "order_total": 220, "payment_method":"credit_card" }
68+
{ "index": { "_index": "retail_orders" } }
69+
{ "status":"DELIVERED", "order_total": 410, "payment_method":"paypal" }
70+
```
71+
{% include copy-curl.html %}
72+
73+
Run the following query to identify `payment_method` values that are unusually common among orders that were returned and cost over $500, compared to the entire index:
74+
75+
```json
76+
GET /retail_orders/_search
2177
{
2278
"size": 0,
2379
"query": {
24-
"terms": {
25-
"machine.os.keyword": [
26-
"ios"
80+
"bool": {
81+
"filter": [
82+
{ "term": { "status": "RETURNED" } },
83+
{ "range": { "order_total": { "gte": 500 } } }
2784
]
2885
}
2986
},
3087
"aggs": {
31-
"significant_response_codes": {
88+
"payment_signals": {
3289
"significant_terms": {
33-
"field": "agent.keyword"
90+
"field": "payment_method"
3491
}
3592
}
3693
}
3794
}
3895
```
3996
{% include copy-curl.html %}
4097

41-
#### Example response
98+
The returned aggregation shows that among the five high-value returns, `gift_card` appears 3 times (60%), compared to 3 out of 12 times in the entire index (25%). As a result, it is flagged as the most overrepresented payment method:
4299

43100
```json
44-
...
45-
"aggregations" : {
46-
"significant_response_codes" : {
47-
"doc_count" : 2737,
48-
"bg_count" : 14074,
49-
"buckets" : [
50-
{
51-
"key" : "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)",
52-
"doc_count" : 818,
53-
"score" : 0.01462731514608217,
54-
"bg_count" : 4010
55-
},
56-
{
57-
"key" : "Mozilla/5.0 (X11; Linux x86_64; rv:6.0a1) Gecko/20110421 Firefox/6.0a1",
58-
"doc_count" : 1067,
59-
"score" : 0.009062566630410223,
60-
"bg_count" : 5362
101+
{
102+
...
103+
"hits": {
104+
"total": {
105+
"value": 5,
106+
"relation": "eq"
107+
},
108+
"max_score": null,
109+
"hits": []
110+
},
111+
"aggregations": {
112+
"payment_signals": {
113+
"doc_count": 5,
114+
"bg_count": 12,
115+
"buckets": [
116+
{
117+
"key": "gift_card",
118+
"doc_count": 3,
119+
"score": 0.84,
120+
"bg_count": 3
121+
}
122+
]
123+
}
124+
}
125+
}
126+
```
127+
128+
## Multi‑set analysis
129+
130+
You can determine the unusual values for each category by first grouping documents into buckets and then running a `significant_terms` aggregation within each bucket.
131+
132+
### Example: Unusual `cancel_reason` per region
133+
134+
The following example groups by region with a terms aggregation and within each bucket runs `significant_terms` to identify cancellation reasons disproportionately common in that region:
135+
136+
```json
137+
GET /rides/_search
138+
{
139+
"size": 0,
140+
"aggs": {
141+
"by_region": {
142+
"terms": { "field": "region.keyword", "size": 5 },
143+
"aggs": {
144+
"odd_cancellations": {
145+
"significant_terms": { "field": "cancel_reason.keyword" }
146+
}
147+
}
148+
}
149+
}
150+
}
151+
```
152+
{% include copy-curl.html %}
153+
154+
### Example: Hotspots on a map
155+
156+
Suppose that you have a dataset of field incidents at sites across a country. Each document contains a point location `site.location` of type `geo_point` and a categorical field `issue.keyword` (for example, `POWER_OUTAGE`, `FIBER_CUT`, or `VANDALISM`). You want to identify the issue types that are overrepresented within specific map tiles compared to a broader reference set. You can use a `geotile_grid` to divide the map into zoom‑level tiles. Higher `precision` produces smaller tiles, such as street or city blocks, while lower `precision` produces larger tiles, such as a city or region. Run a `significant_terms` aggregation within each tile to identify the local outliers.
157+
158+
Segment the data by map tiles and identify the `issue.keyword` values that are unusually frequent in those tiles:
159+
160+
```json
161+
GET field_ops/_search
162+
{
163+
"size": 0,
164+
"aggs": {
165+
"tiles": {
166+
"geotile_grid": { "field": "site.location", "precision": 6 },
167+
"aggs": {
168+
"odd_issues": {
169+
"significant_terms": { "field": "issue.keyword" }
170+
}
171+
}
172+
}
173+
}
174+
}
175+
```
176+
{% include copy-curl.html %}
177+
178+
## Use a `background_filter` to narrow the background set
179+
180+
By default, the background contains the entire index. Use a `background_filter` to restrict background documents for more precise results.
181+
182+
### Example: Compare Toronto to the rest of Canada
183+
184+
The following example filters the foreground to "Toronto" and sets a `background_filter` for "Canada". `significant_terms` highlights topics that are unusually frequent in Toronto relative to other Canadian cities:
185+
186+
```json
187+
GET /news/_search
188+
{
189+
"size": 0,
190+
"query": { "term": { "city.keyword": "Toronto" } },
191+
"aggs": {
192+
"unusual_topics": {
193+
"significant_terms": {
194+
"field": "topic.keyword",
195+
"background_filter": {
196+
"term": { "country.keyword": "Canada" }
197+
}
61198
}
62-
]
199+
}
63200
}
64-
}
201+
}
202+
```
203+
{% include copy-curl.html %}
204+
205+
Using a custom background requires additional processing because the background frequency for each candidate term must be computed by applying the filter. This can be slower than using the default index-wide counts.
206+
{: .warning}
207+
208+
## Field type considerations
209+
210+
`significant_terms` aggregations work best on exact-value fields (for example, `keyword` or `numeric`). Running `significant_terms` aggregations on heavily tokenized text can be memory intensive. For analyzed text, consider using [`significant_text` aggregations]({{site.url}}{{site.baseurl}}/aggregations/bucket/significant-text/), which are designed for full-text fields and support the same significance heuristics.
211+
212+
## Heuristics and scoring
213+
214+
The `score` ranks terms based on how much their foreground frequency differs from the background frequency. It has no units and is meaningful only for comparison within the same request and heuristic.
215+
216+
You can select one heuristic per request by specifying it under `significant_terms`. The following heuristics are supported.
217+
218+
### JLH
219+
220+
The Jensen–Shannon Lift Heuristic (JLH) is suitable for most general‑purpose scenarios. It balances both the absolute frequency of a term and its relative overrepresentation compared to the background set, favoring terms that increase both *absolutely* and *relatively*.
221+
222+
```json
223+
"significant_terms": {
224+
"field": "payment_method.keyword",
225+
"jlh": {}
65226
}
66227
```
67228

68-
If the `significant_terms` aggregation doesn't return any result, you might have not filtered the results with a query. Alternatively, the distribution of terms in the foreground set might be the same as the background set, implying that there isn't anything unusual in the foreground set.
229+
#### JLH scoring
230+
231+
The JLH score is calculated as follows:
232+
233+
`fg_pct = doc_count / foreground_total` and `bg_pct = bg_count / background_total`. JLH ≈ `(fg_pct − bg_pct) * (fg_pct / bg_pct)`.
234+
235+
A term whose frequency increases slightly from a large baseline will score higher than a term with the same absolute increase from a very small background share.
236+
237+
#### Score example calculation using JLH
238+
239+
Suppose that your foreground set (high‑value returns) contains `2,000` orders, and the background set (all orders) contains `120,000` orders. Consider a single term in the `significant_terms` aggregation, which has the following counts:
240+
241+
- `doc_count = 160`
242+
- `bg_count = 3,200`
243+
244+
Percentages of documents containing the term are calculated as follows:
245+
246+
- `fg_pct = 160 / 2000 = 0.08`
247+
- `bg_pct = 3200 / 120000 ≈ 0.026666…`
248+
249+
JLH ≈ `(0.08 − 0.026666…) * (0.08 / 0.026666…) ≈ 0.053333… * 3 ≈ 0.16`
250+
251+
This positive score means that the searched term is notably more prevalent in high‑value returns than it is overall. Scores are relative: use them to rank terms, not as absolute probabilities.
252+
253+
### Mutual information
254+
255+
Mutual information (MI) prefers frequent terms and identifies popular but still distinctive terms. Set `include_negatives: false` to ignore terms that are less common in the foreground than the background. If your background is not a superset of the foreground, set `background_is_superset: false`:
256+
257+
```json
258+
"significant_terms": {
259+
"field": "product.keyword",
260+
"mutual_information": {
261+
"include_negatives": false,
262+
"background_is_superset": true
263+
}
264+
}
265+
```
266+
267+
### Chi‑square
268+
269+
Chi-square is a statistical test that measures how much the observed frequency of a term in a subset (foreground) deviates from the expected frequency based on a reference set (background). Similarly to [MI](#mutual-information), chi-square supports `include_negatives` and `background_is_superset`:
270+
271+
```json
272+
"significant_terms": {
273+
"field": "error.keyword",
274+
"chi_square": { "include_negatives": false }
275+
}
276+
```
277+
278+
### Google Normalized Distance
279+
280+
Google Normalized Distance (GND) favors strong co‑occurrence. It is useful for synonym discovery or items that tend to appear together:
281+
282+
```json
283+
"significant_terms": {
284+
"field": "tag.keyword",
285+
"gnd": {}
286+
}
287+
```
288+
289+
### Percentage
290+
291+
Percentage sorts terms by the `doc_count`/`bg_count` ratio and identifies the number of foreground hits that a term has relative to its background hits. It doesn't account for the overall sizes of the two sets, so very rare terms can dominate:
292+
293+
```json
294+
"significant_terms": {
295+
"field": "sku.keyword",
296+
"percentage": {}
297+
}
298+
```
299+
300+
### Scripted heuristic
301+
302+
To provide a custom heuristic formula, use the following variables:
303+
304+
- `_subset_freq`: The number of documents containing the term in the foreground set.
305+
- `_superset_freq`: The number of documents containing the term in the background set.
306+
- `_subset_size`: The total number of documents in the foreground set.
307+
- `_superset_size`: The total number of documents in the background set.
308+
309+
The following request runs a `significant_terms` aggregation on `field.keyword` using a custom script heuristic to score terms based on their frequency in the foreground relative to the background:
310+
311+
```json
312+
"significant_terms": {
313+
"field": "field.keyword",
314+
"script_heuristic": {
315+
"script": {
316+
"lang": "painless",
317+
"source": "params._subset_freq / (params._superset_freq - params._subset_freq + 1)"
318+
}
319+
}
320+
}
321+
```
69322

70-
The default source of statistical information for background term frequencies is the entire index. You can narrow this scope with a background filter for more focus
71323

0 commit comments

Comments
 (0)