Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAIN-2893: prometheus + graph generating capability #295

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
136 changes: 136 additions & 0 deletions FEATURES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@

# Features

This page document and describes HolmesGPT's behaviour when it comes to its features.


## Root Cause Analysis

Also called Investigation, Root Cause Analysis (RCA) is HolmesGPT's ability to investigate alerts,
typically from Prometheus' alert manager.

### Sectioned output

HolmesGPT generates structured output by default. It is also capable of generating sections based on request.

Here is an example of a request payload to run an investigation:

```json
{
"source": "prometheus",
"source_instance_id": "some-instance",
"title": "Pod is crash looping.",
"description": "Pod default/oomkill-deployment-696dbdbf67-d47z6 (main2) is in waiting state (reason: 'CrashLoopBackOff').",
"subject": {
"name": "oomkill-deployment-696dbdbf67-d47z6",
"subject_type": "deployment",
"namespace": "default",
"node": "some-node",
"container": "main2",
"labels": {
"x": "y",
"p": "q"
},
"annotations": {}
},
"context":
{
"robusta_issue_id": "5b3e2fb1-cb83-45ea-82ec-318c94718e44"
},
"include_tool_calls": true,
"include_tool_call_results": true
"sections": {
"Alert Explanation": "1-2 sentences explaining the alert itself - note don't say \"The alert indicates a warning event related to a Kubernetes pod doing blah\" rather just say \"The pod XYZ did blah\" because that is what the user actually cares about",
"Conclusions and Possible Root causes": "What conclusions can you reach based on the data you found? what are possible root causes (if you have enough conviction to say) or what uncertainty remains. Don't say root cause but 'possible root causes'. Be clear to distinguish between what you know for certain and what is a possible explanation",
"Related logs": "Truncate and share the most relevant logs, especially if these explain the root cause. For example: \nLogs from pod robusta-holmes:\n```\n<logs>```\n. Always embed the surroundding +/- 5 log lines to any relevant logs. "
}
}
```

Notice that the "sections" field contains 3 different sections. The text value for each section should be a prompt telling the LLM what the section should contain.
You can then expect the following in return:

```
{
"analysis": <monolithic text response. Contains all the sections aggregated together>,
"sections": {
"Alert Explanation": <A markdown text with the explanation of the alert>,
"Conclusions and Possible Root causes": <Conclusions reached by the LLM>,
"Related logs": <Any related logs the LLM could find through tools>
},
"tool_calls": <tool calls>,
"instructions": <Specific instructions used for this investigation>
}
```

In some cases, the LLM may decide to set a section to `null` or even add or ignore some sections.


## PromQL

If the `prometheus/metrics` toolset is enabled, HolmesGPT can generate embed graphs in conversations (ask holmes).

For example, here is scenario in which the LLM answers with a graph:


User question:

```
Show me the http request latency over time for the service customer-orders-service?
```


HolmesGPT text response:
```
Here's the average HTTP request latency over time for the `customer-orders-service`:

<< {type: "promql", tool_name: "execute_prometheus_range_query", random_key: "9kLK"} >>
```

In addition to this text response, the returned JSON will contain one or more tool calls, including the prometheus query:

```json
"tool_calls": [
{
"tool_call_id": "call_lKI7CQW6Y2n1ZQ5dlxX79TcM",
"tool_name": "execute_prometheus_range_query",
"description": "Prometheus query_range. query=rate(http_request_duration_seconds_sum{service=\"customer-orders-service\"}[5m]) / rate(http_request_duration_seconds_count{service=\"customer-orders-service\"}[5m]), start=1739705559, end=1739791959, step=300, description=HTTP request latency for customer-orders-service",
"result": "{\n \"status\": \"success\",\n \"random_key\": \"9kLK\",\n \"tool_name\": \"execute_prometheus_range_query\",\n \"description\": \"Average HTTP request latency for customer-orders-service\",\n \"query\": \"rate(http_request_duration_seconds_sum{service=\\\"customer-orders-service\\\"}[5m]) / rate(http_request_duration_seconds_count{service=\\\"customer-orders-service\\\"}[5m])\",\n \"start\": \"1739705559\",\n \"end\": \"1739791959\",\n \"step\": 60\n}"
}
],
```

The result of this tool call contains details about the [prometheus query](https://prometheus.io/docs/prometheus/latest/querying/api/#range-queries) to build the graph returned by HolmesGPT:

```json
{
"status": "success",
"random_key": "9kLK",
"tool_name": "execute_prometheus_range_query",
"description": "Average HTTP request latency for customer-orders-service",
"query": "rate(http_request_duration_seconds_sum{service=\"customer-orders-service\"}[5m]) / rate(http_request_duration_seconds_count{service=\"customer-orders-service\"}[5m])",
"start": "1739705559", // Can be rfc3339 or a unix timestamp
"end": "1739791959", // Can be rfc3339 or a unix timestamp
"step": 60 // Query resolution step width in seconds
}
```

In addition to `execute_prometheus_range_query`, HolmesGPT can generate similar results with an `execute_prometheus_instant_query` which is an [instant query](https://prometheus.io/docs/prometheus/latest/querying/api/#instant-queries):

```
Here's the average HTTP request latency over time for the `customer-orders-service`:

<< {type: "promql", tool_name: "execute_prometheus_instant_query", random_key: "9kLK"} >>
```

```json
{
"status": "success",
"random_key": "2KiL",
"tool_name": "execute_prometheus_instant_query",
"description": "Average HTTP request latency for customer-orders-service",
"query": "rate(http_request_duration_seconds_sum{service=\"customer-orders-service\"}[5m]) / rate(http_request_duration_seconds_count{service=\"customer-orders-service\"}[5m])"
}
```

Unlike the range query, the instant query result lacks the `start`, `end` and `step` arguments.
51 changes: 51 additions & 0 deletions holmes/core/openai_formatting.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
import re

# parses both simple types: "int", "array", "string"
# but also arrays of those simpler types: "array[int]", "array[string]", etc.
pattern = r"^(array\[(?P<inner_type>\w+)\])|(?P<simple_type>\w+)$"


def type_to_open_ai_schema(type_value):
match = re.match(pattern, type_value.strip())

if not match:
raise ValueError(f"Invalid type format: {type_value}")

if match.group("inner_type"):
return {"type": "array", "items": {"type": match.group("inner_type")}}

else:
return {"type": match.group("simple_type")}


def format_tool_to_open_ai_standard(
tool_name: str, tool_description: str, tool_parameters: dict
):
tool_properties = {}
for param_name, param_attributes in tool_parameters.items():
tool_properties[param_name] = type_to_open_ai_schema(param_attributes.type)
if param_attributes.description is not None:
tool_properties[param_name]["description"] = param_attributes.description

result = {
"type": "function",
"function": {
"name": tool_name,
"description": tool_description,
"parameters": {
"properties": tool_properties,
"required": [
param_name
for param_name, param_attributes in tool_parameters.items()
if param_attributes.required
],
"type": "object",
},
},
}

# gemini doesnt have parameters object if it is without params
if tool_properties is None:
result["function"].pop("parameters")

return result
33 changes: 22 additions & 11 deletions holmes/core/performance_timing.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,14 +49,25 @@ def end(self):
)


def log_function_timing(func):
@wraps(func)
def function_timing_wrapper(*args, **kwargs):
start_time = time.perf_counter()
result = func(*args, **kwargs)
end_time = time.perf_counter()
total_time = int((end_time - start_time) * 1000)
logging.info(f'Function "{func.__name__}()" took {total_time}ms')
return result

return function_timing_wrapper
def log_function_timing(label=None):
def decorator(func):
@wraps(func)
def function_timing_wrapper(*args, **kwargs):
start_time = time.perf_counter()
result = func(*args, **kwargs)
end_time = time.perf_counter()
total_time = int((end_time - start_time) * 1000)

function_identifier = (
f'"{label}: {func.__name__}()"' if label else f'"{func.__name__}()"'
)
logging.info(f"Function {function_identifier} took {total_time}ms")
return result

return function_timing_wrapper

if callable(label):
func = label
label = None
return decorator(func)
return decorator
39 changes: 8 additions & 31 deletions holmes/core/tools.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@
model_validator,
)

from holmes.core.openai_formatting import format_tool_to_open_ai_standard


ToolsetPattern = Union[Literal["*"], List[str]]

Expand Down Expand Up @@ -80,36 +82,11 @@ class Tool(ABC, BaseModel):
additional_instructions: Optional[str] = None

def get_openai_format(self):
tool_properties = {}
for param_name, param_attributes in self.parameters.items():
tool_properties[param_name] = {"type": param_attributes.type}
if param_attributes.description is not None:
tool_properties[param_name]["description"] = (
param_attributes.description
)

result = {
"type": "function",
"function": {
"name": self.name,
"description": self.description,
"parameters": {
"properties": tool_properties,
"required": [
param_name
for param_name, param_attributes in self.parameters.items()
if param_attributes.required
],
"type": "object",
},
},
}

# gemini doesnt have parameters object if it is without params
if tool_properties is None:
result["function"].pop("parameters")

return result
return format_tool_to_open_ai_standard(
tool_name=self.name,
tool_description=self.description,
tool_parameters=self.parameters,
)

@abstractmethod
def invoke(self, params: Dict) -> str:
Expand Down Expand Up @@ -415,7 +392,7 @@ def invoke(self, tool_name: str, params: Dict) -> str:
tool = self.get_tool_by_name(tool_name)
return tool.invoke(params) if tool else ""

def get_tool_by_name(self, name: str) -> Optional[YAMLTool]:
def get_tool_by_name(self, name: str) -> Optional[Tool]:
if name in self.tools_by_name:
return self.tools_by_name[name]
logging.warning(f"could not find tool {name}. skipping")
Expand Down
19 changes: 19 additions & 0 deletions holmes/plugins/prompts/generic_ask_conversation.jinja2
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,25 @@ Use conversation history to maintain continuity when appropriate, ensuring effic

{% include '_general_instructions.jinja2' %}

Prometheus/PromQL queries
* Use prometheus to execute promql queries with the tools `execute_prometheus_instant_query` and `execute_prometheus_range_query`
* ALWAYS embed the execution results into your answer
* You only need to embed the partial result in your response. Include the `tool_name` and `random_key`. For example: << {type: "promql", tool_name: "execute_prometheus_query", random_key: "92jf2hf"} >>
* Use these tools to generate charts that users can see. Here are standard metrics but you can use different ones:
** For memory consumption: `container_memory_working_set_bytes`
** For CPU usage: `container_cpu_usage_seconds_total`
** For CPU throttling: `container_cpu_cfs_throttled_periods_total`
** For latencies, prefer using `<metric>_sum` / `<metric>_count` over a sliding window
** Avoid using `<metric>_bucket` unless you know the bucket's boundaries are configured correctly
** Prefer individual averages like `rate(<metric>_sum) / rate(<metric>_count)`
** Avoid global averages like `sum(rate(<metric>_sum)) / sum(rate(<metric>_count))` because it hides data and is not generally informative
* Post processing will parse your response, re-run the query from the tool output and create a chart visible to the user
* Only generate and execute a prometheus query after checking what metrics are available with the `list_available_metrics` tool
* Check that any node, service, pod, container, app, namespace, etc. mentioned in the query exist in the kubernetes cluster before making a query. Use any appropriate kubectl tool(s) for this
* The toolcall will return no data to you. That is expected. You MUST however ensure that the query is successful.
* You can get the current time before executing a prometheus range query
* ALWAYS embed the execution results into your answer

Style guide:
* Reply with terse output.
* Be painfully concise.
Expand Down
6 changes: 5 additions & 1 deletion holmes/plugins/toolsets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,15 @@
from typing import List, Optional

from holmes.core.supabase_dal import SupabaseDal
from holmes.plugins.toolsets.datetime import DatetimeToolset
from holmes.plugins.toolsets.robusta import RobustaToolset
from holmes.plugins.toolsets.grafana.toolset_grafana_loki import GrafanaLokiToolset
from holmes.plugins.toolsets.grafana.toolset_grafana_tempo import GrafanaTempoToolset
from holmes.plugins.toolsets.internet import InternetToolset
from holmes.plugins.toolsets.opensearch import OpenSearchToolset
from holmes.plugins.toolsets.prometheus import PrometheusToolset

from holmes.core.tools import Toolset, YAMLToolset
from holmes.plugins.toolsets.opensearch import OpenSearchToolset
import yaml

THIS_DIR = os.path.abspath(os.path.dirname(__file__))
Expand Down Expand Up @@ -46,6 +48,8 @@ def load_python_toolsets(dal: Optional[SupabaseDal]) -> List[Toolset]:
OpenSearchToolset(),
GrafanaLokiToolset(),
GrafanaTempoToolset(),
PrometheusToolset(),
DatetimeToolset(),
]

return toolsets
Expand Down
35 changes: 35 additions & 0 deletions holmes/plugins/toolsets/datetime.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
from holmes.core.tools import ToolsetTag
from typing import Dict
from holmes.core.tools import Tool, Toolset
import datetime


class CurrentTime(Tool):
def __init__(self):
super().__init__(
name="get_current_time",
description="Return current time information. Useful to build queries that require a time information",
parameters={},
)

def invoke(self, params: Dict) -> str:
now = datetime.datetime.now(datetime.timezone.utc)
return f"The current UTC date and time are {now}. The current UTC timestamp in seconds is {int(now.timestamp())}."

def get_parameterized_one_liner(self, params) -> str:
return "fetched current time"


class DatetimeToolset(Toolset):
def __init__(self):
super().__init__(
name="datetime",
enabled=True,
description="Current date and time information",
docs_url="https://docs.robusta.dev/master/configuration/holmesgpt/toolsets/datetime.html",
icon_url="https://upload.wikimedia.org/wikipedia/commons/8/8b/OOjs_UI_icon_calendar-ltr.svg",
prerequisites=[],
tools=[CurrentTime()],
tags=[ToolsetTag.CORE],
is_default=True,
)
Loading
Loading