Skip to content

Commit 0ddefeb

Browse files
committed
Add monitoring tool document
1 parent 384b9db commit 0ddefeb

File tree

6 files changed

+682
-0
lines changed

6 files changed

+682
-0
lines changed

src/.vuepress/sidebar/V2.0.x/zh-Table.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,7 @@ export const zhSidebar = {
9999
// children: 'structure',
100100
children: [
101101
{ text: '命令行工具', link: 'CLI' },
102+
{ text: '监控工具', link: 'Monitor-Tool_apache' },
102103
],
103104
},
104105
{

src/.vuepress/sidebar_timecho/V2.0.x/zh-Table.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,7 @@ export const zhSidebar = {
101101
// children: 'structure',
102102
children: [
103103
{ text: '命令行工具', link: 'CLI' },
104+
{ text: '监控工具', link: 'Monitor-Tool_timecho' },
104105
],
105106
},
106107
{
Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
<!--
2+
3+
Licensed to the Apache Software Foundation (ASF) under one
4+
or more contributor license agreements. See the NOTICE file
5+
distributed with this work for additional information
6+
regarding copyright ownership. The ASF licenses this file
7+
to you under the Apache License, Version 2.0 (the
8+
"License"); you may not use this file except in compliance
9+
with the License. You may obtain a copy of the License at
10+
11+
http://www.apache.org/licenses/LICENSE-2.0
12+
13+
Unless required by applicable law or agreed to in writing,
14+
software distributed under the License is distributed on an
15+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
16+
KIND, either express or implied. See the License for the
17+
specific language governing permissions and limitations
18+
under the License.
19+
20+
-->
21+
22+
23+
# Prometheus
24+
25+
监控工具的部署可参考文档 [监控面板部署](../Deployment-and-Maintenance/Monitoring-panel-deployment.md) 章节。
26+
27+
## 监控指标的 Prometheus 映射关系
28+
29+
> 对于 Metric Name 为 name, Tags 为 K1=V1, ..., Kn=Vn 的监控指标有如下映射,其中 value 为具体值
30+
31+
| 监控指标类型 | 映射关系 |
32+
| ---------------- |--------------------------------|
33+
| Counter | name_total{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn"} value |
34+
| AutoGauge、Gauge | name{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn"} value |
35+
| Histogram | name_max{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn"} value <br> name_sum{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn"} value <br> name_count{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn"} value <br> name{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn", quantile="0.5"} value <br> name{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn", quantile="0.99"} value |
36+
| Rate | name_total{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn"} value <br> name_total{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn", rate="m1"} value <br> name_total{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn", rate="m5"} value <br> name_total{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn", rate="m15"} value <br> name_total{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn", rate="mean"} value |
37+
| Timer | name_seconds_max{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn"} value <br> name_seconds_sum{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn"} value <br> name_seconds_count{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn"} value <br> name_seconds{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn", quantile="0.5"} value value <br> name_seconds{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn", quantile="0.99"} value |
38+
39+
## 修改配置文件
40+
41+
1) 以 DataNode 为例,修改 iotdb-system.properties 配置文件如下:
42+
43+
```properties
44+
dn_metric_reporter_list=PROMETHEUS
45+
dn_metric_level=CORE
46+
dn_metric_prometheus_reporter_port=9091
47+
```
48+
49+
2) 启动 IoTDB DataNode
50+
51+
3) 打开浏览器或者用```curl``` 访问 ```http://servier_ip:9091/metrics```, 就能得到如下 metric 数据:
52+
53+
```
54+
...
55+
# HELP file_count
56+
# TYPE file_count gauge
57+
file_count{name="wal",} 0.0
58+
file_count{name="unseq",} 0.0
59+
file_count{name="seq",} 2.0
60+
...
61+
```
62+
63+
## Prometheus + Grafana
64+
65+
如上所示,IoTDB 对外暴露出标准的 Prometheus 格式的监控指标数据,可以使用 Prometheus 采集并存储监控指标,使用 Grafana
66+
可视化监控指标。
67+
68+
IoTDB、Prometheus、Grafana三者的关系如下图所示:
69+
70+
![iotdb_prometheus_grafana](https://alioss.timecho.com/docs/img/UserGuide/System-Tools/Metrics/iotdb_prometheus_grafana.png)
71+
72+
1. IoTDB在运行过程中持续收集监控指标数据。
73+
2. Prometheus以固定的间隔(可配置)从IoTDB的HTTP接口拉取监控指标数据。
74+
3. Prometheus将拉取到的监控指标数据存储到自己的TSDB中。
75+
4. Grafana以固定的间隔(可配置)从Prometheus查询监控指标数据并绘图展示。
76+
77+
从交互流程可以看出,我们需要做一些额外的工作来部署和配置Prometheus和Grafana。
78+
79+
比如,你可以对Prometheus进行如下的配置(部分参数可以自行调整)来从IoTDB获取监控数据
80+
81+
```yaml
82+
job_name: pull-metrics
83+
honor_labels: true
84+
honor_timestamps: true
85+
scrape_interval: 15s
86+
scrape_timeout: 10s
87+
metrics_path: /metrics
88+
scheme: http
89+
follow_redirects: true
90+
static_configs:
91+
- targets:
92+
- localhost:9091
93+
```
94+
95+
更多细节可以参考下面的文档:
96+
97+
[Prometheus安装使用文档](https://prometheus.io/docs/prometheus/latest/getting_started/)
98+
99+
[Prometheus从HTTP接口拉取metrics数据的配置说明](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config)
100+
101+
[Grafana安装使用文档](https://grafana.com/docs/grafana/latest/getting-started/getting-started/)
102+
103+
[Grafana从Prometheus查询数据并绘图的文档](https://prometheus.io/docs/visualization/grafana/#grafana-support-for-prometheus)
104+
105+
## Apache IoTDB Dashboard
106+
107+
`Apache IoTDB Dashboard`是 IoTDB 企业版的配套产品,支持统一集中式运维管理,可通过一个监控面板监控多个集群。你可以联系商务获取到 Dashboard 的 Json文件。
108+
109+
110+
![Apache IoTDB Dashboard](https://alioss.timecho.com/docs/img/%E7%9B%91%E6%8E%A7%20default%20cluster.png)
111+
112+
![Apache IoTDB Dashboard](https://alioss.timecho.com/docs/img/%E7%9B%91%E6%8E%A7%20cluster2.png)
113+
114+
115+
### 集群概览
116+
117+
可以监控包括但不限于:
118+
- 集群总CPU核数、总内存空间、总硬盘空间
119+
- 集群包含多少个ConfigNode与DataNode
120+
- 集群启动时长
121+
- 集群写入速度
122+
- 集群各节点当前CPU、内存、磁盘使用率
123+
- 分节点的信息
124+
125+
![](https://alioss.timecho.com/docs/img/%E7%9B%91%E6%8E%A7%20%E6%A6%82%E8%A7%88.png)
126+
127+
### 数据写入
128+
129+
可以监控包括但不限于:
130+
- 写入平均耗时、耗时中位数、99%分位耗时
131+
- WAL文件数量与尺寸
132+
- 节点 WAL flush SyncBuffer 耗时
133+
134+
![](https://alioss.timecho.com/docs/img/%E7%9B%91%E6%8E%A7%20%E5%86%99%E5%85%A5.png)
135+
136+
### 数据查询
137+
138+
可以监控包括但不限于:
139+
- 节点查询加载时间序列元数据耗时
140+
- 节点查询读取时间序列耗时
141+
- 节点查询修改时间序列元数据耗时
142+
- 节点查询加载Chunk元数据列表耗时
143+
- 节点查询修改Chunk元数据耗时
144+
- 节点查询按照Chunk元数据过滤耗时
145+
- 节点查询构造Chunk Reader耗时的平均值
146+
147+
![](https://alioss.timecho.com/docs/img/%E7%9B%91%E6%8E%A7%20%E6%9F%A5%E8%AF%A2.png)
148+
149+
### 存储引擎
150+
151+
可以监控包括但不限于:
152+
- 分类型的文件数量、大小
153+
- 处于各阶段的TsFile数量、大小
154+
- 各类任务的数量与耗时
155+
156+
![](https://alioss.timecho.com/docs/img/%E7%9B%91%E6%8E%A7%20%E5%AD%98%E5%82%A8%E5%BC%95%E6%93%8E.png)
157+
158+
### 系统监控
159+
160+
可以监控包括但不限于:
161+
- 系统内存、交换内存、进程内存
162+
- 磁盘空间、文件数、文件尺寸
163+
- JVM GC时间占比、分类型的GC次数、GC数据量、各年代的堆内存占用
164+
- 网络传输速率、包发送速率
165+
166+
![](https://alioss.timecho.com/docs/img/%E7%9B%91%E6%8E%A7%20%E7%B3%BB%E7%BB%9F%20%E5%86%85%E5%AD%98%E4%B8%8E%E7%A1%AC%E7%9B%98.png)
167+
168+
![](https://alioss.timecho.com/docs/img/%E7%9B%91%E6%8E%A7%20%E7%B3%BB%E7%BB%9Fjvm.png)
169+
170+
![](https://alioss.timecho.com/docs/img/%E7%9B%91%E6%8E%A7%20%E7%B3%BB%E7%BB%9F%20%E7%BD%91%E7%BB%9C.png)
Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
<!--
2+
3+
Licensed to the Apache Software Foundation (ASF) under one
4+
or more contributor license agreements. See the NOTICE file
5+
distributed with this work for additional information
6+
regarding copyright ownership. The ASF licenses this file
7+
to you under the Apache License, Version 2.0 (the
8+
"License"); you may not use this file except in compliance
9+
with the License. You may obtain a copy of the License at
10+
11+
http://www.apache.org/licenses/LICENSE-2.0
12+
13+
Unless required by applicable law or agreed to in writing,
14+
software distributed under the License is distributed on an
15+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
16+
KIND, either express or implied. See the License for the
17+
specific language governing permissions and limitations
18+
under the License.
19+
20+
-->
21+
22+
23+
# 监控工具
24+
25+
监控工具的部署可参考文档 [监控面板部署](../Deployment-and-Maintenance/Monitoring-panel-deployment.md) 章节。
26+
27+
## 监控指标的 Prometheus 映射关系
28+
29+
> 对于 Metric Name 为 name, Tags 为 K1=V1, ..., Kn=Vn 的监控指标有如下映射,其中 value 为具体值
30+
31+
| 监控指标类型 | 映射关系 |
32+
| ---------------- | ------------------------------------------------------------ |
33+
| Counter | name_total{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn"} value |
34+
| AutoGauge、Gauge | name{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn"} value |
35+
| Histogram | name_max{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn"} value <br> name_sum{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn"} value <br> name_count{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn"} value <br> name{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn", quantile="0.5"} value <br> name{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn", quantile="0.99"} value |
36+
| Rate | name_total{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn"} value <br> name_total{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn", rate="m1"} value <br> name_total{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn", rate="m5"} value <br> name_total{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn", rate="m15"} value <br> name_total{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn", rate="mean"} value |
37+
| Timer | name_seconds_max{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn"} value <br> name_seconds_sum{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn"} value <br> name_seconds_count{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn"} value <br> name_seconds{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn", quantile="0.5"} value value <br> name_seconds{cluster="clusterName", nodeType="nodeType", nodeId="nodeId",k1="V1" , ..., Kn="Vn", quantile="0.99"} value |
38+
39+
## 修改配置文件
40+
41+
1) 以 DataNode 为例,修改 iotdb-system.properties 配置文件如下:
42+
43+
```properties
44+
dn_metric_reporter_list=PROMETHEUS
45+
dn_metric_level=CORE
46+
dn_metric_prometheus_reporter_port=9091
47+
```
48+
49+
2) 启动 IoTDB DataNode
50+
51+
3) 打开浏览器或者用```curl``` 访问 ```http://servier_ip:9091/metrics```, 就能得到如下 metric 数据:
52+
53+
```
54+
...
55+
# HELP file_count
56+
# TYPE file_count gauge
57+
file_count{name="wal",} 0.0
58+
file_count{name="unseq",} 0.0
59+
file_count{name="seq",} 2.0
60+
...
61+
```
62+
63+
## Prometheus + Grafana
64+
65+
如上所示,IoTDB 对外暴露出标准的 Prometheus 格式的监控指标数据,可以使用 Prometheus 采集并存储监控指标,使用 Grafana
66+
可视化监控指标。
67+
68+
IoTDB、Prometheus、Grafana三者的关系如下图所示:
69+
70+
![iotdb_prometheus_grafana](https://alioss.timecho.com/docs/img/UserGuide/System-Tools/Metrics/iotdb_prometheus_grafana.png)
71+
72+
1. IoTDB在运行过程中持续收集监控指标数据。
73+
2. Prometheus以固定的间隔(可配置)从IoTDB的HTTP接口拉取监控指标数据。
74+
3. Prometheus将拉取到的监控指标数据存储到自己的TSDB中。
75+
4. Grafana以固定的间隔(可配置)从Prometheus查询监控指标数据并绘图展示。
76+
77+
从交互流程可以看出,我们需要做一些额外的工作来部署和配置Prometheus和Grafana。
78+
79+
比如,你可以对Prometheus进行如下的配置(部分参数可以自行调整)来从IoTDB获取监控数据
80+
81+
```yaml
82+
job_name: pull-metrics
83+
honor_labels: true
84+
honor_timestamps: true
85+
scrape_interval: 15s
86+
scrape_timeout: 10s
87+
metrics_path: /metrics
88+
scheme: http
89+
follow_redirects: true
90+
static_configs:
91+
- targets:
92+
- localhost:9091
93+
```
94+
95+
更多细节可以参考下面的文档:
96+
97+
[Prometheus安装使用文档](https://prometheus.io/docs/prometheus/latest/getting_started/)
98+
99+
[Prometheus从HTTP接口拉取metrics数据的配置说明](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config)
100+
101+
[Grafana安装使用文档](https://grafana.com/docs/grafana/latest/getting-started/getting-started/)
102+
103+
[Grafana从Prometheus查询数据并绘图的文档](https://prometheus.io/docs/visualization/grafana/#grafana-support-for-prometheus)
104+
105+
## Apache IoTDB Dashboard
106+
107+
我们提供了Apache IoTDB Dashboard,支持统一集中式运维管理,可通过一个监控面板监控多个集群。
108+
109+
![Apache IoTDB Dashboard](https://alioss.timecho.com/docs/img/%E7%9B%91%E6%8E%A7%20default%20cluster.png)
110+
111+
![Apache IoTDB Dashboard](https://alioss.timecho.com/docs/img/%E7%9B%91%E6%8E%A7%20cluster2.png)
112+
113+
你可以在企业版中获取到 Dashboard 的 Json文件。
114+
115+
### 集群概览
116+
117+
可以监控包括但不限于:
118+
- 集群总CPU核数、总内存空间、总硬盘空间
119+
- 集群包含多少个ConfigNode与DataNode
120+
- 集群启动时长
121+
- 集群写入速度
122+
- 集群各节点当前CPU、内存、磁盘使用率
123+
- 分节点的信息
124+
125+
![](https://alioss.timecho.com/docs/img/%E7%9B%91%E6%8E%A7%20%E6%A6%82%E8%A7%88.png)
126+
127+
### 数据写入
128+
129+
可以监控包括但不限于:
130+
- 写入平均耗时、耗时中位数、99%分位耗时
131+
- WAL文件数量与尺寸
132+
- 节点 WAL flush SyncBuffer 耗时
133+
134+
![](https://alioss.timecho.com/docs/img/%E7%9B%91%E6%8E%A7%20%E5%86%99%E5%85%A5.png)
135+
136+
### 数据查询
137+
138+
可以监控包括但不限于:
139+
- 节点查询加载时间序列元数据耗时
140+
- 节点查询读取时间序列耗时
141+
- 节点查询修改时间序列元数据耗时
142+
- 节点查询加载Chunk元数据列表耗时
143+
- 节点查询修改Chunk元数据耗时
144+
- 节点查询按照Chunk元数据过滤耗时
145+
- 节点查询构造Chunk Reader耗时的平均值
146+
147+
![](https://alioss.timecho.com/docs/img/%E7%9B%91%E6%8E%A7%20%E6%9F%A5%E8%AF%A2.png)
148+
149+
### 存储引擎
150+
151+
可以监控包括但不限于:
152+
- 分类型的文件数量、大小
153+
- 处于各阶段的TsFile数量、大小
154+
- 各类任务的数量与耗时
155+
156+
![](https://alioss.timecho.com/docs/img/%E7%9B%91%E6%8E%A7%20%E5%AD%98%E5%82%A8%E5%BC%95%E6%93%8E.png)
157+
158+
### 系统监控
159+
160+
可以监控包括但不限于:
161+
- 系统内存、交换内存、进程内存
162+
- 磁盘空间、文件数、文件尺寸
163+
- JVM GC时间占比、分类型的GC次数、GC数据量、各年代的堆内存占用
164+
- 网络传输速率、包发送速率
165+
166+
![](https://alioss.timecho.com/docs/img/%E7%9B%91%E6%8E%A7%20%E7%B3%BB%E7%BB%9F%20%E5%86%85%E5%AD%98%E4%B8%8E%E7%A1%AC%E7%9B%98.png)
167+
168+
![](https://alioss.timecho.com/docs/img/%E7%9B%91%E6%8E%A7%20%E7%B3%BB%E7%BB%9Fjvm.png)
169+
170+
![](https://alioss.timecho.com/docs/img/%E7%9B%91%E6%8E%A7%20%E7%B3%BB%E7%BB%9F%20%E7%BD%91%E7%BB%9C.png)

0 commit comments

Comments
 (0)