Update index.mdx

This commit is contained in:
Yibo Liu 2024-12-05 09:55:21 +08:00 committed by GitHub
parent 57500f1ffc
commit c8d418d93a
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
1 changed files with 14 additions and 14 deletions

View File

@ -148,8 +148,8 @@ TDinsight 仪表盘旨在提供 TDengine 相关资源的使用情况和状态,
### 预配置告警规则自动导入
涛思总结用户使用经验整理出14个常用的告警规则alert rule能够对集群关键指标进行监测并及时上报指标异常、超限等告警信息。
从TDengine-server 3.3.4.3版本tdengine-datasource 3.6.3开始TDengine Datasource 支持预配置告警规则自动导入功能用户可将14个告警规则一键导入Grafana直接使用。
预配置告警规则导入方法如下图所示在tdengine-datasource setting界面打开Load TDengine Alert开关即可导入所有预配置告警规则如不需要点击Clear TDengine Alert按钮即可删除所有预配置告警规则。
从TDengine-server 3.3.4.3版本tdengine-datasource 3.6.3开始TDengine Datasource 支持预配置告警规则自动导入功能用户可将14个告警规则一键导入Grafana11.x版本,直接使用。
预配置告警规则导入方法如下图所示在tdengine-datasource setting界面打开 “Load Tengine Alert” 开关,点击 “Save & test” 按钮后,插件会自动加载上述告警规则, 规则会放入以数据源名称 + “-alert” 的 grafana 告警目录中。如不需要关闭Load TDengine Alert开关。点击 “Clear TDengine Alert” 旁边的按钮则会清除此数据源导入的所有告警。
![TDengine Alert](./assets/TDengine-Alert.webp)
@ -164,18 +164,18 @@ TDinsight 仪表盘旨在提供 TDengine 相关资源的使用情况和状态,
| ------ | --------- | ---------------- | ----------- |------- |----------------------|
|dnode 节点的CPU负载|均值 > 80%|触发告警|5分钟|5分钟 |`select now(), dnode_id, last(cpu_system) as cup_use from log.taosd_dnodes_info where _ts >= (now- 5m) and _ts < now partition by dnode_id having first(_ts) > 0 `|
|dnode 节点的的内存 |均值 > 60%|触发告警|5分钟|5分钟|`select now(), dnode_id, last(mem_engine) / last(mem_total) * 100 as taosd from log.taosd_dnodes_info where _ts >= (now- 5m) and _ts <now partition by dnode_id`|
|dnode 节点的磁盘容量占用 | > 80%|触发告警|5分钟|5分钟|`select now(), dnode_id, last(mem_engine) / last(mem_total) * 100 as taosd from log.taosd_dnodes_info where _ts >= (now- 5m) and _ts <now partition by dnode_id`|
|集群授权到期 |< 60天|触发告警|1天|0秒|`select now(), dnode_id, last(mem_engine) / last(mem_total) * 100 as taosd from log.taosd_dnodes_info where _ts >= (now- 5m) and _ts <now partition by dnode_id`|
|测点数达到授权测点数|>= 90%|触发告警|1天|0秒|`select now(), dnode_id, last(mem_engine) / last(mem_total) * 100 as taosd from log.taosd_dnodes_info where _ts >= (now- 5m) and _ts <now partition by dnode_id`|
|查询并发请求数 | > 100|不触发报警|1分钟|0秒|`select now(), dnode_id, last(mem_engine) / last(mem_total) * 100 as taosd from log.taosd_dnodes_info where _ts >= (now- 5m) and _ts <now partition by dnode_id`|
|慢查询执行最长时间 (无时间窗口) |> 300秒|不触发报警|1分钟|0秒|`select now(), dnode_id, last(mem_engine) / last(mem_total) * 100 as taosd from log.taosd_dnodes_info where _ts >= (now- 5m) and _ts <now partition by dnode_id`|
|dnode下线 |total != alive|触发告警|30秒|0秒|`select now(), dnode_id, last(mem_engine) / last(mem_total) * 100 as taosd from log.taosd_dnodes_info where _ts >= (now- 5m) and _ts <now partition by dnode_id`|
|vnode下线 |total != alive|触发告警|30秒|0秒|`select now(), dnode_id, last(mem_engine) / last(mem_total) * 100 as taosd from log.taosd_dnodes_info where _ts >= (now- 5m) and _ts <now partition by dnode_id`|
|数据删除请求数 |> 0|不触发报警|30秒|0秒|`select now(), dnode_id, last(mem_engine) / last(mem_total) * 100 as taosd from log.taosd_dnodes_info where _ts >= (now- 5m) and _ts <now partition by dnode_id`|
|Adapter RESTful 请求失败 |> 5|不触发报警|30秒|0秒|`select now(), dnode_id, last(mem_engine) / last(mem_total) * 100 as taosd from log.taosd_dnodes_info where _ts >= (now- 5m) and _ts <now partition by dnode_id`|
|Adapter WebSocket 请求失败 |> 5|不触发报警|30秒|0秒|`select now(), dnode_id, last(mem_engine) / last(mem_total) * 100 as taosd from log.taosd_dnodes_info where _ts >= (now- 5m) and _ts <now partition by dnode_id`|
|dnode 数据上报缺少 |< 3|触发告警|180秒|0秒|`select now(), dnode_id, last(mem_engine) / last(mem_total) * 100 as taosd from log.taosd_dnodes_info where _ts >= (now- 5m) and _ts <now partition by dnode_id`|
|dnode 重启 |max(update_time) > last(update_time)|触发告警|90秒|0秒|`select now(), dnode_id, last(mem_engine) / last(mem_total) * 100 as taosd from log.taosd_dnodes_info where _ts >= (now- 5m) and _ts <now partition by dnode_id`|
|dnode 节点的磁盘容量占用 | > 80%|触发告警|5分钟|5分钟|`select now(), dnode_id, data_dir_level, data_dir_name, last(used) / last(total) * 100 as used from log.taosd_dnodes_data_dirs where _ts >= (now - 5m) and _ts < now partition by dnode_id, data_dir_level, data_dir_name`|
|集群授权到期 |< 60天|触发告警|1天|0秒|`select now(), cluster_id, last(grants_expire_time) / 86400 as expire_time from log.taosd_cluster_info where _ts >= (now - 24h) and _ts < now partition by cluster_id having first(_ts) > 0 `|
|测点数达到授权测点数|>= 90%|触发告警|1天|0秒|`select now(), cluster_id, CASE WHEN max(grants_timeseries_total) > 0.0 THEN max(grants_timeseries_used) /max(grants_timeseries_total) * 100.0 ELSE 0.0 END AS result from log.taosd_cluster_info where _ts >= (now - 30s) and _ts < now partition by cluster_id having timetruncate(first(_ts), 1m) > 0`|
|查询并发请求数 | > 100|不触发报警|1分钟|0秒|`select now() as ts, count(*) as slow_count from performance_schema.perf_queries`|
|慢查询执行最长时间 (无时间窗口) |> 300秒|不触发报警|1分钟|0秒|`select now() as ts, count(*) as slow_count from performance_schema.perf_queries where exec_usec>300000000`|
|dnode下线 |total != alive|触发告警|30秒|0秒|`select now(), cluster_id, last(dnodes_total) - last(dnodes_alive) as dnode_offline from log.taosd_cluster_info where _ts >= (now -30s) and _ts < now partition by cluster_id having first(_ts) > 0`|
|vnode下线 |total != alive|触发告警|30秒|0秒|`select now(), cluster_id, last(vnodes_total) - last(vnodes_alive) as vnode_offline from log.taosd_cluster_info where _ts >= (now - 30s) and _ts < now partition by cluster_id having first(_ts) > 0 `|
|数据删除请求数 |> 0|不触发报警|30秒|0秒|`select now(), count(`count`) as `delete_count` from log.taos_sql_req where sql_type = 'delete' and _ts >= (now -30s) and _ts < now`|
|Adapter RESTful 请求失败 |> 5|不触发报警|30秒|0秒|`select now(), sum(`fail`) as `Failed` from log.adapter_requests where req_type=0 and ts >= (now -30s) and ts < now;`|
|Adapter WebSocket 请求失败 |> 5|不触发报警|30秒|0秒|`select now(), sum(`fail`) as `Failed` from log.adapter_requests where req_type=1 and ts >= (now -30s) and ts < now`|
|dnode 数据上报缺少 |< 3|触发告警|180秒|0秒|`select now(), cluster_id, count(*) as dnode_report from log.taosd_cluster_info where _ts >= (now -180s) and _ts < now partition by cluster_id having timetruncate(first(_ts), 1h) > 0`|
|dnode 重启 |max(update_time) > last(update_time)|触发告警|90秒|0秒|`select now(), dnode_id, max(uptime) - last(uptime) as dnode_restart from log.taosd_dnodes_info where _ts >= (now - 90s) and _ts < now partition by dnode_id`|
用户可参考上述告警规则,根据自己业务需求进行修改与完善。
Grafana7.5及以下版本Dashboards与Alert rules功能合在一起而之后的新版本两个功能是分开的。为兼容Grafana7.5及以下版本TDinsight面板中增加了Alert Used Only面板仅Grafana7.5及以下版本需要使用。