Update index.mdx
This commit is contained in:
parent
57500f1ffc
commit
c8d418d93a
|
@ -148,8 +148,8 @@ TDinsight 仪表盘旨在提供 TDengine 相关资源的使用情况和状态,
|
|||
### 预配置告警规则自动导入
|
||||
|
||||
涛思总结用户使用经验,整理出14个常用的告警规则(alert rule),能够对集群关键指标进行监测并及时上报指标异常、超限等告警信息。
|
||||
从TDengine-server 3.3.4.3版本(tdengine-datasource 3.6.3)开始,TDengine Datasource 支持预配置告警规则自动导入功能,用户可将14个告警规则一键导入Grafana,直接使用。
|
||||
预配置告警规则导入方法如下图所示,在tdengine-datasource setting界面,打开Load TDengine Alert开关,即可导入所有预配置告警规则;如不需要,点击Clear TDengine Alert按钮即可删除所有预配置告警规则。
|
||||
从TDengine-server 3.3.4.3版本(tdengine-datasource 3.6.3)开始,TDengine Datasource 支持预配置告警规则自动导入功能,用户可将14个告警规则一键导入Grafana(11.x版本),直接使用。
|
||||
预配置告警规则导入方法如下图所示,在tdengine-datasource setting界面,打开 “Load Tengine Alert” 开关,点击 “Save & test” 按钮后,插件会自动加载上述告警规则, 规则会放入以数据源名称 + “-alert” 的 grafana 告警目录中。如不需要,关闭Load TDengine Alert开关。点击 “Clear TDengine Alert” 旁边的按钮则会清除此数据源导入的所有告警。
|
||||
|
||||

|
||||
|
||||
|
@ -164,18 +164,18 @@ TDinsight 仪表盘旨在提供 TDengine 相关资源的使用情况和状态,
|
|||
| ------ | --------- | ---------------- | ----------- |------- |----------------------|
|
||||
|dnode 节点的CPU负载|均值 > 80%|触发告警|5分钟|5分钟 |`select now(), dnode_id, last(cpu_system) as cup_use from log.taosd_dnodes_info where _ts >= (now- 5m) and _ts < now partition by dnode_id having first(_ts) > 0 `|
|
||||
|dnode 节点的的内存 |均值 > 60%|触发告警|5分钟|5分钟|`select now(), dnode_id, last(mem_engine) / last(mem_total) * 100 as taosd from log.taosd_dnodes_info where _ts >= (now- 5m) and _ts <now partition by dnode_id`|
|
||||
|dnode 节点的磁盘容量占用 | > 80%|触发告警|5分钟|5分钟|`select now(), dnode_id, last(mem_engine) / last(mem_total) * 100 as taosd from log.taosd_dnodes_info where _ts >= (now- 5m) and _ts <now partition by dnode_id`|
|
||||
|集群授权到期 |< 60天|触发告警|1天|0秒|`select now(), dnode_id, last(mem_engine) / last(mem_total) * 100 as taosd from log.taosd_dnodes_info where _ts >= (now- 5m) and _ts <now partition by dnode_id`|
|
||||
|测点数达到授权测点数|>= 90%|触发告警|1天|0秒|`select now(), dnode_id, last(mem_engine) / last(mem_total) * 100 as taosd from log.taosd_dnodes_info where _ts >= (now- 5m) and _ts <now partition by dnode_id`|
|
||||
|查询并发请求数 | > 100|不触发报警|1分钟|0秒|`select now(), dnode_id, last(mem_engine) / last(mem_total) * 100 as taosd from log.taosd_dnodes_info where _ts >= (now- 5m) and _ts <now partition by dnode_id`|
|
||||
|慢查询执行最长时间 (无时间窗口) |> 300秒|不触发报警|1分钟|0秒|`select now(), dnode_id, last(mem_engine) / last(mem_total) * 100 as taosd from log.taosd_dnodes_info where _ts >= (now- 5m) and _ts <now partition by dnode_id`|
|
||||
|dnode下线 |total != alive|触发告警|30秒|0秒|`select now(), dnode_id, last(mem_engine) / last(mem_total) * 100 as taosd from log.taosd_dnodes_info where _ts >= (now- 5m) and _ts <now partition by dnode_id`|
|
||||
|vnode下线 |total != alive|触发告警|30秒|0秒|`select now(), dnode_id, last(mem_engine) / last(mem_total) * 100 as taosd from log.taosd_dnodes_info where _ts >= (now- 5m) and _ts <now partition by dnode_id`|
|
||||
|数据删除请求数 |> 0|不触发报警|30秒|0秒|`select now(), dnode_id, last(mem_engine) / last(mem_total) * 100 as taosd from log.taosd_dnodes_info where _ts >= (now- 5m) and _ts <now partition by dnode_id`|
|
||||
|Adapter RESTful 请求失败 |> 5|不触发报警|30秒|0秒|`select now(), dnode_id, last(mem_engine) / last(mem_total) * 100 as taosd from log.taosd_dnodes_info where _ts >= (now- 5m) and _ts <now partition by dnode_id`|
|
||||
|Adapter WebSocket 请求失败 |> 5|不触发报警|30秒|0秒|`select now(), dnode_id, last(mem_engine) / last(mem_total) * 100 as taosd from log.taosd_dnodes_info where _ts >= (now- 5m) and _ts <now partition by dnode_id`|
|
||||
|dnode 数据上报缺少 |< 3|触发告警|180秒|0秒|`select now(), dnode_id, last(mem_engine) / last(mem_total) * 100 as taosd from log.taosd_dnodes_info where _ts >= (now- 5m) and _ts <now partition by dnode_id`|
|
||||
|dnode 重启 |max(update_time) > last(update_time)|触发告警|90秒|0秒|`select now(), dnode_id, last(mem_engine) / last(mem_total) * 100 as taosd from log.taosd_dnodes_info where _ts >= (now- 5m) and _ts <now partition by dnode_id`|
|
||||
|dnode 节点的磁盘容量占用 | > 80%|触发告警|5分钟|5分钟|`select now(), dnode_id, data_dir_level, data_dir_name, last(used) / last(total) * 100 as used from log.taosd_dnodes_data_dirs where _ts >= (now - 5m) and _ts < now partition by dnode_id, data_dir_level, data_dir_name`|
|
||||
|集群授权到期 |< 60天|触发告警|1天|0秒|`select now(), cluster_id, last(grants_expire_time) / 86400 as expire_time from log.taosd_cluster_info where _ts >= (now - 24h) and _ts < now partition by cluster_id having first(_ts) > 0 `|
|
||||
|测点数达到授权测点数|>= 90%|触发告警|1天|0秒|`select now(), cluster_id, CASE WHEN max(grants_timeseries_total) > 0.0 THEN max(grants_timeseries_used) /max(grants_timeseries_total) * 100.0 ELSE 0.0 END AS result from log.taosd_cluster_info where _ts >= (now - 30s) and _ts < now partition by cluster_id having timetruncate(first(_ts), 1m) > 0`|
|
||||
|查询并发请求数 | > 100|不触发报警|1分钟|0秒|`select now() as ts, count(*) as slow_count from performance_schema.perf_queries`|
|
||||
|慢查询执行最长时间 (无时间窗口) |> 300秒|不触发报警|1分钟|0秒|`select now() as ts, count(*) as slow_count from performance_schema.perf_queries where exec_usec>300000000`|
|
||||
|dnode下线 |total != alive|触发告警|30秒|0秒|`select now(), cluster_id, last(dnodes_total) - last(dnodes_alive) as dnode_offline from log.taosd_cluster_info where _ts >= (now -30s) and _ts < now partition by cluster_id having first(_ts) > 0`|
|
||||
|vnode下线 |total != alive|触发告警|30秒|0秒|`select now(), cluster_id, last(vnodes_total) - last(vnodes_alive) as vnode_offline from log.taosd_cluster_info where _ts >= (now - 30s) and _ts < now partition by cluster_id having first(_ts) > 0 `|
|
||||
|数据删除请求数 |> 0|不触发报警|30秒|0秒|`select now(), count(`count`) as `delete_count` from log.taos_sql_req where sql_type = 'delete' and _ts >= (now -30s) and _ts < now`|
|
||||
|Adapter RESTful 请求失败 |> 5|不触发报警|30秒|0秒|`select now(), sum(`fail`) as `Failed` from log.adapter_requests where req_type=0 and ts >= (now -30s) and ts < now;`|
|
||||
|Adapter WebSocket 请求失败 |> 5|不触发报警|30秒|0秒|`select now(), sum(`fail`) as `Failed` from log.adapter_requests where req_type=1 and ts >= (now -30s) and ts < now`|
|
||||
|dnode 数据上报缺少 |< 3|触发告警|180秒|0秒|`select now(), cluster_id, count(*) as dnode_report from log.taosd_cluster_info where _ts >= (now -180s) and _ts < now partition by cluster_id having timetruncate(first(_ts), 1h) > 0`|
|
||||
|dnode 重启 |max(update_time) > last(update_time)|触发告警|90秒|0秒|`select now(), dnode_id, max(uptime) - last(uptime) as dnode_restart from log.taosd_dnodes_info where _ts >= (now - 90s) and _ts < now partition by dnode_id`|
|
||||
|
||||
用户可参考上述告警规则,根据自己业务需求进行修改与完善。
|
||||
Grafana7.5及以下版本,Dashboards与Alert rules功能合在一起,而之后的新版本两个功能是分开的。为兼容Grafana7.5及以下版本,TDinsight面板中增加了Alert Used Only面板,仅Grafana7.5及以下版本需要使用。
|
||||
|
|
Loading…
Reference in New Issue